Академический Документы
Профессиональный Документы
Культура Документы
Chapter 1
Computer Systems Concepts and Processor Architecture
1.2
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
1.3
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Figure 1.1: Advances in processor and memory chip characteristics in the first 25 years.
Figure 1.1 depicts the chronological evolution of the major characteristics of processor
and memory chips in these first 25 years. (The numbers given in parentheses are the
exceptional, rather than the usual, cases.) We notice that the size of the
microprocessor chip itself becomes bigger with a larger number of pins (from 16
1
Which means a system “clock cycle time” of 1/300MHz = 3.33ns.
1.4
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
input/output pins in 1972 it has surpassed the number 550 pins in 1996) and higher
integration level or number of transistors packed on a single processor chip (from 2K
transistors in 1972 to more than 7M transistors in 1996). A larger number of transistors
permits the design of wider wordlength2 microprocessors (64-bit wordlength in 1996),
while a larger number of pins permits wider external buses to interface the processor with
other components of the system (for example, wider address buses of 36-bit width and
wider data buses of 128-bit width.) The integration level of the memory chip has
increased even more dramatically, from only 1K bits per memory chip in 1972 to 64M
bits in 1996. Finally, the processor clock cycles per instruction or CPIP (an important
figure of merit that provides insight into processor performance discussed below) has
decreased from around 20 processor clock cycles in 1972 (i.e., it took 20 processor clock
cycles to completely execute an instruction in 1972) to .25 clock cycles in superscalar
processors (which means that in 1996, the processor can complete four instructions every
processor clock cycle.)
The performance of the computer system is the metric on which comparative evaluations
and purchasing decisions are based on. Usually, buying the best performance possible
offers the longest useful life and increases the chances the system will be able to also
handle the more complex programs of tomorrow. Measuring performance however, is
not always simple; furthermore, performance is very tightly related to the type of jobs the
computer system is be used for (i.e., how the computer is to be used).
When we refer to the performance of a computer system, we usually mean the
“time” it takes to execute a program or task; i.e., the execution time (also called
“turnaround time”, “elapsed time”, or “response time”). However, if we examine closer
this “execution time”, we observe that it is made up of a number of time components
including: the “CPU or processor time” (the time the processor spends in executing both
user and system tasks), the “memory time” (the time spent for accesses to external main
memory), time periods that have to do with user-task disk-accesses and waiting for I/O
(during which time the processor is used by other tasks in a multitasking environment),
and all the “operating system (OS) overhead” (time spent executing OS instructions.) To
simplify matters here, we will neglect all I/O and OS issues and focus on part of the
“system” made up of the processor and main memory only. We can then define here
processor time TP as the user-CPU time (the time spent by the processor for internal
operations on the user task) and system time TS (as the sum of the processor time TP and
the time spent (by the task) for main memory accesses.)
System performance then is measured by the total execution time TS the computer system
needs to execute a task expressed as:
2
A microprocessor with a “wordlength equal to n” is also referred to as an n-bit microprocessor defined as
a processor with n-bit wide internal data registers and n-bit wide ALU (arithmetic logic unit) which
carries out operations on n-bit input operands.
1.5
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
The “total system clock cycles for a task” term in the above equation equals:
where CPIS is the average number of system clock cycles per instruction over all
instruction types. Different machine instructions will require different numbers of system
clock cycles to execute, but for a computer system with a given instruction set we can
calculate its average number of system clock cycles per instruction or CPIS over all
instruction types, provided we know their frequencies of appearance in a task, as follows
[21]:
CPIS = (total system clock cycles for a task) / (number of instructions, NI, in the task)
∑ i = 1 [CPI i
n
= x NI i ] / NI
∑
n
= [CPI i x (NI i / NI)] (1.3)
i=1
where:
NI i represents the number of times instruction i is executed in a program, and
CPIi represents an average number of system clock cycles for instruction i,
measured from a large amount of program code over a long period of time.
The system CPIS in Equation 1.3 is calculated by multiplying each individual CPIi by the
fraction of occurrences of that instruction in a program (NI i / NI).
where:
NI = number of instructions executed for a task (the “instruction path length”)
CPIS = average number of system clock cycles per instruction
C S = system clock cycle time (or, simply, clock cycle)3
3
The system clock cycle equals the inverse of the system clock’s frequency applied as input to the microprocessor
chip; thus, from Eq. 1.4 it is concluded that it is erroneous to compare computer performance by using only the
processor’s megahertz rating or clock speed. For example, a Pentium processor running at 75 MHz easily
outperforms an Intel 486DX2 processor running at the higher frequency of 100 MHz.
1.6
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Overall system performance increases by reducing one or more of the terms in Equation
1.4. Designing a more balanced instruction set and using optimized compiler
technologies can reduce the term NI. The system clock cycle CS depends upon the
overall system architecture, balanced system implementation assumptions, processor-
memory bandwidth, and processor speed. The term CPIS includes both a number of
“processor clock cycles” (during which time the processor executes internal operations4
required by the instruction) and a number of “external memory references” (that
references to main memory are performed5 as needed by the instruction.) Since during
such accesses to main memory the internal pipeline of the processor usually “stalls”
waiting for the operand, these cycles are also referred to as the “processor stall cycles”.
Thus,
Because most processors nowadays include on-chip instruction and data caches – and,
therefore, external accesses to memory are performed only on “cache misses”6 –
Equation 1.5 can also be expressed as:
CPIS = the “processor CPI” + the “CPI due to cache misses” (the finite cache effect)
= CPIP + (number of memory accesses due to cache misses M) *
(“cache miss penalty” P in cycles)
= CPIP + (M * P) (1.6)
The second term (M*P) in the above Equation 1.6 (whose reduction will increase system
performance because it will lower CPIS and therefore also lower TS of Equation 1.4) is
heavily dependent upon system rather than processor characteristics (such as the duration
of a bus cycle, processor-to-memory bandwidth, delays of any interface components
between the processor and memory, the actual access time of main memory). All these
are described in detail in later Chapters.
4
Such as instruction decode, internal register-to-register transfers, etc..
5
Such as starting various “bus cycles” to external memory in order to either fetch an instruction or
operand, or transfer results to memory. As we will see later, a “bus cycle” usually requires a number of
“system clock cycles”.
6
Details on caches and cache misses are given in Chapter 5.
1.7
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Processor performance is measured by the total time the processor spends in executing
internal operations for a user task. If we now use the “processor clock” instead of the
“system” clock, the processor performance can be given by a formula analogous to that
of the system performance Equation 1.4 as follows:
TP (total processor or CPU time per task) = (NI) * (CPIP) * (CP ) (1.7)
TP, is the total internal execution time, it is called the processor time or CPU time, and
does not include the time the CPU waits for memory or I/O activities. CP is the internal
“processor clock cycle” (which may be different from the external system clock cycle
CS). The processor CPI, or CPIP, ignores all system issues and represents the average
number of processor clock cycles per instruction assuming no cache misses (i.e., infinite
caches). For example, a “single-issue” processor (a processor with only one internal
pipelined functional unit to which it issues for execution only one instruction at a time)
will have an ideal CPIP = 1 (i.e., no cache misses and one instruction completed per
processor clock cycle); a “multiple-issue” processor of degree, say, 4 (i.e., four internal
pipelined functional units to which 4 instructions can be issued for execution in parallel)
will have an ideal CPIP = 0.25 (i.e., four instructions will be completed per processor clock
cycle7).
Processor performance can be improved by reducing any of the three factors in
Equation 1.7. CP, the processor clock cycle time, is technology-driven and depends on
the VLSI design of the chip. This cycle time is chosen long enough to allow execution of
every basic operation (or microoperation) in only one processor clock cycle; other, more
complex operations will require multiple processor clock cycles for their execution. NI,
the number of instructions per task, is a function of the instruction set design and how
effective an optimized compiler is. Finally, CPIP, the average clock cycles per
instruction, is a direct function of the processor’s internal architecture (instruction
pipelining, instruction issue ability, etc.) and the efficiency of instruction scheduling.
In general, the speed of the processor itself (which determines the overall system
performance) can be improved by increasing one or more of the following characteristics:
1. Processor (or CPU) wordlength: By making the processor wordlength wider, more
bits can be processed internally in parallel.
2. Processor clock: By using newer, faster technologies, the frequency of the
processor’s internal clock can increase to speed up program execution by the
processor.
3. Level of integration: By packing more functional units on the processor chip (such
as floating-point units, caches, memory management units, etc.), their
7
The actual CPIP is modeled by simulating the structure of the internal pipeline(s) and measuring
instruction stream execution cycles. For “multiple-issue” processors (i.e., superscalar processors) instead
of the CPI it is more convenient to use its reciprocal, the IPC (instructions per clock cycle).
1.8
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
interconnection distances and the need for off-chip signaling are decreased,
contributing to the increase in the processor operating speed.
4. Architectural advances: Finally internal architectural advances have been
incorporated in the processor, such as pipelined RISC architectures and
superpipelined or superscalar implementations (to be discussed later in the
Chapter), which effectively reduce the average number of processor clock cycles
it takes to execute an instruction (i.e., reduce the CPIP) and, therefore, increase
processor performance.
Another metric commonly used for the “processor performance” is MIPS (million
instructions per second) which for a given task is approximated by the following formula:
This MIPs metric, however, is not a very accurate measure for comparing performance
among computers and should be used with caution [1].
-----------------------------------------------------------------àà
Example 1- 1: Comparing MIPS of various processors
Using Equation 1.8, consider three different processors, each operating with an 100-
MHz internal clock.
We say that a processor which under the best-case conditions requires 2 processor clock
cycles to execute an instruction (i.e., has a CPIP of 2) has a peak performance of 50 MIPS; if the
second processor can execute one instruction per processor clock cycle (i.e., CPIP = 1), it has a
peak performance of 100 MIPS; and if the third processor can execute 3 instructions per
processor clock cycle (CPIP = 0.3), it has a peak performance of 300 MIPS.
ßß--------------------------------------------------------------
ßß
1.1.4 Benchmarks
System performance, however, is truly measured only by executing the actual, specific
application(s) to be run by the system. Since quite often this is not practical or possible,
one uses benchmarks.
A benchmark is a software program (or a suite of programs) which measures the
performance of a computer system or just the parts of the computer. Since there are
many benchmarks, one should be careful to use those that reflect the way the computer is
going to be used and are representative of the type of applications to be run in the system.
It is also important for someone to use up-to-date benchmarks which are able to evaluate
all the newer and more sophisticated features of today’s computer systems, such as pre-
emptive multitasking 32-bit operating systems, larger (graphic- and video-intensive)
applications, superscalar microprocessor architectures, multithreaded applications,
multiprocessor (parallel) systems, etc.
1.9
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
The RISC approach (also called “streamlined architecture” [10]) was introduced to
increase the performance of the basic CPU by advocating new, simpler, fixed-length,
register-oriented instruction sets. The basic principle that drives the RISC approach is
that processor performance can increase by keeping instructions simple.
RISC processors have lower CPIP than CISC processors because the RISC
approach uses simpler, fixed-length instruction formats which allow for faster hardwired
instruction decoding and greatly simplify the use of internal instruction pipelining; as a
result, the number of processor clock cycles needed to execute an instruction can (ideally)
reduce to 1 (i.e., CPIP = 1). CP, the processor clock cycle, is also shorter for a RISC than
for a CISC processor, because of the former’s architectural simplicity and because the
majority of RISC instructions are of the register-to-register type (that do not need to
access external main memory during their execution; all operands are in internal
registers). Thus, RISC approaches improve the program performance by reducing the
last two terms in Equation 1.7. The NI, however, is usually larger for RISCs, the ratio of
the number of instructions for a RISC versus those for a CISC processor being on the
average around 1.8 to 2.
1.10
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
1.11
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
In general, however, CISC and RISC are not really two actual, different, specific
architectures; instead, they represent two different approaches or paradigms to processor
(CPU) design that can be utilized by any processor architecture. A good microprocessor
design would have to trade off and select which attributes of RISC and CISC to combine.
As a matter of fact, the two approaches and their execution cycle times are now
beginning to converge toward the design of hybrid CPUs. CISC microprocessor designs
have adopted features which were characteristic of RISCs, such as internal parallelism,
internal Harvard architecture, and pipelined execution units, which allow frequently used
instructions to execute in a single processor clock cycle; on the other hand, RISC
microprocessors have become more complex, with on-chip MMUs (Memory
Management Units) and FPUs (Floating-Point Units), have adopted a higher degree of
internal pipelining and parallelism, and have implemented a number of additional not-so-
simple instructions.
When it comes, however, to the methodology of designing a computer system,
there are not really too many significant differences when using a RISC or a CISC
microprocessor. Figure 1.2 shows the interconnection with memory of a conventional
CISC processor with a single external bus and that of a RISC processor that has separate
external buses for the instructions and data. (In both cases there is another set of bus
lines that make up the “control bus”. The control bus is used by the processor to send
control signals to the other modules of the system and receive from them feedback
signals in order to properly orchestrate the system’s operation.) The rules for interfacing
various units to the processor are almost the same except that in the case of processors
with external Harvard architecture, data and instruction memories are separate and their
speed requirements may be different from those of a conventional system configuration.
As we will see in the next Section, most of the remaining control signals are similar for
RISCs and CISCs. It will also become apparent at the end of the Chapter, that a more
interesting question is not CISCs versus RISCs but which of the following processor
implementation is preferable: “superpipelined”, “superscalar”, “very long instruction
word”, or any combinations of the above.
As far as the design of a computer system goes, in this textbook we will treat both
RISC and CISC processor in a unified fashion.
Processor Memory
(CPU) Unit(s)
Physical address bus
Data bus
Processor external physical
bus (or the “memory bus”)
Fig. 1.2a: Conventional “processor bus”
1.12
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Figure 1.3 shows a RISC processor with one external bus and a separate
ICACHE (instruction cache) and DCACHE (data cache). The processor contains on-
chip all the hardware logic needed to control these external caches (the cache controller),
and these caches are then mainly static-RAM chips connected directly to processor I/O
pins (to a cache bus.) Main memory is connected using a separate local bus; local bus
control logic (a bus “bridge”) converts the local bus to the bus connected directly to the
processor’s pins.
Finally, Figure 1.3c shows a RISC processor with external Harvard architecture
that defines two external buses: one for data (the data bus, which includes lines for
transferring the address of the data and separate lines to transfer the data itself) and one
for instructions (the instruction bus, which includes lines for transferring the address of
8
A “unified” or “integrated” cache is a cache that stores both data/operands and instructions.
1.13
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
the instruction and separate lines to transfer the instruction itself); these buses require
their own separate sets of synchronization signals. The two buses of the Harvard
architecture are used solely to support the LOAD and STORE instructions. If we assume
that a LOAD/STORE takes two system clock cycles to execute (one to calculate the
memory address and one to do the actual transfer to/from memory), during the second
clock cycle the prefetch unit of a non-Harvard-architecture processor would not be able
to overlap the fetching of the next instruction with the data transfer of the
LOAD/STORE, because the single data bus lines are already busy (doing the transfer of
data to/from memory.)
Figure 1.4 shows the external view of CISC and RISC microprocessor chips and their
I/O signals. Their detailed explanation and use in examples will be given in the next
Chapter 2. In this Section we only present an outline on the address and data buses and
some of the most common control signals in Figure 1.4. The input/output pins of the
microprocessor al together define what is called the processor bus or sometimes called
the component-level bus.
1.14
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
L2 (external)
cache
Processor
(CPU) Cache Bus
L1 (on-chip)
L2 cache controller
MMU cache
“Local bus”
Physical
instruction Instruction Data Data Instruction
bus Cache Cache Memory Memory
(ICACHE) (DCACHE)
9
That’s why earlier microprocessors, which could not provide enough pins, were forced to multiplex their
external bus lines.
1.15
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
“INTA”
Bus
*DTACK = Data Transfer arbitration Interrupt
Acknowledge signals signals
*DTACK
“INTA”
= Data Transfer Bus
Acknowledge arbitration
Interrupt
signals
signals
b) RISC processor external view
1.16
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
-----------------------------------------------------------------àà
Example 1- 2: Variations on external address and data buses
One distinction among processor address buses has to do with how many bits of the
actual physical address the processor places on its external address bus. Some processors issue
on the address bus a byte-address along with signals to indicate the width of the operand to be
transferred, while others issue the most significant part of the byte-address (representing for
example a 32-bit doubleword-address or a 64-bit quadword-address) along with “byte-enable10”
control signals that specify the byte-section(s) of memory to be activated (enabled) during the
current data transfer.
10
As we will see later in more detail, a “byte-enable” signal selects one “byte-section” of a memory: For
example, a “16-bit memory” is composed of two “byte-sections” that require 2 byte-enables; a “32-bit
memory” of four “byte-sections” that require four byte-enables; etc.
11
The pound symbol (#) after a signal designates negative logic and corresponds to the overbar used over
the signal. Quite often, usually for industry standard “system buses,” instead of the pound symbol (#)
the star (*) or the slash (/) symbol is used. In this textbook we will be using mainly the pound symbol
(#) and the overbar interchangeably.
12
The general term “word” does not have a universally accepted definition; some processors call the 16-bit
quantity a word, while others call the 32-bit quantity a word (and the 16-bit quantity a half-word).
Throughout this textbook, we will use the term word to mean 16 bits and the term doubleword to mean
32 bits.
1.17
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Cache
Level-2
64 Address
Cache Data Cache
Bus
Intel
32-bit 36
Address Bus
BE0#-BE7#
8 byte-enables
System
64
D63-D0 Bus
Data Bus
(PentiumPro) 8
Data Parity
(8 byte-enables)
Cache
Level-2
128 Address
Cache Data Cache
Bus
MIPS
64-bit
64
Multiplexed AD Bus
9 System
(R4000 SysCmd (operand size) Bus
&
8
R10000) Data Parity
(620)
Cache
Level-2
128 Address
Cache Data Cache
Bus
Alpha
&
UltraSparc
Address Bus *
(64-bit) 128 System
Data Bus Bus
16
Data Parity
128 Level-1
ICACHE & DCACHE
HP
64-bit Multiplexed Bus
64
(Data bus: 64 bits;
Address bus: 64 bits)
(PA-8000)
1.20
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
1.21
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
At the beginning of a bus cycle, the processor issues a set of signals called the bus
cycle identifiers (some microprocessors call them “status” or “function code” signals) to
inform the rest of the computer modules of the type of bus cycle the processor has
initiated. Figure 1.6a shows the most common types of bus cycles, and Figure 1.6b, the
“bus cycle identifying” signals of some processors. It is noticed that the Intel Pentium
processor identifies the bus cycle by a combination of several control signals (the M/IO#,
D/C#, and W/R#, CACHE#, and KEN#), while the Motorola processors use three output
“function code” signals (the FC2-FC0.)
c) Synchronization signals.
Synchronization signals are needed to synchronize the operation of the processor
with the other modules of the computer system. Output synchronization signals (issued
by the processor in the form of an “address strobe” or “data strobe”) indicate when
address or data are valid on their respective bus lines; input synchronization signals
(received by the processor in the form of a “ready” or “data transfer acknowledge”)
notify it that the addressed memory or I/O port has responded. Processors that operate
asynchronously with their slave devices13 use a pair of “handshake signals” to accomplish
their synchronization with these devices; for example, a processor will send to memory
the “address strobe” signal to indicate that it has placed an address on the address bus,
and the memory will send back to the processor a feedback signal “data transfer
acknowledge” to tell the processor that memory has finished its requested transaction
(either placed data on the data bus or received the data from the data bus). Some
microprocessors use only one input pin for the data transfer acknowledge signal (like the
DTACK# pin of the 16-bit Motorola 68000) and its absence meant that the slave module
has not yet finished its operation. Other microprocessors use two such input pins (like
the DSACK0# and DSACK1# pins of the 32-bit Motorola 680x0 processors); this pair of
signals, in addition to synchronizing the processor to the slave module, they also inform
the processor of the width of the responding slave module (e.g., whether it is a 32-bit
memory port or a 16-bit I/O port.)
13
That is, each may have its own separate clock source, and they each may operate at a
different clock speed.
1.22
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
A number of processors have the capability to dynamically (at run time) adjust the width
of their external data bus, according to the size (width) of the memory or I/O port they
communicate with. A processor issues an operand size indicator14 (like the Motorola
SIZ0 and SIZ1 pair of signals) to indicate the size of the operand to be transferred: for
example, 00 means 4 bytes (one doubleword), 10 means 2 bytes, 01 means 1 byte, and 11
means 3 bytes. When processors do not permit 3-byte transfers in one bus cycle, the
value 11 can be used to mean that the processor requests the memory slave to send a
“long” cache line (of 16 or 32 bytes, depending on the processor).
The port size indicator is a feedback the processor receives indicating the size of
the port currently communicating with it. As we said above, the Motorola products
interpret the binary values on their input pins DSACK0# and DSACK1#; other
processors have one input pin for each port width, e.g., the Intel 486 (which also has a
built in “dynamic bus sizing capability”) uses two input pins BS16# and BS8# for 16-bit
and 8-bit ports, respectively.
14
Some processors (e.g., MIPS) call it the “data identifier”.
1.23
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Intel
M/IO# D/C# W/R# CACHE# KEN# CYCLE DESCRIPTION # OF
TRANSFERS
Motorola
0 0 0 (UNDEFINED, RESERVED) *
1 1 1 CPU SPACE
Figure 1.6: Bus cycles, and Intel and Motorola “bus cycle identifiers”
1.24
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
would mean access to code.) For small systems, the read- and write-type control signals
that the processor itself issues are sufficient to be directly connected to and drive the local
memory and I/O ports. For larger systems, external circuits in the form of “bus
controllers” may be required to amplify the processor’s output signals and convert them
to more powerful system-wide read and write signals.
All processors also have a reset input pin, which is the highest-priority interrupt
that resets the processor to a known internal state. The reset input has different effects on
the various processors: in almost all cases, when a reset occurs, the processor address and
data buses go to a high impedance state (i.e., the processor electrically disconnects itself
from the bus lines), all output control signals go to the inactive state, the interrupt system
is disabled not to accept further external interrupts, and the current bus cycle ends. For
those processors that have both “user” and “supervisor” modes of operation, the
supervisor mode is entered and the appropriate value (called the “reset vector”) is loaded
into the CPU to update the program counter. (More details on user/supervisor modes and
interrupts are given in Chapter 7).
A ready input signal is used by processors that operate in synchronism15 with the
slave devices and is used to accommodate memory and I/O devices that cannot transfer
data at the processor’s fast bandwidth. When the processor samples this READY signal
and finds it asserted, it knows that the action requested has been completed and that the
addressed memory or I/O device has placed data on, or accepted data from, the data bus.
This allows the processor to end the current bus cycle and advance to the next bus
transfer. If it samples the READY signal not asserted, the processor then enters a wait
state. While in a wait state, the processor still has control of the local bus.
1.25
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
“memory interface logic” will be discussed later in Chapter 3. The “processor interface
logic” discussed here, is circuitry that may be required for several reasons: to demultiplex
and/or buffer the CPU’s local address and data lines, to interface the processor with other
modules of the computer system by providing them with the appropriate control signals,
or to receive from them feedback signals and apply them in turn to the processor’s input
pins. The complexity of this interface circuitry and the total number of interface
components depend upon the specific processor and the size and complexity of the final
computer system.
Figure 1.7b gives the most common interface components of a processor; not all
of them are needed in each design. We discuss below the clock generator, the address
latches and address decoder, the data transceivers, the bus controller, and the byte-enable
circuitry. The remaining interface components are discussed in later Chapters.
Bus drivers (or bus buffers) are required to place information on the bus. A
semiconductor device driving the bus may have either open collector or tri-state (3ST)
outputs. Devices with open-collector outputs have only two output states: logic 0 (zero),
i.e., the gate pulls the output line to logic 0 (zero) using an active circuit element (the
gate’s internal output transistor), and logic 1 (one), i.e. the output line is pulled back to
logic 1 (one) by a passive circuit element (an external pull-up resistor)16. The opposite
holds for the tri-state devices used in bus-based computer architectures. They prevent
excessive loading or driving of the bus lines, and allow many devices to be connected to
the bus. Devices connected to the bus must have tri-state outputs: the first two states are
the logic 0 and 1 (0.8 and 3.5 volts for TTL); the third state is a high-impedance state or
open circuit17.
Address latches (registers with D-type flip-flops) are used to latch the address
and hold it as long as required. (We say that the latch operates in a “transparent mode”
when the strobe remains active or its “output enable” OE# pin is grounded to logic 0, i.e.
enabled.) If the processor bus is multiplexed, the external latches are strobed by a
processor signal (called “address latch enable” ALE or “address strobe” AS) issued at the
beginning of each bus cycle to demultiplex the bus and provide at the output of these
latches a buffered address which remains valid for the duration of the whole bus cycle.
16
Devices that have open-collector outputs allow the logical AND among their outputs simply by having
these outputs tied together. For this reason, this connection is also called wired-AND or AND-tied.
When negative logic is used, in which the less-positive level corresponds to logic 1, the open-collector
driven bus lines perform the wired-OR function.
17
A tri-state device has both an active pull-up transistor and an active pull-down transistor in its output
[to define the logic 1 (one) and logic 0 (zero) states], but an extra third input terminal is used to disable
the output. When the output is disabled, it is said that the output floats. This third state is often called
the high-Z or high-impedance state. When a device is placed in its high-impedance state, it is considered
to be electrically disconnected from the bus. Thus, many tri-state devices (drivers) may be connected to
the same bus, with their respective outputs forming the logical OR with each other. For this reason this
connection is also called wire-ORed, OR-tied, or bus-configuration. Appropriate control signals must be
applied to select only one of them to drive the bus, while holding all other drivers “disconnected’ from
the bus.
1.26
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
MEMORY
SLAVE MEMORY
STORAGE
CLOCK
MEMORY
BUS MASTER INTERFACE
LOGIC
PROCESSOR PROCESSOR
(CPU) INTERFACE
LOGIC MEMORY BUS *
ADDRESS BUFFERED
LATCHES ADDRESS BUS
ALE or AS CHIP
CLOCK ADDRESS
DECODER SELECTS
GENERATOR
Data
DATA DATA
TRANSCEIVERS BUS
C
Status/co BUS
ntrol CONTROLLER COMMANDS
PROCESSOR
BYTE ENABLE
(CPU) “BYTE-ENABLES
CIRCUIT
BURST
LOGIC
BUS
ARBITRATION LINES
ARBITER
1.27
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
VCC
Status/Control
CLK
Clock Signals
MemR
generator MemW Control
CLK Signals
RES Bus to
READY Controller I/OR Memory
RDY I/OW
RESET DEN &
GND
DT/R INTA I/O
ALE
MICRO-
PROCESSOR
CPU
20 STB ADDRESS 1-MEGABYTE
GND OE 20
BUS ADDRESS SPACE
AD15-AD0
A19-A16 Address/Data LATCHES BUS (A19-A0)
(3)
BHE
T DATA
OE BUS 16-BUS
TRANSCEIVERS
(2)
DATA BUS
16 16 (D15-D0)
c) Bus controller, address latches, data transceivers
When memory or I/O devices are connected directly to the multiplexed local data
bus, it is essential that they be prevented from corrupting the information (usually, an
address) present on this bus during the first clock cycle of the bus cycle. Most often,
interfacing requirements become simpler if the data bus is buffered. Buffering the data
bus also offers increased current capability and capacitive load immunity. For the bi-
directional data bus, this buffering is accomplished by using external bi-directional bus
drivers (transmitters) and receivers, called data transceivers. The direction of the data
transceivers’ operation is guided by a proper control signal the processor issues (e.g.,
DT/R# = H forces it to act as a transmitter, while DT/R# = L as a receiver).
Address decoders are used to receive some address bits off the address bus,
decode them, and generate some control signals needed by memory and I/O (for example
“bank selects” and “chip-select” signals to be discussed in Chapter 3.) Address decoders
are also used to identify address ranges and determine whether the current access is for a
device connected to the local processor bus or to the global system bus. Decoders may
also be needed to decode status or function code signals (to identify the type of bus cycle
the processor starts).
1.28
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
microprocessor itself, decodes them, and converts them into system-wide “command
signals”. Figure 1.7c shows the bus controller, the address latches, and the data
transceivers in an example configuration.
Finally, the byte-enable circuit in Figure 1.7b contains logic to generate the
“byte-enable” signals for main memory; an external circuit is needed for processors that
do not themselves issue these “byte-enables”. As Figure 1.7b shows, microprocessors
like the Motorola 68030, require this external logic to combine the processor’s output
signals A1,A0 and SIZ1,SIZ0 and generate the four “byte-enables” (called by Motorola
DBBE4-DBBE1) needed by memory.
Functionally, the modules of a bus-based processor system are divided into two broad
classes: bus master devices and bus slave modules. Bus master modules are those that
can gain control of the bus and initiate data transfers by driving the address and control
lines. To perform these tasks, the bus master is equipped with either a processor CPU or
similar logic that makes it capable of initiating bus cycles to transfer data over the bus.
The master informs all other modules in the system of the type of bus cycle it starts and
qualifies a valid address on the address bus by issuing proper “status signals” or “function
code signals” over the control lines. A module acting as a bus slave can not start bus
cycles; it only monitors the bus activities and, if addressed during a particular bus cycle,
it either accepts data from the data bus or places data on it. A slave module is not capable
of controlling the bus. Therefore, a slave always receives and decodes the address and
control signals to determine whether or not it should respond to the current bus cycle
started by the master. All data transfer activities between a bus master and a bus slave
are carried out in terms of “bus cycles” (explained in Section 1.5.2).
Table 1-1 Microprocessor-based system platforms
UltraSPARC: used in AUSPEX NetServer
SuperSPARC: used in Cray CS6400 Enterprise Server
HP PA-RISC 7200: used in Convex Exemplar SPP series
HP PA-8000: used in HP Exemplar (S-Class, X-Class)
MIPS R10000: used in SGI Power Challenge, Pyramid, Concurrent,
Tandem
DEC Alpha 21164: used in Digital’s high performance computers
Intel Pentium Pro: used in Intel TFLOPS, advanced workstations
IBM PowerPC 620: used in IBM RS/6000 SP2 (Scalable Power Parallel
System)
1.29
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Table 1-1 lists current system platforms based on the latest microprocessor chips in
Figure 1.5d.
Figure 1.8 shows the block diagram of representative computer system configurations:
Figure 1.8a represents a small system with only one bus (the processor bus), Figure 1.8b
shows a single-board configuration with a local bus, Figure 1.8c a larger multiboard
system, and Figure 1.8d a multiprocessor system. In addition to the processor itself, a
computer system must contain as a minimum the following types of components
(discussed below in more detail): (1) a clock generator, to supply clock signals to the
processor and the other components of the system18; (2) main memory slave units,
usually composed of dynamic RAM (DRAM) memory chips to store the program code
and the operands/data, with an optional error detection-correction unit (EDCU) to
increase the reliability of the information exchanged; and (3) some kind of Input/Output
interface units to connect external devices to the computer system, such as hard-drives,
printers, modems, network adaptors, etc. Sometimes, other optional coprocessor chips
may be attached to the processor, such as communications or multimedia coprocessors to
operate in parallel with the CPU and execute operations that the processor does not
support. Finally, a system may also incorporate some other special-purpose chips, such
as MMUs (memory management units) to implement virtual memory and handle task
switching or cache memories (composed of static RAM chips and cache controllers) to
increase overall system throughput by providing to processor information faster than
main memory. (As we will see later, most of the high-performance microprocessors have
integrated the MMU and at least a first-level cache on the processor chip itself.)
a) The processor.
Processors nowadays are single-chip devices, called microprocessors, although there may
be some that consist of more than one (referred to as a “chip set”). The processor
communicates with memory and I/O subsystems (units) by sending addresses to them
(over the address bus) and sending to them or receiving from them data (over the data
bus). All control signals issued by the processor CPU travel over the control bus.
Each processor has a wordlength, which is characterized by the width of the
processor’s internal data registers and the width of its arithmetic/logic unit (ALU); this
sometimes is also referred to as its “internal architecture.” Although the width of its
internal or external buses may or may not be the same, in this textbook, unless explicitly
specified otherwise, we will assume that an n-bit processor also has an n-bit external
data bus.
18
In some cases, the processor receives pulses from and external crystal and it is the processor itself that
generates the appropriate clock signals distributed to the other components of the system.
1.30
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
• KEYBOARD
• HARD DRIVE
• CD ROM
• NETWORK
INTERFACE
• GRAPHICS
• SENSORS
Main • ACTUATORS
Memor
DRAM
(&y
Processor Controller)
Module
CPU
Input/syst L2 I/O
BUS
e m (CPU
PROCESSO (&
Cache DMA
•SUBSYSTEM
cloc ) controlle
cache Graphics
•controller
k r) etc.
•controller
Processor/Cache/Memory Subsystem
Local Bus Devices
(Display system,
Processor Module CD ROM, hard
drive, etc.)
Main
Input/syste L2 Cache DRAM
PROCESSOR
m (& cache Memory
(CPU) I/O
clock controller) (& Controller)
Expansion
Boards
Level 2 bus: Local Bus (PCI, VLBus, Sbus) (e.g., X/2 MHz)
1.31
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Processor/Master Board
Local
Processor/Cache/Memory Subsystem Bus
Processor Module Devices
Level 4
Main Graphics I/O Bus
System L2 DRAM (GPIB,
clock (CPU) Cache Cont’lr
Memory LAN)
I/O
Expans’n
Boards
Level 1 bus: CPU or Processor Bus (e.g., X MHz)
AGP *
I/O
Local Bus Bridge (“chip-set”) Bus
Interface
Level 2 bus: Local Bus (PCI, VLBus, Sbus) (e.g., X/2 MHz) Boards
Level 3 bus: System or Expansion Bus (VMEbus, Futurebus, Multibus) (e.g., X/4 MHz)
* AGP =
Shared Shared Accelerated
System Memory Boards System Memory Boards Graphics Port
Memory
Interface Controllers System
PCI: Peripheral Component Interconnect and I/O
Main
SCSI: Small Computer System Interface adapters devices
Memory
GPIB: General-Purpose Instrument Bus
LAN: Local Area Network Global/system memory Global/System I/O
1.32
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
The CISC approach follows the classical design of the CPU with its
accompanying complex, wider, variable-length instruction set (which results in a more
complex structure of the processor’s control unit) and its support of many different
addressing modes. The RISC approach aims at increasing the performance of the
processor itself by providing a different internal processor architecture and simpler,
fixed-length, and primarily register-oriented instructions (that simplify the processor’s
control unit and speed up its operation).
Newer and more advanced processors have increased their wordlengths, bus sizes,
addressing capabilities, and input clock frequencies, and have integrated more features on
the processor chip. For example, the major difference between the Intel 32-bit CISC
processors 80386 [4] and 80486 [5] (also referred to as the 386 and 486, respectively) is
that the latter has integrated a cache and an FPU (Floating-Point Unit) on-chip. In the 32-
bit Motorola CISCs, the earlier 68030 [8] had included on the chip an ICACHE
(instruction cache) and a separate DCACHE (data cache) and single MMU (memory
management unit), while the 68040 [9] has integrated on the chip the FPU and has split
the MMU into an IMMU (instruction MMU) and a separate DMMU (data MMU).
Today’s processors have a 64-bit internal architecture with an 128-bit external
data bus; most of them have on-chip MMUs and caches (separate for instructions and
data), FPUs, and some may also include graphics units. Also, all processors have
incorporated pipelined implementations of their execution units, some are superpipelined
processors (like the MIPS R4400), while others are superscalar processors (like the
88110, PowerPC 620, Alpha, and Ultra-Sparc). These terms are explained later in this
Chapter.
b) Memory units.
Whenever we use the word “memory,” we will mean main memory implemented outside
the microprocessor chip, on separate memory chips; for larger systems, memory chips
will be arranged on one or more memory boards. Static RAMs (SRAMs) hold stored
information as long as power is being applied to them. SRAMs are quite often used for
“cache memories”. On the other hand, dynamic RAMs (DRAMs) are cheaper, have
greater densities and require less standby power. However, the DRAM only provides for
temporary storage of data and requires special circuits to refresh its contents periodically.
This periodic refresh requirement makes dynamic memories slower than static memories.
The main memory in Figure 1.2a is referred to as an “n-bit memory” if the maximum
data it can transfer in parallel is n bits per access, and data exchange between the
processor and main memory is done on the n-bit basis (over an n-bit data bus). (The
design of the memory subsystem and its interfacing to the processor are covered in
Chapter 3.)
1.33
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
d) Caches.
A cache memory is a small high-speed memory placed between the processor CPU and
main memory to speed up the rate at which instructions and data are supplied to the CPU
by keeping copies of the most recently used memory items. For example, since
computer programs spend a lot of time executing loops (i.e., exhibit “temporal locality”),
the instructions for the second and subsequent iterations of a loop will be found in the
cache, and therefore the CPU does not need to access main memory to fetch them.
Similarly, data structures such as arrays, vectors, etc., frequently exhibit the property of
“spatial locality”; thus access to nearby data items will also find them in the cache. The
inclusion of a cache memory in the computer system allows the processor to operate at
cache speed much of the time rather than at slower main memory speed. In this textbook
whenever we us the word “cache” by itself we will mean the “cache RAM” used for
storage as well as the “cache controller.” (Caches are covered in detail in Chapter 5.)
e) MMUs.
A memory management unit (MMU) aided by systems programs, provides a technique
for handling a larger address space in a flexible fashion, on behalf of the user. It does this
by subdividing the total logical/virtual address space into blocks (pages or segments),
defining logical addresses, and translating them at runtime into physical addresses. It
also provides protection and management of virtual and physical address spaces by
checking a number of access attributes, such as user vs. supervisor space, out-of-limits
access, class of ownership (i.e., which tasks are permitted access to that block), privilege
1.34
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
level (the privilege level of the requestor vs. the privilege level of the module to be
accessed), mode of access (read-only vs. read-write, execute only, etc.), etc. MMU
hardware is also used to prevent problems that result when multiple tasks within a given
application contend for limited physical memory, or when users share common data or
employ common programs.
When an MMU is outside the CPU, it is usually connected to the processor bus
and manipulated (i.e., given the attributes of each module, has it registers updated with
new values, etc.) in supervisor mode by special privileged I/O instructions. Most high-
performance processors contain this MMU hardware on their chip along with the CPU.
(Memory management and MMUs are covered in Chapter 6.)
As it is noted from the configurations in Figure 1.8, a hierarchy of buses is usually used
to interconnect the various components of a computer system. We make here some
introductory comments on the types of buses; their detailed operation is given later in
Chapter 2.
Some devices, like the coprocessors in Figure 1.8a, are directly connected to the
processor CPU (sometimes referred to as close coupling) using what is called the CPU or
processor (external) bus19, which is defined by the I/O pins of the processor CPU. The
other components of the system are interconnected through a second-level bus, also
referred to as the local bus20. (If main memory is connected to this local bus, the bus is
also called the memory bus.) In a number of system designs the local bus and the
19
Also called the component-level bus.
20
One local bus standard is the VL-Bus (Video Local Bus) created by the Video Electronics Standards
Association (San Jose, Calif.); another local bus standard invented by Intel is the PCI (Peripheral
Component Interconnect).
1.35
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
processor bus may be the same (and in such cases we will be referring to it
interchangeably either as the processor bus or the local bus); if these two buses are
different, then the proper “local bus interface” or “local bus bridge” (usually in the form
of a “chip-set”) is needed between them to adapt one bus to the other.
Every bus is functionally made up of the data bus, the address bus, and the control
bus. The data bus is used to transfer instructions and data (operands/results); as deduced
form the processor examples in Figure 1.5 its width may or may not equal the
wordlength of the processor’s internal architecture. Data transfers may sometimes use
portion only of the data bus lines, depending upon the width of the operand to be
transferred and the width of the memory or I/O device with which the processor
exchanges data; for example, when a 32-bit processor exchanges data with a 16-bit slave
module, only half of its 32-bit data bus will be involved in this data transfer. The
address bus is used to transfer memory addresses (to select a memory location) or I/O
port numbers (to select an I/O port). For example, if a 32-bit processor has a 32-bit
address then it can directly access up to 4 GB (gigabytes) of main memory. Memory
addressing refers to the fact that the address always specifies the address of the first byte
(the lowest-numbered byte-address) of an operand; this is true regardless of the length of
the operand or its byte-ordering (big/little endian) in memory. Finally, control and timing
signals use the control bus lines to synchronize the operation of the various computer
modules and facilitate their intercommunication activities over the bus lines.
1.36
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
While the CPU or processor bus is specific to the individual processor, the system
bus is independent of the central processor; i.e., it has its own set of specifications that
all board manufacturers must satisfy. For that purpose, each board includes the proper
hardware, called the “system bus interface” or “bridge”, which allows the board to
interface to the system bus. The interface bridge logic included on the processor board of
Figure 1.8c is used to convert the local bus to the processor-independent industry system
bus. An example of such hierarchical bus configuration may have a processor with an n-
bit high-speed external data bus operating at X MHz frequency, a level-2 n/2-bit
intermediate-speed local bus operating at X/2 MHz frequency, and a level-3 standard
n/2- or n/4-bit low-speed system bus operating at X/4 MHz frequency.
Bus characteristics, operation, and timing are important factors in memory and
peripheral device interfacing; they are used to identify the specific address or data placed
on the respective bus lines and the time required to carry out bus transactions. Such
identification is valuable to the system designer/integrator, in order to single-step and
debug the system under development, instruction by instruction, and interface the central
processor with memories and peripherals that have different access times. It is also
valuable to the software designer, who must also be aware of software compatibility
problems arising when a system is built using a certain type of system bus (that may, for
example, be transferring data bytes in its own order21) and a mix of 16-, 32-, or 64-bit
processor boards from different manufacturers, which themselves impose their own
ordering of the most and least significant bytes within a 16-bit word, a 32-bit doubleword
entity22, etc.
c) I/O bus
Finally, a number of peripheral devices may be connected to the computer system using
special I/O buses, such as a SCSI (small computer system interface) bus used by hard
drives, a GPIB (general-purpose instrumentation bus) to interface measuring devices and
instruments, and LAN (local area network) interconnections. Each of these will require
its respective I/O controller or “bus adapter board” shown in Figure 1.8 as part of the
global/system I/O interface module(s).
The evolutionary step in the development is the design of computer systems that
incorporate parallel processing by integrating together a number of processor boards or
nodes. Figure 1.8d shows this more complicated configuration of a bus-based
multiprocessor computer system. Each processing node has its own private DRAM main
memory, I/O resources that no other processor can access (like video monitor, hard
21
The ordering of data bytes on the system bus is referred to as either mad-endian or sad-endian ordering
and is discussed in more detail in Chapter 4.
22
The data ordering in the computer’s memory is referred to as either big-endian or little-endian ordering
and is discussed in more detail in Chapter 2.
1.37
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
drive, and other peripherals), and external local area networks interconnects. (An
example of such system is the Beowulf23 Parallel Workstation [24]). Such a parallel
computer may also provide global (system) memory and I/O that are “shareable”; i.e.,
any processor in the system can request and gain control of the system bus, and use it to
access a global memory location or a global I/O port. In this case, the “system bus
interface” hardware on each processing board will include a separate component, called a
“bus arbiter” (or “bus exchange”), to arbitrate among simultaneous requests of system
bus access and determine which board will be given the shared system bus to execute its
data transfers. (Bus arbitration is covered in more detail in Chapter 4.)
The significant price advantage the RISC processor has, along with its dramatic
improvement in performance, have made it ideal candidate for designing commercial
bus-based SMP (Symmetric MultiProcessing24) parallel architectures. Examples
include the SGI Power Challenge supercomputer, which uses MIPS R8000/R10000
microprocessors, and the Cray CS6400 Enterprise Server which uses SuperSPARC
microprocessors. The SMP architecture resembles that of Figure 1.8d in which each
processor board is a single microprocessor chip (maybe with some external cache).
Because all these microprocessors use the system bus to share global RAM memory,
there is only one memory space, which simplifies both systems and applications
programming.
The single shared memory and bus, however, contribute to the SMP’s biggest
problem, the “memory-bus bottleneck”, which limits system scalability to not more that
8-16 processors. To overcome this limitation, the HP Exemplar high-performance
servers (PA-8000-based platforms) have replaced the interconnecting system bus with a
crossbar switch which makes the HP Exemplar family architecture scalable up to 512
CPUs.
23
The Beowulf Parallel Workstation comprises 16 PC microprocessor motherboards, each coupled with
1.2GB hard drive, 32MB of DRAM memory, a dual 100Mb per second Fast Ethernet channels (with
optional monitor and other peripherals in each processing node). This parallel system has peak
performance in excess of 1 GOPS (giga operations per second), half a gigabyte of main memory, and
20GB of secondary storage.
24
This system configuration is also known as tightly coupled or shared everything
1.38
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
External Bus
Scalar processors Lines
Data Register
File (RF)
Integer
Execution
Instruction Unit
Instructio Prefetch, (ALU) Data
n Queue, or Cache
Cache & (DCACHE)
(ICACHE) Decode Floating Point
Execution Unit
(FPU)
Instructions
Data/Operands
The bus interface unit or BIU is responsible for all activities on the processor’s external
bus; it works in an independent fashion form the other internal components of the
processor. It initiates external bus cycles when so requested by the CPU and maximizes
bus utilization by prefetching instructions from memory whenever the processor’s
external bus is free.
The BIU’s hardware includes address latches and drivers, data transceivers, a
prefetcher circuit that prefetches instructions from memory even before they are needed,
hardware to prioritize the various requests for bus cycles that different internal sections of
the processor request, and a bus controller. The bus controller can also receive and
interpret input signals that an external slave device (like for example a memory port)
sends to it to identify the device’s width (or size); this way, the processor properly
identifies what data presented at its input pins are valid. (This way the processor
implements “dynamic bus sizing”, discussed in detail in Chapter 2). There is internal
multiplexing circuitry to route the data arriving to the processor (for a read cycle with the
execution of a LOAD or INPUT instruction) to the correct internal or for a write cycle to
place the data on the correct output data pins. The BIU also contains the necessary
hardware logic to perform error detection and correction on the data transferred over the
external data bus. Finally, the bus controller may operate at the slower speed of the
1.39
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
external bus while the remaining sections inside the processor may operate at their own
faster speed25.
We saw earlier that some processors have an external Harvard architecture which
provides separate ports to allow simultaneous access to both an instruction-memory and
a data-memory module. Such a two-port, nonmultiplexed memory access scheme
requires separate parts in this BIU -- an “instruction interface unit” and a “data interface
unit” -- to individually control these two independent memory accesses.
Finally, depending of the processor’s complexity, the BIU may have the
appropriate hardware to provide interfacing not only to the external main memory bus but
also to other buses. One such development is the processor having an additional separate
bus for connecting external caches (a cache bus). Most often, this cache bus is for an
external “level-2” cache (see the Pentium Pro, MIPS R10000, Alpha 21164, and
UltraSparc in Figure 1.5d); this assumes that the processor includes an on-chip level-1
cache and the BIU contains all the hardware logic needed to control these external level-2
caches. Other processors (e.g., the HP PA-8000) have no on-chip cache at all, but provide
a wide external cache bus to allow interfacing to level-1 instruction and data caches
implemented outside the processor chip. This allows the implementation of very large
level-1 caches which are needed by applications whose data sets are too large to fit into
the smaller on-chip caches. Finally, another development is that of the Alpha and
UltraSparc processors (Figure 1.5d) whose BIUs provide the necessary hardware to
support separate special-purpose I/O buses: Alpha supports the PCI bus, while
UltraSparc supports Sun’s Bus.
The control unit in the processor is used for instruction fetching, decoding,
sequencing, and dispatching to the proper functional units; for example in Figure 1.9 an
integer instruction would be issued to the fixed-point execution unit (FXU), while a
floating-point instruction would be issued to the floating-point unit (FPU). In scalar
processors this instruction issue is done “in-program-order”, one instruction at a time.
Earlier CISC processors used mainly microprogrammed implementations of the control
unit, while RISC processors used only hardwired implementations.
The control unit contains the proper timing circuitry (driven by the processor
clock pulses) to provide both internal and external timed control signals to carry out
internal CPU microoperations and facilitate external data bus transfers. The control unit
also contains a control sequencer to request from the bus interface unit to initiate
25
For example, the “double-speed” Intel 486DX2 processor operates internally at twice the clock speed
than the rest of the system: e.g., a “50-MHz 486DX2” (a 486DX2/50) processor operates internally at 50
MHz while the external subsystem still operates at 25 MHz. (Thus, this “50-MHz 486DX2” can still work
with older external slower 25-MHz motherboard designs but has an internal performance which is
substantially better than that of a normal “25-MHz Intel 486” processor.)
1.40
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
To carry out its tasks properly, the control unit must provide for at least the
following four capabilities:
1. Establish the present “processor state” (or microarchitecture state) during each
processor clock cycle.
2. Provide logic to determine the correct “next state”; this next state selection is done
through a proper combination of the present-state information, the processor’s
external inputs, and certain feedback lines, either form within the processor (e.g.,
flags from the ALU) or from external components in the system.
3. Provide a facility to store the information that identifies the current state.
4. Finally, provide some means for translating this state into proper intra-module and
inter-module controls signals generated by the processor CPU. Since all
processors synchronize their events with single- or multiple-phase input clocks,
control signals are issued in synchronism with precise clock pulses. The input
clock source, therefore, synchronizes all state transitions of the system.
1.41
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
The processor’s execution unit (EXE) is responsible for executing logic instructions
(such as AND, OR, shift, etc.), integer fixed-point arithmetic instructions, or floating-
point arithmetic instructions. As far as calculating the “effective address” of an operand
or of a result (needed during an instruction execution), some processors use their single
arithmetic ALU while others have a separate ALU dedicated only for that purpose. The
EXE unit includes the following hardware: the necessary arithmetic/logic units (adders
and a barrel shifter) and hardware multipliers/dividers; general-purpose fixed- and
floating-point registers (used to hold the operands and results of operations); special
registers (that hold intermediate results); and finally, the required local control circuitry.
Integer fixed-point instructions are executed in what is referred to as the FXU section of
the EXE unit, while floating-point instructions are executed in the FPU section. Usually,
an “n-bit processor” has an n-bit ALU (which operates in parallel on n-bit wide input
operands).
Quite often, the FXU has its own “integer register file”, and the FPU its own
separate “FPU register file”. (FPU registers have at least twice the width of integer
registers). CISC and RISC processors differ in their typical “register file” in that RISC
chips generally have a much larger number of registers to hold more operands.
Increasing the number of internal registers increases (1) the requirements for on-chip real
estate needed to implement them, (2) the width of the instructions (to be able to specify
the added registers), and (3) the additional hardware needed in the form of multiplexers
for storing into, or reading data from these registers. On the other hand a large number of
registers provides a better support to the larger number of register-to-register instructions
executed by RISCs. Usually, an “n-bit processor” has n-bit internal integer data
registers; “32-bit processors” have 32-bit data registers (with capabilities of addressing
an 8- or 16 bit quantity within a register, or concatenating two registers to hold 64-bit
quantities), “64-bit processors” have 64-bit integer data registers, etc.
Almost all processors have on-chip two separate types of cache memories: an
“instruction cache or ICACHE” which contains the instructions most likely to be fetched
by the control unit for decode and execute, and a “data cache or DCACHE” from which
operands are loaded and to which results of internal operations are being stored26.
(Caches are discussed in detail in Chapter 5).
26
This is the “split cache” approach. Some processors have followed the “unified” or “integrated cache”
approach with only a single on-chip cache used for both instructions and data.
1.42
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
In this section we present the internal organization of the 32-bit Intel 486
microprocessor as a representative scalar CISC processor. More details on the 486
internal registers or on the internal organization of other CISC processor examples can be
found in Appendix A.
àà
-----------------------------------------------------------àà
Example 1- 3: The Intel 486 scalar microprocessor
Figure 1.10 shows the internal organization of the Intel 486 microprocessor with its two
functional units: the integer ALU and a rudimentary FPU. Only one decoded instruction is
issued at a time to one of these two functional units.
The 486 is a 32-bit CISC scalar microprocessor, with a 32-bit ALU and 32-bit data
registers. Its “bus interface” unit contains the appropriate hardware to manage the external bus
lines and keep the buses busy. It can initiate both simple and complex bus cycles and support
variable data bus widths (both discussed in detail in the next Chapter 2), and the 486 contains all
the logic to perform on-chip error detection on the parity-encoded data transferred over the data
bus.
The chip contains a single 8-Kbyte “unified” cache, for both instructions and data. As a
first step, the 486 fetches the instruction from the on-chip cache. However, since a “cache line” is
16 bytes long, most instructions do not require this stage (because they have already been
prefetched with the previous access to the cache.) This step however is always required at the
target of the branch instruction. Then, a first decoding is performed by processing up to 3
instruction bytes. At this stage, the length of the instruction is determined along with actions that
are to be performed for the effective “address generation”. (A small number of instructions may
require 2 clock cycles for this first decoding stage). The “address generation” stage (needed
because of the complexity of the 486 instructions that intermix computations with accesses to
external memory) completes the decoding of the instructions, decodes any displacement or
immediate operands, and in parallel computes the effective address. (Again, a small number of
instructions may require 2 clock cycles in this stage). Depending on the instruction decoded, the
decoder may send it for execution either to the “ALU execution unit” (if an integer instruction)
or to the FPU (if a floating-point instruction) . Instructions that reference memory (including
jump instructions) access the on-chip “unified” cache in this stage. (Along with cache lookup, the
TLB lookup proceeds in parallel). The processor updates the register file either with data from
the on-chip cache or from main memory, or with the results of the FXU or FPU execution. The
“MMU” section inside the Intel processors contains the hardware to implement both segmented
and paged virtual memory. (These terms are explained in detail in Chapter 6). Finally, the 486
processor has neither internal nor external Harvard architecture.
<<-----------------------------------------------------------
1.43
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
32 32 Integer Registers
Integer Unit
Integrate 32 Fetch (ALU) MMU *
d(Unified)
Buffers (Segmentation
(CACHE) or
& &
(8 KB) FPU Paging
Decoder
(Rudimentary) Unit)
FPU Registers
Instructions
MMU * = Memory Management Unit Data/Operands
Figure 1.10: Intel 486’s internal organization (with its two functional units) [24].
We mentioned earlier that CISC processors have much more complex instructions that
RISC processors. As an example consider the complex format in Figure 1.11 of the Intel
instructions [5]. (Not all fields are shown.) These instructions consist of one or two
primary opcode bytes, possibly an address specifier consisting of the “mod r/m” byte and
“scaled index” byte, a displacement if required, and an immediate data field if required.
Within the primary opcode or opcodes, smaller encoding fields may be defined. These
fields vary according to the class of operation. The fields define such information as
direction d of the operations (to/from), size of the displacements w, register encoding
sreg2, sreg3, or sign extension s. The remaining fields of a long instruction specify
register and address mode, address displacement, and/or immediate data.
1.44
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
TTTTTTTT TTTTTTTT mod TTT r/m ss index base d32 16 8 none data 32 16 8 none
7 07 0 7 6 5 3 2 07 6 5 3 2 0
mod r/m ADDRESS MODE SPECIFIER (EFFECTIVE ADDRESS CAN BE A 2 for mod;
GENERAL REGISTER) 3 for r/m
sreg3 SEGMENT REGISTER SPECIFIER FOR CS, SS, DS, ES, PS, GS 3
The complexity of the instruction format and its number of different fields make
its decoding cumbersome. Such a complex CISC instruction may specify a large number
of addressing modes, including direct, based, base plus displacement, index plus
displacement, base plus displacement plus index, etc.
1.45
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
à>
--------------------------------------------------------à
Example 1- 4: Address field
Consider a hypothetical 32-bit microprocessor having 32-bit instructions composed of two fields:
the first field is one byte and represents the opcode and the second field (the remaining bits)
represents either an immediate operand or the address field.
a) What is the maximum directly addressable memory capacity (in number of bytes)?
b) Discuss the impact on the system speed if the microprocessor has:
• a 32-bit external address bus and a 16-bit data bus or,
• a 16-bit address bus and a 16-bit data bus.
c) How many bits are needed for the program counter and the instruction pointer?
Answer:
a) The maximum directly addressable memory is 224 = 16 Mbytes.
b) If the address bus is 32 bits, the whole address can be transferred to memory at once and decoded
there; however, since the data bus is only 16 bits, it will require 2 bus cycles (accesses to
memory) to fetch the 32-bit instruction or operand. In a hypothetical case of a 16-bit address
bus, will have the processor perform two transmissions in order to send to memory the whole 32-
bit address; this will require more complex memory interface control to latch the two halves of
the address before it performs an access to it. In addition to this two-step address issue, since the
data bus is also 16 bits, the microprocessor will need 2 bus cycles to fetch the 32-bit instruction or
operand.
c) The program counter must be at least 24 bits (if we assume here that the program counter
contains the physical address issued by the microprocessor)27.
If the instruction register is to contain the whole instruction, it will have to be 32 bits
wide.
(Sometimes there is a distinction between an instruction register and an “opcode register”; in
such a case, the opcode register here would be 8 bits long to hold the 8-bit opcodes.)
ß ----------------------------------------------------------------
<ß
27
Most likely, a 32-bit microprocessor will have a 32-bit address and a 32-bit program counter, unless on-
chip segment registers are used which may permit a less wide program counter.
1.46
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
àà
-----------------------------------------------------------àà
Example 1- 5: Instruction Formats of the MIPS RISC microprocessors
The MIPS R-series instructions (see Figure 1.12a) can be divided into the following groups [7]:
Load and store instructions: move data between memory and general registers. They are
all I-type instructions, since the only addressing mode supported is base register plus 16-bit
signed immediate offset.
Computational instructions: perform arithmetic, logical, shift, multiply, and divide
operations on operands in internal registers. They occur in both R-type format ( the
operands and the result are stored in registers) and I-type format (one operand is a 16-bit
immediate value).
Jump and branch instructions: change the control flow of a program. Jumps are always
to a paged, absolute address formed by combining a 26-bit target address with the high-
order bits of the program counter (J-type format) or register addresses (R-type format).
Branches have 16-bit offsets relative to the program counter (I-type). “JumpAndLink”
instructions save a return address in internal register 31.
Coprocessor instructions: perform operations in the coprocessors. Coprocessor Load and
Store instructions are I-type.
Coprocessor 0 instructions: perform operations on CP0 registers to manipulate the
memory management and exception-handling facilities of the processor.
Special instructions: perform system calls and breakpoint operations. These instructions
are always R-type.
Exception instructions: cause a branch to the general exception-handling vector based
upon the result of a comparison. These instructions occur in both R-type and I-type
formats.
ßß-----------------------------------------------------------------
ßß
I-type (immediate)
31 26 25 2120 16 15 0
op rs rt immediate
• op 6-bit operation code
• rs 5-bit source register specifier
J-type (jump)
31 26 25
• rt 5-bit target (source/destination)
0
op target register or branch instruction
• immediate 16-bit immediate value, branch
R-type (register) displacement or address displacement
31 26 25 21 20 1615 1110 65 0 • target 26-bit jump target address
op rs rt rd sa funct
• rd 5-bit destination register specifier
• sa 5-bit shift amount
• funct 6-bit function field
1.47
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
3 2 2 0
1 6 5 0
3 2 2 22 1 1 0
1 6 5 10 6 5 0
MEM OPC Ra Rb Mem Disp
3 2 2 22 1 1 0 0 0
1 6 5 10 6 5 5 4 0
OP OPC Ra Rb func Rc
A processor CPU carries out the instruction cycle in two phases: the “instruction fetch
and decode phase” (IF phase) and the “instruction execute phase” (IE phase). An
instruction cycle may require a number of bus cycles28, B1, B2, B3, etc. in order to fetch
and execute the instruction. A bus cycle29 (also referred to as “local bus cycle”,
“memory cycle”, or “external bus cycle”) begins whenever the CPU needs to access an
external memory location or I/O port, i.e., whenever the processor places an address on
its external address bus. Therefore, a bus cycle is the sequence of basic activities needed
to perform a memory read (or I/O input) operation, a memory write (or I/O output)
operation, or a more complex read-modify-write operation. The bus cycle, therefore,
corresponds to the time needed to complete one transfer from/to memory or I/O port.
Once the whole instruction has been fetched and decoded, its subsequent execution phase
may or may not require additional external bus cycles (depending upon whether or not its
execution needs to access memory or I/O). This subdivision of an instruction cycle into
bus cycles and clock cycles is shown in Figure 1.13.
All basic internal activities of the CPU (the “microoperations”) and all bus cycles
that the processor initiates on its external bus must be executed at well-coordinated times.
28
Actually, since most processors now contain an on-chip instruction cache (ICACHE), it is possible that
an instruction need no bus cycles at all if it is already in the on-chip ICACHE (and therefore its IF phase
needs no access to main memory) and its IE phase specifies a completely internal CPU operation (like
the register-to-register instructions).
29
The term “bus cycle” we use here corresponds to what some manufacturers refer to as a “machine
cycle.”
1.48
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
This is accomplished by sometimes using two different clocks, one external (the input or
system clock, denoted as CLK and has a clock cycle time CS) and the internal “internal
processor clock” (or simply the “processor clock”, it is denoted as PCLK and has a clock
cycle time Cp).
Instruction
Cycle
Bus
cycles B1 B2 B3
Input
clock C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4
cycles
Figure 1.13: An instruction cycle subdivided into bus cycles and clock cycles
The system clock, CLK, is used to time the external bus activities as well as
those of the other components in the system; this clock is also called the input clock,
because it is the driving clock applied as input directly to the microprocessor chip. It
establishes the local bus cycle time and controls (clocks) all transfers between the
processor CPU and memory or I/O ports (and that is why it is also referred to a the bus
clock). Everything that happens in the system is synchronized to this system clock’s
rising or falling edges. For example, a processor with an input clock of frequency FS =
16 MHz will have the basic events on the bus occur every CS = 62.5 ns (or even 31.25ns
if both the rising and falling edges of the clock are used).
This input clock CLK is supplied to the processor CPU from an external clock
generator/driver chip that acts as a constant frequency source. Figure 1.14 shows the
clock generators that drive various Intel and Motorola processors. (At the right-hand side
of the clock timing we give the actual names the manufacturers use.) The clock generator
itself requires an external series-resonant crystal input (or constant frequency source)
whose frequency may or may not be the same as that of the generated output clock CLK.
Sometimes the clock generator may provide a second clock output, called a peripheral
clock, that may be half the frequency of FS. The use of this second peripheral clock
simplifies computer system design because it allows the bus interface components to
1.49
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
operate at half the speed specified by the processor’s input clock, imposing on them less
stringent requirements30.
The internal processor clock PCLK is distributed throughout the hardware inside
the microprocessor chip and used to time all internal components and internal
microoperations. The timing of the execution of these internal microoperations is
synchronized to this internal processor clock’s rising or falling edges. The processor
clock PCLK can be either generated internally from the input clock CLK (the most
common case) or supplied externally to the processor using a different input pin from the
CLK input (e.g., the Motorola 68040 of Figure 1.14c). Some processors divide the input
frequency FS internally by 2 to generate the (internal) processor clock frequency31 Fp,
(e.g., Intel 386), others use internally the same clock frequency as that of the input clock
(e.g., Intel 486 and i860), and others multiply internally the incoming frequency by 2 to
generate their internal processor (like most Motorola processors.)
As we said earlier, a bus cycle corresponds to the time needed to do a data
transfer between the processor and the addressed slave device. A bus cycle is always an
integral multiple of the system clock cycle CS. The maximum data transfer rate for an
external bus operation (the processor bus bandwidth) is determined not only by the
frequency of this system clock FS, but also by additional information that include the
width of the processor’s external data bus and the number of clock cycles per bus cycle.
àà
------------------------------------------------------àà
Example 1- 6: Processor bus bandwidth
Assume a processor is driven by a 16-MHz input clock and has bus cycles composed of four input
clock cycles, C1, C2, C3 and C4. Then, if the data bus is 16 bits wide, the maximum data transfer
rate for this processor would be 2 bytes every four clock cycles, or 8 megabytes per second; if the
data bus is 32 bits, the maximum data transfer rate would double to 16 megabytes per second.
ßß-------------------------------------------------------
ßß
Bus Cycle
Crystal
Tw
C1 C2 C3 C4
Processor
CLK
Clock CLK F System
s
Generator Clock
/2
Fp
Input/System Processor
Clock Clock
State T1 State T2
30
For example, consider the clock for the Intel 386 in Figure 1.14a : its external 82384 clock generator
produces two clocks: one is called “CLK2”, which corresponds to our CLK, and the other “CLK”,
which has half the frequency of CLK2 and used as a “peripheral clock.”
31
For example, an Intel processor driven by a 66-MHz system clock and having an internal processor
clock of 32 MHz, is usually referred to as a “32-MHz processor.”
1.50
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Crystal
Processor
Bus Cycle Tw
CLK System
Clock CLK F and
s C1 C2
Generator Processor
Clock
FS = Fp
State T1 State T2
Input/System
Clock
Figure 1.14: Input/system clock, processor (internal) clock, “states”, and bus cycles
for representative Intel and Motorola processors
àà
----------------------------------------------------------àà
Example 1- 7: Comparing the bandwidths of two processors’ external buses
How would we go about comparing the external data bus bandwidths (or transfer rates) of a “clock-
doubled 66-MHz 486DX2” and a “66-MHz Pentium”? As we will see below, what we are often
being quoted as the “actual bandwidth” of the microprocessor, it is really the “theoretically
maximum data transfer rate that the microprocessor may achieve only for a few specific types of bus
cycles”. We need some detailed knowledge of the processors’ characteristics to determine the
bandwidth.
a) 66-MHz 486DX2:
When we refer to the 66-MHz 486DX2 we mean a 486 microprocessor that has an internal
“processor clock frequency” FP = 66 MHz; thus, the “processor clock cycle” is CP =1/66MHz
= 15.15ns.
Since the “DX”2 means a “clock-doubled” microprocessor (i.e., a microprocessor whose
internal processor clock is double the frequency of its external clock), the “system clock
frequency” is FS = 33 MHz; in other words, although the processor operates internally at twice
1.51
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
the speed of a normal “non-clock-doubled” 486, its external bus continues to run at only 33 MHz.
Therefore, the “system clock cycle” is CS = 30.3ns.
How long does it take to access memory and do a data transfer? We need to know how many
system clock cycles are needed by a 486 bus cycle. Since this number is two, a bus cycle equals
2CS = 60.6ns.
Since the external data bus width of the 486 is 32 bits, we have 4 bytes transferred per 60.6ns,
and thus the “maximum data-bus transfer rate” is 66 Mbytes/sec.
If we now consider a particular bus transfer called “burst transfer” (that we will discuss in
more detail in the next Chapter) with which we have a maximum of 4 bytes transferred every
“clock cycle” instead of every “bus cycle”, then the DX2 bus transfer rate increases to 132
Mbytes/sec.
b) 66-MHz Pentium:
When we refer to a 66-Mhz Pentium we mean a Pentium microprocessor whose both
internal and external clock have the same 66 MHz frequency; FS = FP = 66 MHz. Therefore, CS
= CP =1/66MHz = 15.15ns. Internally the Pentium is clocked at the same rate (66 MHz) as the
above 486DX2, but it has much more stringent requirements on the speed of the external
components on the motherboard the Pentium external bus runs a twice the frequency of the
486DX2); the motherboard is “system” implementation dependent.
The Pentium also needs 2 clock cycles to implement a bus cycle; in other words, now,
bus cycle = 2* CS = 30.3ns.
But the external data bus width of the Pentium is 64 bits; thus we have 8 bytes
transferred per 30.3ns, which yields a “maximum data-bus transfer rate” of 264 Mbytes/sec.
Again, under the assumptions of a “burst transfer” (i.e., maximum 8 bytes per clock
cycle) we get a Pentium bus transfer rate of 528 Mbytes/sec.
ßß-----------------------------------------------------------
ßß
It can be seen from Figure 1.15 that the duration of what each processor calls a
state (or “local bus state”) varies among the various processors (even among those from
the same manufacturer!). For some products the “state” is smaller than the input clock
cycle time CS; for example, the Motorola 68040 “state” equals one-fourth the input clock
cycle. Asserting and sampling of external signals can be done for Motorola products
either at system clock cycle (CLK) boundaries or at half the system clock cycle
boundaries (every two T states). For other products, the “state” is equal to the input clock
cycle: for example, the Intel 486 states. Finally, there are products whose “state”
duration is longer than that of the input clock cycle CS; for example, the Intel 386 has a
“state” which is twice as long as the input clock cycle. Thus, when we compare a
number of processors, we may find out that of them have the same bus cycle, consisting
for example of four input clock cycles, but some of them may say that they have a “4-
state” bus cycle, while the others that they have a “2-state” bus cycle. For the bus timing
calculations of bus transfers one should use as reference the input clock cycle time CS
rather the number of “states” in its bus cycle.
The duration of a wait state Tw (which is a state inserted32 to elongate the bus
cycle because a slower slave component cannot respond to the processor’s request within
the allocated time period) also varies among the different products: most often it equals
the input lock cycle, although there may also be products with different duration (e.g., the
32
The actual position in the bus cycle for inserting a wait state Tw depends on the particular processor (as
shown is Figure 1.14 and explained later in more detail).
1.52
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Intel 386 in Figure 1.14a has a wait state equal to two input clock cycles.) It is the
system designer’s responsibility to include the proper external interface logic to force the
processor to insert such wait states. For example, let’s assume that accessing memory
requires one wait state and accessing an I/O port requires three wait states. The design
must include the proper external logic to read and decode the address issued on the
address bus to determine whether it points to memory or I/O space and thus determine
whether one or three wait states will be required for that particular bus cycle.
Most of the basic bus cycles are considered as having fixed length. In some
processors, a bus cycle equals four input clock cycles (4CS), in others three (3CS), and in
others two input clock cycles (2 CS). Figure 1.15 shows the minimum duration of a bus
cycle for some Intel and Motorola processors. The diagram also summarizes the
relationship among the bus cycle, input clock cycle CS, internal processor clock cycle
Cp, and what various commercial processors call a “state” and a “wait state.”
à>
------------------------------------------------à
Example 1- 8: Bandwidth, frequency, and data bus width
Consider a 64-bit microprocessor implemented with a 32-bit external data bus and driven by an 100-
MHz input clock. Assume that this microprocessor has a bus cycle whose minimum duration is 4
input/system clock cycles.
a) What is the maximum data transfer rate that this microprocessor can accomplish?
b) In order to increase its performance, how would you compare increasing its external data bus
width to 64 bits versus doubling its input/system clock frequency to 200 MHz?
Answer:
a) System clock cycle = 1 / 100MHz = 10 ns.
Bus cycle = 4 X 10 ns = 40 ns.
Four bytes are transferred every 40 ns; thus, transfer rate = 100 MBytes/sec .
Doubling the frequency will most likely mean adopting a new chip manufacturing technology (if we
assume that the operation remains the same, and each instruction still requires the same number of
clock cycles). Doubling the external bus maybe easier, since this “64-bit processor” may have
already foreseen a next version implementation with 64 bit data buses; this means wider (maybe
newer) on-chip data bus drivers and latches and modifications of the on-chip “bus interface unit”.
b) In the first case, the speed of the memory subsystem will also need to almost double so that it will
not slow down the processor. In the second case, the “wordlength” of the memory subsystem will
have to double to be able to send and receive 64-bit quantities.
ß -------------------------------------------------------
<ß
All microprocessor CPUs carry out a bus cycle by executing a sequence of one or more
simultaneous basic operations (referred to as microoperations). All microoperations are
executed in synchronism with the processor’s internal clock, and we will assume here
that each microoperation can be executed within one processor clock cycle CP. We also
know that the activities on the external processor bus are executed in synchronism with
the external or system clock CLK.
1.53
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Figure 1.15: Relationships among local (or processor) bus cycle, input (or system)
clock cycle CS, processor (internal) clock cycle CP , and “state”.
Figure 1.16a shows the “state transition diagram” a generic processor may follow
to execute a complete external bus cycle. In this case we assume a processor with a bus
cycle composed of four system clock cycles: C1, C2, C3, and C4. In order to simplify
the discussion here, we will assume that the internal and external clocks have the same
frequency and, therefore, a clock cycle C = CP = CS. During each clock cycle, the
processor executes specific microoperations depending on the type of bus cycle it starts.
33
So far we are discussing only synchronous bus transactions. Asynchronous bus operations are covered
in the next Chapter.
1.54
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
C1
C2
Ready
(READY# = L)
C3
Not Ready
C4 (READY# = H)
AS# ← L
*If we assume, say, a 64-bit external data bus, then we will need 8 byte-enables
1.55
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Some processors do not place the whole m-bit byte-address on the address bus but
only its most significant part and accompany it by asserting proper values on the
processor’s “byte-enable” outputs. For example, consider a 64-bit processor (i.e. the
processor has a 64-bit external data bus connected to a 64-bit main memory) with 32-bit
addresses. This processor may be placing the most significant 29 bits A31-A3 of its 32-
bit address (this is also referred to as a “quadword-address”, i.e., an address whose least
significant 3 bits are zeros and thus points to a quadword location) on the 29-line address
bus and, instead of issuing the three least significant zeros, places proper values (0s or 1s)
on its 8 “byte-enable”34 output pins; in this case, for an m-bit address, we say: Am-1-A3
ß part of address (since A2,A1,A0 are not issued), and BE7# - BE0# ß proper values.
Anytime the processor starts a bus cycle, it notifies all other modules of the
system that a valid address is placed on the address bus by activating during C1 an
“address strobe” (sometimes called “address latch enable” or ALE) output
synchronization signal: (AS# ß L). This AS can trigger external latching circuitry to
latch the address and keep it valid for the remaining part of the bus cycle.
The processor also issues “bus cycle identifiers” (in the form or “status” or
“function code” signals listed in Figure 1.6) to inform the rest of the computer system
modules of the type of bus cycle it has initiated. For example, the processor activates a
control signal Memory/IO# to indicate whether access is to be performed to memory
(because it executes a memory-type instruction) or to I/O (because it executes an
input/output instruction, if it has such). The processor also informs the rest of the system
whether a read (or input) or a write (or output) bus cycle is executed by asserting an
output control line. For a read cycle: Read/Write# ß H. Finally, the processor may also
inform the rest of the system whether it is to transfer data (an operand or a result) or it is
to fetch code (a program instruction). For code fetch: Data/Code# ß L.
Some processors35 also inform the rest of the system modules of the size of the
operand to be transferred during the current bus cycle. They do that by issuing a
“operand size indicator”; for example, a processor with 32-bit operands will issue two
such signals (say, SIZ1 and SIZ0) where 00 indicates a 4-byte doubleword, 01 a byte, 10
a 2-byte word, and 11 a whole cache line. In general for k such output pins36 we can say:
SIZk-1 - SIZ0 ß proper values.
Other control signals are also needed to control external interface devices (“glue
logic”), like the ones shown in Figure 1.7. For example, external transceivers require an
enabling signal in the form of “data enable” (DEN#) and a second signal to indicate the
direction of the data transfer on the data bus, such as a “data transmit/receive” DT/R# (H
34
These “byte-enables” are actually more important for “write” than for “read” cycles, to indicate which
byte lanes of the data bus carry valid data and which “byte-sections” of memory must be triggered to
receive and store these bytes. The byte-enables may be issued directly from the processor CPU (like the
BE7#-BE0# of the Intel Pentium), or must be generated by external circuitry (see Motorola 68030 in
Figure 1.5d). For a “read” cycle, these “byte-enable” signals are usually of no importance, because
quite often an n-bit memory will always drive all its n output data bus lines, will send to the processor n
bits of data, and leave it up to the processor CPU to determine which of its n input data pins actually
have the requested data bytes.
35
For example some Motorola microprocessors.
36
This k depends on the maximum size of the operand the processor can handle; for example, a 64-bit
processor may need k = 3 to be able to specify 1 byte, 2 bytes, 3 bytes, 4 bytes, or 8 bytes.
1.56
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
indicating that the processor issues data and L the processor is to receive data.) These
signals may be issued directly from the processor chip itself or generated by the external
“bus controller” chip (which interprets some status signals the processor issues at the
beginning of each bus cycle).
àà
-------------------------------------------------------------àà
1.57
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Consider now the execution of a CALL instruction with the following 3-byte format:
BYTE1: CALL opcode (assume stored at hexadecimal byte-location 1232)
BYTE2: upper half of the target address (first executable instruction of the called
routine)
BYTE3: lower half of the target address.
The 16-bit target address identifies the location of the first executable instruction in the called
routine. Assume that the execution of this instruction is a follows:
(SP) ß (SP) - 2
[(SP)] ß (PC)
(PC) ß BYTE2,BYTE3
where (X) denotes “the contents of”, and [(X)] denotes “the contents of memory location pointed
at by the contents of X. Assume that initially the stack pointer register SP contains the
hexadecimal value ABCE (pointing to the latest entry on top of the stack).
Then Figure 1.17 shows the proper sequence of the most important microoperations
scheduled for execution during the clock cycles of each bus cycle by the generic processor in
Figure 1.16 in order to complete the CALL’s “instruction cycle” (i.e., both its fetch and execute
phases). The assumptions made here also include: no wait states, and that each instruction is
loaded into program memory starting at an even byte address (and therefore fetching them can
always be done on a 16-bit word basis).
ßß---------------------------------------------------------------
ßß
àà
------------------------------------------------------------------àà
Example 1- 10: Intel Pentium’s bus cycle microoperations
Figure 1.18 depicts the simplified state transition diagrams for the Intel Pentium
processor, along with the most important microoperations needed to execute a memory read and
a memory write cycle. We observe that the bus cycle of the Pentium requires two clock cycles (or
states T1 and T2) and a wait state corresponds to elongating the bus cycle by one input clock
cycle (i.e., repeating state T2). During C1 the Pentium places the 29 most significant bits (a
quadword address) of its 32-bit address on the 29-line nonmultiplexed address bus, and uses
internally bits A2, A1 and A0 to generate the eight output byte enables BE7# - BE0#. The
address status ADS# indicates that a new bus cycle is currently being driven by the
Pentium. Address parity AP is driven with even parity for all bus cycles along with each address
the processor issues. The signal CACHE# output indicates internal cacheability37 of the cycle (if a
read) and indicates a burst37 writeback cycle (if a write). The other two signals W/R# and D/C#
distinguish between read and write cycles and data or code transfer, respectively.
The data is transferred over the proper byte-lanes (D63-D56, D55-D48, D47-D40, D39-
D32, D31-D24, D23-D16, D15-D8, D7-D0) of the 64-bit data bus during C2, for either a read or
a write operation. The maximum data transfer rate for a bus operation is 64 bits for every two
system clock cycles. A bus cycle starts by the processor placing the address and issuing the ADS#
strobe and terminates when it samples the ready signal BRDY# = L. If it is High, wait states are
inserted by repeating T2. If this BRDY# is returned low, it indicates that the external system has
presented valid data on the data pins in response to a read or the external system has accepted
data in response to a write. On a write cycle, in addition with the data, the Pentium places eight
parity bits on lines DP7 - DP0, one parity bit per byte-lane on the data bus.
ßß----------------------------------------------------------------------
ßß
37
These terms are explained in other chapters.
1.58
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Bus Cycle 1: Fetch opcode byte and BYTE2 from program memory:
C1: A15-A1<--123H(0012)*=(PC), BE1#-BE0#<--00, AS#<--L,
M/IO#<--H, R/W#<--H, D/C#<--L, SIZE1-SIZE0<--10=16-bit
word**, DEN#<--L, DT/R#<--L;
C2:
C3: sample “port size indicator” input pins (assume here 16-bit
memory port), (PC)<--(PC)+2=1234H (we assume this internal
microop is executed during C3);
C4: latch the two bytes from the data bus, decode the opcode byte
(the control section now determines the length of the instruction
and its format, the need for one more access to memory to fetch
the remaining part of the instruction, and what to do internally
with the received BYTE2), negate all signals, and end current
bus cycle;
* This is the 15-bit most significant part of the address: 123 are the hexadecimal digits for
the leftmost 12 bits and 001 are the next 3 binary bits to the right.
** Because we assumed -- in this example -- that reads from memory using the PC are always
done on the 16-bit basis.
C4: latch BYTE3 and BYTE4 of the instruction and move them to
internal registers (the last byte BYTE4 will not be used here),
(SP)<--(SP)-2=ABCCH (assume this internal microop to prepare
the stack pointer for the next bus cycle is executed during C4),
negate all signals, and end current bus cycle;
** Because we assumed -- in this example -- that reads from memory using the PC are
always done on the 16-bit basis.
1.59
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Bus Cycle 3: Save “return address” on top of the stack and jump to
the called routine:
C1: A15-A1<--ABCH(1102)*=(SP)-2, BE1#-BE0#<--00, AS#<--L,
M/IO#<--H, R/W#<--L, D/C#<--H, SIZE1-SIZE0<--10=16-bit
word, DEN#<--L, DT/R#<--H;
C2: D15-D0 <--1236H;
C3: sample “port size indicator” input pins (assume here a 16-bit
memory port);
1.60
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
increased by increasing both the depth of the pipeline and the processor clock rate.
This leads to superpipelined implementations (discussed in Section 1.7.1), which
have reduced CPIP below 1 (i.e., they execute more than one instruction per
processor clock cycle).
2. The second approach to parallelism is based on the spatial instruction-level
parallelism in a processor that contains a number of independent functional units or
multiple copies of some of the pipeline stages. This leads to superscalar
implementations in which multiple instructions are issued to various independent
functional units, executed in parallel, and completed per processor clock cycle
(discussed in more detail later in Section 1.7.2). Superscalars also achieve a CPIP
value of less than 1 without increasing the processor’s clock rate.
3. Other approaches -- including superpipelined-superscalar processors (a
combination of both superpipelined and superscalar implementation) and VLIW
(Very Long Instruction Word) processors -- are discussed in Sects 1.7.3 and 1.7.4.
T1
C1
BRDY# = L
T2
C2 BRDY# = H
1.61
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Figure 1.18: Pentium’s simplified state transition diagram and most important
microoperations for a “memory read” and “memory write” bus cycles.
The instruction pipelining technique came about from the observation that each
instruction cycle can be broken down into a number of steps, each of which takes an
equal fraction of the time needed to complete the entire instruction. For example, if each
instruction of the processor can be broken down into five steps, each step can be assigned
to execute by a different stage of a 5-stage pipeline (See Figure 1.19a). Instructions
would enter the pipeline at one end, be processed through all stages, and exit at the other
end. The pipeline accepts new instructions, before any previously accepted instructions
have been completely processed and exited from it. The latency of a pipeline execution
stage is the number of cycles between the time an instruction is issued for execution to
the EXE stage and the time a dependent instruction (which uses the result as an operand)
can be issued. In most cases, integer instructions have a single-cycle latency, while
floating-point add and multiply may have a 2-cycle latency. Other more complex
instructions (like integer multiply, floating-point square-root, and all divide instructions)
are computed iteratively and have longer latencies.
When a subtask result leaves one stage, the logic associated with that stage
becomes free and can accept new results from the previous stage. Thus, the rate at which
instructions are fed to the pipeline is chosen in relation to the time required to get an
input through one stage, with the main goal of keeping all portions of the pipeline fully
utilized. Once the pipeline is full the output rate will match the input rate.
All stages of the pipeline operate in parallel, each one executing a step from a
different instruction. (Storage buffers exist between stages to hold temporary results and
inputs to the next stage.) An individual instruction’s computations advance from one
stage to the next, and the instruction gets closer to completion as the end of the pipeline is
1.62
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
approached. In an ideal pipeline each stage takes the same amount of time to execute its
task; in order to simplify the explanation, let’s assume that this time equals one processor
clock cycle CP. Thus, if each instruction is broken down into n steps, then once the n-
stage pipeline becomes full, it effectively executes n instructions simultaneously. Each
instruction still needs the same amount of time to complete from start to finish (n basic
clock cycles), but because n instructions are being processed at a time, once the pipeline
is filled, the rate at which instructions are completed in an ideal pipeline is n times as
rapid (i.e., the instruction bandwidth has increased n times). The time per instruction on
the pipelined processor is equal to the time per instruction divided by the number of
pipelined stages. This processor would complete one instruction per processor clock
cycle, i.e., it will have a CPIP = 1.
38
As we will explain in detail in Chapter 6, a translation from virtual to physical address may be required
in this step involving part of the MMU hardware. (In processors with paging, this hardware is called
the TLB or “translation look-aside buffer”.)
39
We assume here that the instruction was found in the ICACHE, i.e., we had a “cache hit.”
40
Like the IF stage, the address generation step in this stage may also involve a “TLB look-up”.
1.63
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
IF ID EXE MEM WR
Processor
clock cycle
Cp
PROCESSOR
CLOCK
CYCLE Cp: 1 2 3 4 5
Effective
Address offset address
calculation
CPU
REGISTERS ALU
DCACHE CPU
ICACHE REGISTERS
OPCODE
DECODER
IF ID EXE MEM WB
Note: Stages IF and ALU may involve TLB lookup
Figure 1-19: Five-stage scalar pipeline with one processor clock cycle per stage. (If
this pipeline operates at, say, 50 MHz, then each stage takes 20 ns to
accomplish its task, and the “pipeline basic clock cycle” = 20 ns; we
assume here that the “pipeline basic clock cycle” = processor clock
cycle).
1.64
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
WB (write back): This stage is used to place into a CPU register either the source
operand read from the DCACHE or the ALU result produced at the EXE stage.
PROCESSOR CLOCK
CYCLE CP
(a)
IF ID EXE MEM WB
INSTRUCTION 3:
IF ID EXE MEM WB
INSTRUCTION 5:
1 2 3 4 5 6 7 8 9 10
1.65
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
à ->
-------------------------------------------------------à
Example 1- 11: CPI-frequency balance
We can not easily judge the processor improvement achieved by considering only the
increase in the processor clock rate. Relatively deeper (and simpler) processor pipelines allow
for higher clock speeds. So what happens when we extend the pipeline by adding one additional
stage and increase the clock frequency by 50%? Would this also increase the processor’s
performance by 50%? If one gave the naïve affirmative answer, we will see in the following
example (from [25]) why this is not correct.
Consider a program segment that takes 100 clock cycles in a pipelined processor with a
processor clock frequency of 100 MHz. Then, the baseline internal architecture of the processor
would take 1 microsecond to execute this segment. Suppose we modify the processor’s internal
pipeline by adding an extra stage for LOADs. If LOADs are 30% of all operations (and assuming
each pipeline stage takes one processor clock), this would add 30 clocks to this program
segment, which would now take 130 clocks to execute. However, since this extra pipeline stage
allows the processor clock frequency to increase to 150 MHz, the total execution time becomes
130/150 or 0.867 microseconds. This amounts to a 15 percent performance improvement
(1/0.867) instead of the easy (and wrong) answer of 50% percent improvement.
ß ------------------------------------------------------------------
<ß
In this Section we present the operation of more advanced microprocessor systems using
as an example the baseline 5-stage pipelined processor model of Figure 1.21.
External Bus
Lines
Data Register
File (RF)
FXU
Integer
Instruction
Execution
ICACHE Prefetch, DCACHE
Unit
Instructio Queue, or Data
n Decode & Cache
Cache Dispatch FPU
Floating Point
Execution Unit
IF ID EXE WB MEM
Instructions
Data/Operands
Figure 1.21: The model of a 5-stage scalar pipelined processor. (One instruction is
executed per processor clock cycle).
1.66
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
external bus cycles to access main memory only when there is an ICACHE miss, i.e., the
instruction was not found in the ICACHE.
The decode stage takes its instructions from the on-chip ICACHE. Remember
that access to main memory will take place only if there is a miss on the ICACHE.
During such a long access to main memory the processor is either “stalled” waiting for
the instruction to be fetched or -- in more sophisticated processors -- will instead go
ahead and decode and execute other instructions already prefetched and queued from the
on-chip ICACHE. For example, a processor may incorporate two independent, line-size
(32-byte), undecoded, instruction stream queues: one to buffer sequential instructions
and the other instructions from the branch target buffer (for example, the Pentium
processor discussed later in Example 1- 12.) Another processor may have only a single
instruction stream queue, from which its dispatch logic can choose -- not necessarily in
strict FIFO order -- which, say, n of the bottom m instructions to issue for execution to its
n functional units (e.g., the PowerPC chooses which 3 of the bottom 4 instructions to
issue to its 3 functional units FXU, FPU, and branch processing unit.)
More advanced processors, in addition to the on-chip cache (the “level-1 cache”)
may also have an additional external “level-2 cache”; in that case, when there is a miss
in the on-chip cache, the level-2 cache will be first examined for the information
(instruction or data); only if there is a miss in this second-level cache too access to main
memory will be performed. First-level cache misses initiate bus transactions (to main
memory or level-2 cache) in the form of “cache line transfers” (as explained in detail in
Chapter 2).
In any case, when accesses to the ICACHE is done, instead of fetching only one
instruction, a number of instructions is prefetched by reading out a whole “cache line”
(for example 16 or 32 bytes long). The prefetched bytes are then rotated so that they are
justified for the instruction decoder, i.e., pointing to the starting point of the next
instruction.
The register file in the CPU is accessible by both the integer and the floating
point units, or each unit may have its own specialized registers. The out-of-order
1.67
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
execution units are intelligent enough to know the original order of the instructions in the
program and re-impose program order when the results are to be committed (“retired”) to
their final destination registers.
Finally, all operands are fetched with LOAD instructions from the on-chip
DCACHE; only on a cache miss the processor will start an external bus cycle to access
main memory (or a second-level cache) for the operand. (Quite often in such cases the
processor will not only fetch the requested operand but will start a “burst bus cycle41” to
fetch the whole “cache line” that contains this operand in memory). Similarly, all
STORE instructions store their results in the on-chip DCACHE. (Whether this result is
also sent to main memory depends on the how the DCACHE works, as will be explained
in more detail in Chapter 5).
Quite often, to isolate operations between main memory and the DCACHE and to
smooth the flow of data between the slower memory and the faster processor, “load
queues” and “store buffers” are used inside the processor. “Load queues” hold operands
that come from memory during read cycles, while the “store buffers” hold data that the
processor sends to memory during write cycles. For example, the 486 and the Pentium
have only store buffers, while other advanced RISC processors have both load queues
and store buffers.
The above description applies to ideal operations of pipelines in which all stages require a
single processor clock cycle and instructions are issued to the pipeline in such a way as to
keep it always filled. Unfortunately, data dependencies and control hazards (associated
with the execution of conditional-branch instructions), present various problems that do
not allow such ideal operation. We will give here only a brief introduction to the type of
problems that this internal pipelining may face.
a) Data dependencies
One data dependency problem is the read-after-write or RAW 42. This problem arises
when an instruction depends on previous ALU results; i.e., a subsequent instruction
requires data that are to be produced by the previous instruction. In this case, the
instruction should not be started before these results are available43. Another problem is
the write-after-write or WAW 44 hazard: when an instruction writes to a register (or
resource in general) after a subsequent instruction has already written to the same
41
Burst bus cycles are covered in Chapter 2.
42
This RAW hazard is also called “true data dependence” or “destination-source conflict”.
43
Processors with a single, unified, on-chip cache, avoid this read-before-write hazard by giving priority to
the WB (write back) stage over the “operand fetch” stage (since both share this single on-chip CACHE.)
44
This WAW hazard is also called “output data dependence” or “destination-destination conflict”.
1.68
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
register, thus leaving the register with old stale data. A third data dependency problem is
the write-after-read or WAR 45: when an instruction attempts to write a result to a
register (or a resource) before a previous instruction reads the old data. Finally,
sometimes a problem may arise when an attempt is made to modify an instruction by a
preceding store operation. This can be handled by the decoder in the control section
which tests the instruction as soon as it is loaded to see whether it is a strore-type
instruction, in which case the fetch sequence must wait until the effective address has
been prepared to see whether it is going to modify a successive instruction.
Both software and hardware solutions have been used to handle these data dependencies.
They are listed here without going into their explanation.
Software solutions include: compiler-inserted NOOPs (no-operation
instructions), basic block scheduling, and list-scheduling. Hardware solutions include:
pipeline interlocks (always stall the pipeline until the dependency is resolved),
forwarding (always pass the ALU result directly to the functional unit that requires it
before it is written into the register), scoreboarding (check whether a decoded instruction
has a destination register used as a source register by another instruction, to ensure that a
source operand is not fetched from a register that is currently waiting for a result),
reservation stations (a distributed way of detecting data dependencies and passing the
results directly to the functional units rather than to registers), and register renaming (that
requires an increased number of registers to allow for multiple instances of registers, and
implemented through a mapping table, a reorder buffer, or future file).
45
This WAR hazard is also called “anti data dependence” or “source-destination conflict”.
1.69
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
instructions so that some other unrelated instruction of the program is executed in the
“load delay slot” (i.e., placed between the LOAD instruction and the immediately
following instruction that uses the loaded data), or if no such unrelated instruction can be
found, insert a NOOP instruction in the “load delay slot”. One hardware solution
followed by Intel 486 is to eliminate the load delay by rearranging the pipeline so that
memory addresses be computed in the second decode stage D2 of the pipeline before the
EXE stage [4]. Superpipelined processors may have a load delay slot of 2 or more (see
the MIPS R4000 in Appendix B), making the solution to the load delay problem more
difficult.
c) Control hazards
Software solutions include the following (listed here without further discussion):
branch spreading, scheduling the branch delay slot, branch folding, software (static)
branch prediction, trace scheduling, loop unrolling, software pipelining, register
renaming, and scheduling across branches.
1.70
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
Having introduced a pipeline into modern processors, how can the pipeline be modified
to further decrease the processor’s TP and thus increase its throughput? Superpipelining
accomplishes this by increasing the pipeline depth and increasing the pipeline clock rate;
in other words, superpipelining increases processor performance by reducing the
processor cycle (i.e., by reducing the Cp term in Eq. 1.7). Longer pipelines provide finer
granularity in instruction execution; for example, a bottleneck stage or a stage that
requires longer time to execute can be subdivided into two independent stages.
Increasing the pipeline depth, however, requires faster clocks and an increase in the rate
at which instructions enter and leave the pipeline.
Figure 1.23a shows again the operation of our example 5-stage pipeline of
Figure 1.20b with its basic clock cycle equaling the processor clock cycle (and CP =
20ns); we had said that this pipelined processor completes one instruction every 20ns it
has a CPIP = 1. Figure 1.23b now depicts how the operation of an ideal 10-stage
superpipelined implementation of degree 2 would compare with it. (A major stage in the
pipelined processor is replaced by two substages, the substages are clocked at twice the
frequency of the major stage, and the processor initiates an operation at each substage on
each of the smaller clock cycles). In general, in a superpipelined implementation of
degree n, the pipeline clock needs to be n times as fast as the pipeline clock of the basic
pipeline. Thus, the pipeline clock rate of Figure 1.23b has doubled to 100 MHz (which
allows feeding instructions to the pipeline at twice the previous rate). When comparing
the pipelined implementation in Figure 1.23a with that of this superpipelined
46
High-performance processors include chips such as the Pentium II and Pentium Pro, the MIPS R4000-
and R10000-series, the PowerPC 620, the Alpha 21164, the UltraSparc-II, the PA-8000, etc.
1.71
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
External Bus
Lines
Data Register
File (RF)
FXU
Integer
Instruction
Execution
ICACHE Prefetch, DCACHE
Unit
Instructio Queue, or Data
n Decode & Cache
Cache Dispatch FPU
Floating Point
Execution Unit
1.72
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
IF ID EXE MEM WB
INSTRUCTION 3:
1 2 3 4 5 6 7 8 9 10
Internal processor = 50 MHz, each stage takes 20 ns, one instruction is issued
every 20 ns; ideally, CPIp = 1.
a) 5 stage scalar pipeline
(10-DEEP)
CP
C’P INTERNAL PROCESSOR CLOCK CYCLE (10ns)
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2 INSTRUCTION 1
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2 INSTRUCTION 2
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2 INSTRUCTION 3
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2 INSTRUCTION 4
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
INSTRUCTION 5
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
INSTRUCTION 6
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
INSTRUCTION 7
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
INSTRUCTION 8
INSTRUCTION 9 IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
INSTRUCTION 10
IF1 IF2 ID1 ID2 EXE1 EXE2 MEM1 MEM2 WB1 WB2
INSTRUCTION 11
CURRENT
CPU
CYCLE
0 2 4 6 8 10 12 14 16 18 20
NEW INTERNAL PROCESSOR CLOCK CYCLE
1.73
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
a) Basic concepts
Superscalar processors are built on the principle that more than one instruction can be
fetched, decoded, executed, and completed in parallel. A prerequisite to the superscalar
architecture is the existence inside the processor of a number of independent “functional
units” (integer execution units, floating-points units, load/store units, graphics units, etc.).
A superscalar implementation of degree or way or issue m means that the processor can
(ideally) fetch, decode, issue, complete, and “retire” (or “graduate”) m instructions per
processor (pipeline) clock cycle. An m-way superscalar may have more than m
independent functional units (to allow the execution of more than m instructions per
clock cycle, but it still retires or graduates only m instructions per clock cycle).
In a superscalar implementation, the processor clock remains the same with that
of our earlier regular (basic) scalar implementation, but superscalar techniques increase
processor performance by reducing the average number of clock cycles per instruction
(i.e., by reducing the CPIp term in Eq. 1.7)
Figure 1.25a shows the model of an 8-stage scalar processor and Figure 1.25b
that of an 8-stage 4-way superscalar processor. In both scalar and superscalar processors
their functional units are pipelined47.
1.74
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
long, and presents 16 aligned bytes to the decoders every clock cycle. The decode stage
buffers multiple 16-byte fetches and rotates prefetched bytes to the starting point of the
next instruction.
PROCESSOR CLOCK
(20ns)
1 2 3 4 5 6 7
Internal processor clock = 50 MHz, each stage takes 20 ns, 2 instructions are
dispatched simultaneously and executed in 1 processor clock cycle; ideally,
CPIp = 0.5
Figure 1.24: Superscalar operation: internal processor clock = 50 MHz, each stage
takes 20ns, 2 instructions are dispatched simultaneously and executed in 1 processor
clock cycle; ideally, CPIP = 0.5 [19].
1.75
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
as that of the scalar processor. To select the ideal mix of instructions, superscalar
microprocessors follow different “instruction issue rules”. (As an example, the
“instruction pairing” and “instruction issue rules” for Pentium are given below in
Example 1- 12).
An instruction is issued when it is handed over to a functional unit for execution.
More advanced processors have made the fetch/decode/RF unit more intelligent in
predicting program flow, have included some kind of “instruction pool” or one or more
“reservation stations” or “instructions queues” (where decoded instructions wait until all
their operands are ready and a functional unit is available) and then instructions are
“issued” in parallel for “out-of-order” execution to the many functional units. In an out-
of-order superscalar processor, each instruction is eligible to begin execution as soon as
its operands become available regardless of the original instruction sequence. The
hardware rearranges instructions in order to keep the various functional units busy. This
is called “non-sequential dynamic execution scheduling” or “dynamic instruction
issuing”. With such dynamic execution scheduling, the processor can operate at its
highest efficiency (functional units are kept from going idle) by reordering instructions to
suit the available functional unit resources. The instructions can be executed and
completed out-of-order and then reordered (or “retired” or “graduated”) back in their
original program order, as shown below in Figure 1.26. Such processors have replaced
the classical “execute” phase by two decoupled phases: an “issue/execute” and a “retire”
phase.
External
Line
Bus
s
BUI (Bus Interface
Unit)
Data
File RF)
Register
(
FXU
Instr RF
ICACH or
Decod
. Acces DCACH
E
e s E
FPU1 FPU2 FPU3
1.76
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
External Bus
Lines
Data Register
FXU 1 File (RF)
Decoder 1 FXU 2
Resolve
Depend
ICACHE Decoder 2 endies ,
FPU 1,1 FPU
RF FPU 1,1 FPU1,2
1,2 FPU 1,3
Decoder 3 access, DCACHE
b) The model of an 8-stage pipelined 4-way superscal ar processor. (“4-way” or “4-issue” or “degree-4”
superscalar: up to 4 instructions executed per cycle. For example, 2 integer and 2 floating point
instructions executed simultaneously in their functional units)
An instruction is complete when its result has been computed and stored in a
temporary physical register.
Time
In order
In order
Instru
ction: Fetch Decode Out of order Graduate
Issue Execute Complete
1.77
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
2. MIPS R10000: This processor is presented in more detail below in Appendix A.4.2.
3. Sun Microsystems UltraSparc-II: Figure 1.5d depicts the external view of the
UltraSparc microprocessor with its address and data bus lines. The UltraSparc is a
pure 64-bit superscalar microprocessor (with 64-bit internal registers and ALUs, 41-
bit physical addresses, an 128-bit external data bus, and an independent SBus (for
I/O to slower peripherals), it is a 4-way superscalar, contains nine execution units
(two integer ALUs, five FPUs, a branch-processing unit, and a load/store unit), its
longest pipeline is 9 stages, it does not issue instructions out-of-order, it is one of the
very few processors that have large number of registers to implement “register
windows”, and is optimized more for multimedia and graphics applications.
4. PowerPC 620: Figure 1.5d depicts the external view of the PowerPC
microprocessor with its address and data bus lines. The PowerPC is a pure 64-bit
superscalar processor (with 64-bit registers, 40-bit physical addresses to access 1
TB49 of physical main memory, and an 128-bit external data bus), it can software-
switch between 64- and 32-bit modes (and big-endian and little-endian modes), it is a
4-way superscalar, contains six functional units (three integer ALUs, an FPU, a
branch unit, and a load/store unit), it performs “dynamic instruction scheduling and
49
TB = tera-bytes = 1000 giga-bytes.
1.78
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
àà
--------------------------------------------------------------------àà
Example 1- 12: A 32-bit superscalar processor (Intel Pentium)
The Intel Pentium is a 32-bit 2-way superscalar microprocessor (with 32-bit internal registers
and 32-bit addresses) but has a 64-bit external data bus to improve the data transfer rate [24,25].
The Pentium uses a number of hardwired simple instructions (loads, stores, and simple ALU
instructions), while the more complex ones are microcoded (thus, it is a mixture of CISC and RISC
50
PA stands for “Precision Architecture”.
1.79
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
processor). The input/system clock cycle is the same as the processor’s internal clock cycle (CS =
CP) and a Pentium bus cycle is composed of 2 system clock cycles.
Figure 1..5b shows the external view of the Pentium chip with its 29-line address bus
A31-A3, the 8 byte-enables BE0#-BE7#, the 64-line data bus D63-D0, and an 8-line DP7-DP0 bus
that carries data parity bits (one bit per byte-lane of the data bus). Internal parity checking is
done on the Pentium chip.
When comparing Pentium’s simplified internal structure in Figure 1.27a with that of the
486 in Figure 1.10, we notice that the Pentium has an internal Harvard architecture with a
separate 8-KByte ICACHE and an 8-KByte DCACHE, both internal and external data buses are
64-bit buses, the microprocessor has a superscalar architecture with two integer execution units51
(ALUs) called “u-pipe” and “v-pipe”, and a more sophisticated pipelined FPU. (However, it
does not have a parallel load/store unit that other superscalar microprocessors have). The on-
chip MMU is completely compatible with Intel 486’s MMU. Each cache in Figure 1.27a has its
own TLB (Translation Lookaside Buffer) to translate linear logical addresses to the physical
addresses used by each on-chip cache.
Integer instructions are executed by the Pentium passing through two 5-stage integer
pipelines, the “v-pipe” and the “u-pipe”. Each integer pipeline has its own ALU, address
generation circuitry, and DCACHE interface. It also has an 8-stage FPU pipeline, the first 5
stages of which it shares with the integer unit (Figure 1.27b). Floating-point instructions use the
fourth stage (EX) to fetch the operands (called the OF stage) and the fifth stage (WB) as the first
execution stage (called X1); or one can say that a 3-stage floating-point instruction pipeline
(stages X2, WF, and ER) is appended to the integer pipelines. The v-pipe can execute simple
integer instructions as well as the FXCH floating-point exchange instruction; the u-pipe can
execute all integer and floating-point instructions. A floating-point instruction uses both integer
pipelines, and uses both of them to fetch a 64-bit operand in a single cycle. (For this reason,
except for the FXCH52 instruction, it is not possible to perform two floating-point operations in
parallel while its execution takes place in the u-pipe.)
51
Two integer pipelines are also found in the Sun Microsystems SuperSparc and Motorola 88110.
52
The FXCH instruction swaps any FPU register with the top of the stack. On the Pentium, this
instruction can be issued to the v-pipe in parallel with most other floating-point instructions.
1.80
External Bus
Lines
Data Register
64 Functional Units File (RF) 64
Integer Unit (ALU)
(V-PIPE)
256 Prefetch,
ICACHE Buffers DCACHE
Integer Unit (ALU)
(8KB) & (8KB)
(U-PIPE)
Decoders
(2) FPU
ADD DIVIDE MPY
BTB 32
(Branch Target
WB MEM
Buffer
IF EXE
ID
•Integer instr. pipelines: 5 stages
Instructions Data/Operands •Floating-point instr. pipelines: 8 stages
1 2 3 4 5 6 7 8
PF D1 D2
V-PIPE
ICACHE CALCULATE EX WB (5 stages)
DECODE ADDRESSES
INSTR.
PREFIXES
OF U-PIPE
MEMORY-
RESIDENT
(5 stages)
BTB OPERANDS
EX WB
(OF) (X1)
X2 WF ER FPU
(EXECUTE (ROUND (ERROR (8 STAGES)
STAGE 2) RESULTS) REPORT)
There are certain rules for pairing instructions in the Pentium and for its integer pipelines’
operation:
1) Instruction pairing rules:
The Pentium processor can issue one or two instructions every clock cycle. In order to issue two
instructions simultaneously they must satisfy the following conditions:
• Both instructions in the pair must be “simple” instructions.
• The destination of the first instruction is not the source of the second53 (i.e., there
must be no read-after-write register dependence)
• The destination of the first instruction is not the destination of the second (i.e., there
must be no write-after-write register dependence)
• The first of the two instructions is not a jump instruction.
If these conditions are met, then the first instruction is issued to the u-pipe and the second to
the v-pipe. Else, only the first instruction is issued to the u-pipe.
2) Integer instruction issue rules:
Both the u-pipe and v-pipe instructions enter and leave the D1 and D2 stages in unison.
When an instruction in one pipe is stalled then the instruction in the other pipe is also stalled
at the same pipeline stage. Thus both the u-pipe and v-pipe instructions enter the EX stage is
unison. Once in the EX, if the u-pipe instruction is stalled then the v-pipe instruction (if any)
is also stalled (not the other way around). If the v-pipe instruction is stalled then the
instruction paired with it in the u-pipe is allowed to advance. No successive instructions are
allowed to enter the EX stage of either pipeline until the instructions in both pipelines have
advanced to WB.
53
Logic in D1 ensures that the source and destination registers of the instruction issued to the v-pipe
differ from the destination register of the instruction issued to the u-pipe.
1.81
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
• A limited pairing of two FP instructions can be performed (only when the second
instruction in the FP pair is the “FP exchange” instruction FXCH).
• FP instructions that are not directly followed by an FP exchange instruction are
issued singly to the FPU (to the u-pipe).
ßß----------------------------------------------------------
ßß
54
Some earlier VLIW machines include those from Multiflow (HP), Culler, and Cydrome.
1.82
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
55
VLIW processors can also reduce the CPIp term in Eq. 1.7 because – as in superscalars – multiple
instructions are issued per clock cycle. (Sometimes, the superscalar and VLIW processors are referred to
as “multiple-issue processors”).
1.83
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
31 0
ADD R1 R2
Add
Very long
instruction
Figure 1.28: A generic VLIW (Very Long Instruction Word) processor and
instruction format.
Newer more advanced techniques are continuously being developed and used by
compilers that allow extracting more parallelism from integer code.
Another concern with the VLIW approach is the fact that portability of code may
be compromised, because VLIW code assumes a given underlying hardware
configuration which means there is no binary compatibility between generations. On
the other hand, VLIW presents the advantage of being able to implement old CISC
instruction sets more effectively than RISC can (allowing a company to come up with a
VLIW processor that is compatible to run old programs that had been written for the
company’s CISC products.)
1.84
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
The first stage involves the specification of requirements that the computer
system must meet. These include functional requirements (such as the type of
applications the system is to execute, the programming language to be used, the type of
operating system needed, etc.), other system characteristics (such as upper bounds on cost
and lower bounds on performance, expandability objectives, etc.), and the identification
of additional constraints (such as existing application software, limitations on power,
size, and weight, and other compatibility constraints). This first stage is very difficult
and involves both tangible and intangible requirements.
Then one identifies the characteristics (such as speed and size) of the remaining
modules of the computer system in such a way as to satisfy the overall requirements and
present a well balanced design. The most important system module of course is the
processor. Assuming that the system designer has a free choice (and various reasons,
including company associations and agreements, may restrict this choice), a careful
consideration must be given of the type of processor to be used. In this textbook, we not
only present the architecture and operation of a number of representative RISC and CISC
processors, but also identify their particularities, characteristics, technological constraints,
1.85
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
and architectural advances to help the system designer perform a good trade-off analysis
among them is selecting the most appropriate processor that meets the requirements of
the system to be built. With today’s high-performance processors, it is imperative to
have a well-designed memory hierarchy (an effective cache and memory architecture) to
match the processor’s rate of execution. The faster the memory hierarchy can get
instructions and data into the processor, the better the overall performance of the system.
The faster the processor and the larger the first-level cache it has on-chip, the more
necessary it becomes to include large, second-level, external caches. Finally, the
designer can then develop the I/O subsystem for interfacing to the outside world. Major
subsystem modules and alternative interconnections have already been presented in this
chapter, and many more alternatives will be given in the rest of the chapters.
In establishing the most appropriate system architecture, one can do the following:
1. Analytical methods: Sometimes analytical approximation methods can be used to
evaluate the performance of the processor and the bandwidth of the memory hierarchy
(caches and main memory). Parameters that affect the performance of the memory
hierarchy (caches and main memory) and approximation formulas used in estimating
system performance are discussed in detail in Chapters 5 and 6. Analytical models are
difficult to derive and do not represent the behavior of the actual system.
2. Software simulation: Commercial software simulators may exist or can be
developed (in the form of a computer program) that evaluate the model numerically over
a time period [16]. This approach is used to verify the model and gives an insight only in
the behavior of the system.
The next stage involves the development of the hardware and software usually
done in parallel [20]. In this textbook we do not cover the extensive topic of software
development. Instead we concentrate on the design to construct the hardware
prototype; this involves selecting the basic components (memory chips, I/O, interface
components, cache memories, controllers, buffers, etc.) and designing the various boards
of the actual computer system. We present techniques for designing the main memory
subsystem, the cache subsystem, selecting the system bus to interconnect them together,
and we discuss approaches for handling external interrupting devices that request service
from the system. Quite often, a trade-off has to be performed in deciding whether the
functions will be implemented in software or hardware (such as between software-based
floating-point and attached hardware FPU). The choice of each component depends on
the match between the design requirements for that subsystem (that were established with
the architecture during the previous design stage) and how well the components fit those
requirements.
1.86
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
The next stage is to test and debug the prototype and verify it completely before
committing to the final system. The prototype is exercised (by executing representative
programs – benchmarks -- from the kinds of applications the system will run) and
modified until it satisfies the given system performance requirements. The prototype
construction and verification is an iterative process.
Finally, hardware and software are integrated together, the actual computer
system is built, tested, debugged, and enters the production stage.
1.87
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
BIBLIOGRAPHY
[1] Hennessy, J. L., and D.A. Patterson, Computer Architecture: A Quantitative Approach, Second
edition, Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1996.
[2] Jouppi, N., and D. Wall, “Available Instruction-Level Parallelism for Superscalar and
Superpipelined Machines”, Proc. Third Conf. Architectural Support for Programming Languages
and Operating Systems, ACM, Apr. 1989, pp. 272-282.
[3] Diefendorff, K., and M. Allen, “Organization of the Motorola 88110 Superscalar RISC
Microprocessor,” IEEE Micro, Apr. 1992, pp. 40-63.
[4] Intel Corp., 80386 Hardware Reference Manual, (231732-001), Santa Clara, CA, 1986.
[5] Intel Corp., i486 Microprocessor (240440-001), Santa Clara, CA, 1989.
[6] Integrated Device Technology, Inc., R3000/3001 Designer’s Guide, Santa Clara, CA, 1990.
[7] MIPS Computer Systems, Inc., MIPS R4000 Microprocessor User’s Manual (M8-00040),
Sunnyvale, CA, 1991.
[8] Motorola Inc., MC68030 Enhanced 32-bit Microprocessor User’s Manual, (MC68030UM/AD),
Austin, TX, 1987.
[9] Motorola Inc., MC68040 32-bit Microprocessor User’s Manual (MC68040UM/AD), Austin, TX,
1989.
[10] Motorola Inc., MC88100 Technical Data (BR588/D), Phoenix, AZ, 1988.
[11] Intel Corp., i860 64-bit Microprocessor Hardware Reference Manual (CG-101789), Santa Clara,
CA, 1990.
[12] Piepho, R.S., and W.S. Wu, “A Comparison of RISC Architectures”, IEEE Micro, Aug. 1989, pp.
51-62.
[13] Allison, A., “RISCs Challenge Mini, Micro Suppliers,” Mini-Micro Systems, Nov. 1986, pp. 127-
136.
[14] Patterson, D.A., “Reduced Instruction Computer”, Communications of the ACM, Vol. 28, No. 1, Jan.
1985, p. 189.
[15] Hennessy, J.L., “VLSI Processor Architecture,” IEEE Transactions on Computers, Vol. C-33, No.
12, Dec. 1984.
[16] Law, A.M., and W.D. Kelton, Simulation Modeling and Analysis, McGraw-Hill Book Company,
San Francisco, 1982.
[17] Barad, H., Rapid Prototyping of Massively Parallel Architectures, Tech. Report 88-10, Tulane
University, Electr. Engr. Dept., New Orleans, LA, 1988.
[18] MIPS Computer Systems, MIPS RISC Architecture, Lecture Notes, Sunnyvale, CA, Aug. 1991.
[19] MIPS Computer Systems, RISC Architectures, Lecture Notes, Sunnyvale, CA, Aug. 1991.
[20] Tabak, D., Advanced Microprocessors, McGraw-Hill Book Company, San Francisco, CA, 1991.
[21] Crawford, J.H., “The i486 CPU: Executing Instructions in One Clock Cycle,” IEEE Micro, Febr.
1990, pp. 27-36.]
[22] Sterling, T., “The Scientific Workstation of the Future May be a Pile of PCs”, Communications of
the ACM, Vol. 39, No. 9, Sept. 1996, pp. 11-12.
[23] Halfhill, T.R., "Intel Launches Rocket in a Docket", Byte, May 1993, pp, 92-108.
[24] Papworth, D.B., "Tuning the Pentium Pro Microarchitecture", IEEE Aficro, April 1996, pp. 8-15.
[25] MIPS Computer Systems, Inc., MIPS R4000 Microprocessor Users Manual (M8-00040), Sunnyvale,
CA, 1991
[26] Intel Corp., Pentium Processor User's Manual, Vols. 1-3, Santa Clara, CA, 1993.
[27] Alpert, D., and Avnon, D., "Architecture of the Pentium Microprocessor”, IEEE Micro, June 1993,
pp. 11 -21.
[28] Pountain, D., "The Word on VLIW', Byte, April 1996, pp. 61-64.
[29] Moreno, J.H., et al, "Architecture, compiler and simulation of tree-based VLIW processor", IBM
CyberJournal #RC20495, 7/9/96.
1.88
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
EXERCISES
1.1 Examine the register structures of the Intel 80386 and Motorola 68030
microprocessors and discuss their similarities and differences.
1.2. Consider a hypothetical 32-bit microprocessor having 32-bit instructions composed
of two fields: the first byte represents the opcode and the remaining the immediate
operand or the operand’s direct memory address:
(a) What is the maximum directly addressable memory capacity (in number of
bytes)?
(b) Discuss the impact on the system speed if the microprocessor has
(1) a 32-bit external address bus and 16-bit external data bus or
(2) a 16-bit external address bus and a 16-bit external data bus
(c) How many bits are needed for the program counter and the instruction
register?
1.3. Consider the Intel 8086 (Appendix A and Section 6.5.2) and the instruction ADD
DX,1234 whose execution results in adding the 16-bit contents of the specified
memory location to the contents of internal register DX and placing the sum back
into DX. (To calculate the final 20-bit physical address of the operand, the
microprocessor uses the 16-bit displacement 1234 contained in the above
instruction). Assume that the internal segment registers contain the following: (CS)
= ABCD, (DS) = CDFE, (SS) = D021, and (ES) = CFFF.
(a) Which memory location does the operand come from?
(b) If (IP) = 1046, where is the first byte of the instruction stored at?
Explain.
1.4. Consider a hypothetical microprocessor generating a 16-bit address (for example,
assume that the program counter and the address registers are 16-bit wide) and
having a 16-bit external data bus:
(a) What is the maximum memory address space that the processor can access
directly if it is connected to a “16-bit memory" (i.e., its smallest addressable
location is 16 bits wide)?
(b) What is the maximum memory address space that the processor can access
directly if it is connected to an “8-bit memory" " (i.e., its smallest addressable
location is 8 bits wide)?
(c) What architectural features will allow this microprocessor to access a separate
“I/0 space”?
(d) If an “input” and “output” instruction can specify an 8-bit “I/0 port number”,
how many “8-bit I/0 ports” can the microprocessor support? How many “16-
bit I/0 ports”? Explain.
1.5. Consider the Intel 8086 microprocessor and an intersegment CALL instruction
(calling a FAR procedure located in a different code segment in memory). Assume
that this particular instruction is as follows:
byte1: CALL opcode (assume stored at location 12344)
byte2: upper half of the target (jump) address
byte3: lower half of the target (jump) address
byte4: upper half of the new CS register value
1.89
N. Alexandridis Computer Systems Architecture: Microprocessor-Based Designs
1.90
BACK TO CHAPTERS FOR PRINTING