Вы находитесь на странице: 1из 698

6.

823
Computer System Architecture

Lecturers:

Krste Asanovic, Srini Devadas

Asanovic/Devadas
Spring 2002
6.823

Whats 6.823 About?

Whats under the hood of your desktop?


How powerful is the engine?
How much gas does it burn?
How do you build one?

From the Beta (6.004) RISC processor to the


Pentium-4
We wont emphasize:
VLSI implementations
Parallel processing
Quantitative evaluation

Asanovic/Devadas

Spring 2002

6.823

Course Information

Asanovic/Devadas

Spring 2002

6.823

You must sign up for the course through the web


~7 Homeworks (30%)

Midterm (30%)

Final (40%)

Tutorials : One session / week


1 hour / session
All students must help grade homeworks once during
semester, signup sheets distributed during class

Course Information
(contd.)

Asanovic/Devadas

Spring 2002

6.823

Textbook: Hennessy and Patterson Computer


Architecture: A Quantitative Approach
(strongly recommended)
Prerequisite material: Patterson and Hennessy
Hardware/Software Interface book
Some material in lectures will be from other sources
Tutorials: No new material introduced, reinforcement
of lecture content using examples

Problem Set 0

Asanovic/Devadas

Spring 2002

6.823

Goal is to help you judge for yourself whether you


have prerequisites for this class
We assume that you understand digital logic, a
simple 5-stage pipeline, and simple caches
For this problem set only, work by yourself not
in groups
Due at start of class.

Asanovic/Devadas

Spring 2002

6.823

History of Calculation and Computer

Architecture

Asanovic/Devadas

Spring 2002

6.823

The Importance of Calculation


Plato in Philebus (On Pleasure) 4th century BC
On the very first level of knowledge are to be found number
and calculation

Pythagorean mysticism
Do numbers rule the world?

Scientists needed to escape the hardships of


human calculation
Chinese abacus (suan pan) 13th century AD: First instrument
to provide a simple means to carry out all the operations of
arithmetic
Abacus still in use today!

PLATO
If you cannot calculate, you cannot speculate on future pleasure and your life
will not be that of a human, but that of an oyster or jellyfish.
LUIGI MENABREA
How the prospects of long and arid calculations have demoralized great
thinkers, who seek only time to meditate but see themselves swamped by the
sheer mass of arithmetic to be done by an inadequate system! And yet, it is
the path of laborious analysis that leads to truth, but they cannot follow that
path without employing the guide of number, for which number there is no
way to life the veil which covers the mysteries of nature 1884
CHINESE ABACUS 13th Century AD Several rods, five lower balls on
each rod, 2 upper balls, divided by a slat
JAPANESE ABACUS soroban, more refined, reduced to 1 upper ball, and 4
lower balls

Asanovic/Devadas

Spring 2002

6.823

Calculating Machines
Wilhelm Shickards Calculating Clock (1623) was the
earliest calculating machine in history!
multiplication and division required several interventions by the
operator
Operated on the principle of Napiers bones

Possibility of mechanizing arithmetic was


demonstrated by Blaise Pascal in the machine
Pascaline (1642)
Series of toothed gears, each numbered 0 to 9
First calculating machine to be commercialized!

Arithmometer (1822), keyboard (1850), printer (1885),


Electro-mechanical Arithmometer (1913)

On 20th September 1623, Shickard wrote as follows to his friend Kepler: The
calculations which you do by hand, I have recently attempted to achieve
mechanically I have constructed a machine which, immediately and
automatically, calculates with given numbers, which adds, subtracts,
multiplies and divides. You will cry out with joy when you see how it carries
forward tens and hundreds, or deducts them in subtractions
Kepler sure would have appreicated such an invention to create his tables of
the movements of the planets, but unfortunately, Schickards one and only
copy of his own machine was destroyed by fire on 22 February 1624.
Napiers bones or rods (1617). Ten wooden rods, of square cross-section.
Each of the four sides of each rod corresponded to one of the digits from 0 to
9, and marked down its length in nine divisions, were the multiples of the
corresponding digit. A kind of multiplication table, where you read off the
product horizontally when the rods were placed side by side.
Pascaline was not very reliable. When one wheel completed a revolution, the
next wheel would advance a step. The automatic carrying mechanism tended
to jam when several wheels were simultaneously at 9, necessitating several
simultaneous carries (999 to 1000)
Thomas, (director of a French insurance company) Arithmometer was the first
machine to be commercialized on a large scale.

Asanovic/Devadas

Spring 2002

6.823

Programmability?
Problems with all these machines lay with their
limited capability to carry out a linked sequence
of calculations
Needed to transcribe and enter all intermediate results!

Vaucansons programmable androids (1738)


Vaucanson (1749) constructed the first automatic
loom
Accepted commands by means of a perforated metal cylinder

Jacquard (1804) perfected the programmable


loom
Activated by a sequence of punched-cards!

Basile Bouchon (1725) invented a loom that accepted commands by means


of a punched tape
Vaucanson in 1738 developed the Digesting Duck, an artificial automaton for
Louis XV.
He stretches out his neck to go and take the grain from the hand, he
swallows it, digests it, and excretes it, once digested through the normal
channels; all the gestures of a duck swallowing rapidly, speeding up the
movement in his throat to pass the food into his stomach, are copied from
nature.
Vaucanson turned his loom into a programmable but cylical automaton, one
in which the commands were inscribed by means of perforations on a
hydraulically-driven drum, and were regularly repeated.
Jacquard combined the use of a moving drum equipped with a sequence of
punched cards and the concept of a swinging arm that lifted hooks.

Charles Babbage 1791-1871

Asanovic/Devadas
Spring 2002
6.823

Lucasian Professor of Mathematics


Cambridge University, 1827-1839

Difference Engine

1823

Application?
Mathematical Tables - Astronomy
Nautical Tables Navy
Background
Any continuous function can be approximated
by a polynomial --Weierstrass
Technology
mechanical - gears, Jacquards loom,
simple calculators

First digital calculators with an automatic capability of effecting chained


sequences of operations following a program set up in advance in a control
mechanism were the difference machines.

Difference Engine

Asanovic/Devadas

Spring 2002

6.823

A machine to compute mathematical tables


Weierstrass:
- Any continuous function can be approximated
by a polynomial
- Any polynomial can be computed from
difference tables
f (n) = n2+n+41
d1(n) = f(n) - f(n-1) = 2n
d2(n) = d1(n) - d1(n-1) = 2
f(n) = f(n-1) + d1(n) = f(n-1) + (d1(n-1) + 2)
n
d2(n)
d1(n)
f(n)

41

2
43

2
2
4
47

3
2

4 ...
2
all you need is
an adder!

Asanovic/Devadas

Spring 2002

6.823

Difference Engine
1823 - Babbages paper is published
1834 - The paper is read by Scheutz brothers
in Sweden
1842 - Babbage gives up the idea of building it;
(he is onto Analytic Engine!)
1855 - Scheutz displays his machine at the
Paris World Fare
- Can compute any 6th degree polynomial
- Speed: 33 to 44 32-digit numbers per minute!
Now the machine is at the Smithsonian

Babbage was funded by the British Association for the advancement of


Science, but failed to build the machine.
He was unable to moderate his ambitions, and gradually lost interest in the
original difference engine
Scheutz brothers machine was the first working calculator in history that did
print out the results.

Asanovic/Devadas

Spring 2002

6.823

Analytic Engine
1833 - Babbages paper is published
conceived during a hiatus in the development
of the difference engine
Inspiration:Jacquards Loom
The set of cards with fixed punched holes
dictated the pattern of weave program
The same set of cards could be used with
different colored threads
numbers
1871 - Babbage dies - the machine remains unrealized.
It is not clear if the analytic engine could be
built even today using only mechanical
technology

However, near the end of his life he became depressed. If I could live just a
few more years, the Analytical Engine would exist and its example spread
across the entire planet. Then he added, even more pathetically, If any
man, who is not rebuffed by my example, one day produces a machine
containing within it all of the principles of mathematical analysis, then I have
no fear for my memory, for he alone will be able to appreciate fully the nature
of my efforts and the value of the results I have obtained

Asanovic/Devadas

Spring 2002

6.823

Analytic Engine

The first conception of a general purpose computer


1. The store in which all variables to be operated upon,
as well as intermediate results are placed.
2. The mill into which the quantities about to be
operated upon are always brought.
The program
Operation

variable1

variable2

variable3

An operation in the mill required feeding two punched


cards and producing a new punched card for the store.
An operation to alter the sequence was also provided!

One of the most striking features of Babbages Analytical Engine is the way
conditional operations were to be handled. Proposed that a lever only move
if the result of the calculation was negative, and that is should be used to
advance or roll back the cards on the Jacquard mechanism to any specified
extent.

Asanovic/Devadas

Spring 2002

6.823

The first programmer

Ada Byron aka Lady Lovelace


Babbages ideas had a lot of influence later, primarily
because of
Luigi Menabrea, who published notes of Babbages
lectures in Italy
Lady Lovelace, who translated Menabreas notes
in English and thoroughly expanded them.
... Analytic Engine weaves algebraic patterns....
Adas tutor was Babbage himself!
In the early 20th century - the focus shifted to analog
computers but

Countess Lovelace (1815-1852) was Lord Byrons only daughter. Hard to


imagine a greater contrast between the poet and his daughter, who had
applied herself to the exact and arduous study of calculating machines.
She devised a certain number of programs with the idea of one day
introducing them to the machine of her friend and master. We can say that
the Analytical Engine will weave algebraic patterns, just as Jacquard looms
weave flowers and leaves.

Asanovic/Devadas

Spring 2002

6.823

Harvard Mark I
Built in 1944, in the IBM laboratories at Endicott by

Howard Aiken Professor of Physics at Harvard

Essentially mechanical but had some electro-

magnetically controlled relays and gears

Weighed 5 tons and had 750,000 components

A synchronizing clock that beat every 0.015 seconds

Performance:
0.3 seconds for addition
6 seconds for multiplication
1 minute for a sine calculation
Broke down once a week!

Over 500 miles of electrical wiring, and 3 million solder joints!


72 registers of 23 bits each
At best took a few minutes to repair, averaged 20 minutes, sometimes
several hours.
During the final months of World War II it was used exclusively by the US
Navy to solve problems in ballistics.
Decommissioned in 1959.

Linear Equation Solver

Asanovic/Devadas

Spring 2002

6.823

John Atanasoff, Iowa State University


1930s: Atanasoff and Clifford Berry built the
Linear Equation Solver. It had 300 tubes!
Application:
Linear and Integral differential equations
Background:
Vannevar Bushs Differential Analyzer
--- an analog computer
Technology:
Tubes and Electromechanical relays
Atanasoff decided that the correct mode of
computation was by electronic digital means.

The physical and logical structures of the machine were fairly rudimentary
and it was never an analytical calculator in the true sense of the term (in that
it was not programmable). Furthermore, it never worked properly.

Asanovic/Devadas
Spring 2002
6.823

Electronic Numerical Integrator


and Computer (ENIAC)
Inspired by Atanasoff and Berry, Eckert and
Mauchly designed and built ENIAC (1943-45)
The first, completely electronic, operational,
general-purpose analytical calculator!
30 tons, 72 square meters, 200KW
Performance
Read in 120 cards per minute
Addition took 200 s, Division 6 ms
1000 times faster than Mark I

Not very reliable!

Designers were obliged to install a ventilation shaft and a cooling system in


the room. Machine had 18,000 vacuum tubes.
The slogan was: The ENIAC is able to calculate the trajectory of a largecaliber naval shell in less time than the shell takes to reach its target!
To change a program, it was necessary to change the instructions, the
connector panels, and the positions of the switches all at the same time.
The inventors of ENIAC themselves admitted that, after taking into account
human error, the machine only got the correct result 20 times out of 100!

Asanovic/Devadas

Spring 2002

6.823

Electronic Discrete Variable


Automatic Computer (EDVAC)
ENIACs programming system was external
Sequences of instructions were executed independently of the
results of the calculation
Human intervention required to take instructions out of order

Eckert, Mauchly, John von Neumann and others


designed EDVAC (1944) to solve this problem
Solution was the stored program computer

First Draft of a report on EDVAC was published in


1945, but just had von Neumanns signature!
In 1973 the court of Minneapolis attributed the
honor of inventing the computer to John
Atanasoff

Von Neumann (1903-1957) was born in Budapest. Great quantities of work


in set theory to quantum physics. Celebrated for his theory of games and its
application to economics.
Report was the formal specification of EDVAC. Before that the documentation
on ENIAC or EDVAC was non-existent. Von Neumann devised a
mathematical logical notation which expressed fundamental ideas of the
fetch-execute-decode loop.
Five major components: Arithmetic unit, control unit, the memory, input
devices and output devices.

Stored Program Computer

Asanovic/Devadas

Spring 2002

6.823

Program = A sequence of instructions


How to control instruction sequencing?
manual control

calculators

automatic control
external (paper tape)
internal
plug board
read-only memory
read-write memory

Harvard Mark I , 1944


Zuses Z1, WW2
ENIAC 1946
ENIAC 1948
EDVAC 1947 (concept )

The same storage can be used to store


program and data
EDSAC

1950

Maurice Wilkes

EDSAC: Electronic Delay Storage Automatic Calculator, translation into


machine language of commands input in symbolic form; automatic loading of
the program into core memory, etc.

Asanovic/Devadas

Spring 2002

6.823

The First True Computers


Eckert and Mauchly founded their own company
and built the BINAC 1947-49
Two processors that checked each other for reliability

Whirlwind I of MIT was another von Neumann


computer built between 1946-51 by Jay Forrester
Had a magnetic-core memory and a programming language
Used by the US Air Defense

First commercial American computer was


UNIVAC-I designed and built by Eckert and
Mauchly
Used for opinion polls in 1952 Presidential elections

BINAC was the first electronic computer built in the United States.
There were several von Neumann computer efforts, including SEAC.
UNIVAC-I. CBS television polls, accurately predicted Eisenhowers victory.

Dominant Problem: Reliability

Asanovic/Devadas

Spring 2002

6.823

Mean time between failures (MTBF)


MITs Whirlwind with an MTBF of 20 min. was perhaps
the most reliable machine !
Reasons for unreliability:
1. Vacuum Tubes
2. Storage medium
acoustic delay lines
mercury delay lines
Williams tubes
Selections
CORE

J. Forrester

1951

Asanovic/Devadas

Spring 2002

6.823

Commercial Activity -- 1948-52


IBMs SSEC
Selective Sequence Electronic Calculator
Vacuum tubes in the control unit, and electromagnetic relays everywhere else
150 word store, stored program machine
Instructions, constraints, and tables of data were
read from paper tapes.
66 Tape reading stations!
Serious inconsistencies of logic.

Exhibited by IBM at the start of 1948 in the shop windows of a busy New York
avenue.
Fascinated the public, who came in their thousands to look at the lights on the
calculator blinking away.
Was a near-computer because there a lack of synchronization in the
calculation due to its hybrid nature.

IBM Computers

Asanovic/Devadas

Spring 2002

6.823

IBM 701 -- 30 machines were sold in 1953-54


IBM 650 -- a cheaper, drum based machine,
more than 120 were sold in 1954
and there were orders for 750 more!
Users stopped building their own machines.
Why was IBM late getting into computer
technology?
They were making too much money!
Even without computers, IBM
revenues were doubling every
4 to 5 years in 40s and 50s.

Defense Calculator was another name for the IBM 701.


Commissioned during the Korean War.

Software Developments

Asanovic/Devadas

Spring 2002

6.823

up to 1955

Libraries of numerical routines


Floating point operations
Transcendental functions
Matrix manipulation, equation solvers, . . .

1955-60

High level Languages - Fortran 1956


Operating Systems Assemblers, Loaders, Linkers, Compilers
Accounting programs to keep track of
usage and charges

Machines required experienced operators


Most users could not be expected to understand
these programs, much less write them
Machines had to be sold with a lot of resident
software

Asanovic/Devadas

Spring 2002

6.823

Factors that Influence


Computer Architecture
Technology
Applications

Computer Architecture

Software
Compatibility
Software played almost no role in defining an
architecture before mid fifties.
special-purpose versus general-purpose machines

Asanovic/Devadas

Spring 2002

6.823

Compatibility

Essential for portability and competition


Its importance increases with the market size
but it is also the most regressive force
Instruction Set Architecture (ISA) compatibility
The same assembly program can run on any
upward compatible model
then IBM 360/370 ...
now Intel x86
System and application software developers expect
more than ISA compatibility
(APIs)
applications
operating system
proc + mem + I/O

Java?
Wintel

Microprocessor Economics

Asanovic/Devadas

Spring 2002

6.823

Designing a state-of-the-art microprocessor


requires a huge design team
Pentium
~300 engineers
PentiumPro ~ 500 engineers
Huge investments in fabrication lines
need to sell 2 to 4 million units to be profitable
Continuous improvements are needed to improve
yields and clock speed
price drops to one tenth in 2-3 years
Fast new processors also require new peripheral
chips (memory controller, I/O)
$$$
Cost of launching a new ISA is prohibitive and
the advantage is not so clear!

View of Computer Architecture


Language/ Compiler/
System software designer
Need mechanisms
to support important
abstractions

Determine compilation
strategy; new language
abstractions

Asanovic/Devadas

Spring 2002

6.823

Architect/Hardware
designer
Decompose each
mechanism into essential
micro-mechanisms and
determine its feasibility
and cost effectiveness
Propose mechanisms and
features for performance

Architects main concerns are cost-performance,


performance, and efficiency in supporting a broad
class of software systems.

Asanovic/Devadas

Spring 2002

6.823

Influence of Technology and

Software on Instruction Sets:

Up to the dawn of IBM 360

Krste Asanovic

Laboratory for Computer Science

Massachusetts Institute of Technology

Importance of Technology

Asanovic/Devadas

Spring 2002

6.823

New technologies not only provide greater


speed, size and reliability at lower cost, but
more importantly these dictate the kinds of
structures that can be considered and thus
come to shape our whole view of what a
computer is.
Bell & Newell

Technology is the dominant


factor in computer design

Asanovic/Devadas
Spring 2002
6.823

Technology
Transistors
Integrated circuits
VLSI (initially)
Laser disk, CDs

Computers

Technology
Core memories
Magnetic tapes
Disks

Computers

Technology
ROMs, RAMs
VLSI
Packaging
Low Power

Computers

But Software...

Asanovic/Devadas
Spring 2002
6.823

As people write programs and use computers,


our understanding of programming and program
behavior improves.
This has profound though slower impact on

computer architecture

Modern architects cannot avoid paying attention


to software and compilation issues.

Technology
Computers

Software

Asanovic/Devadas
Spring 2002
6.823

The Earliest Instruction Sets

Single Accumulator - A carry-over from the calculators.


LOAD
STORE

x
x

AC M[x]
M[x] (AC)

ADD
SUB

x
x

AC (AC) + M[x]

MUL
DIV

x
x

Involved a quotient register


AC 2 (AC)

SHIFT LEFT
SHIFT RIGHT
JUMP
JGE

x
x

PC x
if (AC) 0 then PC x

LOAD ADR
STORE ADR

x
x

AC Extract address field(M[x])

Typically less than 2 dozen instructions!

Programming a Single
Accumulator Machine
Ci Ai + Bi, 1 i n
LOOP LOAD
JGE
ADD
STORE
F1
LOAD
F2
ADD
F3
STORE
JUMP
DONE HLT

N
DONE
ONE
N
A
B
C
LOOP

How to modify the


addresses A, B and C ?

Asanovic/Devadas
Spring 2002
6.823

N
ONE
code

-n
1

Self-Modifying Code

Asanovic/Devadas
Spring 2002
6.823

Ci Ai + Bi, 1 i n

LOOP LOAD
JGE
ADD
STORE
F1
LOAD
F2
ADD
F3
STORE
LOAD ADR
ADD
STORE ADR
modify the
LOAD ADR
program
ADD
for the next
STORE ADR
iteration
LOAD ADR
ADD
STORE ADR
JUMP
DONE HLT

N
DONE
ONE
N
A

F1
ONE
F1
F2
ONE
F2
F3
ONE
F3
LOOP

Each iteration involves


total book-keeping
instruction
fetches
17

14

operand
fetches

10

stores

Processor State

Asanovic/Devadas

Spring 2002

6.823

The information held in the processor at the end of an


instruction to provide the processing context for the next
instruction. e.g. Program Counter, Accumulator, . . .
Programmer visible state of the processor (and memory)
plays a central role in computer organization:
Software can only manipulate programmer-visible

state and can only rely on programmer-visible state

Hardware must never let software observe state


changes other than that defined in programmers
manual (e.g., interrupts not visible to running program)
Programmers model of machine is a contract
between hardware and software

Accumulator Machines
Can describe any possible computation using
single accumulator instruction set
Hardware is very simple
Why did different instruction sets evolve?

Asanovic/Devadas

Spring 2002

6.823

Computers in mid 50s

Asanovic/Devadas
Spring 2002
6.823

Hardware was expensive

Programmers view of the machine was inseparable

from the actual hardware implementation

Example: IBM 650 - a drum machine with 44 instructions

1. 60 1234 1009

Load the contents of location 1234 into the


distribution; put it also into the upper accumulator;
set lower accumulator to zero; and then go to location
1009 for the next instruction.
Good programmers optimized the placement of
instructions on the drum to reduce latency
2. Branch on distribution digit equal to 8

Asanovic/Devadas
Spring 2002
6.823

Computers in mid 50s (cont.)

Stores were small (1000 words) and 10 to 50 times


slower than the processor
1. Instruction execution time was totally dominated
by the memory reference time.
More memory references per instruction
longer execution time per instruction
2. The ability to design complex control circuits to
execute an instruction was the central concern
as opposed to the speed of decoding or ALU.
3. No resident system software because there was no
place to keep it!

Asanovic/Devadas

Spring 2002

6.823

Processor
Bottleneck!
Memory
Some early solutions:

fast local storage in the processor, i.e., 8-16 registers


as opposed to one accumulator
indexing capability to reduce book keeping
instructions
complex instructions to reduce instruction fetches
compact instructions, i.e., implicit address bits for
operands, to reduce fetches

Index Registers

Asanovic/Devadas

Spring 2002

6.823

Tom Kilburn, Manchester University, mid 50s

One or more specialized registers to simplify address


calculation
Modify existing instructions
LOAD
ADD

x, IX
x, IX

AC M[x + (IX)]

AC (AC) + M[x + (IX)]

Add new instructions to manipulate index registers


JZi

x, IX

LOADi

x, IX

if (IX)=0 then PC x
else IX (IX) + 1
IX M[x] (truncated to fit IX)

Index registers have accumulator-like characteristics

Using Index Registers


C i Ai + B i

Asanovic/Devadas
Spring 2002
6.823

1 i n
A

LOADi
LOOP JZi
LOAD
ADD
STORE
JUMP
DONE HALT

-n, IX
DONE, IX # Jump or increment IX
A, IX
B, IX
C, IX
LASTA

LOOP

Program does not modify itself


Efficiency has improved dramatically (ops / iter)

with index regs


without index regs
instruction fetch
5 (2)
17 (14)
operand fetch
2
10 (8)
store
1
5 (4)
Costs: Instructions are 1 to 2 bits longer
Index registers with ALU-like circuitry
Complex control

Asanovic/Devadas
Spring 2002
6.823

Indexing vs. Index Registers


LOAD x, IX
Suppose instead of registers, memory locations
are used to implement index registers.

Arithmetic operations on index registers can be


performed by bringing the contents to the accumulator.
Most book keeping instructions will be avoided
but each instruction will implicitly cause several
fetches and stores.
complex control circuitry
additional memory traffic

Asanovic/Devadas
Spring 2002
6.823

Operations on Index Registers


To increment index register by k
AC (IX)
AC (AC) + k

IX (AC)

new instruction

new instruction

also the AC must be saved and restored.


It may be better to increment IX directly
INCi k, IX

IX (IX) + k

More instructions to manipulate index register


STOREi x, IX
...

M[x] (IX) (extended to fit a word)

IX begins to look like an accumulator


several index registers
several accumulators
General Purpose Registers

Support for Subroutine Calls

Main
Program

F:

call F(a1,...)
a1
a2

Subroutine F

call F(b1,...)
b1
b2

return

A special subroutine jump instruction


M:

JSR

Asanovic/Devadas
Spring 2002
6.823

F M + 1 and
jump to F+1

Indirect Addressing and


Subroutine Calls Subroutine

Asanovic/Devadas
Spring 2002
6.823

F
F+1

Caller
6 Events:
Execute M
Execute F+1
Execute S1
Execute S2
Execute S3
Execute M+3

M JSR F
arg
result
M+3

Indirect addressing
LOAD

(x)

S1 LOAD (F)
inc F by 1

fetch
arg

S2 STORE(F)
inc F by 1 store
result

means AC M[M[x]]
S3 JUMP (F)

Indirect addressing almost eliminates the need to write


self-modifying code (location F still needs to be modified)
Problems with recursive procedure calls

Recursive Procedure Calls


and Reentrant Codes

Asanovic/Devadas
Spring 2002
6.823

Indirect Addressing through a register


LOAD

R1, (R2)

Load register R1 with the contents of the


word whose address is contained in register R2
Registers

Memory
Pure Code

PC

SP

Data

Stack

Asanovic/Devadas
Spring 2002
6.823

Evolution of Addressing Modes


1. Single accumulator, absolute address
LOAD

2. Single accumulator, index registers


LOAD

x, IX

3. Indirection
LOAD

(x)

4. Multiple accumulators, index registers, indirection


or

LOAD
LOAD

R, IX, x
R, IX, (x)

the meaning?
R M[M[x] + IX]
or R M[M[x + IX]]

5. Indirect through registers


LOAD

RI, (RJ)

6. The works
LOAD

RI, RJ, (RK)

RJ = index, RK = base address

Asanovic/Devadas

Spring 2002

6.823

Variety of Instruction Formats


Two address formats: the destination is same as
one of the operand sources
(Reg Reg) to Reg
(Reg Mem) to Reg

R I RI + R J
RI RI + M[x]

...
x could be specified directly or via a register;
effective address calculation for x could include
indexing, indirection, ...
Three operand formats: One destination and up to
two operand sources per instruction
(Reg x Reg) to Reg
(Reg x Mem) to Reg

...

R I RJ + RK
RI RJ + M[x]

Many different formats are possible!

Data Formats and

Memory Addresses

Asanovic/Devadas

Spring 2002

6.823

Data formats:
Bytes, Half words, words and double words
Some issues
Byte addressing
Big Endian
0
1
2
3
vs. Little Endian
3
2
1
0
Word alignment
Suppose the memory is organized in 32-bit words.
Can a word address begin only at 0, 4, 8, .... ?
0

Some Problems

Asanovic/Devadas

Spring 2002

6.823

Should all addressing modes be provided for


every operand?
regular vs. irregular instruction formats
Separate instructions to manipulate
Accumulators
Index registers
Base registers
A large number of instructions
Instructions contained implicit memory references several contained more than one
very complex control

Compatibility Problem at IBM

Asanovic/Devadas
Spring 2002
6.823

By early 60s, IBM had 4 incompatible lines of computers!


701
650
702
1401

7094
7074
7080
7010

Each system had its own


Instruction set
I/O system and Secondary Storage:
magnetic tapes, drums and disks
assemblers, compilers, libraries,...
market niche
business, scientific, real time, ...

IBM 360

IBM 360 : Design Premises

Asanovic/Devadas

Spring 2002

6.823

Amdahl, Blaauw and Brooks, 1964

The design must lend itself to growth and


successor machines
General method for connecting I/O devices
Total performance - answers per month rather than
bits per microsecond programming aids
Machine must be capable of supervising itself
without manual intervention
Built-in hardware fault checking and locating aids
to reduce down time
Simple to assemble systems with redundant I/O
devices, memories etc. for fault tolerance
Some problems required floating point words larger
than 36 bits

IBM 360:

A General-Purpose Register Machine

Asanovic/Devadas

Spring 2002

6.823

Processor State
16 General-Purpose 32-bit Registers
- may be used as index and base registers
- Register 0 has some special properties
4 Floating Point 64-bit Registers
A Program Status Word (PSW)
PC, Condition codes, Control flags
A 32-bit machine with 24-bit addresses
No instruction contains a 24-bit address!
Data Formats
8-bit bytes, 16-bit half-words, 32-bit words,
64-bit double-words

IBM 360: Implementations


Storage
Datapath
Circuit Delay
Local Store
Control Store

Model 30
...

8K - 64 KB

8-bit

30 nsec/level

Main Store

Read only 1sec

Asanovic/Devadas
Spring 2002
6.823

Model 70
256K - 512 KB

64-bit

5 nsec/level

Transistor Registers

Conventional circuits

IBM 360 instruction set architecture completely hid the


underlying technological differences between various
models.
With minor modifications it survives today

Asanovic/Devadas

Spring 2002

6.823

IBM S/390 z900 Microprocessor


64-bit virtual addressing
first 64-bit S/390 design (original S/360 was 24-bit, and S/370 was
31-bit extension)

1.1 GHz clock rate (announced ISSCC 2001)


0.18m CMOS, 7 layers copper wiring
770MHz systems shipped in 2000

Single-issue 7-stage CISC pipeline


Redundant datapaths
every instruction performed in two parallel datapaths and results
compared

256KB L1 I-cache, 256KB L1 D-cache on-chip


20 CPUs + 32MB L2 cache per Multi-Chip Module
Water cooled to 10oC junction temp

What makes a good


instruction set?

Asanovic/Devadas
Spring 2002
6.823

Ideally, provides simple software interface yet allows

simple, fast, efficient hardware implementations

but across 25+ year time frame


Example of difficulties:
Current machines have register files with more storage
than entire main memory of early machines!
On-chip test circuitry of current machines hundreds of
times more transistors than entire early computers!

Asanovic/Devadas

Spring 2002

6.823

Full Employment for Architects


Good news: Ideal instruction set changes continually

Technology allows larger CPUs over time


Technology constraints change (e.g., power is now major constraint)
Compiler technology improves over time (e.g., register allocation)
Programming style varies over time (assembly coding, HLL, objectoriented, )
Applications and application demands vary over time (e.g.,
multimedia processing recent change in workload)

Bad news: Software compatibility imposes huge


damping coefficient on instruction set innovation
Software investment dwarfs hardware investment
Innovate at microarchitecture level, below instruction set level, this is
what most computer architects do

New instruction set can only be justified by new large


market and technological advantage
Network processors
Multimedia processors

Asanovic/Devadas
Spring 2002
6.823

Complex Instruction Set


Evolution in the Sixties:
Stack and GPR Architectures

Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology

The Sixties

Asanovic/Devadas
Spring 2002
6.823

Hardware costs started dropping


memories beyond 32K words seemed likely
separate I/O processors
large register files
Systems software development became essential
Operating Systems
I/O facilities
Separation of Programming Model from implementation
become essential
family of computers

The Burroughs B5000:

Asanovic/Devadas
Spring 2002
6.823

An ALGOL Machine, Robert Barton, 1960


Machine implementation can be completely hidden if the
programmer is provided only a high-level language
interface.
Stack machine organization because stacks are
convenient for:
1. expression evaluation;
2. subroutine calls, recursion, nested interrupts;
3. accessing variables in block-structured languages.
B6700, a later model, had many more innovative features
tagged data
virtual memory
multiple processors and memories

Asanovic/Devadas
Spring 2002
6.823

A Stack Machine

A Stack machine has a stack as a part of the processor


state
typical operations:
push
Processor
pop
+
stack
Main
*...
Store
Instructions like + implicitly
:
specify the top 2 elements
of the stack as operands.

push b

b
a

push c

c
b
a

pop

b
a

Evaluation of Expressions

Asanovic/Devadas
Spring 2002
6.823

(a + b * c) / (a + d * c - e)
/

+
a

*
c

*
d

Reverse Polish
abc* +adc*+e-/
Push a
push c multiply
Push b

c
b*c
b
a
Evaluation Stack

Evaluation of Expressions

Asanovic/Devadas
Spring 2002
6.823

(a + b * c) / (a + d * c - e)
/

+
a

*
c

*
d

Reverse Polish
abc*+adc*+e-/
add

b*c
a+b*c
a
Evaluation Stack

Hardware Organization
of the Stack

Asanovic/Devadas
Spring 2002
6.823

Stack is part of the processor state


stack must be bounded and small
number of Registers and not the size of
main memory
Conceptually stack is unbounded
a part of the stack is included in the processor
state; the rest is kept in the main memory

Stack Size and


Memory References
Program
push a
push b
push c
*
+
push a
push d
push c
*
+
push e
/

Asanovic/Devadas
Spring 2002
6.823

abc*+adc*+e-/
Stack (size = 2)
Memory Refs
a
a
a,
b
b
b,
c
c, ss(a)
a,
b*c
sf(a)
a+b*c
a+b*c,
a
a
a,
d
d, ss(a+b*c)
d,
c
c, ss(a)
a,
d*c
sf(a)
a+b*c,
a+d*c
sf(a+b*c)
a+d*c,
e
e, ss(a+b*c)
a+b*c,
a+d*c-e
sf(a+b*c)
(a+b*c)/(a+d*c-e)
4 stores, 4 fetches (implicit)

Stack Operations and


Implicit Memory References

Asanovic/Devadas
Spring 2002
6.823

When just the top 2 elements of the stack are kept in


registers and the rest is kept in the memory:
Each push operation
pop operation

1 memory reference.
1 memory reference.
No Good!

Better performance from keeping the top N elements in


registers and by making memory references only when
register stack overflows or underflows.
Issue - when to Load/Unload registers ?

Stack Size and


Expression Evaluation
abc*+adc*+e-/

a and c are
loaded twice

not the best


use of registers!

push a
push b
push c
*
+
push a
push d
push c
*
+
push e
/

R0
R0 R1
R0 R1 R2
R0 R1
R0
R0 R1
R0 R1 R2
R0 R1 R2 R3
R0 R1 R2
R0 R1
R0 R1 R2
R0 R1
R0

Asanovic/Devadas
Spring 2002
6.823

Asanovic/Devadas
Spring 2002
6.823

Register Usage in a GPR Machine


(a + b * c)/(a + d * c - e)
Load
Load
Load
Mul
Add
Load
Mul
Add
Load
Sub
Div

R0
R1
R2
R2
R2
R3
R3
R3
R0
R3
R2

a
c
b
R1
R0
d
R1
R0
e
R0
R3

More control over register


usage since registers can
be named explicitly
Load
Ri m
Load
Ri (Rj)
Load
Ri (Rj) (Rk)

- eliminates unnecessary
Loads and Stores
- fewer Registers
but instructions may be longer!

B5000 Procedure Calls

Asanovic/Devadas
Spring 2002
6.823

Storage for procedure calls also follows a stack


discipline
However, there is a need to access variables
beyond the current stack frame
- lexical addressing < ll , d >
- display registers to speed up
accesses to stack frames
R
Proc P
Proc Q
Proc R
Q
R
Q

automatic loading
of display registers?

Q
R
3
2

ll = 1

display
registers

stack

static dynamic
links link

Stack Machine Features


In addition to push, pop,
+ etc., the instruction
set must provide the
capability to

machinery to
carry out
+, -, etc.

refer to any element in


the data area
jump to any instruction in
the code area
move any element in
the stack frame to the top

stack

SP
DP
PC

Asanovic/Devadas
Spring 2002
6.823

a
b
c

.
.
.
data
push a
push b
push c
*
+
push e
/

code

Asanovic/Devadas
Spring 2002
6.823

Stack versus GPR Organization


Amdahl, Blaauw and Brooks, 1964

1. The performance advantage of push down stack


organization is derived from the presence of fast
registers and not the way they are used.
2.Surfacing of data in stack which are profitable is
approximately 50% because of constants and common
sub-expressions.
3. Advantage of instruction density because of implicit
addresses is equaled if short addresses to specify
registers are allowed.
4. Management of finite depth stack causes complexity.
5. Recursive subroutine advantage can be realized only
with the help of an independent stack for addressing.
6. Fitting variable length fields into fixed width word is
awkward.

Stack Machines

Asanovic/Devadas
Spring 2002
6.823

(Mostly)

Died by 1980

1. Stack programs are not smaller if short (Register)


addresses are permitted.
2. Modern compilers can manage fast register space
better than the stack discipline.
3. Lexical addressing is a useful abstract model for
compilers but hardware support for it (i.e. display)
is not necessary.
GPRs and caches are better than stack and displays
Early language-directed architectures often did not
take into account the role of compilers!
B5000, B6700, HP 3000, ICL 2900, Symbolics 3600

Stacks post-1980

Asanovic/Devadas
Spring 2002
6.823

Inmos Transputers (1985-2000)

Designed to support many parallel processes in Occam language


Fixed-height stack design simplified implementation
Stack trashed on context swap (fast context switches)
Inmos T800 was worlds fastest microprocessor in late 80s

Forth machines
Direct support for Forth execution in small embedded real-time
environments
Several manufacturers (Rockwell, Patriot Scientific)

Java Virtual Machine


Designed for software emulation not direct hardware execution
Sun PicoJava implementation + others

Intel x87 floating-point unit


Severely broken stack model for FP arithmetic
Deprecated in Pentium-4 in exchange for SSE2 FP register arch.

IBM 360:
A General-Purpose Register
Machine
Processor State

Asanovic/Devadas
Spring 2002
6.823

16 General-Purpose 32-bit Registers


- may be used as index and base registers
- Register 0 has some special properties
4 Floating Point 64-bit Registers
A Program Status Word (PSW)
PC, Condition codes, Control flags
A 32-bit machine with 24-bit addresses
No instruction contains a 24-bit address !
Data Formats
8-bit bytes, 16-bit half-words, 32-bit words,
64-bit doublewords

IBM 360: Implementations


Storage
Datapath
Circuit Delay
Local Store
Control Store

Model 30
...
8K - 64 KB
8-bit
30 nsec/level
Main Store
Read only 1sec

Asanovic/Devadas
Spring 2002
6.823

Model 70
256K - 512 KB
64-bit
5 nsec/level
Transistor Registers
Conventional circuits

IBM 360 instruction set architecture completely hid the


underlying technological differences between various
models.

Asanovic/Devadas
Spring 2002
6.823

IBM 360: Precise Interrupts


IBM 360 ISA (Instruction Set Architecture)
preserves sequential execution model
Programmers view of machine was that each
instruction either completed or signaled a fault
before next instruction began execution
Exception/interrupt behavior constant across
family of implementations

IBM 360:
Some Addressing Modes
8

RR

opcode
8

RX

R1
4

opcode R1

Asanovic/Devadas
Spring 2002
6.823

R1 R1 op R2

R2
4

12

X2

B2

D2

R1 R1 op M[X2 + B2 + D2]
a 24-bit address is formed by adding the
12-bit displacement (D) to a base register (B)
and an Index register (X), if desired
The most common formats for arithmetic & logic
instructions, as well as Load and Store instructions

IBM 360:
Branches & Condition Codes

Asanovic/Devadas
Spring 2002
6.823

Arithmetic and logic instructions set condition codes


equal to zero
greater than zero
overflow
carry...
I/O instructions also set condition codes - channel busy
All conditional branch instructions are based on testing
these condition code registers (CCs)
RX and RR formats
BC_
branch conditionally
BAL_
branch and link, i.e., R15 PC + 1
for subroutine calls
CCs must be part of the PSW

IBM 360:
Character String Operations

Asanovic/Devadas
Spring 2002
6.823

opcode

length

B1

12

D1

B2

12

D2

SS format: store to store instructions


M[B1 + D1] M[B1 + D1] op M[B2 + D2]
iterate length times
most operations on decimal and character strings
use this format
MVC move characters
MP
multiply two packed decimal strings
CLC compare two character strings
...
a lot of memory operations per instruction
complicates exception & interrupt handling

Microarchitecture

Asanovic/Devadas
Spring 2002
6.823

Implementation of an ISA

status
lines

Controller

control
points

Data
path

Structure: How components are connected.


Static
Sequencing: How data moves between components
Dynamic

The DLX ISA


Processor State

32 32-bit GPRs, R0 always contains a 0


32 single precision FPRs, may also be viewed as
16 double precision FPRs
FP status register, used for FP compares & exceptions
PC, the program counter
some other special registers

Data types

8-bit byte, 2-byte half word


32-bit word for integers
32-bit word for single precision floating point
64-bit word for double precision floating point

Load/Store style instruction set

data addressing modes- immediate & indexed


branch addressing modes- PC relative & register indirect
Byte addressable memory- big-endian mode

All instructions are 32 bits


(See Chapter 2, H&P for full description)

Asanovic/Devadas
Spring 2002
6.823

Asanovic/Devadas
Spring 2002
6.823

A Bus-based Datapath for DLX


Opcode
ldIR

busy

zero?
OpSel

ldA

32(PC)
31(Link)
rf3
rf2
rf1

ldB

2 /

IR
ExtSel Imm
/
2
Ext
enImm

rf3
rf2
rf1

ALU
control

addr

32 GPRs
+ PC ...
ALU

32-bit Reg

enALU

data

RegSel

ldMA

MA
addr

RegWrt
enReg

Memory

MemWrt

data

enMem

Bus 32

Microinstruction: register to register transfer (17 control signals)


MA PC
means
RegSel = PC; enReg=yes; ldMA= yes
B Reg[rf2] means
RegSel = rf2; enReg=yes; ldB = yes

Asanovic/Devadas
Spring 2002
6.823

Memory Module
addr

busy

RAM

din

we

Write/Read
Enable

dout

bus
We will assume that Memory operates asynchronously
and is slow as compared to Reg-to-Reg transfers

DLX ALU Instructions


6
5
opcode rf1

5
rf2

5
rf3

Asanovic/Devadas
Spring 2002
6.823

11
function

Register-Register form:
Reg[rf3] function(Reg[rf1], Reg[rf2])

6
5
opcode rf1

5
rf2

16
immediate

Register-Immediate form:
Reg[rf2] function(Reg[rf1], SignExt(immediate))

Asanovic/Devadas
Spring 2002
6.823

Instruction Execution
Execution of a DLX instruction involves
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. memory operation (optional)
5. write back to register file (optional)
and the computation of the address of the
next instruction

Microcontrol Unit

Asanovic/Devadas
Spring 2002
6.823

Maurice Wilkes, 1954

Embed the control logic state table in a memory array


op
conditional
code flip-flop

Next state

address
Matrix A

Matrix B

Decoder

Control lines to
ALU, MUXs, Registers

Microprogram Fragments
instr fetch: MA PC
IR Memory
A PC
PC A + 4
dispatch on OPcode

Asanovic/Devadas
Spring 2002
6.823

can be
treated as
a macro

ALU:

A Reg[rf1]
B Reg[rf2]
Reg[rf3] func(A,B)
do instruction fetch

ALUi:

A Reg[rf1]
B Imm
sign extension ...
Reg[rf2] Opcode(A,B)
do instruction fetch

Asanovic/Devadas
Spring 2002
6.823

DLX Load/Store Instructions


6
5
opcode rf1
base

5
rf2

16
displacement

Load/store byte, halfword, word to/from GPR:


LB, LBU, SB, LH, LHU, SH, LW, SW

byte and half-word can be sign or zero extended

Load/store single and double FP to/from FPR:


LF, LD, SF, SD

Byte addressable machine


Memory access must be data aligned
A single addressing mode
(base) + displacement
Big-endian byte ordering
31

DLX Control Instructions

Asanovic/Devadas
Spring 2002
6.823

Conditional branch on GPR


6
5
opcode rf1

16
offset from PC+4

BEQZ, BNEZ

Unconditional register-indirect jumps


6
5
opcode rf1

16
JR, JALR

Unconditional PC-relative jumps


6
opcode

26
offset from PC+4

J, JAL

PC-offset are specified in bytes


jump-&-link (JAL) stores PC+4 into the link register (R31)
(Real DLX has delayed branches ignored this lecture)

Microprogram Fragments
LW:

J:

beqz:
bz-taken:

(cont.)
A Reg[rf1]
B Imm
MA A + B
Reg[rf2] Memory
do instruction fetch
A PC
B Imm
PC A + B
do instruction fetch
A Reg[rf1]
If zero?(A) then go to bz-taken
do instruction fetch
A PC
B Imm
PC A + B
do instruction fetch

Asanovic/Devadas
Spring 2002
6.823

DLX Microcontroller:
Opcode
zero?
Busy (memory)

Asanovic/Devadas
Spring 2002
6.823

first attempt

PC (state)

latching the inputs


may cause a
one-cycle delay

s
addr

ROM Size ?
How big is s?

Program ROM
2(opcode+status+s) words
word = (control+s) bits
data

next state
17 Control Signals

Asanovic/Devadas

Spring 2002

6.823

Microprogramming

Krste Asanovic

Laboratory for Computer Science

Massachusetts Institute of Technology

Asanovic/Devadas

Spring 2002

6.823

Instruction Set Architecture

(ISA) versus Implementation

ISA is the hardware/software interface


Defines set of programmer visible state
Defines instruction format (bit encoding) and instruction
semantics
Examples: DLX, x86, IBM 360, JVM

Many possible implementations of one ISA


360 implementations: model 30 (c. 1964), z900 (c. 2001)
x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486,
Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon,
Transmeta Crusoe, SoftPC
DLX implementations: microcoded, pipelined, superscalar
JVM: HotSpot, PicoJava, ARM Jazelle, ...

ISA to Microarchitecture

Mapping

Asanovic/Devadas

Spring 2002

6.823

ISA often designed for particular microarchitectural


style, e.g.,

CISC ISAs designed for microcoded implementation


RISC ISAs designed for hardwired pipelined implementation
VLIW ISAs designed for fixed latency in-order pipelines
JVM ISA designed for software interpreter

But ISA can be implemented in any microarchitectural


style
Pentium-4: hardwired pipelined CISC (x86) machine (with some
microcode support)
This lecture: a microcoded RISC (DLX) machine
Intel will probably eventually have a dynamically scheduled out-oforder VLIW (IA-64) processor
PicoJava: A hardware JVM processor

Asanovic/Devadas

Spring 2002

6.823

Microcoded Microarchitecture
Microcode instructions fixed
in ROM inside microcontroller

busy?
zero?
opcode

controller

Datapath
Data

Addr

Memory
Memory (RAM) holds user program
written using macrocode
instructions (e.g., DLX, x86, etc.)

enMem
MemWrt

Asanovic/Devadas

Spring 2002

6.823

A Bus-based Datapath for DLX


Opcode
ldIR

busy

zero?
OpSel

ldA

32(PC)
31(Link)
rf3
rf2
rf1

ldB

2 /

IR
ExtSel Imm
/
2
Ext
enImm

rf3
rf2
rf1

ALU
control

addr

32 GPRs
+ PC ...
ALU

32-bit Reg

enALU

data

RegSel

ldMA

MA
addr

RegWrt
enReg

Memory

MemWrt

data

enMem

Bus 32

Microinstruction: register to register transfer (17 control signals)


MA PC
means
RegSel = PC; enReg=yes; ldMA= yes
B Reg[rf2] means
RegSel = rf2; enReg=yes; ldB = yes

Asanovic/Devadas

Spring 2002

6.823

Instruction Execution
Execution of a DLX instruction involves
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. memory operation (optional)
5. write back to register file (optional)
and the computation of the address of the
next instruction

Microprogram Fragments
instr fetch: MA PC
IR Memory
A PC
PC A + 4
dispatch on OPcode

Asanovic/Devadas

Spring 2002

6.823

can be
treated as
a macro

ALU:

A Reg[rf1]
B Reg[rf2]
Reg[rf3] func(A,B)
do instruction fetch

ALUi:

A Reg[rf1]
B Imm
sign extention ...

Reg[rf2] Opcode(A,B)
do instruction fetch

Microprogram Fragments
LW:

J:

beqz:
bz-taken:

(cont.)
A Reg[rf1]
B Imm
MA A + B
Reg[rf2] Memory
do instruction fetch
A PC
B Imm
PC A + B
do instruction fetch
A Reg[rf1]
If zero?(A) then go to bz-taken
do instruction fetch
A PC
B Imm
PC A + B
do instruction fetch

Asanovic/Devadas
Spring 2002
6.823

DLX Microcontroller:
first attempt

Opcode
zero?
Busy (memory)

PC (state)

latching the inputs


may cause a
one-cycle delay

s
addr

ROM Size ?
How big is s?

Asanovic/Devadas
Spring 2002
6.823

Program ROM
2(opcode+status+s) words
word = (control+s) bits
data

next state
17 Control Signals

Microprogram in the ROM

Asanovic/Devadas

Spring 2002

6.823

worksheet

State Op

zero?

busy

Control points

next-state

fetch0
fetch1
fetch1
fetch2
fetch3

*
*
*
*
*

*
*
*
*
*

*
yes
no
*
*

MA PC
....
IR Memory
A PC
PC A + 4

ALU0
ALU1
ALU2

*
*
*

*
*
*

*
*
*

A Reg[rf1]

ALU1
B Reg[rf2]

ALU2
Reg[rf3] func(A,B)
fetch0

fetch1
fetch1
fetch2
fetch3
?

Microprogram in the ROM


State Op
fetch0
fetch1
fetch1
fetch2
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
...
ALU0
ALU1
ALU2

zero?

busy

Control points

Asanovic/Devadas
Spring 2002
6.823

next-state

*
*
*
*
ALU
ALUi
LW
SW
J
JAL
JR
JALR
beqz

*
*
*
*
*
*
*
*
*
*
*
*
*

*
yes
no
*
*
*
*
*
*
*
*
*
*

MA PC
....
IR Memory
A PC
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4

*
*
*

*
*
*

A Reg[rf1]

ALU1
B Reg[rf2]

ALU2
Reg[rf3] func(A,B)
fetch0

fetch1
fetch1
fetch2
fetch3
ALU0
ALUi0
LW0
SW0
J0
JAL0
JR0
JALR0
beqz0

Microprogram in the ROM

Asanovic/Devadas
Spring 2002
6.823

Cont.

State Op
ALUi0
ALUi1
ALUi1
ALUi2
...

J0
J1
J2
...

beqz0
beqz1
beqz1
beqz2
beqz3
...

zero?

busy

Control points

next-state

sExt

uExt

*
*
*
*

*
*
*
*

A Reg[rf1]

B sExt16(Imm)

B uExt16(Imm)

Reg[rf3] Op(A,B)

ALUi1
ALUi2
ALUi2
fetch0

*
*
*

A PC

B sExt26(Imm)

PC A+B

J1
J2
fetch0

*
yes
no
*
*

*
*
*
*
*

A Reg[rf1]

A PC
....
B sExt16(Imm)
PC A+B

beqz1

beqz
2
fetch0
beqz3
fetch0

Asanovic/Devadas

Spring 2002

6.823

Size of Control Store


w
/

status &
opcode

PC
addr

size = 2(w+s) x (c + s)

Control
ROM

data
Control signals / c

/ s

next
PC

DLX
w = 6+2
c = 17
s = ?

no. of steps per opcode = 4 to 6 + fetch-sequence

no. of states (4 steps per op-group ) x op-groups

+ common sequences
= 4 x 8 + 10 states = 42 states
s=6
Control ROM = 2(8+6) x 23 bits 48 Kbytes

Asanovic/Devadas

Spring 2002

6.823

Reducing Size of Control Store


Control store has to be fast expensive
Reduce the ROM height (= address bits)
reduce inputs by extra external logic
each input bit doubles the size of the
control store
reduce states by grouping opcodes
find common sequences of actions
condense input status bits
combine all exceptions into one, i.e.,
exception/no-exception
Reduce the ROM width
restrict the next-state encoding

Next, Dispatch on opcode, Wait for memory, ...

encode control signals (vertical microcode)

Asanovic/Devadas
Spring 2002
6.823

DLX Controller V2
Opcode

absolute

ext

op-group
Reduced ROM
height by
encoding inputs

PC

PC (state)
address

Control ROM
data

17 Control Signals

PC+1
+1

PCSrc

jump
logic

zero
busy

Reduce ROM
width by
encoding
next-state

JumpType
(next, spin, fetch,
dispatch, feqz, fnez )

Jump Logic

Asanovic/Devadas

Spring 2002

6.823

PCSrc = Case JumpTypes


next

PC+1

spin

if (busy) then PC else PC+1

fetch

absolute

dispatch

op-group

feqz

if (zero) then absolute else PC+1

fnez

if (zero) then PC+1 else absolute

Instruction Fetch & ALU:


DLX-Controller-2

State

Control points

next-state

fetch0
fetch1
fetch2
fetch3
...
ALU0
ALU1
ALU2

MA PC
IR Memory
A PC
PC A + 4

next
spin
next
dispatch

A Reg[rf1]
B Reg[rf2]
Reg[rf3] func(A,B)

next
next
fetch

ALUi0
ALUi1
ALUi2

A Reg[rf1]
B sExt16(Imm)
Reg[rf3] Op(A,B)

next
next
fetch

Asanovic/Devadas
Spring 2002
6.823

Load & Store:

DLX-Controller-2

State

Control points

next-state

LW0
LW1
LW2
LW3
LW4

A Reg[rf1]
B sExt16(Imm)
MA A+B
Reg[rf2] Memory

next
next
next
spin
fetch

SW0
SW1
SW2
SW3
SW4

A Reg[rf1]
B sExt16(Imm)
MA A+B
Memory Reg[rf2]

next
next
next
spin
fetch

Asanovic/Devadas

Spring 2002

6.823

Branches:

DLX-Controller-2

State

Control points

next-state

BEQZ0
BEQZ1

BEQZ2
BEQZ3
BEQZ4

A Reg[rf1]

next
fnez
next
next
fetch

BNEZ0
BNEZ1

BNEZ2
BNEZ3
BNEZ4

A Reg[rf1]

A PC

B sExt16(Imm)

PC A+B

A PC

B sExt16(Imm)

PC A+B

next
feqz
next
next
fetch

Asanovic/Devadas

Spring 2002

6.823

Jumps:

DLX-Controller-2

State

Control points

next-state

J0
J1
J2

A PC
B sExt26(Imm)
PC A+B

next
next
fetch

JR0

PC Reg[rf1]

fetch

JAL0
JAL1
JAL2
JAL3

A PC
Reg[31] A
B sExt26(Imm)
PC A+B

next
next
next
fetch

JALR0
JALR1
JALR2

A PC
Reg[31] A
PC Reg[rf1]

next
next
fetch

Asanovic/Devadas

Spring 2002

6.823

Asanovic/Devadas

Spring 2002

6.823

Microprogramming in IBM 360


M30
Datapath
width (bits)
inst width
(bits)
code size
(K insts)
store
technology
store
cycle (ns)
memory
cycle (ns)
Rental fee
($K/month)

M40

M50

M65

16

32

64

50

52

85

87

2.75

2.75

CCROS

TCROS

BCROS

BCROS

750

625

500

200

1500

2500

2000

750

15

35

Only fastest models (75 and 95) were hardwired

Horizontal vs Vertical Code

Asanovic/Devadas
Spring 2002
6.823

Bits per Instruction


# Instructions

Horizontal code has longer instructions


Can specify multiple parallel operations per instruction
Needs fewer steps to complete each macroinstruction
Sparser encoding more bits

Vertical code has more, narrower instructions

In limit, only single datapath operation per instruction


code branches require separate instruction
More steps to complete each macroinstruction
More compact less bits

Nanocoding
Tries to combine best of horizontal and vertical code

Nanocoding
Exploits recurring
control signal patterns

in code, e.g.,

ALU0 A Reg[rf1]

...

ALUi0 A Reg[rf1]

...

PC (state)

Asanovic/Devadas

Spring 2002

6.823

code
next-state

address

code ROM
nanoaddress

nanoinstruction ROM
data
17 Control Signals

MC68000 had 17-bit code containing either 10-bit


jump or 9-bit nanoinstruction pointer
Nanoinstructions were 68 bits wide, decoded to give
196 control signals

Asanovic/Devadas

Spring 2002

6.823

Implementing Complex Instructions


Opcode

ldIR

busy

zero?

ALUop ldA

32(PC)
31(Link)
rf3
rf2
rf1

ldB

RegSel
IR
ExSel
enImm

Imm
Ext

32 GPRs
+ PC ...
/

ALU

32-bit Reg

enALU

data

MA
addr

addr

ldMA

RegWrt
enReg

Memory

MemWrt
enMem

data

Bus

rf3 M[(rf1)] op (rf2)


M[(rf3)] (rf1) op (rf2)
M[(rf3)] M[(rf1)] op M[(rf2)]

Reg-Memory-src ALU op
Reg-Memory-dst ALU op
Mem-Mem ALU op

Mem-Mem ALU Instructions:

Asanovic/Devadas
Spring 2002
6.823

DLX-Controller-2

Mem-Mem ALU op
ALUMM0
ALUMM1
ALUMM2
ALUMM3
ALUMM4
ALUMM5
ALUMM6

M[(rf3)] M[(rf1)] op M[(rf2)]

MA Reg[rf1]
A Memory
MA Reg[rf2]
B Memory
MA Reg[rf3]
Memory func(A,B)

next
spin
next
spin
next
spin
fetch

Complex instructions usually do not require datapath


modifications in a microprogrammed implementation
-- only extra space for the control program
Implementing these instructions using a hardwired controller is
difficult without datapath modifications

Microcode Emulation

Asanovic/Devadas

Spring 2002

6.823

IBM initially miscalculated importance of


software compatibility when introducing 360
series
Honeywell started effort to steal IBM 1401
customers by offering translation software
(Liberator) for Honeywell H200 series machine
IBM retaliates with optional additional microcode
for 360 series that can emulate IBM 1401 ISA, later
extended for IBM 7000 series
one popular program on 1401 was a 650 simulator, so some
customers ran many 650 programs on emulated 1401s (650>1401->360)

Microprogramming in the
Seventies

Asanovic/Devadas
Spring 2002
6.823

Thrived because:
Significantly faster ROMs than DRAMs were available
For complex instruction sets, datapath and controller
were cheaper and simpler
New instructions , e.g., floating point, could be
supported without datapath modifications
Fixing bugs in the controller was easier
ISA compatibility across various models could be
achieved easily and cheaply
Except for cheapest and fastest machines, all computers
were microprogrammed

Asanovic/Devadas

Spring 2002

6.823

Writable Control Store (WCS)


Implement control store with SRAM not ROM

MOS SRAM memories now almost as fast as control store (core


memories/DRAMs were 10x slower)
Bug-free microprograms difficult to write

User-WCS provided as option on several minicomputers


Allowed users to change microcode for each process

User-WCS failed
Little or no programming tools support
Hard to fit software into small space
Microcode control tailored to original ISA, less useful for others
Large WCS part of processor state - expensive context switches
Protection difficult if user can change microcode
Virtual memory required restartable microcode

Performance Issues

Asanovic/Devadas

Spring 2002

6.823

Microprogrammed control
multiple cycles per instruction
Cycle time ?
tC > max(treg-reg, tALU, tROM, tRAM)
Given complex control, tALU & tRAM can be broken
into multiple cycles. However, tROM cannot be
broken down. Hence
tC > max(treg-reg, tROM)
Suppose 10 * tROM < tRAM
good performance, relative to the single-cycle
hardwired implementation, can be achieved
even with a CPI of 10

VLSI & Microprogramming

Asanovic/Devadas

Spring 2002

6.823

By late seventies
technology assumption about ROM & RAM
speed became invalid
micromachines became more complicated
to overcome slower ROM, micromachines
were pipelined
complex instruction sets led to the need for
subroutine and call stacks in code.
need for fixing bugs in control programs
was in conflict with read-only nature of ROM
WCS (B1700, QMachine, Intel432, )
introduction of caches and buffers, especially for
instructions, made multiple-cycle execution of
reg-reg instructions unattractive

Modern Usage

Asanovic/Devadas
Spring 2002
6.823

Microprogramming is far from extinct


Played a crucial role in micros of the Eighties,
Motorola 68K series
Intel 386 and 486
Microcode is present in most modern CISC micros in an
assisting role (e.g. AMD Athlon, Intel Pentium-4)
Most instructions are executed directly, i.e., with
hard-wired control
Infrequently-used and/or complicated instructions
invoke the microcode engine
Patchable microcode common for post-fabrication bug

fixes, e.g. Intel Pentiums load code patches at bootup

Asanovic/Devadas

Spring 2002

6.823

Simple Instruction

Pipelining

Krste Asanovic

Laboratory for Computer Science

Massachusetts Institute of Technology

Processor Performance
Equation

Asanovic/Devadas
Spring 2002
6.823

Time = Instructions * Cycles


* Time

Program
Program
Instruction Cycle

Instructions per program depends on source code,


compiler technology, and ISA
Microcoded DLX from last lecture had cycles per
instruction (CPI) of around 7 minimum
Time per cycle for microcoded DLX fixed by
microcode cycle time
mostly ROM access + next PC select logic

Pipelined DLX
To pipeline DLX:
First build unpipelined DLX with CPI=1
Next, add pipeline registers to reduce
cycle time while maintaining CPI=1

Asanovic/Devadas

Spring 2002

6.823

A Simple Memory Model

Asanovic/Devadas

Spring 2002

6.823

WriteEnable
Clock
Address
WriteData

MAGIC
RAM

ReadData

Reads and writes are always completed in one cycle


a Read can be done any time (i.e. combinational)
a Write is performed at the rising clock edge
if it is enabled
the write address, data, and enable
must be stable at the clock edge

Asanovic/Devadas

Spring 2002

6.823

Datapath for ALU Instructions


RegWrite
0x4

clk

Add

PC
clk

addr

we
rs1
rs2
rd1
ws
wd rd2

inst<25:21>
inst<20:16>
inst

Inst.
Memory

GPRs

inst<15:11>
inst<15:0>

6
0

5
rf1

5
rf2

opcode

rf1

rf2

RegDst
rf2 / rf3

5
rf3

Imm
Ext

inst<31:26> <5:0>

OpCode

ALU

ALU
Control

ExtSel

5
0

OpSel

6
func

immediate

BSrc
Reg / Imm

rf3 (rf1) func (rf2)


rf2 (rf1) op immediate

Datapath for Memory


Instructions

Asanovic/Devadas
Spring 2002
6.823

Should program and data memory be separate?


Harvard style: separate (Aiken and Mark 1 influence)
- read-only program memory
- read/write data memory
at some level the two memories have
to be the same
Princeton style: the same (von Neumanns influence)
- A Load or Store instruction requires
accessing the memory more than once
during its execution

Load/Store Instructions:

Asanovic/Devadas
Spring 2002
6.823

Harvard-Style Datapath
RegWrite
0x4

we
rs1
rs2
rd1
ws
wd rd2

base
addr

inst

Inst.
Memory

clk

WBSrc
ALU / Mem

clk
Add

PC

MemWrite

clk

ALU

GPRs

disp

we
addr
z

rdata

Data
Memory

Imm
Ext

wdata
ALU
Control

OpCode RegDst

6
opcode
31

26 25

5
rf1

ExtSel

5
rf2
21 20

16 15

OpSel

BSrc

16
displacement

addressing mode
(rf1) + displacement
0

rf1 is the base register

rf2 is the destination of a Load or the source for a Store

Memory Hierarchy c.2002

Asanovic/Devadas
Spring 2002
6.823

Desktop & Small Server

On-chip Caches
I$
Proc

D$

L2

0.5-2ns <10ns
2-3 clk 5~15 clk
8~64KB 0.25-2MB

Off-chip
L3 Cache
SRAM/
eDRAM
< 25ns

15~50 clk

1~8MB

Hard Disk
Interleaved
Banks of DRAM
~150ns
~10ms seek time

100~300 clk

~107 clk

64M~1GB

20~100GB

Our memory model is a good approximation of


the hierarchical memory system when we hit in
the on-chip cache

Conditional Branches
PCSrc ( ~j / j )

RegWrite

MemWrite

0x4
Add
Add
clk

PC

addr
inst

clk

Inst.
Memory

we
rs1
rs2
rd1
ws
wd rd2
GPRs

clk
ALU

Imm
Ext
ALU
Control

OpCode RegDst

ExtSel

OpSel BSrc

zero?

we
addr
rdata
Data
Memory
wdata

Asanovic/Devadas
Spring 2002
6.823

WBSrc

Register-Indirect Jumps
PCSrc ( ~j / j RInd / j PCR )

0x4

RegWrite

MemWrite

Add
Add
clk

PC
clk

addr

inst

Inst.
Memory

we
rs1
rs2
rd1
ws
wd rd2

clk
we
addr

ALU
z

GPRs
Imm
Ext

wdata
ALU
Control

Jump & Link?


OpCode RegDst

ExtSel

rdata

Data
Memory

OpSel BSrc

zero?

Asanovic/Devadas
Spring 2002
6.823

WBSrc

Jump & Link


RegWrite

PCSrc

Asanovic/Devadas

Spring 2002

6.823

MemWrite

0x4
Add
Add

clk

PC
clk

addr

inst

Inst.
Memory

31

we
rs1
rs2
rd1
ws
wd rd2

clk
ALU

GPRs

we
addr

Imm
Ext

wdata
ALU
Control

OpCode

RegDst ExtSel
rf3 / rf2 / R31

rdata

Data
Memory

OpSel BSrc

zero?

WBSrc
ALU / Mem / PC

PC-Relative Jumps
RegWrite

PCSrc

Asanovic/Devadas
Spring 2002
6.823

MemWrite

0x4
Add
Add
clk

PC
clk

addr

inst

31

Inst.
Memory

we
rs1
rs2
rd1
ws
wd rd2

clk

ALU

GPRs

we
addr
z

Imm
Ext

wdata
ALU
Control

No new
datapath
required
OpCode RegDst

rdata

Data
Memory

ExtSel OpSel
BSrc
Ext16 / Ext26

zero?

WBSrc

Hardwired Control is pure

Combinational Logic:

Unpipelined DLX

op code
zero?

combinational
logic

ExtSel
BSrc
OpSel
MemWrite
WBSrc
RegDst
RegWrite
PCSrc

Asanovic/Devadas

Spring 2002

6.823

ALU Control & Immediate


Extension

Asanovic/Devadas
Spring 2002
6.823

Inst<5:0> (Func)
Inst<31:26> (Opcode)

ALUop
+
0?

OpSel
( Func, Op, +, 0? )

Decode Map
ExtSel
( sExt16, uExt16,

sExt26, High16)

Hardwired Control
PCSrc
PCR / RInd / ~j

0x4

worksheet

RegWrite

MemWrite

Add

Add

0x4

Add
clk
inst<25:21>
inst<20:16>

PC

addr

clk

Inst.
Memory

31

inst

inst<15:11>

inst<25:0>
inst<31:26><5:0>

we
rs1
rs2
rd1
ws
wd rd2

clk
we
addr

ALU
z

GPRs
Imm
Ext

rdata

Data
Memory
wdata

ALU
Control

BSrc
Reg / Imm

OpCode
RegDst
rf2 / rf3 /
R31

Asanovic/Devadas
Spring 2002
6.823

ExtSel
sExt16/uExt16/
sExt26/High16

OpSel
Func/
Op/+ / 0?

zero?

WBSrc
ALU / Mem
/ PC

Hardwired Control Table


BSrc = Reg / Imm
PCSrc1 = j / ~j
Opcode

WBSrc = ALU / Mem / PC


PCSrc2 = PCR / RInd

Ext
Sel

BSrc Op
Sel

Mem
Write

Reg
Write

Asanovic/Devadas
Spring 2002
6.823

RegDst = rf2 / rf3 / R31


WB
Src

Reg
Dest

PC
Src

ALU

Reg

Func

no

yes

ALU

rf3

~j

ALUu

Reg

Func

no

yes

ALU

rf3

~j

ALUi

sExt16

Imm

Op

no

yes

ALU

rf2

~j

ALUiu

uExt16

Imm

Op

no

yes

ALU

rf2

~j

LW

sExt16

Imm

no

yes

Mem

rf2

~j

SW

sExt16

Imm

yes

no

~j

BEQZz=0

sExt16

0?

no

no

PCR

BEQZz=1

sExt16

0?

no

no

~j

sExt26

no

no

PCR

JAL

sExt26

no

yes

PC

R31

PCR

JR

no

no

RInd

JALR

no

yes

PC

R31

RInd

Asanovic/Devadas

Spring 2002

6.823

Hardwired Unpipelined Machine


Simple

One instruction per cycle

Why wasnt this a popular machine style?

Unpipelined DLX

Asanovic/Devadas

Spring 2002

6.823

Clock period must be sufficiently long for all of the


following steps to be completed:
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. data fetch if required
5. register write-back setup time

tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB

At the rising edge of the following clock, the


PC, the register file and the memory are updated

Pipelined DLX Datapath

Asanovic/Devadas
Spring 2002
6.823

0x4
Add

PC

addr

rdata

Inst.
Memory

fetch
phase

IR

we
rs1
rs2
rd1
ws
wd rd2
GPRs

ALU

rdata

Data
Memory

Imm
Ext

decode & Reg-fetch


phase

we
addr

wdata

execute
phase

memory
phase

write
-back
phase

Clock period can be reduced by dividing the execution


of an instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} = tDM (probably)
However, CPI will increase unless instructions
are pipelined

An Ideal Pipeline
stage
1

stage
2

stage
3

Asanovic/Devadas
Spring 2002
6.823

stage
4

All objects go through the same stages


No sharing of resources between any two stages
Propagation delay through all pipeline stages is equal

The scheduling of an object entering the pipeline is


not affected by the objects in other stages
These conditions generally hold for industrial
assembly lines. An instruction pipeline, however,
cannot satisfy the last condition. Why?

Pipelining History

Asanovic/Devadas

Spring 2002

6.823

Some very early machines had limited pipelined


execution (e.g., Zuse Z4, WWII)
Usually overlap fetch of next instruction with current execution

IBM Stretch first major supercomputer


incorporating extensive pipelining, result
bypassing, and branch prediction
project started in 1954, delivered in 1961
didnt meet initial performance goal of 100x faster with 10x
faster circuits
up to 11 macroinstructions in pipeline at same time
microcode engine highly pipelined also (up to 6
microinstructions in pipeline at same time)
Stretch was origin of 8-bit byte and lower case characters,
carried on into IBM 360

Asanovic/Devadas

Spring 2002

6.823

How to divide the datapath

into stages

Suppose memory is significantly slower than other


stages. In particular, suppose
tIM = tDM = 10 units
tALU = 5 units
tRF = tRW = 1 unit
Since the slowest stage determines the clock, it may
be possible to combine some stages without any loss
of performance

Minimizing Critical Path

Asanovic/Devadas

Spring 2002

6.823

0 x4
Add

PC

addr

rdata

Inst.
Memory

fetch
phase

IR

we
rs1
rs2
rd1
ws
wd rd2
GPRs

ALU

Imm
Ext

decode & Reg-fetch & execute


phase

we
addr
rdata

Data
Memory
wdata

memory
phase

write
-back
phase

tC > max {tIM, tRF + tALU, tDM, tRW}


Write-back stage takes much less time than other stages.
Suppose we combined it with the memory phase

increase the critical path by 10%

Maximum Speedup by
Pipelining

Asanovic/Devadas
Spring 2002
6.823

For the 4-stage pipeline, given


tIM = tDM = 10 units, tALU = 5 units, tRF = tRW= 1 unit
tC could be reduced from 27 units to 10 units
speedup = 2.7
However, if tIM = tDM = tALU = tRF = tRW = 5 units

The same 4-stage pipeline can reduce tC from 25 units to

10 units

speedup = 2.5
But, since tIM = tDM = tALU = tRF = tRW, it is possible to
achieve higher speedup with more stages in the pipeline.
A 5-stage pipeline can reduce tC from 25 units to
5 units
speedup = 5

Technology Assumptions

Asanovic/Devadas

Spring 2002

6.823

We will assume
A small amount of very fast memory (caches)
backed up by a large, slower memory
Fast ALU (at least for integers)
Multiported Register files (slower!).
It makes the following timing assumption valid
tIM tRF tALU tDM tRW
A 5-stage pipelined Harvard-style architecture will
be the focus of our detailed design

5-Stage Pipelined Execution

Asanovic/Devadas
Spring 2002
6.823

0x4
Add

PC

we
addr

we
rs1
rs2
rd1
ws
wd rd2
GPRs

IR

rdata

Memory
wdata

fetch
phase
(IF)

ALU

rdata

Memory

Imm
Ext

decode & Reg-fetch


phase
(ID)

time
instruction1
instruction2
instruction3
instruction4
instruction5

t0
IF1

t1 t2
ID1 EX1
IF2 ID2
IF3

we
addr

wdata

execute
phase
(EX)
t3
MA1
EX2
ID3
IF4

memory
phase
(MA)

t4 t5 t6 t7 . . . .

WB1

MA2 WB2

EX3 MA3 WB3

ID4 EX4 MA4 WB4

IF5 ID5 EX5 MA5 WB5

write
-back
phase
(WB)

5-Stage Pipelined Execution


Resource Usage Diagram

Asanovic/Devadas
Spring 2002
6.823

0x4
Add
we
addr

IR

rdata

Memory
wdata

fetch
phase
(IF)

Resources

PC

we
rs1
rs2
rd1
ws
wd rd2
GPRs

ALU

rdata

Memory

Imm
Ext

wdata

decode & Reg-fetch


phase
(ID)
time
IF
ID
EX
MA
WB

t0
I1

t1
I2
I1

we
addr

t2
I3
I2
I1

execute
phase
(EX)
t3
I4
I3
I2
I1

t4
I5

I4
I3
I2
I1

write
-back
phase
(WB)

memory
phase
(MA)

t5

t6

t7

....

I5

I4
I3
I2

I5

I4
I3

I5

I4

I5

Pipelined Execution:
ALU Instructions
not quite correct!

0x4
Add

PC

addr
inst IR

Inst
Memory

Asanovic/Devadas
Spring 2002
6.823

31

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU

we
addr
rdata

Data
Memory

Imm
Ext

wdata
MD1

MD2

Pipelined Execution:
Need for Several IRs
0x4

IR

Add

PC

addr
inst IR

Inst
Memory

Asanovic/Devadas
Spring 2002
6.823

IR

IR
31

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU

we
addr
rdata

Data
Memory

Imm
Ext

wdata
MD1

MD2

Asanovic/Devadas

Spring 2002

6.823

IRs and Control points


0x4

IR

Add

PC

addr
inst IR

Inst
Memory

IR

IR
31

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU

we
addr
rdata

Data
Memory

Imm
Ext

wdata
MD1

MD2

Are control points connected properly?


- Load/Store instructions
- ALU instructions

Pipelined DLX Datapath

Asanovic/Devadas

Spring 2002

6.823

without jumps

IR

IR

IR
31

0x4
Add
RegDst
RegWrite

PC

addr
inst IR

Inst
Memory

we
rs1
rs2
rd1
ws
wd rd2

OpSel

MemWrite
WBSrc

A
ALU

GPRs

rdata

Data
Memory

Imm
Ext

wdata
MD1

ExtSel

we
addr

BSrc

MD2

Asanovic/Devadas

Spring 2002

6.823

How Instructions can Interact

with each other in a Pipeline

An instruction in the pipeline may need a resource


being used by another instruction in the pipeline
structural hazard
An instruction may produce data that is needed by
a later instruction
data hazard
In the extreme case, an instruction may determine
the next instruction to be executed
control hazard (branches, interrupts,...)

Asanovic/Devadas

Spring 2002

6.823

Feedback to Resolve Hazards


FB1

FB3

FB2

stage
1

stage
2

FB4

stage
3

stage
4

Controlling pipeline in this manner works provided

the instruction at stage i+1 can complete without

any interference from instructions in stages 1 to i

(otherwise deadlocks may occur)


Feedback to previous stages is used to stall or kill
instructions

Asanovic/Devadas
Spring 2002
6.823

Pipeline Hazards

Krste Asanovic
Laboratory for Computer Science
M.I.T.

Pipelined DLX Datapath

Asanovic/Devadas
Spring 2002
6.823

without interlocks and jumps

IR

IR

IR
31

0x4
Add
RegDst
RegWrite

PC

addr
inst IR

Inst
Memory

we
rs1
rs2
rd1
ws
wd rd2

OpSel

MemWrite
WBSrc

A
ALU

GPRs

rdata

Data
Memory

Imm
Ext

wdata

wdata

MD1
ExtSel

we
addr

BSrc

MD2

Asanovic/Devadas
Spring 2002
6.823

Data Hazards

...
r1 r0 + 10
r4 r1 + 17
...
r4 r1
0x4

IR

Add

addr
inst IR

Inst
Memory

IR

IR
31

D
PC

r1

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU

we
addr

rdata

Data
Memory

Imm
Ext

wdata

wdata

MD1

MD2

Oops!

Resolving Data Hazards

Asanovic/Devadas
Spring 2002
6.823

1. Interlocks
Freeze earlier pipeline stages until data becomes
available

2. Bypasses
If data is available somewhere in the datapath
provide a bypass to get it to the right stage

Interlocks to resolve Data


Hazards

Asanovic/Devadas
Spring 2002
6.823

Stall Condition

0x4

nop
Add

PC

addr
inst IR

Inst
D
Memory

IR

IR

IR
31

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU
B

rdata

Data
Memory

Imm
Ext

wdata

wdata

MD1

...
r1 r0 + 10
r4 r1 + 17
...

we
addr

MD2

Stalled Stages and


Pipeline Bubbles
time
t0
t1
IF1 ID1
IF2

(I1) r1 r0 + 10
(I2) r4 r1 + 17
(I3)
(I4)
(I5)

Resource
Usage

IF
ID
EX
MA
WB

time
t0
t1
I1
I2
I1

t2
t3
t4
t5
EX1 MA1 WB1
ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages

t6

t2
I3
I2
I1

t6
I4
I3
I2
nop
nop

t3
I3
I2
nop
I1

t4
I3
I2
nop
nop
I1

t5
I3
I2
nop
nop
nop

t7

Asanovic/Devadas
Spring 2002
6.823

....

EX2 MA2 WB2


ID3 EX3 MA3 WB3
IF4 ID4 EX4 MA4 WB4
IF5 ID5 EX5 MA5 WB5

nop

t7
I5
I4
I3
I2
nop

....
I5
I4
I3
I2

I5
I4
I3

pipeline bubble

I5
I4

I5

Interlock Control Logic

Asanovic/Devadas
Spring 2002
6.823

worksheet

ws

stall
Cstall

rf1 ?
rf2

0x4

nop
Add

PC

addr
inst IR

Inst
D
Memory

IR

IR

IR
ws

we
rs1
rs2
rd1
ws
wd rd2

GPRs

we
A
ALU

31

Cdest

we
addr
rdata

Data
Memory

Imm
Ext

wdata

wdata

MD1

MD2

Compare the source registers of the instruction in


the decode stage with the destination register of the
uncommitted instructions.

Interlock Control Logic

Asanovic/Devadas
Spring 2002
6.823

ignoring jumps & branches


wsW
weW
Cstall wsM
weM
rf1
wsE
rf2 weE

stall

re1

we

re2
Cre

0x4

nop

Add

PC

addr
inst IR

Inst
D
Memory

ws

ws we
Cdest

IR

Cdest

IR

IR
ws

we
rs1
rs2
rd1
ws
wd rd2

GPRs

we
A
ALU

31

Cdest

we
addr
rdata

Data
Memory

Imm
Ext

wdata

wdata

MD1

MD2

Should we always stall if the rs field matches some rd?


not every instruction writes registers we
not every instruction reads registers re

Asanovic/Devadas
Spring 2002
6.823

Source & Destination Registers


R-type:

op

rf1

rf2

rf3

func

I-type:

op

rf1

rf2

immediate16

J-type:

op

immediate26
source(s)
rf1, rf2
rf1
rf1
rf1, rf2

rf3 rf1 func rf2


rf2 rf1 op imm
rf2 M [rf1 + imm]
M [rf1 + imm] rf2
cond rf1
true: PC PC + 4 + imm
false: PC PC + 4
J
PC PC + 4 + imm
JAL
r31 PC, PC PC + 4 + imm
JR
PC rf1
JALR r31 PC, PC rf1

ALU
ALUi
LW
SW
B__Z

destination
rf3
rf2
rf2

rf1
rf1
S
rf1
rf1

31
31

Deriving the Stall Signal


Cdest
ws = Case opcode
ALU
rf3
ALUi, LW
rf2
JAL, JALR
31

Cre

re1 = Case opcode


ALU, ALUi,
on
off
re2 = Case opcode

we = Case opcode
ALU, ALUi, LW,
JAL, JALR
on
...
off

Asanovic/Devadas
Spring 2002
6.823

on
off

stall =

Stall if the source registers of the instruction in the decode stage


matches the destination register of the uncommitted instructions.

The Stall Signal


Cdest
ws = Case opcode
ALU
rf3
ALUi, LW
rf2
JAL, JALR
31

Cre

we = Case opcode
ALU, ALUi, LW,
JAL, JALR
(ws 0)
...
off
Cstall

stall = ( (rf1D =wsE).weE +


(rf1D =wsM).weM +
(rf1D =wsW).weW ) . re1D
((rf2D =wsE).weE +
(rf2D =wsM).weM +
(rf2D =wsW).weW ) . re2D

Asanovic/Devadas
Spring 2002
6.823

re1 = Case opcode


ALU, ALUi, LW,
SW, BZ,
JR, JALR
on
J, JAL
off
re2 = Case opcode
ALU, SW
on
...
off

This is not
the full story !

Hazards due to Loads &


Stores

Asanovic/Devadas
Spring 2002
6.823

Stall Condition

0x4

nop
Add

PC

addr
inst IR

Inst
D
Memory

IR

IR

IR
31

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU
B

rdata

Data
Memory

Imm
Ext

wdata

wdata

MD1

...
M[r1+7] r2
r4 M[r3+5]
...

we
addr

MD2

Is there any possible data hazard


in this instruction sequence?

Hazards due to Loads & Stores

Asanovic/Devadas
Spring 2002
6.823

depends on the memory system


?

0x4

nop
Add

PC

addr
inst IR

Inst
D
Memory

IR

IR

IR
31

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU
B

rdata

Data
Memory

Imm
Ext

wdata

wdata

MD1

M[r1+7] r2
r4 M[r3+5]
...

we
addr

MD2

r1+7 = r3+5 data hazard


However, the hazard is avoided because
our memory system completes writes in
one cycle !

Asanovic/Devadas
Spring 2002
6.823

VAX: A Complex Instruction Set


Addressing Mode

Syntax

Length in bytes

Literal
Immediate
Register
Register deferred
Byte/word/long
displacement
Byte/word/long
displacement
deferred
Scaled (Indexed)
Autoincrement
Autodecrement
Autoincrement
deferred

#value
#value
Rn
(Rn)
disp(Rn)

1 (6-bit signed value)


1 + |immediate|
1
1
1 + |displacement|

@disp(Rn)

1 + |displacement|

base mode (Rx)


(Rn)+
-(Rn)
@(Rn)+

1 + |base addressing mode|


1
1
1

ADDL3 R1, 737 (R2), #456

1
+ 1 + ( 2 + 1 ) + (1+ 4) = 10 bytes !

R1 <-- M[(R2)+737] + 456

VAX Extremes

Asanovic/Devadas
Spring 2002
6.823

Long instructions, e.g.,


emodh #1.2334, 70000(r1)[r2],
#2.3456, 80000(r1)[r2],
90000(r1)[r2]
54 byte encoding (rumors of even longer instruction
sequences...)

Memory referencing behavior


In worst case, a single instruction could require at least 41
different memory pages to be resident in memory to complete
execution
Doesnt include string instructions which were designed to be
interruptable

Reduced Instruction Set


Computers

Asanovic/Devadas
Spring 2002
6.823

(Cocke, IBM; Patterson, UC Berkeley; Hennessy, Stanford)

Compilers have difficulty using complex instructions


VAX: 60% of microcode for 20% of instructions, only responsible for 0.2%
execution time
IBM experiment retargets 370 compiler to use simple subset of ISA
=> Compiler generated faster code!

Simple instruction sets dont need microcode


Use fast memory near processor as cache, not microcode storage

Design ISA for simple pipelined implementation

Fixed length, fixed format instructions


Load/store architecture with up to one memory access/instruction
Few addressing modes, synthesize others with code sequence
Register-register ALU operations
Delayed branch

MIPS R2000

Asanovic/Devadas
Spring 2002
6.823

(One of first commercial RISCs, 1986)


Load/Store architecture
32x32-bit GPR (R0 is wired), HI & LO SPR (for multiply/divide)
74 instructions
Fixed instruction size (32 bits), only 3 formats
PC-relative branches, register indirect jumps
Only base+displacement addressing mode
No condition bits, compares write GPRs, branches test GPRs
Delayed loads and branches

Five-stage instruction pipeline


Fetch, Decode, Execute, Memory, Write Back
CPI of 1 for register-to-register ALU instructions
8 MHz clock
Tightly-coupled off-chip FP accelerator (R2010)

RISC/CISC Comparisons

Asanovic/Devadas
Spring 2002
6.823

(late 80s/early 90s)

Time = Instructions * Cycles


* Time
Program
Program
Instruction Cycle
R2000 vs VAX 8700 [Bhandarkar and Clark, 91]
R2000 has ~2.7x advantage with equivalent technology

Intel 80486 vs Intel i860 (both 1989)


Same company, same CAD tools, same process
i860 2-4x faster - even more on some floating-point tasks

DEC nVAX vs Alpha 21064 (both 1992)


Same company, same CAD tools, same process
Alpha 2-4x faster

Complications due to Jumps


PCSrc2
( PCR / RInd)

PCSrc1
( j / ~j )

Add

I1
I2
I3
I4

PC

addr

104

Inst
Memory

096
100
104
304

stall

Add

for
register
indirect
jumps

0x4

inst

ADD
J +200
ADD
SUB

Asanovic/Devadas
Spring 2002
6.823

nop

Jump?

IR

IR

I1

IR
I2

kill

A jump instruction kills (not stalls)


the following instruction
How?

Asanovic/Devadas
Spring 2002
6.823

Pipelining Jumps

PCSrc2
( PCR / RInd)

PCSrc1
( j / ~j )

stall

304
Add

nop

0x4
Add

Jump?

IR

IR

I1

IRSrcD

I1
I2
I3
I4

PC

addr

104

Inst
Memory

096
100
104
304

inst

nop

ADD
J +200
ADD
SUB

IR
I2

Killing the fetched instruction:


Insert a mux before IR

kill

IRSrcD = Case opcodeD


J, JAL
nop
...
IM

Any interaction between stall and jump?

Asanovic/Devadas
Spring 2002
6.823

Jump Pipeline Diagrams

I1 096
I2 100
I3 104
I4 304

Resource
Usage

ADD
J +200
ADD kill
SUB

IF
ID
EX
MA
WB

time
t0
t1
IF1 ID1
IF2

time
t0
t1
I1
I2
I1

t2
I3
I2
I1

t2
EX1
ID2
IF3

t3
I4
nop
I2
I1

t3
MA1
EX2
nop
IF4

t4
WB1
MA2
nop
ID4

t4

t5

t5

t6

t7

....

WB2
nop nop
EX4 MA4 WB4

t6

t7

....

I4
nop I4
I2
nop I4
I1
I2
nop I4
nop

pipeline bubble

Pipelining Conditional Branches


PCSrc2
( PCR / RInd)

PCSrc1
( j / ~j )

stall

Add

nop

0x4
Add

BEQZ?

IR

IR

I1

zero?

IRSrcD

I1
I2
I3
I4

PC

addr

104

Inst
Memory

096
100
104
304

inst

nop

ADD
BEQZ r1, +200
ADD
ADD

Asanovic/Devadas
Spring 2002
6.823

IR
I2

A
ALU

Branch condition is not known until


the execute stage
what action should be taken in the
decode stage ?

Conditional Branches:
PCSrc2
PCSrc1
( jD/ jE / ~j ) ( PCR / RInd)

Asanovic/Devadas
Spring 2002
6.823

solution 1

Add

stall

Add

nop

0x4
Add

BEQZ?
E

IR

IR

I2

I1

IRSrcD
?

I1
I2
I3
I4

096
100
104
304

PC

addr

108

Inst
Memory

inst

nop

ADD
BEQZ r1, +200
ADD
ADD

zero?

PC

IR
I3

A
ALU

If the branch is taken


- kill the two following instructions
- the instruction at the decode
stage is not valid
stall signal is not valid

New Stall Signal

Asanovic/Devadas
Spring 2002
6.823

stall = (((rf1D =wsE).weE + (rf1D =wsM).weM + (rf1D =wsW).weW).re1D


+ ((rf2D =wsE).weE + (rf2D =wsM).weM + (rf2D =wsW).weW).re2D)
. !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z)

Control Equations for PC Muxes

Asanovic/Devadas
Spring 2002
6.823

Solution 1

PCSrc1 = Case opcodeE


BEQZ.z, BNEZ.!z jE
...
Case opcodeD
J, JAL, JR, JALR jD
...
~j
PCSrc2 = Case opcodeD
J, JAL
PCR
JR, JALR
RegI
IRSrcD = Case opcodeE
BEQZ.z, BNEZ.!z nop
...

(Case opcodeD
J, JAL, JR, JALR nop
...
IM)
IRSrcE = Case opcodeE
BEQZ.z, BNEZ.!z nop
...
stall.nop + !stall.IRD

Give priority
to the older
instruction,
i.e., execute
stage instruction
over decode
stage instruction

Conditional Branches:

solution 2
Test
for zero at the decode stage
PCSrc2

PCSrc1
( j / ~j )

Asanovic/Devadas
Spring 2002
6.823

stall

( PCR / RInd)
304

304

Add

108

BEQZ?

nop

0x4
Add

IR

IR

I1
IRSrcD

I1
I2
I3
I4

PC

addr

104

Inst
Memory

096
100
104
304

inst

nop

ADD
BEQZ r1, +200
ADD
ADD

zero?
or
IR
I2

A
ALU

Need to kill only one instruction !


Wouldnt work if DLX had general
branch conditions (i.e., r1>r2)?

Conditional Branches:
Delayed Branches

PCSrc2
( PCR / RInd)

PCSrc1
( j / ~j )

solution 3

Asanovic/Devadas
Spring 2002
6.823

stall

304
304

Add

108

BEQZ?

nop

0x4
Add

IR

IR

I1

zero?

I1
I2
I3
I4

096
100
104
304

PC

addr
inst

IR

104

Inst
Memory

I2

A
ALU

ADD
Change the semantics of branches and jumps
BEQZ r1, +200
Instruction after branch always executed,
ADD (delay slot)
regardless if branch taken or not taken.
ADD

Need not kill any instructions !

Annulling Branches

Asanovic/Devadas
Spring 2002
6.823

Plain delayed branches only allow compiler to fill delay slot with
instructions that will always be executed:
ADD R1, R2, R3
BEQZ R3, target
BEQZ R3, target
NOP (delay slot)
ADD R1, R2, R3 (delay slot)
Annulling branches kill delay slot if branch is not taken. Compiler
can therefore fill delay slot with copy of branch target:
ADD R1, R2, R3
ADD R1, R2, R3
BEQZ R1, target
BEQZL R1, target+4 (annulling)
NOP (delay slot)
ADDI R2, R2, #1 (delay slot)
...
...
target:
target:
ADDI R2, R2, #1
ADDI R2, R2, #1
SUB R5, R6, R8
SUB R5, R6, R8
Can also have variant of annulling branch that kills delay slot if
branch is taken

Branch Delay Slots

Asanovic/Devadas
Spring 2002
6.823

First introduced in pipelined microcode engines.


Adopted for early RISC machines with single-issue pipelines, made
visible to user-level software.
Advantages
simplifies control logic for simple pipeline
compiler helps reduce branch hazard penalties
~70% of single delay slots usefully filled

Disadvantages
complicates ISA specification and programming
adds extra next-PC state to programming model
complicates control logic for more aggressive implementations
e.g., out-of-order superscalar designs

Out of favor for new (post-1990) general-purpose ISAs


Later lectures will cover advanced techniques for control hazards:
dynamic (run-time) schemes, e.g., branch prediction
static (compile-time) schemes, e.g., predicated execution

Pipelining Delayed Jumps &


Links

Asanovic/Devadas
Spring 2002
6.823

RA

304

304

108
Add

0x4

RA

holding the return address


for linking

Add

IR

PC

104

I1
I2
I3
I4

addr

inst IR

Inst
Memory I2

096
100
104
304

we
rs1
rs2
rd1
ws
wd rd2

GPRs

31

I1
A
ALU

we
addr
rdata

Data
Memory

Imm
Ext

ADD
JALR +200
ADD (delay slot)
ADD

IR

IR

wdata
MD1

MD2

Asanovic/Devadas
Spring 2002
6.823

Bypassing
time
(I1) r1 r0 + 10
(I2) r4 r1 + 17
(I3)
(I4)
(I5)

t0
IF1

t1
ID1
IF2

t2
t3
t4
t5
EX1 MA1 WB1
ID2 ID2 ID2 ID2
IF3 IF3 IF3 IF3
stalled stages

t6

t7

....

EX2 MA2 WB2


ID3 EX3 MA3
IF4 ID4 EX4
IF5 ID5

Each stall or kill introduces a bubble in the pipeline


CPI > 1
A new datapath, i.e., a bypass, can get the data from
the output of the ALU to its input
time
(I1) r1 r0 + 10
(I2) r4 r1 + 17
(I3)
(I4)
(I5)

t0
IF1

t1
ID1
IF2

t2
EX1
ID2
IF3

t3
MA1
EX2
ID3
IF4

t4
WB1
MA2
EX3
ID4
IF5

t5

t6

t7

....

WB2
MA3 WB3
EX4 MA4 WB4
ID5 EX5 MA5 WB5

Asanovic/Devadas
Spring 2002
6.823

Adding Bypasses
Stall

r1 ...

r4 r1...
0x4

nop
Add

IR

IR

IR
31

ASrc

PC

addr
inst IR

Inst
D
Memory

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU
B

(I1)
(I2)

we
addr
rdata

Data
Memory

Imm
Ext

wdata

wdata

MD1

...
r1 r0 + 10
r4 r1 + 17
...

MD2

Of course you can add many


more bypasses!

The Bypass Signal

Asanovic/Devadas
Spring 2002
6.823

deriving it from the stall signal


Cdest
ws = Case opcode
ALU
rf3
ALUi, LW
rf2
JAL, JALR
R31
we = Case opcode
ALU, ALUi, LW (ws R0)
JAL, JALR
on
...
off
Cstall

Cre

stall = ( (rf1D =wsE).weE +


(rf1D =wsM).weM +
(rf1D =wsW).weW ) . re1D
((rf2D =wsE).weE +
(rf2D =wsM).weM +
(rf2D =wsW).weW ) . re2D

Cbypass ASrc = (rf1D =wsE).weE. re1D

re1 = Case opcode


ALU, ALUi, LW,
SW, BZ,
JR, JALR
on
J, JAL
off
re2 = Case opcode
ALU, SW
on
...
off

Is this correct ?

Usefulness of a Bypass

Asanovic/Devadas
Spring 2002
6.823

Stall

0x4

nop
Add

I1

IR

IR

IR
31

ASrc

PC

addr
inst IR

Inst
I
Memory 2

we
rs1
rs2
rd1
ws
wd rd2

GPRs

A
ALU

rdata

Data
Memory

Imm
Ext

MD1
Consider
...
...
r1 M[r0 + 10]
r1 r0 + 10
(I1)
r4 r1 + 17
(I2)
r4 r1 + 17
Where can this bypass help?

we
addr

wdata

wdata

MD2

...
JAL 500
r4 r31 + 17

Bypass and Stall Signals


Stall
0x4

re1

we-stall
we-bypass

re2

nop
Add

PC

addr

ASrc

inst IR

Inst
Memory

we
rs1
rs2
rd1
ws
wd rd2

GPRs

we

ws

IR

IR

ws

Cdest

Asanovic/Devadas
Spring 2002
6.823

we

ws

Cdest
IR

W
bypass
A
ALU

we
addr
rdata

Data
Memory

Imm
Ext

wdata
MD1

MD2

weE has to be split into two components


we-bypassE
= ((opcodeE=ALU) + (opcodeE=ALUiE)) . (wsE R0 )
we-stallE
= (opcodeE= LWE).(wsE R0)
+ (opcodeE=JALE) + (opcodeE=JALRE)
ASrc = (rf1D =wsE).we-bypassE . re1D
Stall = ((rf1D =wsE).we-stallE ) + (rf1D=wsM).weM + (rf1D=wsW).weW). re1D
+ ((rf2D = wsE).weE + (rf2D = wsM).weM + (rf2D = wsW).weW). re2D

Fully Bypassed Datapath


Stall

RA

Asanovic/Devadas
Spring 2002
6.823

RA
weW

Add

0x4
Add

Cdest

ASrc nop

PC

wsW

addr

inst IR

Inst
Memory

we
rs1
rs2
rd1
ws
wd rd2

Imm
Ext

Is there still
a need for the
Stall signal ?

A
ALU

GPRs

IR

IR

IR

we
addr
rdata

Data
Memory

wdata

BRSrc

MD1

MD2

Stall = (rf1D=wsE). (opcodeE=LWE).(wsER0 ).re1D


+ (rf2D=wsE). (opcodeE=LWE).(wsER0 ).re2D

Why an Instruction may not


be dispatched every cycle

Asanovic/Devadas
Spring 2002
6.823

(CPI>1)

Full bypassing may be too expensive to implement


typically all frequently used paths provided
some infrequently used bypass paths may increase cycle
time and/or cost more than their benefit in reducing CPI

Loads have two cycle latency


Instruction after load cannot use load result
MIPS-I ISA defined load delay slots, a software-visible
pipeline hazard (compiler schedules independent instruction
or inserts NOP to avoid hazard). Removed in MIPS-II.

Conditional branches may cause bubbles


kill following instruction(s) if no delay slots
(Machines with software-visible delay slots may execute
significant number of NOP instructions inserted by compiler code
scheduling)

Multilevel Memories
(Improving performance using a
little cash)

Page 1

CPU-Memory Bottleneck

CPU

Memory

Performance of high-speed computers is usually


limited by memory bandwidth & latency
Latency (time for a single access)
Memory access time >> Processor cycle time
Bandwidth (number of accesses per unit time)
if fraction m of instructions access memory,
1+m memory references / instruction
CPI = 1 requires 1+m memory refs / cycle
2

Page 2

Processor-DRAM Gap (latency)


Proc
60%/yr.
CPU

Moores Law
Processor-Memory
Performance Gap:
(grows 50% / year)

100
10

DRAM
7%/yr.

1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000

DRAM

1980
1981

Performance

1000

Time

[Courtesy of David Patterson, UC


Berkeley; Used with Permission]

Four-issue superscalar executes 400 instructions during cache miss!

Multilevel Memory
Strategy: Hide latency using small, fast
memories called caches.
Caches are a mechanism to hide memory
latency based on the empirical observation
that the stream of memory references made
by a processor exhibits locality

loop: ADD r2, r1, r1


SUBI r3, r3, #1
BNEZ r3, loop

PC
96
100
104
108

112

What is the pattern


of instruction
memory addresses?

Page 4

Typical Memory Reference Patterns

Address

linear sequence
n loop iterations

Instruction
fetches

Stack
accesses
Data
accesses
Time
5

Page 5

Locality Properties of Patterns

Two locality properties of memory references:


If a location is referenced it is likely to be
referenced again in the near future.
This is called temporal locality.
If a location is referenced it is likely that
locations near it will be referenced in the near
future.
This is called spatial locality.

Page 6

Caches

Caches exploit both properties of patterns.


Exploit temporal locality by remembering the
contents of recently accessed locations.
Exploit spatial locality by remembering blocks
of contents of recently accessed locations.

Page 7

Memory Hierarchy
A
CPU

Small,
Fast
Memory
(RF, SRAM)

Big, Slow
Memory
(DRAM)

holds frequently used data

size:
Register << SRAM << DRAM
latency:
Register << SRAM << DRAM
bandwidth: on-chip >> off-chip why?
On a data access:
hit (data fast memory)
miss (data fast memory)

why?
why?

low latency access


long latency access (DRAM)

Fast mem effective only if bandwidth reqrmnt at B << A

Due to cost

Due to size of DRAM

Due to cost and wire delays (wires on-chip cost much less, and are faster)

Page 8

Management of Memory Hierarchy


Software managed, e.g., registers
part of the software-visible processor state
software in complete control of storage allocation
but hardware might do things behind softwares back,
e.g., register renaming

Hardware managed, e.g., caches


not part of the software-visible processor state
hardware automatically decides what is kept in
fast memory
but software may provide hints, e.g., dont cache or
prefetch
9

Page 9

A Typical Memory Hierarchy c.2000

Split instruction & data primary


caches (on-chip SRAM)

CPU
RF
Multiported
register file
(part of CPU)

L1
Instruction
Cache

Multiple interleaved
memory banks
(DRAM)
Memory

Unified L2
Cache

L1 Data
Cache

Memory
Memory
Memory

Large unified secondary cache


(on-chip or off-chip SRAM)

10

Page 10

10

Inside a Cache
Address

Processor

Address

CACHE
Data

Data

Main
Memory

copy of main copy of main


memory
memory
location 100 location 101

Address
Tag

100

Data Data
Byte Byte

304

Data
Byte

Line

6848

Data Block
11

Page 11

11

Cache Algorithm (Read)


Look at Processor Address, search cache
tags to find match. Then either
Found in cache
a.k.a. HIT

Return copy
of data from
cache

Not in cache
a.k.a. MISS

Read block of data from


Main Memory
Wait

Which line
do we replace?

Return data to processor


and update cache
12

Page 12

12

Placement Policy
Block Number

1111111111 222222222233
0123456789 0123456789 012345678901

Memory

Set Number

01234567

Cache

block 12
can be placed

Fully
Associative

(2-way) Set
Associative

Direct
Mapped

anywhere

anywhere in
set 0
(12 mod 4)

only into
block 4
(12 mod 8)

13

Page 13

13

Direct-Mapped Cache
Tag

Index

t
V

Tag

Block
Offset

Data Block

2k
lines
t
=

HIT

Data Word or Byte


14

Page 14

14

Fully Associative Cache


V Tag

Data Block
t

Tag

Block
Offset

HIT

Data
Word
or Byte

15

Page 15

15

Direct Map Address Selection


higher-order vs. lower-order address bits
Block Address
indexm
-

tag

offsetb

tag

=
Selection based on lower-order
address bits or higher-order
address bits?

status

data block

valid?

hit?

word
16

Page 16

16

2-Way Set-Associative Cache


Tag
t

Index

k
V Tag Data Block

Block
Offset

V Tag Data Block

t
=

Data
Word
or Byte
HIT

17

Page 17

17

Highly-Associative Caches
For high associativity, use content-addressable
memory (CAM) for tags
Overhead?
tagt

seti

offsetb

Set i
Set 1
Set 0

Tag
Tag

=?
=?

Data Block
Data Block

Tag

=?

Data Block

Only one set enabled


Only hit data accessed

Hit?

Data

18

Advantage is low power because only hit data is accessed.

Page 18

18

Average Read Latency using a Cache


is HIT RATIO: Fraction of references in cache
1 - is MISS RATIO: Remaining references
Average access time for serial search:
Addr

Addr

CACHE

Processor
Data

Data

Main
Memory

tc + (1 - ) tm

Average access time for parallel search:


Addr

CACHE

Processor
Data

Data

Main
Memory

tc + (1 - ) tm

tc is smallest for which type of cache?

19

Page 19

19

Write Policy
Cache hit:
write through: write both cache & memory
- generally higher traffic but simplifies cache coherence

write back: write cache only (memory is written


only when the entry is evicted)
- a dirty bit per block can further reduce the traffic

Cache miss:
no write allocate: only write to main memory
write allocate: (aka fetch on write)
fetch block into cache
Common combinations:
write through and no write allocate
write back with write allocate

20

Page 20

20

Write Performance
Tag

Block
Offset

Index

t
V

Tag

Data
2k
lines

t
=

HIT

WE

Data Word or Byte


21

Page 21

21

Replacement Policy
In an associative cache, which block from a set should
be evicted when the set becomes full?
Random
Least Recently Used (LRU)
LRU cache state must be updated on every access
true implementation only feasible for small sets (2-way easy)
pseudo-LRU binary tree often used for 4-8 way

First In, First Out (FIFO) a.k.a. Round-Robin


used in highly associative caches

This is a second-order effect. Why?


22

Page 22

22

Causes for Cache Misses


Compulsory: first-reference to a block a.k.a. cold
start misses
- misses that would occur even with infinite cache
Capacity: cache is too small to hold all data needed
by the program
- misses that would occur even under perfect
placement & replacement policy
Conflict:

misses that occur because of collisions


due to block-placement strategy
- misses that would not occur with full associativity

Determining the type of a miss requires running


program traces on a cache simulator

23

Page 23

23

Improving Cache Performance

Average memory access time =


Hit time + Miss rate x Miss penalty
To improve performance:
reduce the miss rate (e.g., larger cache)
reduce the miss penalty (e.g., L2 cache)
reduce the hit time
Aside: What is the simplest design strategy?

24

Design the largest primary cache without slowing down the clock
Or adding pipeline stages.

Page 24

24

Block Size and Spatial Locality


Block is unit of transfer between the cache and memory
Tag
Split CPU
address

Word0

Word1

Word2

Word3

block address

2b

4 word block,
b=2

offsetb

b bits
32-b bits
= block size a.k.a line size (in bytes)

Larger block size has distinct hardware advantages


less tag overhead
exploit fast burst transfers from DRAM
exploit fast burst transfers over wide busses

What are the disadvantages of increasing block size?


25

Larger block size will reduce compulsory misses (first miss to a block).
Larger blocks may increase conflict misses since the number of blocks
is smaller.

Page 25

25

Victim Caches (HP 7200)

CPU
RF

Unified L2
Cache

L1 Data
Cache

Replaced data
from L1
Hit data
from VC
(miss
in L1)

Victim
FA Cache
4 blocks

where ?
Replaced data
From VC

Works very well with a direct-mapped cache


26

Page 26

26

Pseudo-Associative Caches (MIPS R1000)


Look at Processor Address, look in indexed
set for data. Then either
HIT

MISS

Return copy
of data from
cache

Invert MSB of index field


Look in pseudo set
MISS

SLOW HIT
Read block of data from
next level of cache

27

Page 27

27

Reducing (Read) Miss Penalty

CPU

Unified
L2 Cache

L1 Data
Cache

Write
buffer

RF

Replaced dirty data from L1 in writeback cache


OR
All writes in writethru cache

Write buffer may hold updated value of location


needed by a read miss
On a read miss, simple scheme is to wait for the write
buffer to go empty
Check write buffer on read miss, if no conflicts, allow
read miss to continue (else, return value in write buffer)
Doesnt help much. Why?

28

Deisgners of the MIPS M/1000 estimated that waiting for a four-word buffer
to empty
increased the read miss penalty by a factor of 1.5.

Page 28

28

Block-level Optimizations
Tags are too large, i.e., too much overhead
Simple solution: Larger blocks, but miss
penalty could be large.

Sub-block placement
A valid bit added to units smaller than the full
block, called sub
- blocks
Only read a sub
- block on a miss
If a tag matches, is the word in the cache?

100
300
204

1
1
0

1
1
1

1
0
0

1
0
1
29

Page 29

29

Write Alternatives
- Writes take two cycles in memory stage, one cycle
for tag check plus one cycle for data write if hit
- Design data RAM that can perform read and write
in one cycle, restore old value after tag miss
- Hold write data for store in single buffer ahead of
cache, write cache data during next stores tag
check
-

Need to bypass from write buffer if read matches write buffer


tag

30

Page 30

30

Pipelining Cache Writes (Alpha 21064)


Tag

Block
Offset

Index

t
V

Tag

Data
2k
lines

t
=

HIT

WE

Bypass

Data Word or Byte


31

Page 31

31

Effect of Cache Parameters on Perf.


Larger cache size
+ reduces capacity and conflict misses
- hit time may increase
Larger block size
+ spatial locality reduces compulsory misses and
capacity reload misses
- fewer blocks may increase conflict miss rate
- larger blocks may increase miss penalty
Higher associativity
+ reduces conflict misses (up to around 4-8 way)
- may increase access time
See page 427 of text for a nice summary

32

Page 32

32

Cache (Memory) Performance


Optimization

Page 1

Improving Cache Performance

Average memory access time =


Hit time + Miss rate x Miss penalty
To improve performance:
reduce the miss rate (e.g., larger cache)
reduce the miss penalty (e.g., L2 cache)
reduce the hit time
The simplest design strategy is to design the
largest primary cache without slowing down the
clock or adding pipeline stages
2

Design the largest primary cache without slowing down the clock
Or adding pipeline stages.

Page 2

RAM-tag 2-way Set-Associative Cache


Tag
t

Index
k
Tag

Data Word

Tag

Data Word

MUX

Set

Data Word

Page 3

CAM-tag 2-way Set-Associative Cache


Tag
t

Index
k
Word

Word

Tag

Demux

Demux

Tag

Data Word
4

Page 4

Causes for Cache Misses


Compulsory: first-reference to a block a.k.a. cold
start misses
- misses that would occur even with infinite cache
Capacity: cache is too small to hold all data needed
by the program
- misses that would occur even under perfect
placement & replacement policy
Conflict:

misses that occur because of collisions


due to block-placement strategy
- misses that would not occur with full associativity

Page 5

Block-level Optimizations
Tags are too large, i.e., too much overhead
Simple solution: Larger blocks, but miss
penalty could be large.

Sub-block placement
A valid bit added to units smaller than the full
block, called sub
- blocks
Only read a sub
- block on a miss
If a tag matches, is the word in the cache?

100
300
204

1
1
0

1
1
1

1
0
0

1
0
1
6

Main reason for subblock placement is to reduce tag overhead.

Page 6

Write Alternatives
- Writes take two cycles in memory stage, one cycle
for tag check plus one cycle for data write if hit
- Design data RAM that can perform read and write
in one cycle, restore old value after tag miss
- Hold write data for store in single buffer ahead of
cache, write cache data during next stores tag
check
-

Need to bypass from write buffer if read matches write buffer


tag

Page 7

Pipelining Cache Writes (Alpha 21064)


Tag

Block
Offset

Index

t
V

Tag

Data
2k
lines

t
=

WE

Bypass

Data Word or Byte

HIT

Writes usually take longer than reads because the tags have to

Be checked before writing the data.

First, tags and data are split, so they can be addressed independently.

Page 8

Prefetching
Speculate on future instruction and data
accesses and fetch them into cache(s)
Instruction accesses easier to predict than data
accesses

Varieties of prefetching
Hardware prefetching
Software prefetching
Mixed schemes

What types of misses does prefetching affect?


9

Reduces compulsory misses, can increase conflict and


capacity misses.

Page 9

Issues in Prefetching
Usefulness should produce hits
Timeliness not late and not too early
Cache and bandwidth pollution

CPU
RF

L1
Instruction
Unified L2
Cache
L1 Data
Prefetched data

10

Page 10

10

Hardware Prefetching of Instructions


Instruction prefetch in Alpha AXP 21064
Fetch two blocks on a miss; the requested
block and the next consecutive block
Requested block placed in cache, and next
block in instruction stream buffer

Req
block

CPU
RF

Stream
Buffer

Prefetched
instruction block

(4 blocks)

L1
Instruction

Req
block

Unified L2
Cache

11

Need to check the stream buffer if the requested block is in there.


Never more than one 32-byte block in the stream buffer.

Page 11

11

Hardware Data Prefetching


One Block Lookahead (OBL) scheme
Initiate prefetch for block b + 1 when block b is
accessed
Why is this different from doubling block size?

Prefetch-on-miss:
Prefetch b + 1 upon miss on b

Tagged prefetch:
Tag bit for each memory block
Tag bit = 0 signifies that a block is demand
fetched or if a prefetched block is referenced for
the first time
Prefetch for b + 1 initiated only if tag bit = 0 on b
12

HP PA 7200 uses OBL prefetching


Tag prefetching is twice as effective as prefetch-on-miss in
reducing miss rates.

Page 12

12

Comparison of Variant OBL Schemes


Prefetch-on-miss accessing contiguous blocks
Demand-fetched

Demand-fetched

Demand-fetched

Prefetched

Prefetched

Prefetched
Demand-fetched
Prefetched

Tagged prefetch accessing contiguous blocks


0 Demand-fetched

0 Demand-fetched

0 Demand-fetched

1 Prefetched

0 Prefetched

0 Prefetched

1 Prefetched

0 Prefetched

1 Prefetched

0
13

Page 13

13

Software Prefetching

for(i=0; i < N; i++) {


fetch( &a[i + 1] );
fetch( &b[i + 1] );
SUM = SUM + a[i] * b[i];
}

What property do we require of the cache for


prefetching to work ?

14

Cache should be non-blocking or lockup-free.

By that we mean that the processor can proceed while the prefetched

Data is being fetched; and the caches continue to supply instructions

And data while waiting for the prefetched data to return.

Page 14
14

Software Prefetching Issues


Timing is the biggest issue, not predictability
If you prefetch very close to when the data is
required, you might be too late
Prefetch too early, cause pollution
Estimate how long it will take for the data to
come into L1, so we can set P appropriately
Why is this hard to do?
for(i=0; i < N; i++) {
fetch( &a[i + P] );
fetch( &b[i + P] );
SUM = SUM + a[i] * b[i];
}

Dont ignore cost of prefetch instructions

15

Page 15

15

Compiler Optimizations
Restructuring code affects the data block
access sequence
Group data accesses together to improve
spatial locality
Re
- order data accesses to improve temporal
locality

Prevent data from entering the cache


Useful for variables that are only accessed
once

Kill data that will never be used


Streaming data exploits spatial locality but not
temporal locality
16

Miss-rate reduction without any hardware changes.


Hardware designers favorite solution.

Page 16

16

Loop Interchange
for(j=0; j < N; j++) {
for(i=0; i < M; i++) {
x[i][j] = 2 * x[i][j];
}
}

for(i=0; i < M; i++) {


for(j=0; j < N; j++) {
x[i][j] = 2 * x[i][j];
}
}

What type of locality does this improve?


17

Page 17

17

Loop Fusion
for(i=0; i < N; i++)
for(j=0; j < M; j++)
a[i][j] = b[i][j] * c[i][j];
for(i=0; i < N; i++)
for(j=0; j < M; j++)
d[i][j] = a[i][j] * c[i][j];

for(i=0; i < M;
for(j=0; j <
a[i][j] =
d[i][j] =
}

i++)
N; j++) {
b[i][j] * c[i][j];
a[i][j] * c[i][j];

What type of locality does this improve?


18

Page 18

18

Blocking
for(i=0; i < N; i++)
for(j=0; j < N; j++) {
r = 0;
for(k=0; k < N; k++)
r = r + y[i][k] * z[k][j];
x[i][j] = r;
}

Not touched

Old access

New access

19

Page 19

19

Blocking
for(jj=0; jj < N; jj=jj+B)
for(kk=0; kk < N; kk=kk+B)
for(i=0; i < N; i++)
for(j=jj; j < min(jj+B,N); j++) {
r = 0;
for(k=kk; k < min(kk+B,N); k++)
r = r + y[i][k] * z[k][j];
x[i][j] = x[i][j] + r;
}

What type of locality does this improve?

20

Y benefits from spatial locality


z benefits from temporal locality

Page 20

20

CPU-Cache Interaction (5-stage pipeline)


0x4

Add

M
A

nop
PC

addr

inst

IR
D

Decode,
Register
Fetch

ALU

Primary
Data rdata
Cache
wdata

hit?

PCen

we
addr

Primary
Instruction
Cache

MD1

hit?

MD2

Stall entire
CPU on data
cache miss
To Memory Control

Cache Refill Data from Lower Levels


of Memory Hierarchy

Instruction miss ?

21

Page 21

21

Memory (DRAM) Performance

1 Gb
DRAM

50-100 ns access time


Needs refreshing

Upon a cache miss


4 clocks to send the address
24 clocks for the access time per word
4 clocks to send a word of data

Latency worsens with increasing block size

22

Need 128 or 116 clocks, 128 for a dumb memory.

Page 22

22

Memory organizations
CPU

CPU

CPU

Cache
Bus

Cache
Bus

Bus

Cache

Memory

bank 0

bank 1

Memory
Wide memory

Interleaved memory

1 word wide

23

Alpha AXP 21064 256 bits wide memory and cache.

Page 23

23

Interleaved Memory
Banks are often 1 word wide
Send an address to all the banks
How long to get 4 words back?
CPU

Bus

Cache

Bank i contains all


words whose address
modulo N is i

B0

B1

B2

B3
24

4 + 24 + 4* 4 clocks = 44 clocks from interleaved memory.

Page 24

24

Independent Memory
Send an address to all the banks
How long to get 4 words back?
CPU

Bus

Bus

Bus

Non
Blocking
Cache

Bus

Bank i contains all


words whose address
modulo N is i

B0

B1

B2

B3

25

4 + 24 + 4 = 32 clocks from main memory for 4 words.

Page 25

25

Memory Bank Conflicts


Consider a 128-bank memory in the NEC SX/3
where each bank can service independent
requests
int x[256][512];
for(j=0; j < 512; j++)
for(i=0; i < 256; i++)
x[i][j] = 2 * x[i][j];
Consider column k x[i][k], 0 <= i < 256
Address of elements in column is i*512 + k
Where do these addresses go?
26

Page 26

26

Bank Assignment
Bank number = Address MOD NumBanks
Address within bank =
Address
NumBanks
.

Address
within
Bank

0
1
2
3

B0

0
1
2
3

B1

0
1
2
3

B2

0
1
2
3

B3

Should randomize address to bank mapping.


Could use odd/prime number of banks. Problem?
Bank number for i*512 + k should depend on i
Can pick more significant bits in address
27

Page 27

27

Virtual Memory Basics

Page 1

Memory Management
The Fifties:
- Absolute Addresses
- Dynamic address translation
The Sixties:
- Paged memory systems and TLBs
- Atlas Demand paging
Modern Virtual Memory Systems

Page 2

Types of Names for Memory Locations

machine
language
address

ISA

Address
Mapping physical
address

virtual
address

Physical
Memory
(DRAM)

Machine language address

as specified in machine code

Virtual address

ISA specifies translation of machine code address into virtual


address of program variable

Physical address
operating system specifies mapping of virtual address into
name for a physical memory location

Translation of machine code address into virtual address


may involve a segment register.
Physical memory location => actual address signals going to DRAM chips.

Page 3

Absolute Addresses
EDSAC, early 50s
effective address = physical memory address
Only one program ran at a time, with unrestricted access
to entire machine (RAM + I/O devices)
Addresses in a program depended upon where
the program was to be loaded in memory
But it was more convenient for programmers to
write location-independent subroutines
How could location independence be achieved?
4

Led to the development of loaders and linkers to statically relocate and link
Programs.

Page 4

Dynamic Address Translation

Higher throughput if CPU and I/O of 2 or more


programs were overlapped. How?
multiprogramming

prog1

Location independent programs:


Programming and storage management ease
need for a base register
Protection:
Independent programs should not affect
each other inadvertently
need for a bound register

prog2

Physical Memory

Motivation:
In the early machines, I/O operations were slow
and each word transferred involved the CPU

Page 5

Bound
Register

Load X

Program
Address
Space

Effective Addr
Register

Bound
Violation
current
segment

+
Physical
Address

Base
Register

Main Memory

Simple Base and Bound Translation

Base and bounds registers only visible/accessible when


processor running in kernel (a.k.a supervisor) mode
6

Page 6

Load X

Program
Address
Space

Data Bound
Register
Effective Addr
Register
Data Base
Register

Bound
Violation

data
segment

Program Bound
Register
Program
Counter
Program Base
Register

Bound
Violation

Main Memory

Separate Areas for Program and Data

program
segment

What is an advantage of this separation?


Used today on Cray vector supercomputers

Permits sharing of program segments.

Page 7

Memory Fragmentation

OS
Space
user 1
16 K
user 2
24 K

Users 4 & 5
arrive

OS
Space
user 1
16 K
user 2
24 K

user 3

32 K

user 4
free
user 3

free

24 K

user 5

free

24 K

Users 2 & 5
leave

OS
Space
user 1
16 K
free

16 K
32 K

user 4
free
user 3

24 K

free

8 K

24 K
16 K
8 K

32K
24 K

As users come and go, the storage is fragmented.


Therefore, at some stage programs have to be moved
around to compact the storage.

Called Burping the memory.

Page 8

Paged Memory Systems


Processor generated address can be interpreted
as a pair <page number,offset>
page number

offset

A page table contains the physical address of


the base of each page
0
1
2
3
Address Space
of User-1

1
0

0
1
2
3
Page Table
of User-1

What requirement does fixed-length pages plus


indirection through page table relax?

3
2
9

Relaxes the contiguous allocation requirement.

Page 9

User 1

VA1
Page Table

User 2

Physical
Memory

Private Address Space per User

OS
pages

VA1
Page Table

User 3

VA1

Page Table

FREE

Each user has a page table


Page table contains an entry for each user page
10

OS ensures that the page tables are disjoint.

Page 10

10

Where Should Page Tables Reside?


Space required by the page tables is proportional to
the address space, number of users, ...
Space requirement is large
too expensive to keep in registers

Special registers just for the current user:


- What disadvantages does this have?
may not be feasible for large page tables

Main memory:
- needs one reference to retrieve the page base
address and another to access the data word
doubles number of memory references!
11

Affects context-switching overhead, and needs new management


instructions.

Page 11

11

Page Tables in Physical Memory


Page Table, User 1

Page Table, User 2

VA1
User 1

VA1
User 2

12

Page 12

12

A Problem in Early Sixties


There were many applications whose data could not
fit in the main memory, e.g., Payroll
Paged memory system reduced fragmentation but
still required the whole program to be resident in
the main memory
Programmers moved the data back and forth from the
secondary store by overlaying it repeatedly on the
primary store
tricky programming!

13

Page 13

13

Manual Overlays
Assuming an instruction can address
all the storage on the drum
Method 1 - programmer keeps track of
addresses in the main memory and
initiates an I/O transfer when required
Method 2 - automatic initiation of I/O
transfers by software address
translation
Brookers interpretive coding, 1960

40k bits
main

central
store

640k bits
drum

Ferranti Mercury
1956

Method 1 problem ?
Method 2 problem ?
14

British Firm Ferranti, did Mercury and then Atlas

Method 1 too difficult for users

Method 2 too slow.

Page 14

14

Demand Paging in Atlas (1962)

A page from secondary


storage is brought into the
primary storage whenever
it is (implicitly) demanded
by the processor.
Tom Kilburn

Primary
32 Pages
512 words/page

Central
Memory

Secondary
(Drum)
32x6 pages

Primary memory as a cache


for secondary memory

User sees 32 x 6 x 512 words


of storage

15

Single-level Store

Page 15

15

Hardware Organization of Atlas

Effective
Address

Initial
Address
Decode

PARs

48-bit words
512-word pages

Fixed (ROM)
16 pages
0.4 ~1 sec

system code
(not swapped)

Subsidiary
2 pages
1.4 sec

system data
(not swapped)

Main
32 pages
1.4 sec

1 PAR per page frame

Drum (4)
192 pages

Tape
8 decks
88 sec/word

31

(Page Address Register)

<effective PN , status>

Compare the effective page address against all 32 PARs


match
normal access
no match
page fault
the state of the partially executed
instruction was saved

16

Atlas Autocode example here.

Page 16

16

Atlas Demand Paging Scheme


On a page fault:
input transfer into a free page is initiated
the Page Address Register (PAR) is updated
if no free page is left, a page is selected
to be replaced (based on usage)
the replaced page is written on the drum
- to minimize drum latency effect, the first
empty page on the drum was selected
the page table is updated to point to the new
location of the page on the drum

17

This was called the Supervisor program, which clearly


foreshadowed the operating system.

Page 17

17

Caching vs. Demand Paging


secondary
memory
CPU

cache

primary
memory

CPU

Caching
cache entry
cache block (~32 bytes)
cache miss (1% to 20%)
cache hit (~1 cycle)
cache miss (~10 cycles)
a miss is handled
in hardware

primary
memory

Demand paging
page-frame
page (~4K bytes)
page miss (<0.001%)
page hit (~100 cycles)
page miss(~5M cycles)
a miss is handled
mostly in software

18

Page 18

18

Modern Virtual Memory Systems


illusion of a large, private, uniform store

Protection & Privacy


- several users, each with their private

OS

address space and one or more shared


address spaces
page table name space

Demand Paging
- ability to run a program larger than
than the primary memory

useri

Primary
Memory

Swapping
Store

What is another big benefit ?


The price is address translation on
each memory reference

VA

mapping
TLB

PA
19

Portability on machines with different memory configurations.

Page 19

19

Address Translation and Protection


Virtual Address

Virtual Page No. (VPN)

offset

Kernel/User Mode
Read/Write

Protection
Check

Address
Translation

Exception?
Physical Address

Physical Page No. (PPN)

offset

Every instruction and data access needs address


translation and protection checks
A good VM design needs to be fast (~ one cycle) and
space efficient

20

Page 20

20

Linear Page Table


Virtual address

VPN

Data Pages

Offset

Page Table
Page Table Entry (PTE) contains:
PPN (physical page number) of
memory-resident page,
DPN (disk page number) of
page swapped to disk, or
non-existent page
Status bits for protection and
usage
OS changes page table
base register to point to
base of page table for
active user process

PPN
PPN
DPN
PPN

Data word
Offset

DPN
PPN
PPN
DPN
DPN

VPN
DPN
PPN
PPN

PT Base Register

21

Page 21

21

Size of Linear Page Table


With 32-bit addresses, 4-KB pages, and 4-byte PTEs:
220 PTEs, i.e, 4 MB page table per user

4 GB of swap needed to back up full virtual address


space

Larger pages?
more internal fragmentation (dont use all memory in page)
larger page fault penalty (more time to read from disk)
What about 64-bit virtual address space???
Even 1MB pages would require 244 8-byte PTEs (35 TB!)

What is the saving grace ?


22

Virtual address space is large but only a small fraction of the


pages are populated. So we can use a sparse representation
of the table.

Page 22

22

Hierarchical Page Table


Virtual Address
31

22 21

p1
10-bit L1 index

12 11

p2

offset
offset

10-bit L2 index

Root of the Current


Page Table

p2
p1

(Processor
Register)

Level-1
Page Table

page in primary memory


page in secondary memory

Level-2
Page Tables
Data Pages23

PTE of a nonexistent page

Page 23

23

Translation Lookaside Buffers


Address translation is very expensive!
In a two-level page table, each reference becomes
the best case is
the worst case is

?
?

Solution: Cache translations in TLB


TLB hit Single Cycle Translation
TLB miss Page Table Walk to refill
virtual address
VRWD

tag

PPN

VPN

offset

(VPN = virtual page number)


(PPN = physical page number)

hit?

physical address

PPN

offset

24

3 memory references
2 page faults (disk accesses) + ..

Page 24

24

TLB Designs
Typically 32-128 entries
Usually fully associative
Each entry maps a large page, hence less spatial locality
across pages more likely that two entries conflict
Sometimes larger TLBs are 4-8 way set-associative
Random or FIFO replacement policy
Typically only one page mapping per entry
No process information in TLB

TLB Reach: Size of largest virtual address space that can be


simultaneously mapped by TLB
Example: 64 TLB entries, 4KB pages, one page per entry
TLB Reach = ______________________________?
25

Page 25

25

Handling A TLB Miss


Software (MIPS, Alpha)
TLB miss causes an exception and the operating
system walks the page tables and reloads TLB
privileged untranslated addressing mode used
for walk
Hardware (SPARC v8, x86, PowerPC)
A memory management unit (MMU) walks the
page tables and reloads the TLB
If a missing (data or PT) page is encountered
during the TLB reloading, MMU gives up and
signals a Page-Fault exception for the original
instruction

26

Page 26

26

Hierarchical Page Table Walk: SPARC v8


Virtual Address Index 1
31
Context
Table
Register

Context Table

Context
Register

root ptr

Index 2
23

Index 3
17

Offset
11

L1 Table
L2 Table
PTP

L3 Table
PTP

PTE

31

Physical Address

11

PPN

Offset

MMU does this table walk in hardware on a TLB miss

27

Page 27

27

Address Translation: putting it all together


Virtual Address
hardware
hardware or software
software

TLB
Lookup
miss

hit

Page Table
Walk
memory

Protection
Check

the page is

Page Fault
(OS loads page)
Where?

memory

denied

Protection
Fault

Update TLB

SEGFAULT

permitted

Physical
Address
(to cache)
28

Need to restart instruction.


Soft and hard page faults.

Page 28

28

Virtual Memory:
Part Deux

Page 1

Address Translation: putting it all together


Virtual Address
Restart instruction

hardware
hardware or software
software

TLB
Lookup

miss

hit

Page Table
Walk
memory

Protection
Check

the page is

Page Fault
(OS loads page)

memory

denied

Protection
Fault

Update TLB

SEGFAULT

permitted

Physical
Address
(to cache)
2

Need to restart instruction.

Page 2

Address Translation in CPU Pipeline

PC

Inst.
TLB

Inst.
Cache

Decode

TLB miss? Page Fault?


Protection violation?

Data
TLB

Data
Cache

TLB miss? Page fault?


Protection violation?

Need restartable exception on page fault/protection


violation for software handler
Either hardware or software TLB refill on TLB miss
Coping with the additional latency of a TLB:

slow down the clock


pipeline the TLB and cache access
virtual address caches
parallel TLB/cache access

Page 3

Virtual Address Caches


Instead of
CPU

VA

PA
TLB

Physical
Cache

Primary
Memory

place the cache before the TLB


VA
CPU

Virtual
Cache

TLB

PA

Primary
Memory

(StrongARM)

+ one-step process in case of a hit


- cache needs to be flushed on a context switch
unless address space identifiers (ASIDs) included in tags
- aliasing problems due to the sharing of pages
4

Page 4

Aliasing in Virtual-Address Caches


VA1

Page Table
Data Pages
PA

Tag

Data

VA1

1st Copy of Data at PA

VA2

2nd Copy of Data at PA

VA2

Virtual cache can have two


copies of same physical data.
Writes to one copy not visible to
reads of other!

Two virtual pages share


one physical page

General Solution: Disallow aliases to coexist in cache


Software (i.e., OS) solution for direct-mapped cache:
VAs of shared pages must agree in cache index
bits; this ensures all VAs accessing same PA will
conflict in direct-mapped cache (early SPARCs)

Two processes sharing the same file,

Map the same memory segment to different

Parts of their address space.

Page 5

Concurrent Access to TLB & Cache


VA

VPN

TLB
PA

PPN

b
k

Page Offset

Tag

=
hit?

Virtual
Index
Direct-map Cache
2L blocks
b
2 -byte block

Physical Tag

Data

Index L is available without consulting the TLB


cache and TLB accesses can begin simultaneously
Tag comparison is made after both accesses are completed
Cases: L + b = k, L + b < k, L + b > k

Page 6

Virtual-Index Physical-Tag Caches:


Associative Organization
VA

VPN

L = k-b

TLB
PA

PPN

Virtual
Index

2a

b
Direct-map
2L blocks

Direct-map
2L blocks
Phy.
Tag

Page Offset

Tag

hit?

After the PPN is known, 2a physical tags are compared

Data

Is this scheme realistic?


7

Consider 4-Kbyte pages and


caches with 32-byte blocks
32-Kbyte cache 2a = 8
4-Mbyte cache
=1024

2a
No !

Page 7
7

Concurrent Access to TLB & Large L1


Virtual Index

VA

VPN

Page Offset

TLB
PA

PPN

Page Offset

L1 PA cache
Direct-map

VA1 PPN

Data

VA2 PPN

Data

Tag

hit?

Can VA1 and VA2 both map to PA ?


8

If they differ in the lower a bits alone, and share a physical page.

Page 8

Second Level Cache


L1
Instruction
Cache
CPU
RF

Memory
Unified L2
Cache

L1 Data
Cache

Memory
Memory
Memory

Usually a common L2 cache backs up both Instruction


and Data L1 caches
L2 is inclusive of both Instruction and Data caches
9

Page 9

Anti-Aliasing Using L2: (MIPS R10000)


Virtual Index

VA

VPN

TLB
PA

PPN

Page Offset

into L2 tag

Page Offset

L1 PA cache
Direct-map
VA1 PPN

Data

VA2 PPN

Data

b
PPN

Tag

Suppose VA1 and VA2 both map to PA


and VA1 is already in L1, L2 (VA1 VA2)
After VA2 is resolved to PA, a collision
will be detected in L2.
VA1 will be purged from L1 and L2, and
VA2 will be loaded no aliasing !

PA

a1

hit?

Data

Direct-Mapped L2
10

Page 10

10

Virtually-Addressed L1:
Anti-Aliasing using L2
VA

VPN

Page Offset b

TLB
PA

PPN
Tag

Page Offset
Physical
Index & Tag

Virtual
Index & Tag
VA1

Data

VA2

Data

L1 VA Cache
Virtual
Tag
PA VA1

Data

L2 PA Cache

Physically-addressed L2 can also be used L2 contains L1


to avoid aliases in virtually-addressed L1
11

Page 11

11

Page Fault Handler


When the referenced page is not in DRAM:
The missing page is located (or created)
It is brought in from disk, and page table is updated
(A different job may start running on the CPU while the first
job waits for the requested page to be read from disk)

If no free pages are left, a page is swapped out


Pseudo-LRU replacement policy

Since it takes a long time to transfer a page (secs),


page faults are handled completely in software by the
operating system
Untranslated addressing mode is essential to allow kernel to
access page tables

12

Deleted pseudo-LRU policy here. Pseudo-LRU means

You cant afford to do the real thing.

Essentially has 1 bit to indicate if a page has been recently used

Or not. A round-robin policy is used to find the first page

with a value of 0.

Page 12
12

Swapping a Page of a Page Table


A PTE in primary memory contains
primary or secondary memory addresses
A PTE in secondary memory contains
only secondary memory addresses
a page of a PT can be swapped out only
if none its PTEs point to pages in the
primary memory
Why?__________________________________
13

What is the worst thing you can do with respect to storing page tables?
Storing page table on disk for whose entries point to phys. Mem.

Page 13

13

Atlas Revisited
One PAR for each physical page
PARs contain the VPNs of the
pages resident in primary
memory

PARs

PPN

VPN

Advantage: The size is


proportional to the size of the
primary memory
What is the disadvantage ?
14

Good old Atlas which had all the ideas

(Was an improvement on most of its successors.)

Page 14

14

Hashed Page Table:

Approximating Associative Addressing


VPN

Virtual Address
Page Table

PID

hash

Offset

PA of PTE

Base of Table
VPN PID PPN

Hashed Page Table is typically 2 to 3


times larger than the number of PPNs
to reduce collision probability

VPN PID DPN


VPN PID

It can also contain DPNs for some


non-resident pages (not common)
If a translation cannot be resolved in
this table then the software consults a
data structure that has an entry for
every existing page

Primary
Memory
15

Page 15

15

Global System Address Space


User

map
Level A

User

Global
System
Address
Space

map

Physical
Memory

Level B

map

Level A maps users address spaces into the global


space providing privacy, protection, sharing etc.
Level B provides demand-paging for the large global
system address space
Level A and Level B translations may be kept in
16
separate TLBs

Page 16

16

Hashed Page Table Walk:

PowerPC Two-level, Segmented Addressing


Seg ID

64-bit user VA
0

hashS
PA of Seg Table
per process

63

PA

Global Seg ID
0

Page
51

hashP

Offset
51

Hashed Segment Table

80-bit System VA

PA of Page Table
system-wide

Page
35

Offset
67

79

Hashed Page Table

PA
0

40-bit PA

27

PPN

39

Offset

17

Page 17

17

Power PC: Hashed Page Table


VPN

80-bit VA

Page Table
hash

Offset

PA of Slot

VPN
VPN

PPN

Base of Table
Each hash table slot has 8 PTE's <VPN,PPN>
that are searched sequentially
If the first hash slot fails, an alternate hash
function is used to look in another slot
All these steps are done in hardware!
Hashed Table is typically 2 to 3 times larger
than the number of physical pages
The full backup Page Table is a software
data structure

Primary
Memory
18

Page 18

18

Interrupts:

altering the normal flow of control

program

Ii-1

HI1

Ii

HI2

Ii+1

HIn

interrupt
handler

An external or internal event that needs to be processed


by another (system) program. The event is usually
unexpected or rare from programs point of view.
19

Page 19

19

Causes of Interrupts
An interrupt is an event that requests the attention of
the processor
Asynchronous: an external event
input/output device service-request
timer expiration
power disruptions, hardware failure
Synchronous: an internal event (a.k.a exceptions)
undefined opcode, privileged instruction
arithmetic overflow, FPU exception
misaligned memory access
virtual memory exceptions: page faults,
TLB misses, protection violations
traps: system calls, e.g., jumps into kernel

20

Page 20

20

Asynchronous Interrupts:
invoking the interrupt handler

An I/O device requests attention by asserting one of


the prioritized interrupt request lines
When the processor decides to process the interrupt
it stops the current program at instruction Ii ,
completing all the instructions up to Ii-1
(precise interrupt)
it saves the PC of instruction Ii in a special
register (EPC)
disables interrupts and transfers control to a
designated interrupt handler running in kernel
mode

21

Page 21

21

Interrupt Handler
To allow nested interrupts, EPC is saved before
enabling interrupts
need an instruction to move EPC into GPRs
need a way to mask further interrupts at least
until EPC can be saved
A status register indicates the cause of the interrupt it must be visible to an interrupt handler
The return from an interrupt handler is a simple
indirect jump but usually involves
enabling interrupts
restoring the processor to the user mode
restoring hardware status and control state
a special return-from-exception instruction (RFE)

22

Page 22

22

Synchronous Interrupts
A synchronous interrupt (exception) is caused by
a particular instruction
In general, the instruction cannot be completed
and needs to be restarted after the exception has
been handled
requires undoing the effect of one or more
partially executed instructions
In case of a trap (system call), the instruction is
considered to have been completed
a special jump instruction involving a
change to privileged kernel mode
23

Page 23

23

Exception Handling (5-Stage Pipeline)

PC

Inst.
Mem

Decode

Illegal
Opcode

Overflow

PC Address
Exceptions

Data
Mem

Data Address
Exceptions

Asynchronous
Interrupts

How to handle multiple simultaneous exceptions in


different pipeline stages?
How and where to handle external asynchronous
interrupts?

24

Page 24

24

Exception Handling (5-Stage Pipeline)


Commit
Point

PC

Select
Handler
PC

Inst.
Mem

Decode

Illegal
Opcode
PC Address
Exceptions

Kill F
Stage

Overflow

Exc
D
PC
D

Exc
E

Kill D
Stage

PC
E

Data
Mem

Data Address Kill


Exceptions Writeback

Exc
M

Kill E
Stage

PC
M

Cause

EPC
Asynchronous
Interrupts
25

Page 25

25

Exception Handling (5-Stage Pipeline)


Hold exception flags in pipeline until commit point
(M stage)
Exceptions in earlier pipe stages override later
exceptions for a given instruction
Inject external interrupts at commit point (override
others)
If exception at commit: update Cause and EPC
registers, kill all stages, inject handler PC into fetch
stage
26

Page 26

26

Virtual Memory Use Today


Desktops/servers have full demand-paged virtual memory
Portability between machines with different memory sizes
Protection between multiple users or multiple tasks
Share small physical memory among active tasks
Simplifies implementation of some OS features
Vector supercomputers have translation and protection but not demand-paging
(Crays: base&bound, Japanese: pages)
Dont waste expensive CPU time thrashing to disk (make jobs fit in memory)
Mostly run in batch mode (run set of jobs that fits in memory)
More difficult to implement restartable vector instructions
Most embedded processors and DSPs provide physical addressing only
Cant afford area/speed/power budget for virtual memory support
Often there is no secondary storage to swap to!
Programs custom written for particular memory configuration in product
Difficult to implement restartable instructions for exposed architectures27

Page 27

27

Asanovic/Devadas

Spring 2002

6.823

Complex Pipelining

Krste Asanovic

Laboratory for Computer Science

Massachusetts Institute of Technology

Asanovic/Devadas

Spring 2002

6.823

Complex Pipelining: Motivation


Pipelining becomes complex when we want high
performance in the presence of
Memory systems with variable access time

Long latency or partially pipelined floating-point


units
Multiple function and memory units

Realistic Memory Systems

Asanovic/Devadas

Spring 2002

6.823

Latency of access to the main memory is usually


much greater than one cycle and often unpredictable
Solving this problem is a central issue in computer
architecture
Common solutions
separate instruction and data memory ports
and buffers
no self-modifying code
caches
single cycle except in case of a miss stall
interleaved memory
multiple memory accesses stride conflicts
split-phase memory operations
out-of-order responses

Floating Point Unit

Asanovic/Devadas

Spring 2002

6.823

Much more hardware than an integer unit


Single-cycle floating point units are a bad idea - why?
it is common to have several floating point units
it is common to have different types of FPU's
Fadd, Fmul, Fdiv, ...
an FPU may be pipelined, partially pipelined or not
pipelined
To operate several FPUs concurrently the register
file needs to have more read and write ports

Function Unit Characteristics


fully

pipelined
partially

pipelined

Asanovic/Devadas

Spring 2002

6.823

busy 1cyc 1cyc 1cyc accept

busy

2 cyc

2 cyc

accept

Function units have internal pipeline registers


operands are latched when an instruction
enters a function unit
inputs to a function unit (e.g., register file)
can change during a long latency operation

Asanovic/Devadas

Spring 2002

6.823

Multiple Function Units


ALU
IF

ID

Issue
GPRs
FPRs

Mem
WB

Fadd
Fmul

Fdiv

Floating Point ISA

Asanovic/Devadas

Spring 2002

6.823

Interaction between the Floating point datapath


and the Integer datapath is determined largely
by the ISA
DLX ISA
separate register files for FP and Integer
instructions
separate load/store for FPRs and GPRs
but both use GPRs for address calculation
separate conditions for branches
FP branches are defined in terms of
condition codes
the only interaction is via a set of move
instructions (some ISAs dont even permit this)

New Possibilities of Hazards

Asanovic/Devadas

Spring 2002

6.823

Structural conflicts at the write-back stage due to


variable latencies of different function units
Structural conflicts at the execution stage if some
FPU or memory unit is not pipelined and takes more
than one cycle
Out-of-order write hazards due to variable latencies
of different function units
appropriate pipeline interlocks can resolve
all these hazards

Asanovic/Devadas

Spring 2002

6.823

Write-Back Stage Arbitration


Is the latency of each function unit fixed and known?
Yes - structural hazards can be avoided by
delaying the instruction from entering the
execution phase
No - WB stage has to be arbitrated dynamically
WB stage may exert back pressure on
a function unit
Issue may not dispatch an instruction
into that unit until the unit is free
We will assume that the latencies are not known
(This is the more general case)

Asanovic/Devadas

Spring 2002

6.823

WB and Back Pressure


WB

Issue
1cyc

GPR
1cyc 1cyc 1cyc

Interconnect

FPR
2 cyc

2 cyc

Conflict?

CDC 6600 Seymour Cray, 1963

Asanovic/Devadas
Spring 2002
6.823

A fast pipelined machine with 60-bit words


(128 Kword main memory capacity, 32 banks)
Ten functional units (parallel, unpipelined)
Floating Point adder, 2 multipliers, divider
Integer adder, 2 incrementers, ...
Hardwired control (no microcoding)
Dynamic scheduling of instructions using a
scoreboard
Ten Peripheral Processors for Input/Output
- a fast multi-threaded 12-bit integer ALU
Very fast clock, 10 MHz (FP add in 4 clocks)
>400,000 transistors, 750 sq. ft., 5 tons, 150 kW,
novel freon-based technology for cooling
Fastest machine in world for 5 years (until 7600)
over 100 sold ($7-10M each)

IBM Memo on CDC6600

Asanovic/Devadas

Spring 2002

6.823

Thomas Watson Jr., IBM CEO, August 1963:


Last week, Control Data ... announced the
6600 system. I understand that in the
laboratory developing the system there are
only 34 people including the janitor. Of these,
14 are engineers and 4 are programmers...

Contrasting this modest effort with our vast

development activities, I fail to understand why


we have lost our industry leadership position
by letting someone else offer the world's most
powerful computer.
To which Cray replied: It seems like Mr. Watson
has answered his own question.

Asanovic/Devadas

Spring 2002

6.823

CDC 6600: Datapath


Operand Regs
8 x 60-bit
operand

10 Functional
Units

result
Central
Memory

Address Regs
8 x 18-bit
load
addr
store
addr

Index Regs
8 x 18-bit

IR
Inst. Stack
8 x 60-bit

CDC 6600:

A Load/Store Architecture

Asanovic/Devadas

Spring 2002

6.823

Separate instructions to manipulate three types of reg.


8 60-bit data registers (X)
8 18-bit address registers (A)
8 18-bit index registers (B)
All arithmetic and logic instructions are reg-to-reg
6

opcode

Ri (Rj) op (Rk)

Only Load and Store instructions refer to memory!


6

18

opcode

disp

Ri M[(Rj) + disp]

Touching address registers 1 to 5 initiates a load

6 to 7 initiates a store
- very useful for vector operations

CDC6600: Vector Addition


B0 - n

loop: JZE B0, exit


A0 B0 + a0
A1 B0 + b0
X6 X0 + X1
A6 B0 + c0
B0 B0 + 1
jump loop
exit:

load X0

load X1

store X6

Asanovic/Devadas

Spring 2002

6.823

Asanovic/Devadas

Spring 2002

6.823

Dependence Analysis

Types of Data Hazards

Asanovic/Devadas
Spring 2002
6.823

Consider executing a sequence of


rk (ri) op (rj)
type of instructions
Data-dependence
r3 (r1) op (r2)
r5 (r3) op (r4)

Read-after-Write
(RAW) hazard

Anti-dependence
r3 (r1) op (r2)
r1 (r4) op (r5)

Write-after-Read
(WAR) hazard

Output-dependence
r3 (r1) op (r2)
r3 (r6) op (r7)

Write-after-Write
(WAW) hazard

Detecting Data Hazards

Asanovic/Devadas
Spring 2002
6.823

Range and Domain of instruction i


R(i) = Registers (or other storage) modified
by instruction i
D(i) = Registers (or other storage) read
by instruction i
Suppose instruction j follows instruction i in the
program order. Executing instruction j before the
effect of instruction i has taken place can cause a
RAW hazard if

R(i) D(j)

WAR hazard if

D(i) R(j)

WAW hazard if

R(i) R(j)

Register vs. Memory

Data Dependence

Asanovic/Devadas

Spring 2002

6.823

Data hazards due to register operands can be


determined at the decode stage
but
data hazards due to memory operands can be
determined only after computing the effective address
store
load

M[(r1) + disp1] (r2)


r3 M[(r4) + disp2]

Does (r1 + disp1) = (r4 + disp2) ?

Data Hazards: An Example


worksheet

I1

DIVD

f6,

f6,

I2

LD

f2,

45(r3)

I3

MULTD

f0,

f2,

f4

I4

DIVD

f8,

f6,

f2

I5

SUBD

f10,

f0,

f6

I6

ADDD

f6,

f8,

f2

RAW Hazards

WAR Hazards

WAW Hazards

f4

Asanovic/Devadas
Spring 2002
6.823

Asanovic/Devadas
Spring 2002
6.823

Instruction Scheduling
I1

DIVD

f6,

f6,

f4

I2

LD

f2,

45(r3)

I3

MULTD

f0,

f2,

f4

I4

DIVD

f8,

f6,

f2

I1

I5

SUBD

f10,

f0,

f6

I6

ADDD

f6,

f8,

f2

I2
I3
I4

Valid orderings:
in-order

I1

I2

I3

I4

I5

I6

out-of-order

I2

I1

I3

I4

I5

I6

out-of-order

I1

I2

I3

I5

I4

I6

I5
I6

Asanovic/Devadas
Spring 2002
6.823

Out-of-order Completion
In-order Issue

f4

Latency
4

I1

DIVD

f6,

f6,

I2

LD

f2,

45(r3)

I3

MULTD

f0,

f2,

f4

I4

DIVD

f8,

f6,

f2

I5

SUBD

f10,

f0,

f6

I6

ADDD

f6,

f8,

f2

in-order comp

1 2

1 2 3 4

out-of-order comp

1 2 2 3 1 4 3 5 5 4 6 6

3 5 4 6 5 6

Asanovic/Devadas

Spring 2002

6.823

Scoreboard:

A Data Structure to Detect Hazards

When is it Safe to Issue an

Instruction?

Asanovic/Devadas
Spring 2002
6.823

For each instruction at the Issue stage the following


checks need to be made
Is the required function unit available?
Is the input data available? RAW?
Is it safe to write the destination? WAR?
WAW?
Is there a structural conflict at the WB stage?
Assume there is only one instruction in the Issue stage
and if it cannot be issued then the Instruction Fetch and
Decode stall

Asanovic/Devadas

Spring 2002

6.823

A Data Structure for Correct Issues


Keeps track of the status of Functional Units
Name
Busy
Op
Dest Src1 Src2
Int
Mem
Add1
Add2
Add3
Mult1
Mult2
Div

The instruction i at the Issue stage consults this table


FU available?
RAW?
WAR?
WAW?

check the busy column

search the dest column for is sources

search the source columns for is destination

search the dest column for is destination

An entry is added to the table if no hazard is detected;


An entry is removed from the table after Write-Back

Simplifying the Data Structure

Assuming In-order Issue

Asanovic/Devadas

Spring 2002

6.823

Suppose the instruction is not dispatched by the Issue

stage if a RAW hazard exists or the required FU is busy

Can the dispatched instruction cause a


WAR hazard ?
WAW hazard ?

Simplifying the Data Structure

Assuming In-order Issue

Asanovic/Devadas

Spring 2002

6.823

Suppose the instruction is not dispatched by the Issue

stage if a RAW hazard exists or the required FU is busy

Can the dispatched instruction cause a


WAR hazard ?
NO: Operands read at issue
WAW hazard ?
YES: Out-of-order completion

Asanovic/Devadas

Spring 2002

6.823

Simplifying the Data Structure ...


No WAR hazard no need to keep src1 and src2
The Issue stage does not dispatch an instruction in
case of a WAW hazard
a register name can occur at most once in
the dest column

WP[reg#] : a bit-vector to record the registers for which


writes are pending
These bits are set to true by the Issue stage and set to
false by the WB stage
Each pipeline stage in the FU's must carry the
dest field and a flag to indicate if it is valid
the (we, ws) pair

Scoreboard for In-order Issues

Asanovic/Devadas

Spring 2002

6.823

Busy[unit#] : a bit-vector to indicate units availability.


(unit = Int, Add, Mult, Div)
These bits are hardwired to FU's.

WP[reg#] : a bit-vector to record the registers for which


writes are pending
Issue checks the instruction (opcode dest src1 src2)
against the scoreboard (Busy & WP) to dispatch
FU available?
RAW?
WAR?
WAW?

not Busy[FU#]

WP[src1] or WP[src2]

cannot arise

WP[dest]

Scoreboard Dynamics
Functional Unit Status

Registers Reserved
WB
for Writes

Int(1) Add(1) Mult(3) Div(4)


t0 I1
f6
t1 I2 f2
f6
t2
f6
f2
t3 I3
f0
f6
t4
f0
f6
t5 I4
f0 f8
t6
f8
f0
t7 I5
f10
f8
t8
f8 f10
t9
f8
t10 I6
f6
t11
f6
I1
I2
I3
I4
I5
I6

DIVD
LD
MULTD
DIVD
SUBD
ADDD

f6,
f2,
f0,
f8,
f10,
f6,

f6,
45(r3)
f2,
f6,
f0,
f8,

Asanovic/Devadas

Spring 2002

6.823

f4
f4
f2
f6
f2

f6
f6, f2
f6, f2
f6, f0
f6, f0
f0, f8
f0, f8
f8, f10
f8, f10
f8
f6
f6

I2
I1
I3
I5
I4
I6

Asanovic/Devadas
Spring 2002
6.823

Out-of-Order Execution &


Register Renaming

Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology

Scoreboard for In-order Issue

Asanovic/Devadas
Spring 2002
6.823

Busy[unit#] : a bit-vector to indicate units availability.


(unit = Int, Add, Mult, Div)
These bits are hardwired to FU's.
WP[reg#] : a bit-vector to record the registers for which
writes are pending
Issue checks the instruction (opcode dest src1 src2)
against the scoreboard (Busy & WP) to dispatch
FU available?
RAW?
WAR?
WAW?

not Busy[FU#]
WP[src1] or WP[src2]
cannot arise
WP[dest]

Out-of-Order Dispatch
ALU
IF

ID

Issue

Asanovic/Devadas
Spring 2002
6.823

Mem
WB

Fadd
Fmul

Issue stage buffer holds multiple instructions waiting


to issue.
Decode adds next instruction to buffer if there is
space and the instruction does not cause a WAR
or WAW hazard.
Any instruction in buffer whose RAW hazards are
satisfied can be dispatched (for now, at most one
dispatch per cycle). On a write back (WB), new
instructions may get enabled.

Asanovic/Devadas
Spring 2002
6.823

Out-of-Order Issue: an example


1

LD

F2,

34(R2)

latency
1

LD

F4,

45(R3)

long

MULTD

F6,

F4,

F2

SUBD

F8,

F2,

F2

DIVD

F4,

F2,

F8

ADDD

F10,

F6,

F4

5
6

In-order:
1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
Out-of-order: 1 (2,1) 4 4 . . . . 2 3 . . 3 5 . . . 5 6 6
Out-of-order did not allow any significant improvement !

Asanovic/Devadas
Spring 2002
6.823

How many Instructions can be


in the pipeline
Which features of an ISA limit the number of
instructions in the pipeline?

Which features of a program limit the number of


instructions in the pipeline?

Overcoming the Lack of


Register Names

Asanovic/Devadas
Spring 2002
6.823

Number of registers in an ISA limits the number


of partially executed instructions in complex pipelines
Floating Point pipelines often cannot be kept filled with
small number of registers.
IBM 360 had only 4 Floating Point Registers
Can a microarchitecture use more registers than
specified by the ISA without loss of ISA compatibility ?
Robert Tomasulo of IBM suggested an
ingenious solution in 1967 based on on-the-fly
register renaming

Instruction-Level Parallelism
with Renaming
1

LD

F2,

34(R2)

LD

F4,

45(R3)

latency
1

Asanovic/Devadas
Spring 2002
6.823

long

MULTD

F6,

F4,

F2

SUBD

F8,

F2,

F2

DIVD

F4,

F2,

F8

ADDD

F10,

F6,

F4

X
5
6

In-order:
1 (2,1) . . . . . . 2 3 4 4 (5,3) . . . 5 6 6
Out-of-order: 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6
Any antidependence can be eliminated by renaming
renaming additional storage
Can it be done in hardware? yes!

Register Renaming
ALU
IF

ID

ROB

Asanovic/Devadas
Spring 2002
6.823

Mem
WB

Fadd
Fmul

Decode does register renaming and adds instructions


to the issue stage reorder buffer (ROB).
renaming makes WAR or WAW hazards
impossible
Any instruction in ROB whose RAW hazards have
been satisfied can be dispatched.
Out-of order or dataflow execution

Renaming & Out-of-order Issue

Asanovic/Devadas
Spring 2002
6.823

An example

Renaming table
p

data

Reorder buffer
Ins# use exec op p1

F1
F2
F3
F4
F5
F6
F7
F8

src1 p2 src2

t1
t2
.
.
.

data / ti
1
2
3
4
5
6

LD
LD
MULTD
SUBD
DIVD
ADDD

F2,
F4,
F6,
F8,
F4,
F10,

34(R2)
45(R3)
F4,
F2,
F2,
F6,

F2
F2
F8
F4

When are names in sources


replaced by data?
When can a name be reused?

Asanovic/Devadas
Spring 2002
6.823

Data-Driven Execution
Renaming
table &
reg file
Reorder
buffer
Replacing the
tag by its value
is an expensive
operation

Ins# use exec op

Load
Unit

FU

p1

src1 p2 src2

FU

t1
t2
.
.
tn

Store
Unit

< t, result >

Instruction template (i.e., tag t) is allocated by the


Decode stage, which also stores the tag in the reg file
When an instruction completes, its tag is deallocated

Asanovic/Devadas
Spring 2002
6.823

Simplifying Allocation/Deallocation
Reorder buffer
Ins# use exec op p1

ptr2
next to
deallocate
prt1
next
available

src1 p2

src2
t1
t2
.
.
.

tn

Instruction buffer is managed circularly


When an instruction completes its use bit is
marked free
ptr2 is incremented only if the use bit is marked free

IBM 360/91 Floating Point Unit

Asanovic/Devadas
Spring 2002
6.823

R. M. Tomasulo, 1967

1
2
3
4
5
6

data

distribute
instruction
templates
by
functional
units

instructions

load
buffers
(from
memory)

1 p data
2
3

1
2
3
4

...

data

Floating
Point
Reg

data

1 p data
2

Adder

data

Mult

< t, result >


store
buffers
(to memory)

data

Common bus ensures that data is


made available immediately to all
the instructions waiting for it

Asanovic/Devadas
Spring 2002
6.823

Effectiveness?
Renaming and Out-of-order execution was first
implemented in 1969 in IBM 360/91 but did not show
up in the subsequent models until mid-Nineties.
Why ?
Reasons
1. Exceptions not precise!
2. Effective on a very small class of programs
One more problem needed to be solved

Precise Interrupts

Asanovic/Devadas
Spring 2002
6.823

It must appear as if an interrupt is taken between two


instructions (say Ii and Ii+1)
the effect of all instructions up to and including Ii
is totally complete
no effect of any instruction after Ii has taken place
The interrupt handler either aborts the program or
restarts it at Ii+1 .

Asanovic/Devadas
Spring 2002
6.823

Effect on Interrupts
Out-of-order Completion

I1
I2
I3
I4
I5
I6

DIVD
LD
MULTD
DIVD
SUBD
ADDD

f6,
f2,
f0,
f8,
f10,
f6,

f6,
45(r3)
f2,
f6,
f0,
f8,

f4
f4
f2
f6
f2

out-of-order comp 1 2 2 3 1 4 3 5 5 4 6 6
restore f2
restore f10
Consider interrupts
Precise interrupts are difficult to implement at high speed
- want to start execution of later instructions before
exception checks finished on earlier instructions

Exception Handling

(In-Order Five-Stage
PC

Select
Handler
PC

Inst.
Mem

Decode

Illegal
Opcode
PC Address
Exceptions

Kill F
Stage

Commit
Pipeline) Point

Overflow

Data
Mem

Asanovic/Devadas
Spring 2002
6.823

Data Address Kill


Exceptions Writeback

Exc
D

Exc
E

Exc
M

Cause

PC
D

PC
E

PC
M

EPC

Kill D
Stage

Kill E
Stage

Asynchronous
Interrupts

Hold exception flags in pipeline until commit point (M stage)


Exceptions in earlier pipe stages override later exceptions
Inject external interrupts at commit point (override others)
If exception at commit: update Cause and EPC registers, kill
all stages, inject handler PC into fetch stage

Phases of Instruction Execution

Asanovic/Devadas
Spring 2002
6.823

PC
I-cache
Fetch
Buffer
Issue
Buffer
Func.
Units
Result
Buffer
Arch.
State

Fetch: Instruction bits retrieved from


cache.
Decode: Instructions placed in appropriate
issue stage buffer (sometimes called
issue or dispatch)
Execute: Instructions and operands sent to
execution units (sometimes called issue or
dispatch). When execution completes, all
results and exception flags are available.
Commit: Instruction irrevocably updates
architectural state (sometimes called
graduation or completion).

In-Order Commit for


Precise Exceptions

Asanovic/Devadas
Spring 2002
6.823

Instructions fetched and decoded into instruction reorder buffer in-order


Execution is out-of-order ( out-of-order completion)
Commit (write-back to architectural state, regfile+memory) is in-order

Temporary storage needed to hold results before commit


(shadow registers and store buffers)
Out-of-order

In-order

Fetch

Reorder Buffer

Decode

Inject handler PC

Commit

Kill

Kill
Kill

In-order

Execute

Exception?

Asanovic/Devadas
Spring 2002
6.823

Extensions for Precise Exceptions


Instruction reorder buffer
Inst# use exec op p1

src1 p2 src2

pd dest

data cause

ptr2
next to
commit
ptr1
next
available

add <pd, dest, data, cause> fields in the instruction template


commit instructions to reg file and memory in program
order buffers can be maintained circularly
on exception, clear reorder buffer by resetting ptr1=ptr2
(stores must wait for commit before updating memory)

Rollback and Renaming

Asanovic/Devadas
Spring 2002
6.823

Register File
(now holds only
committed state)

Reorder
buffer

Ins# use exec op

Load
Unit

FU

p1

src1 p2

FU

src2

FU

pd dest

Store
Unit

data

Commit

< t, result >

Register file does not contain renaming tags any more.


How does the decode stage find the tag of a source
register?

t1
t2
.
.
tn

Asanovic/Devadas
Spring 2002
6.823

Renaming Table
Rename
Table

Reorder
buffer

r1
r2

ti

vi

Ins# use exec op

Load
Unit

FU

Register
File
p1

src1 p2

FU

src2

FU

pd dest

Store
Unit

data

t1
t2
.
.
tn

Commit

< t, result >

Renaming table is like a cache to speed up register name


look up (rename tag + valid bit per entry). It needs to be
cleared after each exception taken . When else are valid
bits cleared? ______________________________________

Asanovic/Devadas
Spring 2002
6.823

Effect of Control Transfer on


Pipelined Execution
Control transfer instructions require insertion
of bubbles in the pipeline.
The number of bubbles depends upon the number
of cycles it takes
to determine the next instruction address, and
to fetch the next instruction

Branch Penalty
Next fetch
started

PC
I-cache
Fetch
Buffer

Fetch

Decode

Issue
Buffer
Func.
Units

Branch executed

Result
Buffer
Arch.
State

Execute

Commit

Asanovic/Devadas
Spring 2002
6.823

Asanovic/Devadas
Spring 2002
6.823

Branch Penalties in Modern Pipelines


UltraSPARC-III instruction fetch pipeline stages
(in-order issue, 4-way superscalar, 750MHz, 2000)
Fetch

A PC Generation/Mux
P Instruction Fetch Stage 1
F

Decode

B Branch Address Calc/Begin Decode


I Complete Decode
J

Execute

Instruction Fetch Stage 2

Steer Instructions to Functional units

R Register File Read


E Integer Execute
Remainder of execute pipeline (+another 6 stages)

Branch penalty: Cycles?________ Instructions?_________

Average Run-Length between


Branches

Asanovic/Devadas
Spring 2002
6.823

Average dynamic instruction mix from SPEC92:


SPECint92 SPECfp92
ALU
39 %
13 %
FPU Add
20 %
FPU Mult
13 %
load
26 %
23 %
store
9%
9%
branch
16 %
8%
other
10 %
12 %
SPECint92:
SPECfp92:

compress, eqntott, espresso, gcc , li


doduc, ear, hydro2d, mdijdp2, su2cor

What is the average run length between branches?

Asanovic/Devadas
Spring 2002
6.823

Reducing Control Transfer Penalties


Software solution
loop unrolling
Increases the run length
instruction scheduling
Compute the branch condition as early
as possible
(limited)
Hardware solution
delay slots
replaces pipeline bubbles with useful work
(requires software cooperation)
branch prediction & speculative execution
of instructions beyond the branch

Branch Prediction

Asanovic/Devadas
Spring 2002
6.823

Motivation: branch penalties limit performance of deeply


pipelined processors
Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly
Required hardware support:
Prediction structures: branch history tables, branch
target buffers, etc.
Mispredict recovery mechanisms:
In-order machines: kill instructions following
branch in pipeline
Out-of-order machines: shadow registers and
memory buffers for each speculated branch

DLX Branches and Jumps


Instruction
BEQZ/BNEZ
J
JR

Taken known?
After Reg. Fetch
Always Taken
Always Taken

Asanovic/Devadas
Spring 2002
6.823

Target known?
After Inst. Fetch
After Inst. Fetch
After Reg. Fetch

Must know (or guess) both target address and


whether taken to execute branch/jump.

Asanovic/Devadas
Spring 2002
6.823

Branch Prediction &


Speculative Execution

Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology

Average Run-Length between


Branches

Asanovic/Devadas
Spring 2002
6.823

Average dynamic instruction mix from SPEC92:


SPECint92 SPECfp92
ALU
39 %
13 %
FPU Add
20 %
FPU Mult
13 %
load
26 %
23 %
store
9%
9%
branch
16 %
8%
other
10 %
12 %
SPECint92:
SPECfp92:

compress, eqntott, espresso, gcc , li


doduc, ear, hydro2d, mdijdp2, su2cor

What is the average run length between branches?

Asanovic/Devadas
Spring 2002
6.823

Reducing Control Transfer Penalties


Software solution
loop unrolling
Increases the run length
instruction scheduling
Compute the branch condition as early
as possible
(limited)
Hardware solution
delay slots
replaces pipeline bubbles with useful work
(requires software cooperation)
branch prediction & speculative execution
of instructions beyond the branch

Asanovic/Devadas
Spring 2002
6.823

Effect of Control Transfer on


Pipelined Execution
Control transfer instructions require insertion
of bubbles in the pipeline.
The number of bubbles depends upon the number
of cycles it takes
to determine the next instruction address, and
to fetch the next instruction

DLX Branches and Jumps

Asanovic/Devadas
Spring 2002
6.823

Must know (or guess) both target address and


whether taken to execute branch/jump.
Instruction

Taken known?

Target known?

BEQZ/BNEZ
J
JR

After Reg. Fetch


Always Taken
Always Taken

After Inst. Fetch


After Inst. Fetch
After Reg. Fetch

Branch Penalty
Next fetch
started

PC
I-cache
Fetch
Buffer

Fetch

Decode

Issue
Buffer
Func.
Units

Branch executed

Result
Buffer
Arch.
State

Execute

Commit

Asanovic/Devadas
Spring 2002
6.823

Asanovic/Devadas
Spring 2002
6.823

Branch Penalties in Modern Pipelines


UltraSPARC-III instruction fetch pipeline stages
(in-order issue, 4-way superscalar, 750MHz, 2000)
A PC Generation/Mux
P Instruction Fetch Stage 1
Branch
Target
Address
Known
Branch
Direction &
Jump
Register
Target Known

Instruction Fetch Stage 2

B Branch Address Calc/Begin Decode


I Complete Decode
J

Steer Instructions to Functional units

R Register File Read


E Integer Execute
Remainder of execute pipeline (+ another 6 stages)

Branch Prediction

Asanovic/Devadas
Spring 2002
6.823

Motivation: branch penalties limit performance of deeply


pipelined processors
Modern branch predictors have high accuracy
(>95%) and can reduce branch penalties significantly
Required hardware support:
Prediction structures: branch history tables, branch
target buffers etc.
Mispredict recovery mechanisms:
In-order machines: kill instructions following
branch in pipeline
Out-of-order machines: shadow registers and
memory buffers for each speculated branch

Static Branch Prediction

Asanovic/Devadas
Spring 2002
6.823

(Encode prediction as part of branch instruction)


Probability a
backward
branch is taken
90%
(~60-70% overall):

JZ

forward
50%

JZ

Can predict all taken,


or backwards taken/forward not-taken (use offset sign bit)
ISA can attach additional semantics to branches about
preferred direction, e.g., Motorola MC88110
bne0 (preferred taken)
beq0 (not taken)
ISA can allow arbitrary choice of statically predicted
direction (HP PA-RISC, Intel IA-64)

Asanovic/Devadas
Spring 2002
6.823

Dynamic Branch Prediction


learning based on past behavior

Temporal correlation
The way a branch resolves may be a good predictor
of the way it will resolve at the next execution
Spatial correlation
Several branches may resolve in a highly correlated
manner (a preferred path of execution)

Branch Prediction Bits

Asanovic/Devadas
Spring 2002
6.823

Assume 2 BP bits per instruction


Change the prediction after two consecutive mistakes!

taken
taken

take
right

take
wrong
taken

taken
take
right

taken

taken
taken

take
wrong

taken

BP state:
(predict take/take) x (last prediction right/wrong)

Branch History Table


Fetch PC

00
k

I-Cache
Instruction
Opcode

Asanovic/Devadas
Spring 2002
6.823

BHT Index

2k-entry
BHT,
2 bits/entry

offset
+

Branch?

Target PC

Taken/Taken?

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

Exploiting Spatial Correlation

Asanovic/Devadas
Spring 2002
6.823

Yeh and Patt, 1992

if (x[i] < 7) then


y += 1;
if (x[i] < 5) then
c -= 4;
If first condition false, second condition also false
History bit: H records the direction of the last branch
executed by the processor
Two sets of BHT bits (BHT0 & BHT1) per branch
instruction
H = 0 (not taken)
H = 1 (taken)

consult BHT0
consult BHT1

Two-Level Branch Predictor

Asanovic/Devadas
Spring 2002
6.823

Pentium Pro uses the result from the last two branches
to select one of the four sets of BHT bits (~95% correct)
00
Fetch PC

2-bit global branch


history shift register
Shift in
Taken/Taken
results of each
branch
Taken/Taken?

Limitations of BHTs

Asanovic/Devadas
Spring 2002
6.823

Cannot redirect fetch stream until after branch instruction


is fetched and decoded, and target address determined
Correctly
predicted taken
branch penalty
Jump Register
penalty

A PC Generation/Mux
P Instruction Fetch Stage 1
F

Instruction Fetch Stage 2

B Branch Address Calc/Begin Decode


I Complete Decode
J

Steer Instructions to Functional units

R Register File Read


E Integer Execute
Remainder of execute
pipeline (+ another 6 stages)

(UltraSPARC-III fetch pipeline)

Branch Target Buffer (BTB)


I-Cache

Asanovic/Devadas
Spring 2002
6.823

2k-entry direct-mapped BTB


(can also be associative)

PC

Entry PC

Valid

predicted
target PC

=
match
valid target
Keep both the branch PC and target PC in the BTB
PC+4 is fetched if match fails
Only taken branches and jumps held in BTB
Next PC determined before branch fetched and decoded

Combining BTB and BHT

Asanovic/Devadas
Spring 2002
6.823

BTB entries are considerably more expensive than


BHT, but can redirect fetches at earlier stage in
pipeline and can accelerate indirect branches (JR)
BHT can hold many more entries and is more accurate

BTB

A PC Generation/Mux
P Instruction Fetch Stage 1
F

BHT in later
pipeline stage
corrects when
BTB misses a
predicted
taken branch

BHT

Instruction Fetch Stage 2

B Branch Address Calc/Begin Decode


I Complete Decode
J

Steer Instructions to Functional units

R Register File Read


E Integer Execute

BTB/BHT only updated after branch resolves in E stage

Uses of Jump Register (JR)

Asanovic/Devadas
Spring 2002
6.823

Switch statements (jump to address of matching case)

Dynamic function call (jump to run-time function


address)

Subroutine returns (jump to return address)

How well does BTB work for each of these cases?

BTB Performance

Asanovic/Devadas
Spring 2002
6.823

Switch statements (jump to address of matching case)


BTB works well if same case used repeatedly
Dynamic function call (jump to run-time function
address)
BTB works well if same function usually called,
(e.g., in C++ programming, when objects have
same type in virtual function call)
Subroutine returns (jump to return address)
BTB works well if usually return to the same place
Often one function called from many different
call sites!

Subroutine Return Stack

Asanovic/Devadas
Spring 2002
6.823

Small structure to accelerate JR for subroutine returns,


typically much more accurate than BTBs.
fa() { fb(); }
fb() { fc(); }
fc() { fd(); }
Pop return address
when subroutine return
decoded

Push call address when


function call executed

&fd()
&fc()
&fb()

k entries
(typically k=8-16)

Speculating Both Directions

Asanovic/Devadas
Spring 2002
6.823

An alternative to branch prediction is to execute both


directions of a branch speculatively
resource requirement is proportional to the
number of concurrent speculative executions
only half the resources engage in useful work
when both directions of a branch are executed
speculatively
branch prediction takes less resources
than speculative execution of both paths
With accurate branch prediction, it is more cost effective
to dedicate all resources to the predicted direction

Mispredict Recovery

Asanovic/Devadas
Spring 2002
6.823

In-order execution machines:


Assume no instruction issued after branch can
write-back before branch resolves
Kill all instructions in pipeline behind mispredicted
branch

Out-of-order execution?
Multiple instructions following branch in program
order can complete before branch resolves

In-Order Commit for


Precise Exceptions

Asanovic/Devadas
Spring 2002
6.823

Instructions fetched and decoded into instruction reorder buffer in-order


Execution is out-of-order ( out-of-order completion)
Commit (write-back to architectural state, regfile+memory) is in-order

Temporary storage in ROB holds results before commit

Out-of-order

In-order

Fetch

Reorder Buffer

Decode

Inject handler PC

Commit

Kill

Kill
Kill

In-order

Execute

Exception!

ROB for Precise Exceptions


ptr2
next to
commit

Inst# use exec op p1

src1 p2 src2

pd dest

Asanovic/Devadas
Spring 2002
6.823

data cause

ptr1
next
available

add <pd, dest, data, cause> fields in the instruction template


commit instructions to reg file and memory in program
order buffers can be maintained circularly
on exception, clear reorder buffer by resetting ptr1=ptr2
(stores must wait for commit before updating memory)

Branch Misprediction Recovery


ptr2
next to
commit

Inst# use exec op p1

rollback
next
available
ptr1
next
available

src1 p2 src2

pd dest

Asanovic/Devadas
Spring 2002
6.823

data cause

BEQZ
Speculative Instructions

On mispredict
Roll back next available pointer to just after branch
Reset use bits
Flush mis-speculated instructions from pipelines
Restart fetch on correct branch path

Asanovic/Devadas
Spring 2002
6.823

Branch Misprediction in Pipeline

restart fetch on correct path


Branch
Prediction

kill
kill

Branch
Resolution

kill
PC

Fetch

Decode

Reorder Buffer

Commit

Complete

Execute
Can have multiple unresolved branches in ROB
Can resolve branches out-of-order

Killing Speculative Instructions

Asanovic/Devadas
Spring 2002
6.823

Each instruction tagged with single speculative bit,


carried throughout all pipelines
Decode stage stalls if second branch encountered
before first speculative branch resolves
When speculative branch resolves:
Prediction incorrect, kill all instructions tagged as
speculative
Prediction correct, clear speculative bit on all
instructions
To allow speculation past multiple branches, add
multiple bits per instruction indicating on which
outstanding branches it is speculative
speculation bits reclaimed when
corresponding branch resolves and is oldest
speculative branch in machine (manage
allocation of speculation tag bits circularly)

Recovering Renaming Table


Rename
Table

Reorder
buffer

r1
r2

ti

ti

ti

ti

v
vi i

v
vi i

Ins# use exec op

Load
Unit

FU

Asanovic/Devadas
Spring 2002
6.823

Register
Rename
Snapshots File
p1

src1 p2

FU

src2

FU

pd dest

Store
Unit

data

Commit

< t, result >

Take snapshot of register rename table at each predicted


branch, recover earlier snapshot if branch mispredicted

t1
t2
.
.
tn

Asanovic/Devadas
Spring 2002
6.823

Advanced Superscalar
Architectures

Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology

Physical Register Renaming

Asanovic/Devadas
Spring 2002
6.823

(single physical register file: MIPS R10K, Alpha 21264, Pentium-4)


During decode, instructions allocated new physical destination register
Source operands renamed to physical register with newest value
Execution unit only sees physical register numbers

ld r1, (r3)
add r3, r1, #4
sub r6, r7, r9
add r3, r3, r6
ld r6, (r1)
add r6, r6, r3
st r6, (r1)
ld r6, (r11)

Rename

ld P1, (Px)
add P2, P1,
sub P3, Py,
add P4, P2,
ld P5, (P1)
add P6, P5,
st P6, (P1)
ld P7, (Pw)

#4
Pz
P3
P4

Asanovic/Devadas
Spring 2002
6.823

Physical Register File


r1
r2

ti
tj

Rename
Table

t1
t2
.
tn

Snapshots for
mispredict recovery

Load
Unit

FU

FU

Reg
File

FU

Store
Unit

(ROB not shown)


< t, result >
One regfile for both committed and speculative values (no data in ROB)
During decode, instruction result allocated new physical register, source
regs translated to physical regs through rename table
Instruction reads data from regfile at start of execute (not in decode)
Write-back updates reg. busy bits on instructions in ROB (assoc. search)
Snapshots of rename table taken at every branch to recover mispredicts
On exception, renaming undone in reverse order of issue (MIPS R10000)

Speculative and Out-of-Order


Execution

Asanovic/Devadas
Spring 2002
6.823

Update predictors

kill
Branch
Prediction

PC

Fetch

kill

Branch
Resolution

kill

Decode &
Rename

kill

Out-of-Order

Reorder Buffer

In-Order
Commit

In-Order
Physical Reg. File
Branch
ALU MEM
Unit
Execute

Store
Buffer

D$

Lifetime of Physical Registers

Asanovic/Devadas
Spring 2002
6.823

Physical regfile holds committed and speculative values


Physical registers decoupled from ROB entries
(no data in ROB)
ld r1, (r3)
add r3, r1, #4
sub r6, r7, r9
add r3, r3, r6
ld r6, (r1)
add r6, r6, r3
st r6, (r1)
ld r6, (r11)

Rename

ld P1, (Px)
add P2, P1,
sub P3, Py,
add P4, P2,
ld P5, (P1)
add P6, P5,
st P6, (P1)
ld P7, (Pw)

#4
Pz
P3
P4

When can we reuse a physical register?


When next write of same architectural register commits

Physical Register Management


Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7

P8
P7
P5
P6

P0
P1
P2
P3
P4
P5
P6
P7
P8

<R6>
<R7>
<R3>
<R1>

p
p
p
p

p2

Rd

Free List
P0
P1
P3
P2
P4

ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)

Pn

ROB
use ex

Physical Regs

Asanovic/Devadas
Spring 2002
6.823

op

p1

PR1

PR2

LPRd

PRd

(LPRd requires third read port on Rename Table for each instruction)

Physical Register Management


Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7

Physical Regs
P0
P1
P2
P3
P4
P5
P6
P7
P8

P8 P0
P7
P5
P6

p
p
p
p

p2

Rd

Free List
P0
P1
P3
P2
P4

ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)

Pn

ROB
use ex

<R6>
<R7>
<R3>
<R1>

Asanovic/Devadas
Spring 2002
6.823

op

ld

p1

PR1

P7

PR2

r1

LPRd

P8

PRd

P0

(LPRd requires third read port on Rename Table for each instruction)

Physical Register Management


Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7

P8 P0
P7 P1
P5
P6

x
x

P0
P1
P2
P3
P4
P5
P6
P7
P8

<R6>
<R7>
<R3>
<R1>

p
p
p
p

p2

Rd

Free List
P0
P1
P3
P2
P4

ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)

Pn

ROB
use ex

Physical Regs

Asanovic/Devadas
Spring 2002
6.823

op

p1

ld
p
add

PR1

P7
P0

PR2

r1
r3

LPRd

P8
P7

PRd

P0
P1

(LPRd requires third read port on Rename Table for each instruction)

Physical Register Management


Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7

P8 P0
P7 P1
P5 P3
P6

x
x
x

P0
P1
P2
P3
P4
P5
P6
P7
P8

<R6>
<R7>
<R3>
<R1>

p
p
p
p

p2

PR2

Rd

P5

Free List
P0
P1
P3
P2
P4

ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)

Pn

ROB
use ex

Physical Regs

Asanovic/Devadas
Spring 2002
6.823

op

p1

ld
p
add
sub p

PR1

P7
P0
P6

r1
r3
r6

LPRd

P8
P7
P5

PRd

P0
P1
P3

(LPRd requires third read port on Rename Table for each instruction)

Physical Register Management


Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7

P8 P0
P7 P1 P2
P5 P3
P6

x
x
x
x

P0
P1
P2
P3
P4
P5
P6
P7
P8

<R6>
<R7>
<R3>
<R1>

p
p
p
p

p2

PR2

Rd

P5
P3

Free List
P0
P1
P3
P2
P4

ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)

Pn

ROB
use ex

Physical Regs

Asanovic/Devadas
Spring 2002
6.823

op

p1

ld
p
add
sub p
add

PR1

P7
P0
P6
P1

r1
r3
r6
r3

LPRd

P8
P7
P5
P1

PRd

P0
P1
P3
P2

(LPRd requires third read port on Rename Table for each instruction)

Physical Register Management


Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7

P8 P0
P7 P1 P2
P5 P3 P4
P6

x
x
x
x
x

P0
P1
P2
P3
P4
P5
P6
P7
P8

<R6>
<R7>
<R3>
<R1>

p
p
p
p

p2

PR2

Rd

P5
P3

Free List
P0
P1
P3
P2
P4

ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)

Pn

ROB
use ex

Physical Regs

Asanovic/Devadas
Spring 2002
6.823

op

p1

ld
p
add
sub p
add
ld

PR1

P7
P0
P6
P1
P0

r1
r3
r6
r3
r6

LPRd

P8
P7
P5
P1
P3

PRd

P0
P1
P3
P2
P4

(LPRd requires third read port on Rename Table for each instruction)

Physical Register Management


Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7

Physical Regs
P0
P1
P2
P3
P4
P5
P6
P7
P8

P8 P0
P7 P1 P2
P5 P3 P4
P6

x
x
x
x
x

<R6>
<R7>
<R3>
<R1>

p
p
p
p

p2

PR2

Rd

P5
P3

Free List
P0
P1
P3
P2
P4

ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)

P8

Pn

ROB
use ex

<R1>

Asanovic/Devadas
Spring 2002
6.823

op

ld
add
sub
add
ld

p1

p
p
p
p

PR1

P7
P0
P6
P1
P0

r1
r3
r6
r3
r6

LPRd

P8
P7
P5
P1
P3

PRd

P0
P1
P3
P2
P4

Execute &
Commit

(LPRd requires third read port on Rename Table for each instruction)

Physical Register Management


Rename
Table
R0
R1
R2
R3
R4
R5
R6
R7

Physical Regs
P0
P1
P2
P3
P4
P5
P6
P7
P8

P8 P0
P7 P1 P2
P5 P3 P4
P6

x
x
x
x
x

x
x

p
p

<R6>
<R7>
<R3>

p
p
p

p2

PR2

Rd

P5
P3

Free List
P0
P1
P3
P2
P4

ld r1, 0(r3)
add r3, r1, #4
sub r6, r7, r6
add r3, r3, r6
ld r6, 0(r1)

P8
P7

Pn

ROB
use ex

<R1>
<R3>

Asanovic/Devadas
Spring 2002
6.823

op

ld
add
sub
add
ld

p1

p
p
p
p
p

PR1

P7
P0
P6
P1
P0

r1
r3
r6
r3
r6

LPRd

P8
P7
P5
P1
P3

PRd

P0
P1
P3
P2
P4

Execute &
Commit

(LPRd requires third read port on Rename Table for each instruction)

Reorder Buffer Holds


Active Instruction Window
(Older instructions)
ld r1, (r3)
add r3, r1, r2
sub r6, r7, r9
add r3, r3, r6
ld r6, (r1)
add r6, r6, r3
st r6, (r1)
ld r6, (r1)
(Newer instructions)
Cycle t

Commit

Execute

Fetch

ld r1, (r3)
add r3, r1,
sub r6, r7,
add r3, r3,
ld r6, (r1)
add r6, r6,
st r6, (r1)
ld r6, (r1)

Cycle t + 1

Asanovic/Devadas
Spring 2002
6.823

r2
r9
r6
r3

Asanovic/Devadas
Spring 2002
6.823

Superscalar Register Renaming

During decode, instructions allocated new physical destination register


Source operands renamed to physical register with newest value
Execution unit only sees physical register numbers

Update
Mapping

Write
Ports

Inst 1 Op Dest Src1 Src2

Read Addresses

Rename Table
Read Data

Op PDest PSrc1 PSrc2

Op Dest Src1 Src2 Inst 2

Register
Free List

Op PDest PSrc1 PSrc2

Does this work?

Asanovic/Devadas
Spring 2002
6.823

Superscalar Register Renaming

Update
Mapping

Write
Ports

Inst 1 Op Dest Src1 Src2

Read Addresses

Rename Table
Read Data

Op PDest PSrc1 PSrc2

Inst 2

Op Dest Src1 Src2

=?

=?

Register
Free List

Op PDest PSrc1 PSrc2

Must check for RAW hazards between instructions issuing in


same cycle. Can be done in parallel with rename lookup.
(MIPS R10K renames 4 serially-RAW-dependent insts/cycle)

Asanovic/Devadas
Spring 2002
6.823

Memory Dependencies

st r1, (r2)
ld r3, (r4)
When can we execute the load?

Speculative Loads / Stores

Asanovic/Devadas
Spring 2002
6.823

Just like register updates, stores should not modify


the memory until after the instruction is committed
store buffer entry must carry a speculation bit and
the tag of the corresponding store instruction
If the instruction is committed, the speculation bit of
the corresponding store buffer entry is cleared, and
store is written to cache
If the instruction is killed, the corresponding store
buffer entry is freed
Loads work normally -- older store buffer entries
needs to be searched before accessing the memory
or the cache

Asanovic/Devadas
Spring 2002
6.823

Load Path
Load Address
Speculative
Store Buffer
VS
Tag
VS
Tag
VS
Tag
VS
Tag
VS
Tag
VS
Tag

L1 Data Cache
Data
Data
Data
Data
Data
Data

Tags

Data

Store Commit Path


Load Data

Hit in speculative store buffer has priority over hit in data


cache
Hit to newer store has priority over hits to older stores in
speculative store buffer

Datapath: Branch Prediction


and Speculative Execution

Asanovic/Devadas
Spring 2002
6.823

Update predictors
Branch
Prediction

kill
kill

Branch
Resolution

kill
PC

Fetch

Decode &
Rename

kill

Reorder Buffer

Commit

Reg. File
Branch
ALU MEM
Unit
Execute

Store
Buffer

D$

In-Order Memory Queue

Asanovic/Devadas
Spring 2002
6.823

Execute all loads and stores in program order


=> Load and store cannot leave ROB for execution
until all previous loads and stores have
completed execution
Can still execute loads and stores speculatively,
and out-of-order with respect to other instructions
Stores held in store buffer until commit

Conservative Out-of-Order
Load Execution

Asanovic/Devadas
Spring 2002
6.823

st r1, (r2)
ld r3, (r4)
Split execution of store instruction into two
phases: address calculation and data write
Can execute load before store, if addresses
known and r4 != r2
Each load address compared with addresses of
all previous uncommitted stores (can use partial
conservative check i.e., bottom 12 bits of address)
Dont execute load if any previous store address
not known
(MIPS R10K, 16 entry address queue)

Address Speculation

Asanovic/Devadas
Spring 2002
6.823

st r1, (r2)
ld r3, (r4)

Guess that r4 != r2
Execute load before store address known
Need to hold all completed but uncommitted
load/store addresses in program order
If subsequently find r4==r2, squash load and all
following instructions
=> Large penalty for inaccurate address speculation

Asanovic/Devadas
Spring 2002
6.823

Memory Dependence Prediction


(Alpha 21264)

st r1, (r2)
ld r3, (r4)
Guess that r4 != r2 and execute load before store
If later find r4==r2, squash load and all following
instructions, but mark load instruction as store-wait
Subsequent executions of the same load instruction
will wait for all previous stores to complete
Periodically clear store-wait bits

Asanovic/Devadas
Spring 2002
6.823

Improving Instruction Fetch


Performance of speculative out-of-order
machines often limited by instruction fetch
bandwidth
speculative execution can fetch 2-3x more
instructions than are committed
mispredict penalties dominated by time to refill
instruction window
taken branches are particularly troublesome

Increasing Taken Branch Bandwidth

Asanovic/Devadas
Spring 2002
6.823

(Alpha 21264 I-Cache)

PC
Generation

Branch Prediction
Instruction Decode
Validity Checks
Tag Tag

PC

Line
Way
Cached
Way Way
Predict Predict Instructions
0
1
4 insts
fast fetch path
=?

=?

Hit/Miss/Way
Fold 2-way tags and BTB into predicted next block
Take tag checks, inst. decode, branch predict out of loop
Raw RAM speed on critical loop (1 cycle at ~1 GHz)
2-bit hysteresis counter per block prevents overtraining

Tournament Branch Predictor

Asanovic/Devadas
Spring 2002
6.823

(Alpha 21264)

Local
history
table
(1,024x10b)

Local
prediction
(1,024x3b)

Global Prediction
(4,096x2b)

Choice
Prediction
(4,096x2b)

PC
Prediction

Global History (12b)

Choice predictor learns whether best to use local or


global branch history in predicting next branch
Global history is speculatively updated but restored on
mispredict
Claim 90-100% success on range of applications

Taken Branch Limit

Asanovic/Devadas
Spring 2002
6.823

Integer codes have a taken branch every 6-9


instructions
To avoid fetch bottleneck, must execute multiple taken
branches per cycle when increasing performance
This implies:
predicting multiple branches per cycle
fetching multiple non-contiguous blocks per cycle

Branch Address Cache

Asanovic/Devadas
Spring 2002
6.823

(Yeh, Marr, Patt)

Entry PC

Valid

predicted
target #1

len

predicted
target #2

PC

=
match valid

target#1 len#1 target#2

Extend BTB to return multiple branch predictions


per cycle

Fetching Multiple Basic Blocks


Requires either
multiported cache: expensive
interleaving: bank conflicts will occur
Merging multiple blocks to feed to decoders adds
latency increasing mispredict penalty and
reducing branch throughput

Asanovic/Devadas
Spring 2002
6.823

Asanovic/Devadas
Spring 2002
6.823

Trace Cache
Key Idea: Pack multiple non-contiguous basic
blocks into one contiguous trace cache line
BR

BR

BR

BR

BR

BR

Single fetch brings in multiple basic blocks

Trace cache indexed by start address and next n branch


predictions

Used in Intel Pentium-4 processor to hold decoded uops

MIPS R10000 (1995)

Asanovic/Devadas
Spring 2002
6.823

0.35m CMOS, 4 metal layers


Four instructions per cycle
Out-of-order execution
Register renaming
Speculative execution past 4
branches
On-chip 32KB/32KB split I/D
cache, 2-way set-associative
Off-chip L2 cache
Non-blocking caches

Compare with simple 5-stage


pipeline (R5K series)
~1.6x performance
SPECint95
~5x CPU logic area
~10x design effort

Asanovic/Devadas

Spring 2002

6.823

VLIW/EPIC: Statically Scheduled ILP

Krste Asanovic

Laboratory for Computer Science

Massachusetts Institute of Technology

Littles Law

Asanovic/Devadas
Spring 2002
6.823

Parallelism = Throughput * Latency


Throughput per Cycle

One Operation
Latency in Cycles

Asanovic/Devadas
Spring 2002
6.823

Example Pipelined ILP Machine


Max Throughput, Six Instructions per Cycle

One Pipeline Stage


Two Integer Units,
Latency Single
Cycle Latency
in Cycles
Two Load/Store Units,
Three Cycle Latency

Two Floating-Point Units,


Four Cycle Latency

How much instruction-level parallelism (ILP) required


to keep machine pipelines busy?

Asanovic/Devadas
Spring 2002
6.823

Superscalar Control Logic Scaling


Issue Width N
Issue Group
Previously
Issued
Instructions

Lifetime

Number of interlock checks and/or tag compares grows as


N*(N*L) where L is lifetime of instructions in machine
Each of N instructions issued or completed must check against N*L
instructions in flight

For in-order machines, lifetime L is related to pipeline latencies


For out-of-order machines, L also includes time spent in
instruction buffers (instruction window or ROB)
As N increases, need larger instruction window to find enough
parallelism to keep machine busy => greater lifetime L
=> Out-of-order control logic grows faster than N^2 (~N^3)

Asanovic/Devadas
Spring 2002
6.823

Out-of-Order Control Complexity:


MIPS R10000

Control

Logic

Sequential ISA Bottleneck


Sequential
source code

Superscalar compiler

a = foo(b);
for (i=0, i<
Find independent
operations

Schedule
operations

Superscalar processor

Check instruction
dependencies

Schedule
execution

Asanovic/Devadas
Spring 2002
6.823

Sequential
machine code

Asanovic/Devadas
Spring 2002
6.823

VLIW: Very Long Instruction Word


Int Op 1 Int Op 2 Mem Op 1 Mem Op 2 FP Op 1

FP Op 2

Two Integer Units,


Single Cycle Latency
Two Load/Store Units,
Three Cycle Latency
Two Floating-Point Units,
Four Cycle Latency

Compiler schedules parallel execution


Multiple parallel operations packed into one long
instruction word
Compiler must avoid data hazards (no interlocks)

Asanovic/Devadas

Spring 2002

6.823

Early VLIW Machines

FPS AP120B (1976)


scientific attached array processor
first commercial wide instruction machine
hand-coded vector math libraries using software pipelining and loop
unrolling

Multiflow Trace (1987)


commercialization of ideas from Fishers Yale group including trace
scheduling
available in configurations with 7, 14, or 28 operations/instruction
28 operations packed into a 1024-bit instruction word

Cydrome Cydra-5 (1987)


7 operations encoded in 256-bit instruction word
rotating register file

Asanovic/Devadas
Spring 2002
6.823

Loop Execution
for (i=0; i<N; i++)
Int1

B[i] = A[i] + C;
Compile

loop:

loop: ld f1, 0(r1)


add r1, 8
fadd f2, f0, f1
sd f2, 0(r2)
add r2, 8
bne r1, r3, loop

Schedule

Int 2

M1

M2

FP+

FPx

Asanovic/Devadas
Spring 2002
6.823

Loop Execution
for (i=0; i<N; i++)
Int1

B[i] = A[i] + C;
Compile

loop:

Int 2

add r1

M1

M2

ld

fadd

loop: ld f1, 0(r1)


add r1, 8

Schedule

fadd f2, f0, f1


sd f2, 0(r2)

FP+

add r2 bne

sd

add r2, 8
bne r1, r3, loop

How many FP ops/cycle?

1 fadd / 8 cycles = 0.125

FPx

Loop Unrolling

Asanovic/Devadas
Spring 2002
6.823

for (i=0; i<N; i++)


B[i] = A[i] + C;

Unroll inner loop to perform


4 iterations at once
for (i=0; i<N; i+=4)
{
B[i]

= A[i] + C;

B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}

Need to handle values of N that are not multiples of


unrolling factor with final cleanup loop

Scheduling Loop Unrolled Code

Asanovic/Devadas
Spring 2002
6.823

Unroll 4 ways
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1)
ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
sd f8, 24(r2)
add r2, 32
bne r1, r3, loop

Int1

loop:

Schedule

Int 2

M1

M2

FP+

FPx

Scheduling Loop Unrolled Code

Asanovic/Devadas
Spring 2002
6.823

Unroll 4 ways
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1)
ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
sd f8, 24(r2)
add r2, 32
bne r1, r3, loop

Int1

loop:

add r1

Int 2

M1

ld f1
ld f2
ld f3
ld f4

Schedule

How many FLOPS/cycle?

sd f5
sd f6
sd f7
add r2 bne sd f8

M2

FP+

fadd f5
fadd f6
fadd f7
fadd f8

FPx

Asanovic/Devadas
Spring 2002
6.823

Software Pipelining
Unroll 4 ways first
loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1)
ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
add r2, 32
sd f8, -8(r2)
bne r1, r3, loop

Int1

Int 2

M1

M2

FP+

FPx

Asanovic/Devadas
Spring 2002
6.823

Software Pipelining
Int1

Unroll 4 ways first


loop: ld f1, 0(r1)
ld f2, 8(r1)
ld f3, 16(r1)
ld f4, 24(r1)
add r1, 32
fadd f5, f0, f1
fadd f6, f0, f2
fadd f7, f0, f3
fadd f8, f0, f4
sd f5, 0(r2)
sd f6, 8(r2)
sd f7, 16(r2)
add r2, 32
sd f8, -8(r2)
bne r1, r3, loop

Int 2

M1

ld f1
ld f2
ld f3
add r1
ld f4
prolog
ld f1
ld f2
ld f3
add r1
ld f4
ld f1
loop:
iterate
ld f2
add r2 ld f3
add r1bne ld f4
epilog

How many FLOPS/cycle?

add r2
bne

M2

sd f5
sd f6
sd f7
sd f8
sd f5
sd f6
sd f7
sd f8
sd f5

FP+

fadd f5
fadd f6
fadd f7
fadd f8
fadd f5
fadd f6
fadd f7
fadd f8
fadd f5
fadd f6
fadd f7
fadd f8

FPx

Asanovic/Devadas
Spring 2002
6.823

Software Pipelining
vs. Loop Unrolling
Startup overhead

Loop Unrolled

Wind-down overhead

performance

Loop Iteration

time

Software Pipelined
performance

Loop Iteration

time

Software pipelining pays startup/wind-down


costs only once per loop, not once per iteration

Asanovic/Devadas
Spring 2002
6.823

What if there are no loops?

Basic block

Branches limit basic block size


in control-flow intensive
irregular code
Difficult to find ILP in individual
basic blocks

Asanovic/Devadas
Spring 2002
6.823

Trace Scheduling [ Fisher,Ellis]

Asanovic/Devadas
Spring 2002
6.823

Trace Scheduling [ Fisher,Ellis]

Pick string of basic blocks, a trace, that


represents most frequent branch path
Use profiling feedback or compiler
heuristics to find common branch paths
Schedule whole trace at once
Add fixup code to cope with branches
jumping out of trace

Asanovic/Devadas

Spring 2002

6.823

Problems with Classic VLIW


Object-code compatibility
have to recompile all code for every machine, even for two
machines in same generation

Object code size


instruction padding wastes instruction memory/cache
loop unrolling/software pipelining replicates code

Scheduling variable latency memory operations


caches and/or memory bank conflicts impose statically
unpredictable variability

Scheduling around statically unpredictable


branches
optimal schedule varies with branch path

Asanovic/Devadas
Spring 2002
6.823

VLIW Instruction Encoding


Various schemes to reduce effect of unused fields
Compressed format in memory, expand on I-cache refill
Cydra-5 MultiOp instructions: execute VLIW as sequential operations
Mark parallel groups (used in TMS320C6x DSPs, Intel IA-64)

Group 1

Group 2

Group 3

Rotating Register Files

Asanovic/Devadas
Spring 2002
6.823

Problem: Scheduled loops require lots of registers, lots of


duplicated code in prolog, epilog
ld r1, ()
add r2, r1, #1

ld r1, ()

st r2, ()

add r2, r1, #1

ld r1, ()

st r2, ()

add r2, r1, #1


st r2, ()

Prolog
Loop
Epilog

ld r1, ()
ld r1, ()

add r2, r1, #1

ld r1, ()

add r2, r1, #1

st r2, ()

add r2, r1, #1

st r2, ()
st r2, ()

Solution: Allocate new set of registers for each loop iteration

Asanovic/Devadas
Spring 2002
6.823

Rotating Register File


RRB=3
R1

P7
P6
P5
P4
P3
P2
P1
P0

Rotating Register Base (RRB) register points to base of


current register set. Value added on to logical register
specifier to give physical register number. Usually, split into
rotating and non-rotating registers.

Prolog
Loop
Epilog

ld r1, ()

dec RRB

ld r1, ()

add r3, r2, #1

dec RRB

ld r1, ()

add r3, r2, #1

st r4, ()

bloop

add r2, r1, #1

st r4, ()

dec RRB

st r4, ()

dec RRB

Loop closing
branch
decrements
RRB

Asanovic/Devadas
Spring 2002
6.823

Rotating Register File


(Previous Loop Example)

Three cycle load latency


encoded as difference of
3 in register specifier
number (f4 - f1 = 3)

Four cycle fadd latency


encoded as difference of
4 in register specifier
number (f9 f5 = 4)

ld f1, ()

fadd f5, f4, ...

sd f9, ()

bloop

ld P9, ()

fadd P13, P12,

sd P17, () bloop

RRB=8

ld P8, ()

fadd P12, P11,

sd P16, () bloop

RRB=7

ld P7, ()

fadd P11, P10,

sd P15, () bloop

RRB=6

ld P6, ()

fadd P10, P9,

sd P14, () bloop

RRB=5

ld P5, ()

fadd P9, P8,

sd P13, () bloop

RRB=4

ld P4, ()

fadd P8, P7,

sd P12, () bloop

RRB=3

ld P3, ()

fadd P7, P6,

sd P11, () bloop

RRB=2

ld P2, ()

fadd P6, P5,

sd P10, () bloop

RRB=1

Predicate Software Pipeline


Stages

Asanovic/Devadas
Spring 2002
6.823

Single VLIW Instruction


(p1) ld r1

(p2) add r3

(p3) st r4 (p1) bloop

Dynamic Execution
(p1) ld r1

(p1) bloop

(p1) ld r1

(p2) add r3

(p1) bloop

(p1) ld r1

(p2) add r3

(p3) st r4 (p1) bloop

(p1) ld r1

(p2) add r3

(p3) st r4 (p1) bloop

(p1) ld r1

(p2) add r3

(p3) st r4 (p1) bloop

(p2) add r3

(p3) st r4 (p1) bloop


(p3) st r4 (p1) bloop

Software pipeline stages turned on by rotating predicate registers


Much denser encoding of loops

Asanovic/Devadas

Spring 2002

6.823

Cydra-5:

Memory Latency Register (MLR)

Problem: Loads have variable latency

Solution: Let software choose desired memory latency

Compiler tries to schedule code for maximum load-use distance


Software sets MLR to latency that matches code schedule
Hardware ensures that loads take exactly MLR cycles to return
values into processor pipeline
Hardware buffers loads that return early
Hardware stalls processor if loads return late

Asanovic/Devadas

Spring 2002

6.823

Intel EPIC IA-64

EPIC is the style of architecture (cf. CISC, RISC)


Explicitly Parallel Instruction Computing

IA-64 is Intels chosen ISA (cf. x86, MIPS)


IA-64 = Intel Architecture 64-bit
An object-code compatible VLIW

Itanium (aka Merced) is first implementation (cf. 8086)


First customer shipment should be in 1997 , 1998, 1999, 2000, 2001
McKinley will be second implementation due 2002

IA-64 Instruction Format


Instruction 2

Instruction 1

Instruction 0

Asanovic/Devadas
Spring 2002
6.823

Template

128-bit instruction bundle

Template bits describe grouping of these


instructions with others in adjacent bundles
Each group contains instructions that can execute
in parallel
bundle j-1 bundle j

group i-1

bundle j+1 bundle j+2

group i

group i+1

group i+2

Asanovic/Devadas

Spring 2002

6.823

IA-64 Registers

128 General Purpose 64-bit Integer Registers


128 General Purpose 64/80-bit Floating Point Registers
64 1-bit Predicate Registers
GPRs rotate to reduce code size for software pipelined
loops

IA-64 Predicated Execution

Asanovic/Devadas
Spring 2002
6.823

Problem: Mispredicted branches limit ILP


Solution: Eliminate some hard to predict branches with predicated
execution
Almost all IA-64 instructions can be executed conditionally under predicate
Instruction becomes NOP if predicate register false
b0: Inst 1
Inst 2
br a==b, b2
b1: Inst 3
Inst 4
br b3
b2: Inst 5
Inst 6

if

else
Predication
then

Inst 1
Inst 2
p1,p2 <- cmp(a==b)
(p1) Inst 3 || (p2) Inst 5
(p1) Inst 4 || (p2) Inst 6
Inst 7
Inst 8

One basic block


b3: Inst 7
Inst 8

Four basic blocks

Mahlke et al, ISCA95: On average


>50% branches removed

IA-64 Speculative Execution

Asanovic/Devadas
Spring 2002
6.823

Problem: Branches restrict compiler code motion


Solution: Speculative operations that dont cause exceptions
Inst 1
Inst 2
br a==b, b2

Load r1
Use r1
Inst 3
Cant move load above branch
because might cause spurious
exception

Load.s r1
Inst 1
Inst 2
br a==b, b2

Chk.s r1
Use r1
Inst 3

Speculative load
never causes
exception, but sets
poison bit on
destination register

Check for exception in


original home block
jumps to fixup code if
exception detected

Particularly useful for scheduling long latency loads early

IA-64 Data Speculation

Asanovic/Devadas
Spring 2002
6.823

Problem: Possible memory hazards limit code scheduling


Solution: Hardware to check pointer hazards

Inst 1
Inst 2
Store
Load r1
Use r1
Inst 3
Cant move load above store
because store might be to same
address

Load.a r1
Inst 1
Inst 2
Store
Load.c
Use r1
Inst 3

Data speculative load


adds address to
address check table
Store invalidates any
matching loads in
address check table
Check if load invalid (or
missing), jump to fixup
code if so

Requires associative hardware in address check table

Asanovic/Devadas
Spring 2002
6.823

ILP Datapath Hardware Scaling


Register File
Multiple
Functional
Units

Replicating functional units and


cache/memory banks is straightforward and
scales linearly
Register file ports and bypass logic for N
functional units scale quadratically (N*N)
Memory interconnection among N functional
units and memory banks also scales
quadratically

Memory Interconnect
Multiple
Cache/Memory
Banks

(For large N, could try O(N logN) interconnect


schemes)

Technology scaling: Wires are getting even


slower relative to gate delays
Complex interconnect adds latency as well
as area
=> Need greater parallelism to hide latencies

Clustered VLIW
Cluster
Interconnect
Local
Regfile

Local
Regfile

Cluster

Memory Interconnect

Multiple
Cache/Memory
Banks

Asanovic/Devadas
Spring 2002
6.823

Divide machine into clusters of


local register files and local
functional units
Lower bandwidth/higher latency
interconnect between clusters
Software responsible for mapping
computations to minimize
communication overhead

Asanovic/Devadas

Spring 2002

6.823

Limits of Static Scheduling

Four major weaknesses of static scheduling:

Unpredictable branches
Variable memory latency (unpredictable cache misses)
Code size explosion
Compiler complexity

Asanovic/Devadas
Spring 2002
6.823

Vector Computers

Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology

Supercomputers

Asanovic/Devadas
Spring 2002
6.823

Definition of a supercomputer:
Fastest machine in world at given task
Any machine costing $30M+
A device to turn a compute-bound problem into an
I/O bound problem
Any machine designed by Seymour Cray
CDC6600 (Cray, 1964) regarded as first supercomputer

Supercomputer Applications

Asanovic/Devadas
Spring 2002
6.823

Typical application areas


Military research (nuclear weapons, cryptography)
Scientific research
Weather forecasting
Oil exploration
Industrial design (car crash simulation)
All involve huge computations on large data sets

In 70s-80s, Supercomputer Vector Machine

Vector Supercomputers
(Epitomized by Cray-1, 1976)
Scalar Unit + Vector Extensions

Load/Store Architecture
Vector Registers
Vector Instructions
Hardwired Control
Highly Pipelined Functional Units
Interleaved Memory System
No Data Caches
No Virtual Memory

Asanovic/Devadas
Spring 2002
6.823

Cray-1 (1976)

Asanovic/Devadas
Spring 2002
6.823

Asanovic/Devadas
Spring 2002
6.823

Cray-1 (1976)
64 Element
Vector Registers

Single Port
Memory
16 banks of
64-bit words
+
8-bit SECDED

( (Ah) + j k m )
(A0)

80MW/sec data
load/store

Tjk

( (Ah) + j k m )
(A0)

320MW/sec
instruction
buffer refill

64
T Regs

Si

64
T Regs

Ai
Bjk

S0
S1
S2
S3
S4
S5
S6
S7
A0
A1
A2
A3
A4
A5
A6
A7

NIP

64-bitx16

4 Instruction Buffers

memory bank cycle 50 ns

V0
V1
V2
V3
V4
V5
V6
V7

Vi

V. Mask

Vj

V. Length

Vk

FP Add
Sj

FP Mul

Sk

FP Recip

Si

Int Add
Int Logic
Int Shift
Pop Cnt
Aj
Ak
Ai

Addr Add
Addr Mul
CIP

LIP

processor cycle 12.5 ns (80MHz)

Vector Programming Model


Scalar Registers

Vector Registers

r15

v15

r0

v0

[0]

[1]

[2]

[VLRMAX-1]

Vector Length Register

Vector Arithmetic
Instructions
VADD v3, v1, v2

VLR

v1
v2
+

[0]

[1]

v3

Vector Load and


Store Instructions
VLD v1, r1, r2
Base, r1

Asanovic/Devadas
Spring 2002
6.823

Stride, r2

v1

[VLR-1]
Vector Register

Memory

Asanovic/Devadas
Spring 2002
6.823

Vector Code Example


# C code
for (i=0; i<64; i++)
C[i] = A[i] + B[i];

# Scalar Code
li r4, #64
loop:
ld f1, 0(r1)
ld f2, 0(r2)
fadd f3, f1, f2
st f3, 0(r3)
add r1, r1, #1
add r2, r2, #1
add r3, r3, #1
sub r4, #1
bnez r4, loop

# Vector Code
li vlr, #64
lv v1, r1, #1
lv v2, r2, #1
faddv v3, v1, v2
sv v3, r3, #1

Asanovic/Devadas
Spring 2002
6.823

Vector Instruction Set Advantages


Compact
one short instruction encodes N operations

Expressive, tells hardware that these N operations:

are independent
use the same functional unit
access disjoint registers
access registers in the same pattern as previous instructions
access a contiguous block of memory (unit-stride load/store)
access memory in a known pattern (strided load/store)

Scalable
can run same object code on more parallel pipelines or lanes

Asanovic/Devadas
Spring 2002
6.823

Vector Arithmetic Execution


Use deep pipeline (=> fast clock)
to execute element operations
Simplifies control of deep pipeline
because elements in vector are
independent (=> no hazards!)

V
1

V
2

V
3

Six stage multiply pipeline

V3 <- v1 * v2

Asanovic/Devadas
Spring 2002
6.823

Vector Memory System

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency


Bank busy time: Cycles between accesses to same bank

Base Stride

Vector Registers
Address
Generator

0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks

Asanovic/Devadas
Spring 2002
6.823

Vector Instruction Execution


VADD C,A,B
Execution using
one pipelined
functional unit

A[6]
A[5]
A[4]
A[3]

B[6]
B[5]
B[4]
B[3]

Execution using
four pipelined
functional units

A[24]
A[20]
A[16]
A[12]

B[24]
B[20]
B[16]
B[12]

A[25]
A[21]
A[17]
A[13]

B[25]
B[21]
B[17]
B[13]

A[26]
A[22]
A[18]
A[14]

B[26]
B[22]
B[18]
B[14]

A[27]
A[23]
A[19]
A[15]

B[27]
B[23]
B[19]
B[15]

C[2]

C[8]

C[9]

C[10]

C[11]

C[1]

C[4]

C[5]

C[6]

C[7]

C[0]

C[0]

C[1]

C[2]

C[3]

Asanovic/Devadas
Spring 2002
6.823

Vector Unit Structure


Functional Unit

Vector
Registers

Elements
0, 4, 8,

Elements
1, 5, 9,

Elements
2, 6, 10,

Lane
Memory Subsystem

Elements
3, 7, 11,

Asanovic/Devadas
Spring 2002
6.823

T0 Vector Microprocessor (1995)

Vector register
elements striped
over lanes

Lane
[24][25]
[16][17]
[8] [9]
[0] [1]

[26] [27] [28]


[18] [19] [20]
[10] [11] [12]
[2] [3] [4]

[29]
[21]
[13]
[5]

[30]
[22]
[14]
[6]

[31]
[23]
[15]
[7]

Vector Memory-Memory versus


Vector Register Machines

Asanovic/Devadas
Spring 2002
6.823

Vector memory-memory instructions hold all vector operands


in main memory
The first vector machines, CDC Star-100 (73) and TI ASC (71),
were memory-memory machines
Cray-1 (76) was first vector register machine
Vector Memory-Memory Code
Example Source Code
for (i=0; i<N; i++)
{
C[i] = A[i] + B[i];
D[i] = A[i] - B[i];
}

VADD C, A, B
VSUB D, A, B

Vector Register Code


LV V1, A
LV V2, B
VADD V3, V1, V2
SV V3, C
VSUB V4, V1, V2
SV V4, D

Vector Memory-Memory vs.


Vector Register Machines

Asanovic/Devadas
Spring 2002
6.823

Vector memory-memory architectures (VMMA) require


greater main memory bandwidth, why?
All operands must be read in and out of memory

VMMAs make if difficult to overlap execution of


multiple vector operations, why?
Must check dependencies on memory addresses

VMMAs incur greater startup latency


Scalar code was faster on CDC Star-100 for vectors < 100 elements
For Cray-1, vector/scalar breakeven point was around 2 elements

Apart from CDC follow-ons (Cyber-205, ETA-10) all


major vector machines since Cray-1 have had vector
register architectures
(we ignore vector memory-memory from now on)

Automatic Code Vectorization

Asanovic/Devadas
Spring 2002
6.823

for (i=0; i < N; i++)


C[i] = A[i] + B[i];

Vectorized Code

Scalar Sequential Code


load
load

Iter. 1
add

store
load
load

Iter. 2
add

store

load
load

Time

load

Iter
1

load

add

add

store

store
Iter
2

Vector Instruction

Vectorization is a massive compile-time


reordering of operation sequencing
requires extensive loop dependence
analysis

Vector Stripmining

Asanovic/Devadas
Spring 2002
6.823

Problem: Vector registers have finite length


Solution: Break loops into pieces that fit into vector
registers, Stripmining
and r1, N, 63 #
move vlr, r1
#
loop:
lv v1, rA
Remainder
add rA, rA, r1 #
lv v2, rB
add rB, rB, r1
64 elements vadd v3, v1, v2
sv v3, rC
add rC, rC, r1
sub N, N, r1
#
move vlr, 64
#
move r1, 64
bgtz N, loop
#

for (i=0; i<N; i++)


C[i] = A[i]+B[i];
A

C
+

N mod 64
Do remainder

Bump pointer

Subtract elements
Reset full length
Any more to do?

Vector Instruction Parallelism

Asanovic/Devadas
Spring 2002
6.823

Can overlap execution of multiple vector instructions


example machine has 32 elements per vector register and 8 lanes

Load Unit
load

Multiply Unit

Add Unit

mul
add

time
load
mul

add

Instruction
issue

Complete 24 operations/cycle while issuing 1 short instruction/cycle

Vector Chaining

Asanovic/Devadas
Spring 2002
6.823

Vector version of register bypassing


introduced with Cray-1

lv

V
1

v1

V
2

V
3

V
4

vmul v3,v1,v2
vadd v5, v3, v4
Chain
Load
Unit
Memory

Chain

Mult.

Add

V
5

Vector Chaining Advantage

Asanovic/Devadas
Spring 2002
6.823

Without chaining, must wait for last element of result to


be written before starting dependent instruction
Load
Mul
Time

Add

With chaining, can start dependent instruction as soon


as first result appears
Load
Mul
Add

Asanovic/Devadas
Spring 2002
6.823

Vector Startup
Two components of vector startup penalty
functional unit latency (time through pipeline)
dead time or recovery time (time before another vector
instruction can start down pipeline)
Functional Unit Latency
R

First Vector Instruction

Dead Time

Dead Time

Second Vector Instruction


W

Asanovic/Devadas
Spring 2002
6.823

No Dead Time => Shorter Vectors

No dead time
4 cycles dead time

64 cycles active
Cray C90, Two lanes
4 cycle dead time
Maximum efficiency 94%
with 128 element vectors

T0, Eight lanes


No dead time
100% efficiency with 8 element
vectors

Vector Scatter/Gather

Asanovic/Devadas
Spring 2002
6.823

Want to vectorize loops with indirect accesses:


for (i=0; i<N; i++)
A[i] = B[i] + C[D[i]]

Indexed load instruction (Gather)


lv vD, rD
lvx vC, rC, vD
lv vB, rB
vadd vA, vB, vC
sv vA, rA

#
#
#
#
#

Load indices in D vector


Load indirect from rC base
Load B vector
Do add
Store result

Vector Scatter/Gather

Asanovic/Devadas
Spring 2002
6.823

Scatter example:
for (i=0; i<N; i++)
A[B[i]]++;
Is following a correct translation?
lv vB, rB
lvx vA, rA, vB
vadd vA, vA, 1
svx vA, rA, vB

#
#
#
#

Load indices in B vector


Gather initial A values
Increment
Scatter incremented values

Vector Conditional Execution

Asanovic/Devadas
Spring 2002
6.823

Problem: Want to vectorize loops with conditional code:


for (i=0; i<N; i++)
if (A[i]>0) then
A[i] = B[i];
else
A[i] = C[i];

Solution: Add vector mask (or flag) registers


vector version of predicate registers, 1 bit per element

and maskable vector instructions


vector operation becomes NOP at elements where mask bit is clear

Code example:
lv vA, rA
# Load A vector
mgtz m0, vA
# Set bits in mask register m0 where A>0
lv.m vA, rB, m0 # Load B vector into A under mask
fnot m1, m0
# Invert mask register
lv.m vA, rC, m1 # Load C vector into A under mask
sv vA, rA
# Store A back to memory (no mask)

Masked Vector Instructions


Simple Implementation
execute all N operations, turn off
result writeback according to mask
M[7]=1
M[6]=0
M[5]=1
M[4]=1
M[3]=0

A[7]
A[6]
A[5]
A[4]
A[3]

B[7]
B[6]
B[5]
B[4]
B[3]

M[2]=0

C[2]

M[1]=1

C[1]

Asanovic/Devadas
Spring 2002
6.823

Density-Time Implementation
scan mask vector and only execute
elements with non-zero masks
M[7]=1
M[6]=0
M[5]=1
M[4]=1
M[3]=0
M[2]=0
M[1]=1
M[0]=0

A[7]

B[7]

C[5]
C[4]

C[1]
Write data port

M[0]=0
Write Enable

C[0]
Write data port

Compress/Expand Operations

Asanovic/Devadas
Spring 2002
6.823

Compress packs non-masked elements into


contiguous vector
population count of mask vector gives packed vector length

Expand performs inverse operation


M[7]=1
M[6]=0
M[5]=1
M[4]=1
M[3]=0
M[2]=0
M[1]=1
M[0]=0

A[7]
A[6]
A[5]
A[4]
A[3]
A[2]
A[1]
A[0]

A[7]
A[5]
A[4]
A[1]
A[7]
A[5]
A[4]
A[1]

Compress

A[7]
B[6]
A[5]
A[4]
B[3]
B[2]
A[1]
B[0]

M[7]=1
M[6]=0
M[5]=1
M[4]=1
M[3]=0
M[2]=0
M[1]=1
M[0]=0

Expand

Used for density-time conditionals and also for general


selection operations

Vector Reductions

Asanovic/Devadas
Spring 2002
6.823

Problem: Loop-carried dependence on reduction variables


sum = 0;
for (i=0; i<N; i++)
sum += A[i]; # Loop-carried dependence on sum

Solution: Re-associate operations if possible, use binary


tree to perform reduction
# Rearrange as:
sum[0:VL-1] = 0
#
for(i=0; i<N; i+=VL)
#
sum[0:VL-1] += A[i:i+VL-1]; #
# Now have VL partial sums in one
do {
VL = VL/2;
sum[0:VL-1] += sum[VL:2*VL-1]
} while (VL>1)

Vector of VL partial sums


Stripmine VL-sized chunks
Vector sum
vector register
# Halve vector length
# Halve no. of partials

A Modern Vector Super: NEC SX-5 (1998)

Asanovic/Devadas
Spring 2002
6.823

CMOS Technology
250MHz clock (312 MHz in 2001)
CPU fits on one multi-chip module
SDRAM main memory (up to 128GB)

Scalar unit
4-way superscalar with out-of-order and speculative execution
64KB I-cache and 64KB data cache

Vector unit

8 foreground VRegs + 64 background VRegs (256 elements/VReg)


1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit, 1 mask unit
16 lanes, 8 GFLOPS peak (32 FLOPS/cycle)
1 load & store unit (32x8 byte accesses/cycle)
64 GB/s memory bandwidth per processor

SMP structure
16 CPUs connected to memory through crossbar
1 TB/s shared memory bandwidth

Multimedia Extensions

Asanovic/Devadas
Spring 2002
6.823

Very short vectors added to existing ISAs for micros


Usually 64-bit registers split into 2x32b or 4x16b or 8x8b
Newer designs have 128-bit registers (Altivec, SSE2)
Limited instruction set:
no vector length control
no strided load/store or scatter/gather
unit-stride loads must be aligned to 64/128-bit boundary

Limited vector register length:


requires superscalar dispatch to keep multiply/add/load units busy
loop unrolling to hide latencies increases register pressure

Trend towards fuller vector support in microprocessors

Symmetric Multiprocessors:
Synchronization and Sequential
Consistency

Page 1

Symmetric Multiprocessors
Processor

Processor
CPU-Memory bus
bridge
I/O bus

Memory

I/O controller I/O controller I/O controller

Graphics
output

symmetric

Networks

All memory is equally far

away from all processors


Any processor can do any I/O
(set up a DMA transfer)

Page 2

Synchronization
The need for synchronization arises whenever
there are parallel processes in a system
fork
(even in a uniprocessor system)
Forks and Joins: In parallel programming
a parallel process may want to wait until
several events have occurred
Producer-Consumer: A consumer process
must wait until the producer process has
produced data
Exclusive use of a resource: Operating
system has to ensure that only one
process uses a resource at a given time

P2

P1
join

producer
consumer
3

Page 3

A Producer-Consumer Example
Producer

tail

head

Consumer
Rtail

Rtail

Rhead

Consumer:
Rhead M[head]
3
spin: Rtail M[tail]
if <Rhead> == <Rtail>
4
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead>
process(R)

Producer posting Item x:


Rtail M[tail]
M[<Rtail>] x
1
Rtail <Rtail> + 1
2
M[tail] <Rtail>

The program is written assuming


instructions are executed in order.

Possible problems?

What is the problem?

Suppose the tail pointer gets updated before the item x is stored?

Suppose R is loaded before x has been stored?

Page 4

A Producer-Consumer Example
Consumer:
Rhead M[head]
3
spin: Rtail M[tail]
if <Rhead> == <Rtail>
4
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead >
process(R)

Producer posting Item x:


Rtail M[tail]
M[<Rtail>] x
1
Rtail <Rtail> + 1
2
M[tail] <Rtail >

Programmer assumes that if 3 happens after 2, then 4 happens


after 1.
Problems are:
Sequence 2, 3, 4, 1
Sequence 4, 1, 2, 3

Programmer assumes that if 3 happens after 2, then 4 happens after 1.

Page 5

Sequential Consistency: A Memory Model


P

A system is sequentially consistent if the result of


any execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual processor
appear in the order specified by the program
Leslie Lamport
Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs

Page 6

Sequential Consistency
Concurrent sequential tasks: T1, T2
Shared variables: X, Y (initially X = 0, Y = 10)
T1:

T2:
Store(X, 1) (X = 1)
Store(Y, 11) (Y = 11)

Load(R1, Y)
Store(B, R1) (B = Y)
Load(R2, X)
Store(A, R2) (A = X)

what are the legitimate answers for A and B ?


(A, B) { (1, 11), (0, 10), (1, 10), (0, 11) } ?
7

(0, 11) is not legit.

Page 7

Sequential Consistency
Sequential consistency imposes additional memory
ordering constraints in addition to those imposed by
uniprocessor program dependencies
What are these in our example ?
Does (can) a system with caches, write buffers, or
out-of-order execution capability provide a
sequentially consistent view of the memory ?
More on this later

Page 8

Multiple Consumer Example


Producer

tail

Consumer1 Rtail

head

Rhead

Rtail

Consumer2

Producer posting Item x:


Rtail M[tail]
M[<Rtail>] x
Rtail <Rtail> + 1
M[tail] <Rtail >

What is wrong with this

Rtail

Rhead

Consumer:
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail>
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead >
code? process(R)

Page 9

Multiple Consumer Example


Producer

tail

head

Consumer1

Consumer2

Consumer:
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail>
R M[<Rhead>]
Rhead <Rhead> + 1
Critical Section:
M[head] <Rhead >
Needs to be executed atomically
process(R)
by one consumer locks

Producer posting Item x:


Rtail M[tail]
M[<Rtail>] x
Rtail <Rtail> + 1
M[tail] <Rtail >

10

Page 10

10

Locks or Semaphores:
E. W. Dijkstra, 1965

A semaphore is a non-negative integer, with the


following operations:
P(s): if s > 0 decrement s by 1 otherwise wait
V(s): increment s by 1 and wake up one of
the waiting processes
Ps and Vs must be executed atomically, i.e., without
interruptions or
interleaved accesses to s by other processors
Process i
P(s)
What does initial value of s
<critical section> determine?
V(s)
11

The maximum number of processes in the critical section.


A sempahore is a visual system for sending information based on 2 flags
held
In each hand.

Page 11
11

Implementation of Semaphores
Semaphores (mutual exclusion) can be implemented
using ordinary Load and Store instructions in the
Sequential Consistency memory model. However,
protocols for mutual exclusion are difficult to design...
Simpler solution:
atomic read-modify-write instructions
Examples: (a is a memory address, R is a register)
Test&Set(a, R):
R M[a];
if <R>==0 then
M[a] 1;

Fetch&Add(a, RV, R):


R M[a];
M[a] <R> + <RV>;

Swap(a, R):
Rt M[a];
M[a] <R>;
R <Rt>;
12

Page 12

12

Multiple Consumers Example:


using the Test & Set Instruction

P:

Test&Set(mutex, Rtemp)
if (<Rtemp> != 0) goto P
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail> goto spin
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead >
Store(mutex, 0)
V:
process(R)

Critical
Section

Other atomic read-modify-write instructions (Swap,


Fetch&Add, etc.) can also implement Ps and Vs
What is the problem with this code?

13

What if the process stops or is swapped out while in the critical section?

Page 13

13

Nonblocking Synchronization
Compare&Swap(a, Rt, Rs): implicit arg - status
if (<Rt> == M[a])
then
M[a] <Rs>;
Rt <Rs>;
status success;
else
status fail;
try: Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail> goto spin
R M[<Rhead>]
Rnewhead <Rhead> + 1
Compare&Swap(head, Rhead, Rnewhead)
if (status == fail) goto try
process(R)

14

Page 14

14

Load-reserve & Store-conditional


Non-blocking Synchronization

Special register(s) to hold reservation flag and


address, and the outcome of store-conditional
Load-reserve(R, a):
<flag, adr> <1, a>;
R M[a];

try:
spin:

Store-conditional(a, R):
if <flag, adr> == <1, a>
then cancel other procs
reservation on a;
M[a] <R>;
status succeed;
else status fail;

Load-reserve(Rhead, head)
Rtail M[tail]
if <Rhead > == <Rtail> goto spin
R M[<Rhead>]
Rhead <Rhead> + 1
Store-conditional(head, Rhead)
if (status == fail) goto try
process(R)

15

Page 15

15

Mutual Exclusion Using Load/Store


A protocol based on two shared variables c1 and
c2. Initially, both c1 and c2 are 0 (not busy)
Process 1
...
c1 = 1;
L: if c2 == 1 then go to L
< critical section >
c1 = 0;

Process 2
...
c2 = 1;
L: if c1 == 1 then go to L
< critical section >
c2 = 0;

What is wrong?
16

Page 16

16

Mutual Exclusion: second attempt


To avoid deadlock, let process give up reservation
(i.e., Process 1 sets c1 to 0) while waiting.
Process 1
...
L: c1 = 1;
if c2 == 1 then
{ c1 = 0; goto L }
< critical section >
c1 = 0

Process 2
...
L: c2 = 1;
if c1 == 1 then
{ c2 = 0; goto L }
< critical section >
c2 = 0

Deadlock is not possible.


What could go wrong?
17

This is the most promising solution, but alas, we still have a problem with
bounded waiting . Suppose Process j continually reenters its entry protocol
after leaving its exit protocol, while Process i is waiting. It is possible That
Process j will repeatedly reach the while test when Process i has
temporarily cleared its flag. We cannot place a bound on how many times
this could happen.

Page 17
17

A Protocol for Mutual Exclusion


T. Dekker, 1966

A protocol based on 3 shared variables c1, c2 and


turn. Initially, both c1 and c2 are 0 (not busy)
Process 1
...
c1 = 1;
turn = 1;
L: if c2 == 1 && turn == 1
then goto L
< critical section >
c1 = 0;

Process 2
...
c2 = 1;
turn = 2;
L: if c1 == 1 && turn == 2
then goto L
< critical section>
c2 = 0;

turn = i ensures that only process i can wait


variables c1 and c2 ensure mutual exclusion

18

Take a number approach used in bakeries.

Never seen one in bakeries, but the RMV uses one.

Page 18

18

N-process Mutual Exclusion


Lamports Bakery Algorithm

Process i

Initially num[j] = 0, for all j

Entry Code

choosing[i] = 1;
num[i] = max(num[0], , num[N-1]) + 1;
choosing[i] = 0;
for(j = 0; j < N; j++) {
while( choosing[j] );
while( num[j] &&
( ( num[j] < num[i] ) ||
( num[j] == num[i] && j < i ) ) );
}
Exit Code

num[i] = 0;

19

Wait if the process is currently choosing

Wait if the process has a number and comes ahead of us.

Page 19

19

Implementation Issues
P

Implementation of SC is complicated by two issues


Out-of-order execution capability
Load(a); Load(b)
yes
Load(a); Store(b)
yes if a b
Store(a); Load(b)
yes if a b
Store(a); Store(b)
yes if a b
Caches
Caches can prevent the effect of a store from
being seen by other processors

20

Page 20

20

Memory Fences:

Instructions to serialize memory accesses

Processors with relaxed or weak memory models


(i.e., permit Loads and Stores to different addresses
to be reordered) need memory fence instructions to
force serialization of memory accesses
Processors with relaxed memory models:
Sparc V8 (TSO, PSO): Membar
PowerPC (WO): Sync, EIEIO
Memory fences are expensive operations, however,
one pays for serialization only when it is required

21

Page 21

21

Using Memory Fences


Producer

tail

Producer posting Item x:


Rtail M[tail]
M[<Rtail>] x
membarSS
Rtail = <Rtail> + 1
M[tail] <Rtail >
What does this ensure?

head

Consumer

Consumer:
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail>
membarLL
R M[<Rhead>]
Rhead <Rhead> + 1
M[head] <Rhead >
What does
process(R)
this ensure?

22

Ensures that tail pointer is not updated before


X has been stored.
Ensures that R is not loaded before x has been stored.

Page 22

22

Cache Coherence

Page 1

Memory Consistency in SMPs


CPU-1

CPU-2

cache-1

100

100

cache-2

CPU-Memory bus

100

memory

Suppose CPU-1 updates A to 200.


write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value
Do these stale values matter?
What is the view of shared memory for programming?
2

Page 2

Write-back Caches & SC

T1 is executed

prog T1
ST X, 1
ST Y,11

cache-1 writes back Y

T2 executed
cache-1 writes back X
cache-2 writes
back X & Y

cache-1
X= 1
Y=11

memory
X= 0
Y =10
X=
Y=

cache-2
Y=
Y=
X=
X=

X= 1
Y=11

X= 0
Y =11
X=
Y=

Y=
Y=
X=
X=

X= 1
Y=11

X= 0
Y =11
X=
Y=

Y = 11
Y= 11
X= 0
X= 0

X= 1
Y=11

X= 1
Y =11
X=
Y=

Y = 11
Y= 11
X= 0
X= 0

X= 1
Y=11

X= 1
Y =11
X= 0
Y=11

Y=
Y=
X=
X=

prog T2
LD Y, R1
ST Y, R1
LD X, R2
ST X,R2

Page 3

Write-through Caches & SC

prog T1
ST X, 1
ST Y,11

T1 executed
T2 executed

cache-1
X= 0
Y=10

memory
X= 0
Y =10
X=
Y=

cache-2
Y=
Y=
X= 0
X=

X= 1
Y=11

X= 1
Y =11
X=
Y=

Y=
Y=
X= 0
X=

X= 1
Y=11

X= 1
Y =11
X= 0
Y= 11

Y = 11
Y= 11
X= 0
X= 0

prog T2
LD Y, R1
ST Y, R1
LD X, R2
ST X,R2

Write-through caches dont preserve


sequential consistency either

Page 4

Maintaining Sequential Consistency

SC sufficient for correct producer-consumer


and mutual exclusion code (e.g., Dekker)
Multiple copies of a location in various caches
can cause SC to break down.
Hardware support is required such that
only one processor at a time has write
permission for a location
no processor can load a stale copy of
the location after a write
cache coherence protocols

Page 5

A System with Multiple Caches


P
L1
P
L1

P
L1

P
L1
L2

P
L1

P
L1

L2

Interconnect
M

Modern systems often have hierarchical caches


Each cache has exactly one parent but can have
zero or more children
Only a parent and its children can communicate directly
Inclusion property is maintained between a parent
and its children, i.e.,

a Li+1
a Li
6

Page 6

Cache Coherence Protocols for SC


write request:
the address is invalidated (updated) in all other
caches before (after) the write is performed
read request:
if a dirty copy is found in some cache, a write-back
is performed before the memory is read

We will focus on Invalidation protocols


as opposed to Update protocols
7

Update protocols, or write broadcast. Latency between writing a word in one


processor
and reading it in another is usually smaller in a write update scheme.
But since bandwidth is more precious, most multiprocessors use a write
invalidate scheme.

Page 7
7

Warmup: Parallel I/O


Memory Physical
Bus
Memory

Address (A)

Proc.

Data (D)

Cache

R/W

Page transfers
occur while
Processor runs
A

Either Cache or DMA can


be the Bus Master and
effect transfers

D
R/W

DMA

DISK

DMA stands for Direct Memory Access


8

Page 8

Problems with Parallel I/O


Cached portions
of page

Physical
Memory

Memory
Bus

Proc.

Cache
DMA transfers

DMA
DISK
Memory
Disk

Disk: Physical memory may be


stale if Cache is
.

Memory: Cache may have data


corresponding to

.
9

Page 9

Snoopy Cache [ Goodman 1983 ]


Idea: Have cache watch (or snoop upon) DMA
transfers, and then do the right thing
Snoopy cache tags are dual-ported
Used to drive Memory Bus
when Cache is Bus Master
A

Proc.

R/W
D

Tags and
State
Data
(lines)

A
R/W

Snoopy read port


attached to Memory
Bus

Cache
10

A snoopy cache works in analogy to your snoopy next door neighbor, who is
always watching to see what you're doing, and interfering with your life. In
the case of the snoopy cache, the caches are all watching the bus for
transactions that affect blocks that are in the cache at the moment. The
analogy breaks down here; the snoopy cache only does something if your
actions actually affect it, while the snoopy neighbor is always interested in
what you're up to.

Page 10
10

Snoopy Cache Actions


Observed Bus
Cycle

Cache State

Cache Action

Address not cached

Read Cycle
Memory

Disk

Cached, unmodified

Disk

Memory

Cached, modified

Address not cached

Write Cycle

Cached, unmodified
Cached, modified

11

Page 11

11

Shared Memory Multiprocessor


Memory
Bus

M1

Snoopy
Cache

M2

Snoopy
Cache

M3

Snoopy
Cache

Physical
Memory

DMA

DISKS

Use snoopy mechanism to keep all


processors view of memory coherent
12

Page 12

12

Cache State Transition Diagram


M: Modified Exclusive
E: Exclusive, unmodified
S: Shared
I: Invalid

Each cache line tag


Address tag
state
bits
M1 write
or read

M1 write

Write miss

Other processor
read/ Write back line
Read
miss
Read by any
processor

M1 read

M1

o
tt
en
t
in

e
rit

Other processor
intent to write

Other processor
intent to write

I
13

Page 13

13

2 Processor Example
Block b

M1

Block b

M2

M1 write
or read

M1 write

M2 read,
Write back line
Read
miss

M2 write
or read

M1

to

M2 write

M2

nt

en
int

Write miss

ite
wr

M2 intent to write

M1 read,
Write back line
Read
miss

e
int

wr
t to

M2 intent to write

M2 read
Write miss

ite

M1 intent to write

M1 read

M1 intent to write

14

Page 14

14

Observation
M1 write
or read

M1 write

Read by any
processor

M1 read
Write miss

Other processor
read/ Write back line
Read
miss

M1

to
nt
e
t
in

e
rit

Other processor
intent to write

Other processor
intent to write

If a line is in M or E, then no other cache has


a valid copy of the line!
Memory stays coherent, multiple differing copies
cannot exist
15

Page 15

15

Cache Coherence State Encoding


block Address
tag

indexm

offset

tag

V M

data block

Valid and dirty bits can be used


to encode S, I, and (E, M) states
V=0, D=x Invalid
V=1, D=0 Shared (not dirty)
V=1, D=1 Exclusive (dirty)

Hit?

word

16

What does it mean to merge E, M states?

Page 16

16

2-Level Caches
CPU

CPU

CPU

CPU

L1 $

L1 $

L1 $

L1 $

L2 $

L2 $

L2 $

L2 $

Snooper

Snooper

Snooper

Snooper

Processors often have two-level caches


Small L1 on chip, large L2 off chip
Inclusion property: entries in L1 must be in L2
invalidation in L2 invalidation in L1
Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?

17

Interlocks are required when both CPU-L1 and L2-Bus interactions involve
the same address.

Page 17

17

Intervention
CPU-1
A

CPU-2
cache-1

200

cache-2

CPU-Memory bus
A

100

Memory (stale)

When a read-miss for A occurs in cache-2,


a read request for A is placed on the bus
Cache-1 needs to supply & change its state to shared
The memory may respond to the request also!

does memory know it has stale data?


Cache-1 needs to intervene through memory
controller to supply correct data to cache-2

18

Page 18

18

False Sharing
state blk addr data0 data1

...

dataN

A cache block contains more than one word


Cache-coherence is done at the block-level and not
word-level
Suppose M1 writes wordi and M2 writes wordk and
both words have the same block address.
What can happen?

19

The block may be invalidated many times unnecessarily because


the addresses share a common block.

Page 19

19

Synchronization and Caches:


Performance Issues

Processor 1
R1
L: swap(mutex, R);
if <R> then goto L;
<critical section>
M[mutex] 0;

Processor 2
R1
L: swap(mutex, R);
if <R> then goto L;
<critical section>
M[mutex] 0;

cache

mutex=1

Processor 3
R1
L: swap(mutex, R);
if <R> then goto L;
<critical section>
M[mutex] 0;

cache

CPU-Memory Bus

Cache-coherence protocols will cause mutex to ping-pong


between P1s and P2s caches.
Ping-ponging can be reduced by first reading the mutex
location (non-atomically) and executing a swap only if
it is found to be zero.

20

Page 20

20

Performance Related to Bus occupancy


In general, a read-modify-write instruction requires
two memory (bus) operations without intervening
memory operations by other processors
In a multiprocessor setting, bus needs to be locked
for the entire duration of the atomic read and write
operation
expensive for simple buses
very expensive for split-transaction buses
modern processors use
load-reserve
store-conditional

21

Split transaction bus has a read-request transaction followed by a

Memory-reply transaction that contains the data.

Split transactions make the bus available for other masters

While the memory reads the words of the requested address.

It also normally means that the CPU must arbitrate for the bus

To request the data and memory must arbitrate for the bus to

Return the data. Each transaction must be tagged. Split

Transaction buses have higher bandwidth and higher latency.

Page 21
21

Load-reserve & Store-conditional


Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve(R, a):
<flag, adr> <1, a>;
R M[a];

Store-conditional(a, R):
if <flag, adr> == <1, a>
then cancel other procs
reservation on a;
M[a] <R>;
status succeed;
else status fail;

If the snooper sees a store transaction to the address


in the reserve register, the reserve bit is set to 0
Several processors may reserve a simultaneously
These instructions are like ordinary loads and stores
with respect to the bus traffic
22

Page 22

22

Performance:

Load-reserve & Store-conditional

The total number of memory (bus) transactions is


not necessarily reduced, but splitting an atomic
instruction into load-reserve & store-conditional:
increases bus utilization (and reduces
processor stall time), especially in splittransaction buses
reduces cache ping-pong effect because
processors trying to acquire a semaphore do
not have to perform a store each time
23

Page 23

23

Out-of-Order Loads/Stores & CC


snooper
(Wb-req, Inv-req, Inv-rep)
load/store
buffers
CPU

Cache
(I/S/E)

Blocking caches

pushout (Wb-rep)

Memory

(S-rep, E-rep)
(S-req, E-req)

One request at a time + CC SC

Non-blocking caches
Multiple requests (different addresses) concurrently + CC
Relaxed memory models

CC ensures that all processors observe the same order


of loads and stores to an address

24

Page 24

24

Relaxed Memory Models

Episode III in our multiprocessing miniseries.

Relaxed memory models.

What I really wanted here was an elephant with sunglasses relaxing

On a beach, but I couldnt find one.

Page 1
1

Sequential Consistency: A Memory Model


P

Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs
SC is easy to understand but architects and
compiler writers want to violate it for performance
2

Mark Hill written a paper which essentially says why break your back for
20%. Actually people are out there breaking their backs for 1% in
architecture these days.

Page 2

Optimizations & Memory Models


Pushout
buffers
load/store
buffers

Memory
Data
Cache

CPU

ProcessorMemory
Interface

Load queue

Architectural optimizations that are correct for


uniprocessors often result in a new memory
model for multiprocessors
3

This means that we are relaxing the ordering or relaxing atomicity.

Page 3

Relaxed Models
What orderings among reads and writes
performed by a single processor are
preserved by the model?
R R, R W, W W, W R
dependence if they are to the same address

If there is a dependence, then program


semantics demand that operations be ordered
If there is no dependence, the memory
consistency model determines what orders
must be preserved
Relaxed model may allow an operation
executed later to complete first
4

Page 4

Memory Fences & Weak Memory Models


Processors with relaxed or weak memory models
need memory fence instructions to force serialization
of memory accesses
In SC there is an implicit fence at each memory
operation
Processors with relaxed memory models:
Sparc V8 (TSO, PSO): Membar
PowerPC (WO): Sync, EIEIO
Memory fences are expensive operations, however,
one pays for serialization only when it is required
5

Page 5

Using Memory Fences


Producer

tail

head

Consumer

Producer posting Item x:


Rtail M[tail]
M[<Rtail>] x
membarSS
Rtail <Rtail> + 1
M[tail] <Rtail>
Ensures that tail pointer
is not updated before x
has been stored.

Consumer:
Rhead M[head]
spin: Rtail M[tail]
if <Rhead> == <Rtail>
membarLL
R M[<Rhead>]
Rhead <Rhead> + 1
Ensures that R is M[head] <Rhead>
not loaded before process(R)
x has been stored.

Ensures that tail pointer is not updated before


X has been stored.
Ensures that R is not loaded before x has been stored.

Page 6

Data-Race Free Programs


(a.k.a. Properly Synchronized Programs)

Process 1
...
Acquire(mutex);
< critical section >
Release(mutex);

Process 2
...
Acquire(mutex);
< critical section >
Release(mutex);

Synchronization variables (e.g., mutex) are separate


from data variables
Accesses to writable shared data variables are
protected in critical regions
no data races except for locks
(Formal definition is elusive)
In general, it cannot be proven if a program is datarace free.

Nondeterminator.

Page 7

Fences in Data-Race Free Programs


Process 1
...
Acquire(mutex);
membar;
< critical section >
membar;
Release(mutex);

Process 2
...
Acquire(mutex);
membar;
< critical section >
membar;
Release(mutex);

Relaxed memory models allow reordering of


instructions by the compiler or the processor as long
as the reordering is not done across a fence
What about speculation and prefetching?
8

Processor should not speculate or prefetch across fences.

Page 8

Total Store Order (TSO)


IBM370, DECVAX

Eliminates the order W(a) R(b) a b


Advantage?
=A

=A
B=

B=

acquire (S)

acquire (S)

SC

C=
=D

TSO

C=
=D
release (S)

release (S)
E=

E=

F=

F=

Allows the buffering of writes with bypassing by reads, which occurs


whenever the
processor allows a read to proceed before it guarantees that
an earlier write by the processor has been seen by all the other processors.

Page 9
9

TSO vs. SC
Initially x = old, y = old
Processor P1

Processor P2

x = new;
y_copy = y;

y = new;
x_copy = x;

Under SC what values can x_copy and y_copy get ?


Under TSO what values can x_copy and y_copy get ?

10

TSO both can get old values.

SC at least one has to get the value of new.

Page 10

10

Partial Store Ordering (PSO)


SPARC

Also eliminates the order W(a) W(b) a b


Advantage?
=A

=A
B=

B=

acquire (S)

acquire (S)

TSO

C=

PSO

C=

=D

=D

release (S)

release (S)

E=

E=

F=

F=

11

Allows pipelining or overlapping of write operations, rather than


Forcing one operation to complete before another.

Page 11

11

Weak Ordering
POWERPC

Also eliminates the orders R(a) R(b) a b


and R(a) W(b) a b
Need non-blocking reads to exploit relaxation
=A

=A
B=

B=

acquire (S)

acquire (S)

PSO

C=

WO

C=

=D

=D

release (S)

release (S)

E=

E=

F=

F=

12

Non-blocking reads, doesnt help too much.

Page 12

12

Release Consistency
Alpha, MIPS

Read/write that precedes acquire need not


complete before acquire, and read/write that
follows a release need not wait for the release
=A

=A
B=

B=

acquire (S)

acquire (S)

SC

C=
=D

RC

C=
=D
release (S)

release (S)
E=

E=

F=

F=

13

Weakest of the memory models used these days.

Page 13

13

Release Consistency Example


Initially data = old
Processor P1

Processor P2

data = new;

while(flag != SET) { }

flag = SET;

data_copy = data;

How do we ensure that data_copy is always set


to new ?
14

Membar Store Store

Membar Load Load

Page 14

14

Weaker Memory Models


Alpha, Sparc
PowerPC, ...

Store is globally
performed

TSO, PSO,
RMO, ...

Writebuffers

RMO=WO?

SMP, DSM

Hard to understand and remember


Unstable - Modle de lanne
15

Weak ordering. Interaction with cache coherence. Weak atomicity.

Page 15

15

Why SC is not the right model for compilers

Intermediate representation is intrinsically a partial


order (data-flow graph)
expose scope for instruction reordering
to the underlying architecture
Load/Store atomicity forces compilers to overspecify requirements for completion of operations
expose cache coherence actions

16

Page 16

16

The CRF Model

X. Shen, Arvind, L. Rudolph (1999)

proc

proc

sache

sache

proc

...

sache

shared memory
Exposes
data caching via semantic caches

Store(a,v)
Load(a)

StoreL(a,v); Commit(a)
Reconcile(a); LoadL(a)

instruction reordering (controllable via Fence)


17

Page 17

17

CRF: Load Local & Store Local

proc

proc

...

LoadL(a)
Cell(a,v,-)

StoreL(a,v)
Cell(a,v,D)

shared memory
LoadL reads from the sache if the address is cached
StoreL writes into the sache and sets the state to Dirty
18

Page 18

18

CRF: Commit and Reconcile

proc
Commit(a)
C
Cell(a,-,D)?

proc
Reconcile(a)

...

Cell(a,-,C)?

shared memory
Commit completes if the address is not cached in the
Dirty state
Reconcile completes if the address is not cached in
Clean

19

Page 19

19

CRF: Background Operations

proc

proc

...
Cell(a,5,C)

proc
C

Cell(b,8,D)

Cache
Writeback
Cell(a,5)
Cell(b,8)

...
Cell(c,7,C)

Purge
Cell(c,7)

Cache (retrieve) a copy of an uncached address


from memory
Writeback a Dirty copy to memory and set its
state Clean
Purge a Clean copy

20

Page 20

20

CRF: Fences
Instructions can be reordered except for
Data dependence
StoreL(a,v); Commit(a);
Reconcile(a); LoadL(a);
Reconcile(a1);
LoadL(a1);

Fencerr (a1, a2)

Reconcile(a2);
LoadL(a2);

Fencewr; Fencerw; Fenceww;


21

Page 21

21

Producer-Consumer Synchronization

reader Reconcile(a);
LoadL(a);

StoreL(a,v); writer
Commit(a);
writeback

cache

memory
Break down the synchronization equally between
the producer and consumer
Semantically, memory behaves as the rendezvous
between processors
no operation involves more than one sache
22

Page 22

22

CRF: A Universal Memory Model

SC Program
Load
Store

CRF Program
Translation
Scheme

LoadL
StoreL
Commit
Reconcile

CRF
Protocol

A CRF protocol is automatically a protocol for


any memory model whose programs can be
translated into CRF programs
23

Page 23

23

Translating SC into CRF


Initially a = 0, flag = 0
Processor 1

Processor 2

Store(a,10);
Store(flag,1);

L: r1 = Load(flag);
Jz(r1,L);
r2 = Load(a);

24

Page 24

24

Translating SC into CRF

Processor 1
Store(a,10);
Fenceww(a, flag);
Store(flag,1);

Processor 2
L: r1 = Load(flag);
Jz(r1,L);
Fencerr(flag, a);
r2 = Load(a);

Weak ordering
25

Page 25

25

Translating SC into CRF

Processor 1

Processor 2

StoreL(a,10);
Commit(a);
Fenceww(a, flag);
StoreL(flag,1);
Commit(flag);

L: Reconcile(flag);
r1 = LoadL(flag);
Jz(r1,L);
Fencerr(flag, a);
Reconcile(a);
r2 = LoadL(a);

26

Page 26

26

Asanovic/Devadas
Spring 2002
6.823

Microprocessor Evolution:
4004 to Pentium Pro

Krste Asanovic
Laboratory for Computer Science
Massachusetts Institute of Technology

First Microprocessor
Intel 4004, 1971

Asanovic/Devadas
Spring 2002
6.823

4-bit accumulator
architecture
8m pMOS
2,300 transistors
3 x 4 mm2
750kHz clock
8-16 cycles/inst.

Asanovic/Devadas
Spring 2002
6.823

Microprocessors in the Seventies


Initial target was embedded control
First micro, 4-bit 4004 from Intel, designed for a desktop
printing calculator

Constrained by what could fit on single chip


Single accumulator architectures

8-bit micros used in hobbyist personal computers


Micral, Altair, TRS-80, Apple-II

Little impact on conventional computer market until


VISICALC spreadsheet for Apple-II (6502, 1MHz)
First killer business application for personal computers

Asanovic/Devadas
Spring 2002
6.823

DRAM in the Seventies


Dramatic progress in MOSFET memory technology
1970, Intel introduces first DRAM (1Kbit 1103)
1979, Fujitsu introduces 64Kbit DRAM
=> By mid-Seventies, obvious that PCs would soon
have > 64KBytes physical memory

Microprocessor Evolution

Asanovic/Devadas
Spring 2002
6.823

Rapid progress in size and speed through 70s


Fueled by advances in MOSFET technology and expanding markets

Intel i432

Most ambitious seventies micro; started in 1975 - released 1981


32-bit capability-based object-oriented architecture
Instructions variable number of bits long
Severe performance, complexity, and usability problems

Intel 8086 (1978, 8MHz, 29,000 transistors)


Stopgap 16-bit processor, architected in 10 weeks
Extended accumulator architecture, assembly-compatible with 8080
20-bit addressing through segmented addressing scheme

Motorola 68000 (1979, 8MHz, 68,000 transistors)


Heavily microcoded (and nanocoded)
32-bit general purpose register architecture (24 address pins)
8 address registers, 8 data registers

Intel 8086

Asanovic/Devadas
Spring 2002
6.823

Class
Data:

Register
AX,BX
CX
DX

Purpose
general purpose
string and loop ops only
mult/div and I/O only

Address:

SP
BP
SI,DI

stack pointer
base pointer (can also use BX)
index registers

Segment:

CS
SS
DS
ES

code segment
stack segment
data segment
extra segment

Control:

IP
FLAGS

instruction pointer (lower 16 bit of PC)


C, Z, N, B, P, V and 3 control bits

Typical format R R op M[X], many addressing modes


Not a GPR organization!

IBM PC, 1981

Asanovic/Devadas
Spring 2002
6.823

Hardware
Team from IBM building PC prototypes in 1979
Motorola 68000 chosen initially, but 68000 was late
IBM builds stopgap prototypes using 8088 boards from
Display Writer word processor
8088 is 8-bit bus version of 8086 => allows cheaper system
Estimated sales of 250,000
100,000,000s sold

Software
Microsoft negotiates to provide OS for IBM. Later buys and
modifies QDOS from Seattle Computer Products.

Open System

Standard processor, Intel 8088


Standard interfaces
Standard OS, MS-DOS
IBM permits cloning and third-party software

The Eighties:
Microprocessor Revolution

Asanovic/Devadas
Spring 2002
6.823

Personal computer market emerges


Huge business and consumer market for spreadsheets, word
processing and games
Based on inexpensive 8-bit and 16-bit micros: Zilog Z80, Mostek
6502, Intel 8088/86,

Minicomputers replaced by workstations


Distributed network computing and high-performance graphics
for scientific and engineering applications (Sun, Apollo, HP,)
Based on powerful 32-bit microprocessors with virtual memory,
caches, pipelined execution, hardware floating-point

Massively Parallel Processors (MPPs) appear


Use many cheap micros to approach supercomputer
performance (Sequent, Intel, Parsytec)

The Nineties

Asanovic/Devadas
Spring 2002
6.823

Distinction between workstation and PC disappears


Parallel microprocessor-based SMPs take over lowend server and supercomputer market
MPPs have limited success in supercomputing market
High-end mainframes and vector supercomputers
survive killer micro onslaught
64-bit addressing becomes essential at high-end
In 2001, 4GB DRAM costs <$5,000

CISC ISA (x86) thrives!

Asanovic/Devadas
Spring 2002
6.823

Reduced ISA Diversity in Nineties


Few major companies in general-purpose market

Intel x86 (CISC)


IBM 390 (CISC)
Sun SPARC, SGI MIPS, HP PA-RISC (all RISCs)
IBM/Apple/Motorola introduce PowerPC (another RISC)
Digital introduces Alpha (another RISC)

Software costs make ISA change prohibitively expensive


64-bit addressing extensions added to RISC instruction sets
Short vector multimedia extensions added to all ISAs, but without compiler
support

=> Focus on microarchitecture (superscalar, out-of-order)

CISC x86 thrives!


RISCs (SPARC, MIPS, Alpha, PowerPC) fail to make significant inroads into
desktop market, but important in server and technical computing markets

RISC advantage shrinks with superscalar out-of-order


execution

Intel Pentium Pro, (1995)

Asanovic/Devadas
Spring 2002
6.823

During decode, translate complex x86


instructions into RISC-like micro-operations
(uops)
e.g., R R op Mem translates into
load T, Mem
# Load from Mem into temp reg
R R op T
# Operate using value in temp

Execute uops using speculative out-of-order


superscalar engine with register renaming
Pentium Pro family architecture (P6 family) used
on Pentium-II and Pentium-III processors

Intel Pentium Pro (1995)


External Bus
L2 Cache

Memory Reorder
Buffer

Bus Interface

Data Cache

Branch
Target
Buffer
MicroInstruction
Sequencer

Register
Alias Table
Internal RISC-like micro-ops

Memory
Interface Unit
Reservation Station

Instruction Decoder

x86 CISC
macro
instructions

Instruction Cache
and Fetch Unit

Address
Generation Unit
Integer Unit
Floating-Point
Unit

Reorder Buffer
and Retirement
Register File

Asanovic/Devadas
Spring 2002
6.823

P6 Instruction Fetch & Decode


8KB I-cache, 4-way s.a.,

I-TLB

32-byte lines

32+4 entry

virtual index, physical tag

fully assoc.

16-byte aligned fetch of 16 bytes

Fetch
Buffer
(holds x86
insts.)

Simple
Decoder

1 uop

Simple
Decoder

1 uop

Complex
Decoder

PC from
branch
predictor
I-TLB has 32 entries for
4KB pages plus 4 entries
for 4MB pages

uop Buffer
(6 entries)
1-4 uops

uop Sequencer
(microcode)

Asanovic/Devadas
Spring 2002
6.823

P6 uops

Asanovic/Devadas
Spring 2002
6.823

Each uop has fixed format of around 118 bits


opcode, two sources, and destination
sources and destination fields are 32-bits wide to hold
immediate or operand

Simple decoders can only handle simple x86


instructions that map to one uop
Complex decoder can handle x86 translations of
up to 4 uops
Complicated x86 instructions handled by
microcode engine that generates uop sequence
Intel data shows average of 1.2-1.7 uops per x86
instruction on SPEC95 benchmarks, 1.4-2.0 on MS
Office applications

Asanovic/Devadas
Spring 2002
6.823

P6 Reorder Buffer and Renaming


uop Buffer
(6 entries)

3 uops
/cycle

Reorder Buffer (ROB)


Allocate
ROB, RAT,
RS entries

Data

Status

40 entries
in ROB

Register Alias Table (RAT)


EAX
EBX
ECX
EDX
ESI
EDI
ESP
EBP

Retirement
Register File
(RRF)

Values move from ROB to architectural register file


(RRF) when committed

P6 Reservation Stations and


Execution Units

Asanovic/Devadas
Spring 2002
6.823

ROB

Renamed
uops
(3/cycle)

(40 entries)

Reservation Station (20 entries)

dispatch
up to 5
uops/cycle

Store
Data

stores only
leave MOB
when uop
commits

Store
Addr.

Load
Addr.

Int.
ALU

Int.
ALU

FP
ALU

Memory Reorder Buffer


(MOB)
1 store

D-TLB

1 load
8KB D-cache, 4-way s.a., 32-byte lines,
divided into 4 interleaved banks

D-TLB has 64 entries for 4KB pages fully assoc.,


plus 8 entries for 4MB pages, 4-way s.a.

Load
data

Asanovic/Devadas
Spring 2002
6.823

P6 Retirement
After uop writes back to ROB with no outstanding
exceptions or mispredicts, becomes eligible for
retirement
Data written to RRF from ROB
ROB entry freed, RAT updated
uops retired in order, up to 3 per cycle
Have to check and report exceptions at valid x86
instruction fault points
complex instructions (e.g., string move) may generate
thousands of uops

Asanovic/Devadas
Spring 2002
6.823

P6 Pipeline
RS
x86uop
Write RS Exec.
Decode Rename Read
Retire

BTB I-Cache
Access Access

Fetch buffer

uop buffer
Reservation Station

10 11 12
ROB

Addr.
Calc.

Load pipeline

D-cache

L2 Access

Retire

L1 L2 L3 L4 L5 L6 11 12
MOB

ROB
Bypass L2
access if L1 hit

P6 Branch Penalties
RS
x86uop
Write RS Exec.
Decode Rename Read
Retire

BTB I-Cache
Access Access

Fetch buffer

Asanovic/Devadas
Spring 2002
6.823

uop buffer

10 11 12
ROB

Reservation Station

Branch mispredict penalty

P6 Branch Target Buffer (BTB)

Asanovic/Devadas
Spring 2002
6.823

512 entries, 4-way set-associative


Holds branch target, plus two-level BHT for
taken/not-taken
Unconditional jumps not held in BTB
One cycle bubble on correctly predicted taken
branches (no penalty if correctly predicted nottaken)

Two-Level Branch Predictor

Asanovic/Devadas
Spring 2002
6.823

Pentium Pro uses the result from the last two branches
to select one of the four sets of BHT bits (~90-95% correct)
00
Fetch PC

2-bit global branch


history shift register
Shift in
Taken/Taken
results of each
branch
Taken/Taken?

P6 Static Branch Prediction

Asanovic/Devadas
Spring 2002
6.823

If a branch misses in BTB, then static prediction


performed
Backwards branch predicted taken, forwards
branch predicted not-taken

Asanovic/Devadas
Spring 2002
6.823

P6 Branch Penalties

RS
x86uop
Write RS Exec.
Decode Rename Read
Retire

BTB I-Cache
Access Access

BTB
predicted
taken
penalty

Fetch buffer

uop buffer

10 11 12
ROB

Reservation
Station

Decode and predict


branch that missed in BTB
(backwards taken,
forwards not-taken)

Branch resolved

Asanovic/Devadas
Spring 2002
6.823

P6 System
PCI Bus

DRAM

Memory
controller

AGP
Bus

AGP
Graphics
Card

Glueless SMP to 4 procs., split-transaction


Frontside bus
CPU

CPU

CPU

CPU

L1 I$ L1 D$

L1 I$ L1 D$

L1 I$ L1 D$

L1 I$ L1 D$

L2 $

L2 $

L2 $

Backside bus
L2 $

Pentium-III Die Photo


Programmable
Interrupt Control

Asanovic/Devadas
Spring 2002
6.823

External and Backside


Packed FP Datapaths
Bus Logic
Integer Datapaths
Page Miss Handler
Floating-Point
Datapaths
Memory Order
Buffer
Memory Interface
Unit (convert floats

Clock

to/from memory
format)

16KB
4-way
s.a. D$

MMX Datapaths
Register Alias Table
Allocate entries
(ROB, MOB, RS)

Reservation
Station
Branch
Address Calc
Reorder Buffer
(40-entry physical
regfile + architect.
regfile)

256KB
8-way s.a.
Instruction Fetch Unit:
16KB 4-way s.a. I-cache

Instruction Decoders:
3 x86 insts/cycle

Microinstruction
Sequencer

Pentium Pro vs MIPS R10000

Asanovic/Devadas
Spring 2002
6.823

External Bus
Interface

Execution
Units

D-cache

I-cache

Estimates of 30% hit for CISC versus RISC


compare with original RISC Advantage of 2.6
RISC Advantage decreased because size of out-of-order core
largely independent of original ISA

Asanovic/Devadas

Spring 2002

6.823

Advanced CISC Implementations:

Pentium 4

Krste Asanovic

Laboratory for Computer Science

Massachusetts Institute of Technology

Intel Pentium Pro (1995)

Asanovic/Devadas

Spring 2002

6.823

During decode, translate complex x86


instructions into RISC-like micro-operations
(uops)
e.g., R R op Mem translates into
load T, Mem
# Load from Mem into temp reg
R R op T
# Operate using value in temp

Execute uops using speculative out-of-order


superscalar engine with register renaming
Pentium Pro family architecture (P6 family) used
on Pentium-II and Pentium-III processors

Intel Pentium 4 (2000)

Asanovic/Devadas

Spring 2002

6.823

Deeper pipelines than P6 family


about half as many levels of logic per pipeline stage as P6

Trace cache holds decoded uops


only has a single x86->uop decoder

Decreased latency in same process technology


aggressive circuit design
new microarchitectural tricks

This lecture contains figures and data taken from: The microarchitecture
of the Pentium 4 processor, Intel Technology Journal, Q1, 2001

Pentium 4 Block Diagram

Asanovic/Devadas
Spring 2002
6.823

P-III vs. P-4 Pipelines

Asanovic/Devadas
Spring 2002
6.823

In same process technology, ~1.5x clock frequency


Performance Equation:
Time = Instructions * Cycles * Time
Program
Program
Instruction Cycle

Asanovic/Devadas

Spring 2002

6.823

Apple Marketing for G4


Shorter data pipeline

The performance advantage of the PowerPC G4


starts with its data pipeline. The term processor
pipeline refers to the number of processing steps,
or stages, it takes to accomplish a task. The fewer
the steps, the shorter and more efficient the
pipeline. Thanks to its efficient 7-stage design
(versus 20 stages for the Pentium 4 processor)
the G4 processor can accomplish a task with 13
fewer steps than the PC. You do the math.

Asanovic/Devadas

Spring 2002

6.823

Relative Frequency of Intel Designs

Over time, fewer logic levels per pipeline stage and


more advanced circuit design
Higher frequency in same process technology

Deep Pipeline Design

Asanovic/Devadas
Spring 2002
6.823

Greater potential throughput but:


Clock uncertainty and latch delays eat into cycle time
budget
doubling pipeline depth gives less than twice frequency
improvement

Clock load and power increases


more latches running at higher frequencies

More complicated microarchitecture needed to cover


long branch mispredict penalties and cache miss
penalties
from Littles Law, need more instructions in flight to cover longer
latencies larger reorder buffers

P-4 has three major clock domains


Double pumped ALU (3 GHz), small critical area at highest speed
Main CPU pipeline (1.5 GHz)
Trace cache (0.75 GHz), save power

Pentium 4 Trace Cache

Asanovic/Devadas
Spring 2002
6.823

Holds decoded uops in predicted program flow


order, 6 uops per line
Code in memory

cmp
br T1
...
T1: sub
br T2
...
T2: mov
sub
br T3
...
T3: add
sub
mov
br T4
...
T4:

Code packed in trace cache


(6 uops/line)
cmp
br T2

br T1
mov

sub
sub

br T3
mov

add
br T4

sub
T4:...

Trace cache fetches one 6 uop line


every 2 CPU clock cycles (runs at
main CPU rate)

Trace Cache Advantages

Asanovic/Devadas

Spring 2002

6.823

Removes x86 decode from branch mispredict


penalty
Parallel x86 decoder took 2.5 cycles in P6, would be 5 cycles in P-4
design

Allows higher fetch bandwidth fetch for correctly


predicted taken branches
P6 had one cycle bubble for correctly predicted taken branches
P-4 can fetch a branch and its target in same cycle

Saves energy
x86 decoder only powered up on trace cache refill

Pentium 4 Front End


L2 Cache
x86 instructions,
8 Bytes/cycle

Inst. Prefetch
& TLB

Fetch Buffer

Asanovic/Devadas
Spring 2002
6.823

Front End
BTB
(4K Entries)

Single x86 instruction/cycle

x86 Decoder
4 uops/cycle

Trace Cache Fill Buffer


6 uops/line

Trace Cache (12K uops)

Translation from x86


instructions to internal
uops only happens on
trace cache miss, one x86
instruction per cycle.
Translations are cached in
trace cache.

Asanovic/Devadas
Spring 2002
6.823

P-4 Trace Cache Fetch


1
TC Next IP
(BTB)
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive

Trace Cache
(12K uops, 2K lines of 6 uops)
Trace IP

Microcode
ROM

Trace BTB
(512 entries)
16-entry
subroutine return
address stack

6 uops every two


CPU cycles

uop buffer
3 uops/cycle

P-III vs. P-4 Renaming


1
TC Next IP
(BTB)
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive

Asanovic/Devadas
Spring 2002
6.823

P-4 physical register file separated from ROB status.


ROB entries allocated sequentially as in P6 family.
One of 128 physical registers allocated from free list.
No data movement on retire, only Retirement RAT
updated.

Asanovic/Devadas
Spring 2002
6.823

P-4 Op Queues and Schedulers


1
TC Next IP
(BTB)
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive

Allocated/Renamed uops
3 uops/cycle
Memory uop
Queue

Memory
Scheduler

uop queues are


in-order within
each queue

Fast
Fast
Scheduler
Scheduler
(x2)

Arithmetic
uop Queue

General
Scheduler

Simple FP
Scheduler

Ready uops compete for dispatch ports


(Fast schedulers can each dispatch 2 ALU
operations per cycle)

P-4 Execution Ports

Asanovic/Devadas
Spring 2002
6.823

Schedulers compete for access to execution ports


Loads and stores have dedicated ports
ALUs can execute two operations per cycle
Peak bandwidth of 6 uops per cycle
load, store, plus four double-pumped ALU operations

P-4 Fast ALUs and Bypass Path

Asanovic/Devadas
Spring 2002
6.823

Register
File and
Bypass
Network

L1 Data
Cache
Fast ALUs and bypass network runs at double speed
All non-essential circuit paths handled out of loop to reduce circuit
loading (shifts, mults/divs, branches, flag/ops)
Other bypassing takes multiple clock cycles

P-4 Staggered ALU Design

Asanovic/Devadas
Spring 2002
6.823

Staggers 32-bit add and flag


compare into three cycle
phases
low 16 bits
high 16 bits
flag checks

Bypass 16 bits around every


cycle
back-back dependent 32-bit adds
at 3GHz in 0.18m

L1 Data Cache access


starts with bottom 16 bits as
index, top 16 bits used as
tag check later

Asanovic/Devadas
Spring 2002
6.823

P-4 Load Schedule Speculation


1
TC Next IP
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17 Load Execute 1
18 Load Execute 2
19 Branch Check
20
Drive

Long delay from


schedulers to load
hit/miss

P-4 guesses that load will hit in L1 and


schedules dependent operations to use
value
If load misses, only dependent
operations are replayed

P-4 Branch Penalty


1
TC Next IP
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive

Asanovic/Devadas
Spring 2002
6.823

20 cycle branch
mispredict penalty

P-4 uses new trade secret branch


prediction algorithm
Intel claims 1/3 fewer mispredicts than
P6 algorithm

P-4 Microarchitecture

Asanovic/Devadas
Spring 2002
6.823

Pentium-III Die Photo


Programmable
Interrupt Control

Asanovic/Devadas
Spring 2002
6.823

External and Backside


Packed FP Datapaths
Bus Logic
Integer Datapaths
Page Miss Handler
Floating-Point
Datapaths
Memory Order
Buffer
Memory Interface
Unit (convert floats

Clock

to/from memory
format)

16KB
4-way
s.a. D$

MMX Datapaths
Register Alias Table
Allocate entries
(ROB, MOB, RS)

Reservation
Station
Branch
Address Calc
Reorder Buffer
(40-entry physical
regfile + architect.
regfile)

256KB
8-way s.a.
Instruction Fetch Unit:
16KB 4-way s.a. I-cache

Instruction Decoders:
3 x86 insts/cycle

Microinstruction
Sequencer

Scaling of Wire Delay

Asanovic/Devadas
Spring 2002
6.823

Over time, transistors are getting relatively faster


than long wires
wire resistance growing dramatically with shrinking width and
height
capacitance roughly fixed for constant length wire
RC delays of fixed length wire rising

Chips are getting bigger


P-4 >2x size of P-III

Clock frequency rising faster than transistor speed


deeper pipelines, fewer logic gates per cycle
more advanced circuit designs (each gate goes faster)

Takes multiple cycles for signal to cross chip

Visible Wire Delay in P-4 Design

1
TC Next IP
2
3
TC Fetch
4
5
Drive
6
Alloc
7
Rename
8
9
Queue
10 Schedule 1
11 Schedule 2
12 Schedule 3
13 Dispatch 1
14 Dispatch 2
15 Register File 1
16 Register File 2
17
Execute
18
Flags
19 Branch Check
20
Drive

Asanovic/Devadas
Spring 2002
6.823

Pipeline stages dedicated to just


driving signals across chip!

Asanovic/Devadas

Spring 2002

6.823

Instruction Set Translation


Convert a target ISA into a host machines ISA
Pentium Pro (P6 family)
translation in hardware after instruction fetch

Pentium-4 family
translation in hardware at level 1 instruction cache refill

Transmeta Crusoe
translation in software using Code Morphing

Asanovic/Devadas

Spring 2002

6.823

Virtual Machines and Dynamic

Translation:

Implementing ISAs in Software

Krste Asanovic

Laboratory for Computer Science

Massachusetts Institute of Technology

Software Applications

Asanovic/Devadas

Spring 2002

6.823

How is a software application encoded?


What are you getting when you buy a software application?
What machines will it work on?
Who do you blame if it doesnt work, i.e., what contract(s) were
violated?

ISA + Environment =

Virtual Machine

Asanovic/Devadas
Spring 2002
6.823

ISA alone not sufficient to write useful programs, need I/O

Direct access to memory mapped I/O via load/store


instructions problematic
time-shared systems
portability

Operating system responsible for I/O


sharing devices and managing security
hiding different types of hardware (e.g., EIDE vs. SCSI disks)

ISA communicates with operating system through


some standard mechanism, i.e., syscall instructions
example convention to open file:

addi r1, r0, 27


# 27 is code for file open

addu r2, r0, rfname


# r2 points to filename string

syscall
# cause trap into OS

# On return from syscall, r1 holds file descriptor

Application Binary Interface


(ABI)

Asanovic/Devadas

Spring 2002

6.823

Programs are usually distributed in a binary format


that encode the program text (instructions) and
initial values of some data segments
Virtual machine specifications include
what state is available at process creation
which instructions are available (the ISA)
what system calls are possible (I/O, or the environment)

The ABI is a specification of the binary format used


to encode programs for a virtual machine
Operating system implements the virtual machine
at process startup, OS reads the binary program, creates an
environment for it, then begins to execute the code, handling traps
for I/O calls, emulation, etc.

OS Can Support Multiple VMs

Asanovic/Devadas

Spring 2002

6.823

Virtual machine features change over time with


new versions of operating system
new ISA instructions added
new types of I/O are added (e.g., asynchronous file I/O)

Common to provide backwards compatibility so


old binaries run on new OS
SunOS 5 (System V Release 4 Unix, Solaris) can run binaries
compiled for SunOS4 (BSD-style Unix)
Windows 98 runs MS-DOS programs

If ABI needs instructions not supported by native


hardware, OS can provide in software

Supporting Multiple OSs on


Same Hardware

Asanovic/Devadas
Spring 2002
6.823

Can virtualize the environment that an operating


system sees, an OS-level VM
Hypervisor layer implements sharing of real hardware
resources by multiple OS VMs that each think they
have a complete copy of the machine
Popular in early days to allow mainframe to be shared by multiple
groups developing OS code
Used in modern mainframes to allow multiple versions of OS to be
running simultaneously OS upgrades with no downtime!
Example for PCs: VMware allows Windows OS to run on top of Linux
(or vice-versa)

Requires trap on access to privileged state that OS


needs
easier if OS interface to hardware well defined

ISA Implementations Partly in


Software

Asanovic/Devadas

Spring 2002

6.823

Often good idea to implement part of ISA in software:


Expensive but rarely used instructions can cause trap
to OS emulation routine:
e.g., decimal arithmetic instructions in MicroVax
implementation of VAX ISA

Infrequent but difficult operand values can cause trap


e.g., IEEE floating-point denormals cause traps in almost all
floating-point unit implementations

Old machine can trap unused opcodes, allows


binaries for new ISA to run on old hardware
e.g., Sun SPARC v8 added integer multiply instructions, older
v7 CPUs trap and emulate

Supporting Non-Native ISAs

Asanovic/Devadas
Spring 2002
6.823

Run programs for one ISA on hardware with different ISA


Techniques:
Emulation
OS software interprets instructions at run-time
E.g., OS for PowerPC Macs had emulator for 68000 code

Binary Translation
convert at install and/or load time
IBM AS/400 to modified PowerPC cores
DEC tools for VAX->MIPS->Alpha

Dynamic Translation (or Dynamic Compilation)


compile non-native ISA to native ISA at run time
Suns HotSpot Java JIT (just-in-time) compiler
Transmeta Crusoe, x86->VLIW code morphing

Run-time Hardware Emulation


Hardware supports two ISAs!
IBM 360 had IBM 1401 emulator in microcode
Intel Itanium converts x86 to native VLIW (two software-visible ISAs)

Emulation

Asanovic/Devadas
Spring 2002
6.823

Software instruction set interpreter fetches and


decodes one instruction at a time in emulated VM
Emulator Stack
Guest
Stack

Executable
on Disk
Guest
ISA
Data
Guest
ISA
Code

Guest
ISA
Data

Load into
emulator
memory

Guest
ISA
Code
Emulator Data

Emulator Code

Memory image of
guest VM lives in
host emulator
data memory

fetch-decode loop
while(!stop)
{
inst = Code[PC];
PC += 4;
execute(inst);
}

Emulation
Easy to code, small code footprint
Slow, approximately 100x slower than native
execution for RISC ISA hosted on RISC ISA
Problem is time taken to decode instructions

fetch instruction from memory


switch tables to decode opcodes
extract register specifiers using bit shifts
access register file data structure
execute operation
return to main fetch loop

Asanovic/Devadas
Spring 2002
6.823

Binary Translation

Asanovic/Devadas
Spring 2002
6.823

Each guest ISA instruction translates into some


set of host (or native) ISA instructions
Instead of dynamically fetching and decoding
instructions at run-time, translate entire binary
program and save result as new native ISA
executable
Removes interpretive fetch-decode overhead
Can do compiler optimizations on translated code
to improve performance
register allocation for values flowing between guest ISA
instructions
native instruction scheduling to improve performance
remove unreachable code
inline assembly procedures

Binary Translation, Take 1

Asanovic/Devadas
Spring 2002
6.823

Executable
on Disk
Executable
on Disk
Guest
ISA
Data
Guest
ISA
Code

Data
unchanged

Translate to
native ISA code

Guest
ISA
Data
Native
Data
Native
ISA
Code

Native translation
might need extra data
workspace

Binary Translation Problems

Asanovic/Devadas

Spring 2002

6.823

Branch and Jump targets


guest code:
j L1
...
L1: lw r1, (r4)
jr (r1)
native code
j
translation

native jump at end of


block jumps to native
translation of lw

lw
translation
jr
translation

Where should the jump register go?

PC Mapping Table

Asanovic/Devadas
Spring 2002
6.823

Table gives translated PC for each hosted PC


Indirect jumps translated into code that looks in
table to find where to jump to
can optimize well-behaved guest code for subroutine
call/return by using native PC in return links

If can branch to any guest PC, then need one


table entry for every instruction in hosted
program big table
If can branch to any PC, then either
limit inter-instruction optimizations
large code explosion to hold optimizations for each possible
entry into sequential code sequence

Only minority of guest instructions are indirect


jump targets, want to find these
highly structured VM design
run-time feedback of where targets were

Binary Translation Problems

Asanovic/Devadas

Spring 2002

6.823

Self-modifying code!
sw r1, (r2)

# r2 points into code space

Rare in most code, but has to be handled if allowed


by guest ISA
Usually handled by including interpreter and
marking modified code pages as interpret only
Have to invalidate all native branches into modified
code pages

Binary Translation, Take 2

Asanovic/Devadas
Spring 2002
6.823

Executable
on Disk
Guest
ISA Data

Executable
on Disk
Guest
ISA
Data
Guest
ISA
Code

Keep copy
of code and
data in
native data
segment

Guest
ISA Code
PC
Mapping
Table

Translate to
native ISA code

Native
ISA Code
Native
Emulator

Mapping table used


for indirect jumps and
to jump from emulator
back into native
translations
Translation has to
check for modified
code pages then jump
to emulator
Emulator used for
run-time modified
code, checks for
jumps back into
native code using PC
mapping table

IBM System/38 and AS/400

Asanovic/Devadas

Spring 2002

6.823

System/38 announced 1978, AS/400 is follow-on line


High-level instruction set interface designed for binary
translation
Memory-memory style instruction set, never directly
executed by hardware
User Applications
Languages,
Database,
Utilities

Control
Program
Facility

High-Level
Architecture Interface
Vertical Microcode
Used 48-bit CISC
engine in earlier
machines

Horizontal Microcode
Hardware Machine

Replaced by modified
PowerPC cores in
newer AS/400 machines

Dynamic Translation

Asanovic/Devadas

Spring 2002

6.823

Translate code sequences as needed at run-time,


but cache results
Can optimize code sequences based on dynamic
information (e.g., branch targets encountered)
Tradeoff between optimizer run-time and time
saved by optimizations in translated code
Technique used in Java JIT (Just-In-Time)
compilers
Also, Transmeta Crusoe for x86 emulation

Transmeta Crusoe

Asanovic/Devadas

Spring 2002

6.823

Converts x86 ISA into internal native VLIW format


using software at run-time Code Morphing
Optimizes across x86 instruction boundaries to
improve performance
Translations cached to avoid translator overhead
on repeated execution
Completely invisible to operating system looks
like x86 hardware processor

Transmeta VLIW Engine

Asanovic/Devadas

Spring 2002

6.823

Two VLIW formats, 64-bit and 128-bit, contains 2


or 4 RISC-like operations
VLIW engine optimized for x86 code emulation
evaluates condition codes the same way as x86
has 80-bit floating-point unit
partial register writes (update 8 bits in 32 bit register)

Support for fast instruction writes


run-time code generation important

Two different VLIW implementations, low-end


TM3120, high-end TM5400
native ISA differences invisible to user, hidden by translation
system

Asanovic/Devadas
Spring 2002
6.823

Crusoe System

Crusoe
Boot
Flash
ROM

Inst. Cache

VLIW Processor

Portion of system DRAM is


used by Code Morph
software and is invisible to
x86 machine

Data Cache

Compressed
compiler held in
boot ROM

Crusoe CPU

Code Morph
Compiler Code
(VLIW)
Translation
Cache (VLIW)
Workspace
Code Morph DRAM
System DRAM

x86 DRAM

x86 BIOS
Flash

Transmeta Translation

Asanovic/Devadas
Spring 2002
6.823

x86 code:
addl %eax, (%esp) # load data from stack, add to eax

addl %ebx, (%esp) # load data from stack, add to ebx

movl %esi, (%ebp) # load esi from memory

subl %ecx, 5

# sub 5 from ecx

first step, translate into RISC ops:


ld %r30, [%esp]

# load from stack into temp

add.c %eax, %eax, %r30 # add to %eax, set cond.codes

ld %r31, [%esp]

add.c %ebx, %ebx, %r31

ld %esi, [%ebp]

sub.c %ecx, %ecx, 5

Compiler Optimizations

Asanovic/Devadas
Spring 2002
6.823

RISC ops:
ld %r30, [%esp]

# load from stack into temp

add.c %eax, %eax, %r30 # add to %eax, set cond.codes

ld %r31, [%esp]

add.c %ebx, %ebx, %r31

ld %esi, [%ebp]
sub.c %ecx, %ecx, 5

Optimize:
ld %r30, [%esp]

# load from stack only once

add %eax, %eax, %r30


add %ebx, %ebx, %r30

# reuse data loaded earlier

ld %esi, [%ebp]
sub.c %ecx, %ecx, 5

# only this cond. code needed

Scheduling

Asanovic/Devadas
Spring 2002
6.823

Optimized RISC ops:


ld %r30, [%esp]

# load from stack only once

add %eax, %eax, %r30


add %ebx, %ebx, %r30

# reuse data loaded earlier

ld %esi, [%ebp]
sub.c %ecx, %ecx, 5

# only this cond. code needed

Schedule into VLIW code:


ld %r30, [%esp]; sub.c %ecx, %ecx, 5

ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30

Translation Overhead

Asanovic/Devadas

Spring 2002

6.823

Highly optimizing compiler takes considerable


time to run, adds run-time overhead
Only worth doing for frequently executed code
Translation adds instrumentation into translations
that counts how often code executed, and which
way branches usually go
As count for a block increases, higher
optimization levels are invoked on that code

Exceptions

Asanovic/Devadas
Spring 2002
6.823

Original x86 code:


addl %eax, (%esp) # load data from stack, add to eax

addl %ebx, (%esp) # load data from stack, add to ebx

movl %esi, (%ebp) # load esi from memory

subl %ecx, 5

# sub 5 from ecx

Scheduled VLIW code:


ld %r30, [%esp]; sub.c %ecx, %ecx, 5

ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30

x86 instructions executed out-of-order with respect to


original program flow
Need to restore state for precise traps

Shadow Registers and Store


Buffer

Asanovic/Devadas

Spring 2002

6.823

All registers have working copy and shadow copy


Stores held in software controlled store buffer,
loads can snoop
At end of translation block, commit changes by
copying values from working regs to shadow
regs, and by releasing stores in store buffer
On exception, re-execute x86 code using
interpreter

Handling Self-Modifying Code

Asanovic/Devadas

Spring 2002

6.823

When a translation is made, mark the associated


x86 code page as being translated in page table
Store to translated code page causes trap, and
associated translations are invalidated

Multithreaded Processors

Page 1

Pipeline Hazards
LW r1, 0(r2)
LW r5, 12(r1)
ADDI r5, r5, #12
SW 12(r1), r5
Each instruction may depend on the next
Without bypassing, need interlocks
t0 t1 t2 t3 t4 t5 t6 t7 t8

LW r1, 0(r2)
LW r5, 12(r1)
ADDI r5, r5, #12
SW 12(r1), r5

t9 t10 t11 t12 t13 t14

F D X MW
F D D D D X MW
F F F F D D D D X MW
F F F F D D D D

Bypassing cannot completely eliminate interlocks


or delay slots

Page 2

Multithreading
How can we guarantee no dependencies between
instructions in a pipeline?
One way is to interleave execution of instructions
from different program threads on same pipeline

Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe


t0 t1 t2 t3 t4 t5 t6 t7

F D X MW
T1: LW r1, 0(r2)
F D X M
T2: ADD r7, r1, r4
F D X
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
F D
T1: LW r5, 12(r1)
F

t8

W
MW
X MW
D X MW

t9

Last instruction
in a thread
always completes
writeback before
next instruction
in same thread
reads regfile
3

Page 3

CDC 6600 Peripheral Processors


(Cray, 1965)

First multithreaded hardware


10 virtual I/O processors
fixed interleave on simple pipeline
pipeline has 100ns cycle time
each processor executes one instruction every 1000ns
accumulator-based instruction set to reduce processor state

Page 4

Simple Multithreaded Pipeline

PC
PC
PC 1
PC 1
1
1

I$

IR

GPR1
GPR1
Y

D$

+1
2

Thread
select

Have to carry thread select down pipeline to ensure


correct state bits read/written at each pipe stage
5

Page 5

Multithreading Costs
Appears to software (including OS) as multiple
slower CPUs
Each thread requires its own user state
GPRs
PC

Also, needs own OS control state


virtual memory page table base register
exception handling registers

Other costs?
6

Page 6

Thread Scheduling Policies


Fixed interleave (CDC 6600 PPUs, 1965)
each of N threads executes one instruction every N
cycles
if thread not ready to go in its slot, insert pipeline bubble

Software-controlled interleave (TI ASC PPUs, 1971)


OS allocates S pipeline slots amongst N threads
hardware performs fixed interleave over S slots,
executing whichever thread is in that slot

Hardware-controlled thread scheduling (HEP, 1982)


hardware keeps track of which threads are ready to go
picks next thread to execute based on hardware priority
scheme
7

Software-controlled interleave, S > N. Wheel with the threads in each slot. If


thread is ready to go
It goes, else NOP, I.e., pipeline bubble.

Page 7

What Grain Multithreading?


So far assumed fine-grained multithreading
CPU switches every cycle to a different thread
When does this make sense?

Coarse-grained multithreading
CPU switches every few cycles to a different
thread
When does this make sense?

Page 8

Multithreading Design Choices

CPU
RF

L1 Inst.
Cache

L1 Data
Cache

Memory
Unified
L2
Cache

Memory
Memory
Memory

Context switch to another thread every cycle,


or on hazard or L1 miss or L2 miss or network
request
Per-thread state and context-switch overhead
Interactions between threads in memory
hierarchy

Managing interactions between threads.

Page 9

Denelcor HEP
(Burton Smith, 1982)

First commercial machine to use hardware threading in main


CPU
120 threads per processor
10 MHz clock rate
Up to 8 processors
precursor to Tera MTA (Multithreaded Architecture)

10

Page 10

10

Tera MTA Overview

Up to 256 processors
Up to 128 active threads per processor
Processors and memory modules populate a sparse
3D torus interconnection fabric
Flat, shared main memory
No data cache
Sustains one main memory access per cycle per processor

50W/processor @ 260MHz
11

SGI bought Cray, and Tera was a spin-off.


1997. Integer sort press release.

Page 11

11

MTA Instruction Format


Three operations packed into 64-bit instruction word
(short VLIW)
One memory operation, one arithmetic operation,
plus one arithmetic or branch operation
Memory operations incur ~150 cycles of latency
Explicit 3-bit lookahead field in instruction gives
number of subsequent instructions (0-7) that are
independent of this one
c.f. Instruction grouping in VLIW
allows fewer threads to fill machine pipeline
used for variable
- sized branch delay slots

Thread creation and termination instructions


12

Page 12

12

MTA Multithreading
Each processor supports 128 active hardware threads
128 SSWs, 1024 target registers, 4096 general-purpose
registers

Every cycle, one instruction from one active thread is


launched into pipeline
Instruction pipeline is 21 cycles long
At best, a single thread can issue one instruction
every 21 cycles
Clock rate is 260MHz, effective single thread issue rate is
260/21 = 12.4MHz
13

32 64-bit general-purpose registers (R0-R31)

unified integer/floating-point register set


R0 hard-wired to zero
8 64-bit branch target registers (T0-T7)

load branch target address before branch instruction


T0 contains address of user exception handler
1 64-bit stream status word (SSW)

includes 32-bit program counter


four condition code registers
floating-point rounding mode

Page 13
13

MTA Pipeline
Issue Pool

Inst Fetch

Write Pool

Memory Pool

Retry Pool

Interconnection Network
Memory pipeline
14

Memory unit is busy, or sync operation failed retry.


Just goes around the memory pipeline.

Page 14

14

Coarse-Grain Multithreading
Tera MTA designed for supercomputing
applications with large data sets and low locality
No data cache
Many parallel threads needed to hide large memory
latency

Other applications are more cache friendly


Few pipeline bubbles when cache getting hits
Just add a few threads to hide occasional cache
miss latencies
Swap threads on cache misses

15

Tera not very successful, 2 machines sold.


Changed their name back to Cray!

Page 15

15

MIT Alewife

Modified SPARC chips


register windows hold
different thread contexts

Up to four threads per node


Thread switch on local cache
miss

16

Page 16

16

IBM PowerPC RS64-III (Pulsar)


Commercial coarse-grain multithreading CPU
Based on PowerPC with quad-issue in-order fivestage pipeline
Each physical CPU supports two virtual CPUs
On L2 cache miss, pipeline is flushed and
execution switches to second thread
short pipeline minimizes flush penalty (4 cycles), small
compared to memory access latency
flush pipeline to simplify exception handling

17

Page 17

17

Speculative, Out-of-Order Superscalar


Processor
Out-of-Order
PC

Fetch

Decode &
Rename

Reorder Buffer

In-Order
Commit

In-Order
Physical Reg. File
Branch
ALU MEM
Unit
Execute

Store
Buffer

D$

18

Page 18

18

Superscalar Machine Efficiency


Issue width
Instruction
issue
Completely idle cycle
(vertical waste)
Time

Partially filled cycle,


i.e., IPC < 4
(horizontal waste)

Why horizontal waste?


Why vertical waste?

19

Page 19

19

Vertical Multithreading
Issue width
Instruction
issue
Second thread interleaved
cycle-by-cycle
Time
Partially filled cycle,
i.e., IPC < 4
(horizontal waste)

Cycle-by-cycle interleaving of second thread


removes vertical waste

20

Page 20

20

Ideal Multithreading for Superscalar


Issue width

Time

Interleave multiple threads to multiple issue slots


with no restrictions

21

Page 21

21

Simultaneous Multithreading
Add multiple contexts and fetch engines to
wide out-of-order superscalar processor
[Tullsen, Eggers, Levy, UW, 1995]

OOO instruction window already has most of


the circuitry required to schedule from
multiple threads
Any single thread can utilize whole machine

22

Page 22

22

Comparison of Issue Capabilities


Courtesy of Susan Eggers; Used with Permission

Illustrates SMT thread issue & execution & how differs


SS: only single thread; long latency instructions w. lots of instructions
dependent
FGMT: limited by amount of ILP in each thread, just as on the SS
MP: each processor issues instructions from its own thread
Example of one thread stalls or has little ILP
Performance

Page 23

From Superscalar to SMT

SMT is an out-of-order superscalar extended with


hardware to support multiple executing threads
Fetch
Unit

IQ
Renaming
Registers

Functional
Units

ICOUNT

Threads

24

no special HW for scheduling instructions from different threads onto


FUs
can use same ooo mechanism as superscalar for instruction issue:
RENAMING HW eliminates false dependences both within a thread (just
like a conventional SS) & between threads
MAP thread-specific architectural registers in all threads onto a

pool of physical registers

instructions are issued when operands available without regard

to thread

(scheduler not look at thread IDs)

thereafter called by their physical name

Page 24
24

From Superscalar to SMT

Extra pipeline stages for accessing thread-shared


register files
Fetch
Unit

IQ
Renaming
Registers

Functional
Units

ICOUNT

Threads

25

8*32 for the architecture state + 96 additional registers for register


renaming

Page 25

25

From Superscalar to SMT


Fetch from the two highest throughput threads.
Why?
Fetch
Unit

IQ
Renaming
Registers

Functional
Units

ICOUNT

Threads

26

Fetch unit that can keep up with the simultaneous multithreaded


execution engine
have the fewest instructions waiting to be executed
making the best progress through the machine
40% increase in IPC over RR

Page 26
26

From Superscalar to SMT

Small items
per-thread program counters
per-thread return stacks
per-thread bookkeeping for instruction retirement,
trap & instruction dispatch queue flush
thread identifiers, e.g., with BTB & TLB entries

27

none of small stuff endangers critical path


most mechanisms already exist; now duplicated for each thread or
implemented to apply only to 1 thread at a time
carry thread ID for retirement, trap, queue flush, not used for
scheduling
HW structure that points to all Is for each thread
need flush mechanism for branch misprediction
Fairly straightforward extension to OOO SS; this + n-fold performance
boost was responsible for the technology transfer to chip
manufacturers

Page 27
27

Simultaneous Multithreaded Processor


Rename Table 1
Rename Table 1
Rename Table 1

PC
PC
PC
PC
Fetch
1

Decode &
Rename

OOO execution unit


does not see thread
identifiers, only
physical register
specifiers

Reorder Buffer

Commit
Commit
Commit
Commit

Physical Reg. File


Branch
ALU MEM
Unit
Execute

Store
Buffer

D$

28

Page 28

28

SMT Design Issues


Which thread to fetch from next?
Dont want to clog instruction window with
thread with many stalls try to fetch from
thread that has fewest insts in window

Locks
Virtual CPU spinning on lock executes many
instructions but gets nowhere add ISA
support to lower priority of thread spinning on
lock

29

Page 29

29

Intel Pentium-4 Xeon Processor


Hyperthreading == SMT
Dual physical processors, each 2-way SMT
Logical processors share nearly all resources
of the physical processor
Caches, execution units, branch predictors

Die area overhead of hyperthreading ~ 5%


When one logical processor is stalled, the
other can make progress
No logical processor can use all entries in
queues when two threads are active

A processor running only one active software


thread to run at the same speed with or
without hyperthreading
30

Load-store buffer in L1 cache doesnt behave like that, and hence 15%
slowdown.

Page 30

30

Parallel Processors

Page 1

Parallel Processing: The Holy Grail

Use multiple processors to improve runtime of a


single task
technology limits speed of uniprocessor
economic advantages to using replicated processing units

Preferably programmed using a portable high-level


language
When are two heads better than one?

Jigsaw puzzle analogy

Page 2

Motivating Applications
Weather forecasting
Climate modeling
Material science
Drug design
Computational genomics
And many more
3

Number crunching apps


Claim that weather forecasts are accurate over 5 days, and would like to
improve that.
Nonlinear phenomena.

Page 3
3

Flynns Classification (1966)


Broad classification of parallel computing systems
based on number of instruction and data streams
SISD: Single Instruction, Single Data
conventional uniprocessor

SIMD: Single Instruction, Multiple Data


distributed memory SIMD (MPP, DAP, CM-1&2, Maspar)
shared memory SIMD (STARAN, vector computers)

MIMD: Multiple Instruction, Multiple Data


message passing machines (Transputers, nCube, CM-5)
non-cache-coherent SMPs (BBN Butterfly, T3D)
cache-coherent SMPs (Sequent, Sun Starfire, SGI Origin)

MISD: Multiple Instruction, Single Data


no commercial examples

MISD, check that a number is prime, do a divide on the same number with
Processor number I.

Page 4

Potential of the four classes


+
A
B

SISD

+, *
A
B

MISD

+
A, B
C, D

SIMD

+, *
A, B
C, D

MIMD

A+B

A+B
A*B
A+B
C+D
A+B
C*D
5

Page 5

SIMD Architecture
Central controller broadcasts instructions to
multiple processing elements (PEs)
Inter-PE Connection Network

Array
Controller
PE

PE

PE

PE

PE

PE

PE

PE

M
e
m

M
e
m

M
e
m

M
e
m

M
e
m

M
e
m

M
e
m

M
e
m

Control
Data

Only requires one controller for whole array


Only requires storage for one copy of program
All computations fully synchronized

Page 6

SIMD Machines
Illiac IV (1972)
64 64-bit PEs, 16KB/PE, 2D network
ICL DAP (Distributed Array Processor) (1980)
4K bit-serial PEs, 512B/PE, 2D network
Thinking Machines Connection Machine CM-1 (1985)
64K bit-serial PEs, 512B/PE, 2D + hypercube router
CM-2: 2048B/PE, plus 2,048 32-bit floating-point units
Maspar MP-1 (1989)
16K 4-bit processors, 16-64KB/PE, 2D-mesh + Xbar
MP-2: 16K 32-bit processors, 64KB/PE

Also shared memory SIMD vector supercomputers


TI ASC (71), CDC Star- 100 (73), Cray
- 1(76)
7

Illiac IV: $31M instead of $8M paid by DARPA, Illiaction


Glypnir language: Norse mythological rope that controls a vicious animal

Page 7

SIMD Today
Distributed memory SIMD failed as large-scale
general-purpose computer platform
Why?

Vector supercomputers (shared memory SIMD) still


successful in high-end supercomputing
Reasonable efficiency on short vector lengths (10-100)
Single memory space

Distributed memory SIMD popular for special purpose


accelerators, e.g., image and graphics processing
Renewed interest for Processor-in-Memory (PIM)
Memory bottlenecks put simple logic close to memory
Viewed as enhanced memory for conventional system
Need merged DRAM + logic fabrication process
Commercial examples, e.g., graphics in Sony Playstation-2

Required huge quantities of data parallelism (> 10000 elements)


Required programmer-controlled distributed data layout.

Page 8

MIMD Machines
Message passing, distributed memory

Thinking Machines CM-5


Intel Paragon
Meiko CS-2
many cluster systems (e.g., IBM SP-2, Linux Beowulfs)

Shared memory
no hardware cache coherence

IBM RP3
BBN Butterfly
Cray T3D/T3E
Parallel vector supercomputers (Cray T90, NEC SX-5)

hardware cache coherence


many small-scale SMPs (e.g. Quad Pentium Xeon systems)
large scale bus/crossbar-based SMPs (Sun Starfire)
large scale directory-based SMPs (SGI Origin)

Thinking machines, interconnect network a FAT-TREE.


Rp3: research parallel processor prototype
BBN: Bolt Beranek and Newman, A spiffy interconnect will not by itself
Allow the construction of a supercomputer out of common household
appliances.
Problem was the commodity processors did not support multiple outstanding
Memory requests. So the remote memory access killed off performance,
unless
Excellent data layout was achieved. So it lost to message-passing networks.

Page 9
9

Summing Numbers
On a sequential computer
Sum = a[0]
for(i = 0; i < m; i++)
Sum = Sum + a[i]
(m) complexity
Have N processors adding up m/N numbers
Shared memory
Global-sum = 0
for each processor {
local-sum = 0
Calculate local-sum of m/N numbers
Lock
Global-sum = Global-sum + local-sum
Unlock
}
Complexity ?

10

Theta(m/N) + Theta(N)

Theta(N) comes from summing N numbers in series.

Page 10

10

Summing Numbers Distributed Memory


Assume a square mesh of processors
Each processor computes the local sum of its
m/N numbers
Each processor passes its local sum to
another processor in a coordinated way
The global sum in finally in processor P11
P11

P12

P13

P11

P12

P13

P11

P12

P13
j

11

Page 11

11

Complexity
P11

P12

P13

P11

P12

P13

P11

P12

P13

How many additions?


How many communications?
Total time complexity?
12

Sqrt(N) 1 additions, sqrt(N) 1 communications in each direction


Theta(m/N) + Theta(sqrt(N)) + C, C is cost of communication

Page 12

12

Message Passing MPPs


(Massively Parallel Processors)

Initial Research Projects


Caltech Cosmic Cube (early 1980s) using custom Mosaic
processors

Commercial Microprocessors including MPP Support


Transputer (1985)
nCube-1(1986) /nCube-2 (1990)

Standard Microprocessors + Network Interfaces


Intel Paragon (i860)
TMC CM-5 (SPARC)
IBM SP-2 (RS/6000)

Interconnect Network

NI

NI

NI

NI

NI

NI

NI

NI

MPP Vector Supers


Fujitsu VPP series
Designs scale to 100s-10,000s
of nodes

Mem Mem Mem Mem Mem Mem Mem Mem 13

Read up on Cosmic Cube, transputer, CM5 other machines.

Page 13

13

Message Passing MPP Problems

All data layout must be handled by software


cannot retrieve remote data except with message
request/reply

Message passing has high software overhead


early machines had to invoke OS on each message
(100s
- 1ms/message)
even user level access to network interface has
dozens of cycles overhead (NI might be on I/O bus)
Cost of sending messages?
Cost of receiving messages?
14

Sending can be cheap (like stores)

Receiving is expensive, need to poll or interrupt.

Page 14

14

Shared Memory Machines


Two main categories with respect to caches
non cache coherent
hardware cache coherent

Will work with any data placement (but might be slow)


can choose to optimize only critical portions of code

Load and store instructions used to communicate


data between processes
no OS involvement
low software overhead

Usually some special synchronization primitives


fetch&op
load linked/store conditional

In large scale systems, logically shared memory is


implemented as physically distributed modules
15

Page 15

15

Shared Memory Machines

Exclusive Read, Exclusive Write (EREW)


Concurrent Read, Exclusive Write (CREW)
Exclusive Read, Concurrent Write (ERCW)
Concurrent Read, Concurrent Write (CRCW)

Need to deterministically specify the contents


of a memory location after concurrent write
Assign priorities to processors and store value
from processor with highest priority
Allow concurrent writes if the values being
written are the same
Max, min, average, or sum of values is stored
(numeric applications)
16

Page 16

16

Searching
N processors search a list S = {L1, L2, , Lm}
for the index of an item x
Assume x can appear many times and any
index will do

Step 1: for i = 1 to N in parallel


read x
Step 2: Sequentially search through sub-list
Si with m/N numbers
Return Ki = -1 if x not in Si, else index
Step 3: for i = 1 to N in parallel
if (Ki > 0) k = Ki
17

Page 17

17

Complexity
What is the complexity for the different shared
memory schemes?
Step 1
EREW: O(N)

Step 2
O(m/N)

Step 3
Total
O(N) O(N) + O(m/N)

CREW:
ERCW:
CRCW:
18

Page 18

18

Cray T3E
Up to 2,048 675MHz Alpha 21164
processors connected in 3D torus

Each node has 256MB-2GB local DRAM memory


Load and stores access global memory over network
Only local memory cached by on-chip caches
Alpha microprocessor surrounded by custom shell circuitry to
make it into effective MPP node. Shell provides:
external copy of on-chip cache tags to check against remote
writes to local memory, generates on-chip invalidates on match
address management to allow all of external physical memory
to be addressed
atomic memory operations (fetch&op)
support for hardware barriers/eureka to synchronize parallel
tasks
19

Read up on no hardware coherence for the Cray T3E, eureka?

Page 19

19

Bus-Based Cache-Coherent SMPs


P

$
Bus

Central
Memory

Small scale (<= 4 processors) bus-based SMPs by far the


most common parallel processing platform today
Bus provides broadcast and serialization point for simple
snooping cache coherence protocol
Modern microprocessors integrate support for this protocol
20

Page 20

20

Sun Starfire (UE10000)


Up to 64-way SMP using bus-based snooping protocol
P

Board Interconnect

Board Interconnect

4 processors + memory
module per system
board
Uses 4 interleaved address
busses to scale snooping
protocol

16x16 Data Crossbar

Memory
Module

Memory
Module

Separate data
transfer over
high bandwidth
crossbar

21

Interleaved multiple address busses. Hardest part of scaling up an SMP


system is its coherence bandwidth.
Read Starmicro paper.

Page 21

21

SGI Origin 2000


Large scale distributed directory SMP
Scales from 2 processor workstation
to 512 processor supercomputer

Node contains:
Two MIPS R10000 processors plus caches
Memory module including directory
Connection to global network
Connection to I/O

Scalable hypercube switching network


supports up to 64 two-processor nodes (128
processors total)
(Some installations up to 512 processors)

Read up on directories, and compare to snoopy protocols.


SGI origin paper.

Page 22

22

Diseconomies of Scale
Few customers require the largest machines
much smaller volumes sold
have to amortize development costs over smaller number of
machines

Different hardware required to support largest


machines

dedicated interprocessor networks for message passing MPPs


T3E shell circuitry
large backplane for Starfire
directory storage and routers in SGI Origin

Large machines cost more per processor than small


machines!
23

Page 23

23

Trends in High-End Server CPUs

To other
processing
nodes
Processor
Chip

Net I/F

CPU
1

Coherence
Engine

CPU
N
Shared L2

Mem. Cntl.

DRAM controllers on chip (UltraSPARC-III, Alpha 21364)


reduce main memory latency
increase memory bandwidth
On-chip network routers (Alpha 21364, Power-4)
Cheaper/faster connectivity for cache coherence traffic
Multiple processors on one chip
Chip multiprocessor systems (IBM Power-4)
Simultaneous multithreading (Pentium-4 Xeon)

Local
Off-chip
DRAM

24

INTEGRATION in VLSI. DRAM controllers on chip to reduce main memory


latency (250ns to 170 ns in Ultrasparc III) and
Increase memory bandwidth 12 GB/s over 8 Rambus channels in 21364.

Page 24

24

Clusters and Networks of Workstations


Connect multiple complete machines together using
standard fast interconnects
Little or no hardware development cost
Each node can boot separately and operate
independently
Interconnect can be attached at I/O bus (most common)
or on memory bus (higher speed but more difficult)

Clustering initially used to provide fault tolerance


Clusters of SMPs (CluMPs)
Connect multiple n-way SMPs using a cachecoherent memory bus, fast message passing
network or non cache-coherent interconnect
Build message passing MPP by connecting
multiple workstations using fast interconnect
connected to I/O Bus. Main advantage?
25

HP Exemplar cache coherent, SGI Power Challenge Array, Vector


supercomputers NEC SX-5
Ucberkeley NOW, 100 Sun workstations connected by a Myrinet network.
What is Myrinet?
Sold commercially by IBM (SP-2), and Cray selling networks of Alphas.

Page 25
25

The Future?

26

Page 26

26