Вы находитесь на странице: 1из 104

CMPE 200

Lecture 12.
Multi-Processors
(P&H 5.10, 6.5, H&P 5.1-5.5)

Hyeran Jeon

SJSU SAN JOSÉ STATE


UNIVERSITY
Recap of previous lecture
• Application, at compile time, doesn’t know the size of underneath main
memory
• Main memory size may be insufficient to hold all code/data needed to run an
application
– Some parts of a code is loaded to main memory
– Some other parts of the code may remain in the disk

Virtual memory space


that is seen by an application 256MB DRAM
0x00000000 0x00000000

0x0FFFFFFF

0xFFFFFFFF

SJSU SAN JOSÉ STATE


2 UNIVERSITY
Pages vs. Segments
• The unit of memory to fetch from storage
– similar to the concept of “block” in cache
• Page has fixed size
– Most systems use paging
memory space
• Segment has flexible size
– Some systems use paged segments Code/Data
• Code and data are stored in segment unit but Page
each segment is divided by pages
Segment

Page Segment
two (segment base and
Addressing fields One (offset)
offset)
Programmer visible? Invisible May be visible

Replacing a block Trivial Hard

Memory use efficiency Internal fragmentation External fragmentation

Efficient disk traffic Yes Not always


SAN JOSÉ STATE
3 SJSU UNIVERSITY
Translation Using a Page Table
• A page table in main memory 31 12 11 0

– accessed to get the physical address of a Virtual page number Page offset
given virtual address
Page table register x table entry size in byte
• Page table register
Page table
– contains the base address of the page
table +
• Table entry address =
Page table register value + Virtual page RWX V M R .. Physical page number
number x Page table entry size in byte
• Meta data
– RWX : Read/Write/eXecute
– V : Valid (If valid, the page is in memory.
Otherwise, page fault exception!) 29 12 11 0
– M : Modified Physical page number Page offset
– R : Referenced (used for page
replacement)
SJSU SAN JOSÉ STATE
4 UNIVERSITY
Translating Using a Hierarchical Page Table

• Multi-level page table can 31 24 23 18 17 12 11 0


V1 V2 V3 Page offset
effectively reduce the size
x table entry size in byte
• Each level page table(s) maintains
Page table register x table entry size in byte
the base address of the next- x table entry size in byte

level page table 0x400300 0x501000

• For each entry in the level N + 0x501000 +


page table, there can be up to
0x400300 +
one N+1 level page table
– If there is unaccessed memory region, Level 1 Page table Level 2 Page table Level 3 Page table
the corresponding page table doesn’t
need to be allocated
• Lower level page table is allocated
only when the corresponding entry
29 12 11 0
in the upper level page table is
Physical page number Page offset
accessed

SJSU SAN JOSÉ STATE


5 UNIVERSITY
Example
• We use three-level page table and the virtual address fields are like below:
41 32 31 22 21 12 11 0
V1 V2 V3 Page offset

• What is the address coverage of an entry in each-level page table?


– first-level: 232 = 210 entries per second-level page table x second-level entry coverage
– second-level: 222 = 210 entries per third-level page table x third-level entry coverage
– third-level: 212

SJSU SAN JOSÉ STATE


6 UNIVERSITY
Translation Lookaside Buffer (TLB)
• Page tables are stored in main memory
• At least two memory accesses per memory access by a program
– One for obtaining the physical address and second for getting the data

• How can we improve the performance?


• Let’s exploit the locality here too!
– When a translation for a virtual page number is used, it will probably be needed
again in the near future
– The references to the words on that page have both temporal and spatial locality.

 Cache the physical address of the frequently accessed pages


(Translation Lookaside Buffer)

SJSU SAN JOSÉ STATE


7 UNIVERSITY
TLB and Cache Interaction
31 12 11 0
Virtual page number Page offset
• If cache tag uses physical
address
– Need to translate before cache
lookup

29 12 11 0
Physical page number Page offset
Tag Index Block
29 87 54 0

physical address is 30
bits in this example

= Data
SAN JOSÉ STATE
8 SJSU
Match  Hit! UNIVERSITY
Multiprocessor
• ILP is an optimization within a processor (or core)
• Parallelism in processor unit  Multiprocessor!
– Multiple CPUs or Multiple Cores within a CPU
• Two representative types:
– Shared-memory (SMP/CMP): shared data management is a key issue
– Message-passing: explicit exchange of data among the processors that
use their own memory

CPU CPU
Moore’s Law

CPU

Shared-memory multiprocessor (SMP) Quad-core Chip-multiprocessor (CMP)


SAN JOSÉ STATE
9 SJSU UNIVERSITY
Cache Coherence
• A multiprocessor is coherent if the results of any
execution of a program can be reconstructed by a
hypothetical serial order

• Write propagation
– Writes are visible to other processors
• Write serialization
– All writes to the same location are seen in the same order by all
processors

• Simple solution: inform copies on every update!

SJSU SAN JOSÉ STATE


10 UNIVERSITY
FSM for Cache Blocks
• Each cache block’s status can be presented by a finite state
machine

• i.e. A Simple Protocol for Write-through Caches


• Valid (V): This copy is up-to-date
• Invalid (I): This copy is stale because remote copy (in another processor’s
cache) is modified
remote copy
• One
action of this cache “Valid” bit per cache block is modified
action of this cache
or message message
sent to bus BusWr/--
received from bus

PrRd/-- BusRd/--
PrWr/BusWr
BusRd/-- V I BusWr/--

PrRd/BusRd this processor


PrWr/BusWr reads an up-to-
date copy
SJSU SAN JOSÉ STATE
11 UNIVERSITY
Write-back Caches: MSI-Invalidate Protocol
PrRd/--
• Write-back cache updates memory only PrWr/--

when the block is replaced


– When a processor updates a block, the other
processors should know their copies are stale M
 Invalidate
• Block States: BusRd/ PrRd/-- PrWr/
Flush BusRd/-- BusUpgr
– Invalid (I)
• This copy is stale PrWr/BusRdX
– Shared (S)
• One or more caches are sharing the same clean
BusRdX/Flush S
copy
• Memory is clean PrRd/BusRd
BusUpgr/--
– Modified (M) BusRdX/--
• There is one cache that has the most updated
copy of the block
• Other caches that have the same block copy
have old values (in invalid state)
I
• Memory is stale BusRd/--
BusUpgr/--
BusRdX/--
SAN JOSÉ STATE
12 SJSU UNIVERSITY
The Problem with MSI-Invalidate
• Typically many blocks are private (not shared with any other
caches)
• On a read, the block immediately goes to S state although it
may be the only copy to be cached (i.e., no other processor
will cache it)

• Why problem?
– Whenever a cache that has the block wants to write to the block, the
cache needs to broadcast “invalidate” even though it is the only cache
that holds the block copy.
– How can we reduce unnecessary broadcasting overhead?
• In the next class...

SJSU SAN JOSÉ STATE


13 UNIVERSITY
Solution: MESI
• Add a new state indicating that this is the only cached copy of the block and it
is clean
– Exclusive (E) state
• There is only one cache that holds the block
• Memory is clean

• Block is placed into the E state if there isn’t any other caches that hold the
same block
– “Shared” signal on bus can detect that the copy is unique; snooping caches assert
the signal if they also have a copy
– On a read miss, go to E or S depending on the value returned by the shared line
– Silent transition from E to M is possible on write

• Mark S. Papamarcos and Janak H. Patel, “A low-overhead coherence solution


for multiprocessors with private cache memories, ” ISCA’84
• Employed in Intel, PowerPC, MIPS
SJSU SAN JOSÉ STATE
14 UNIVERSITY
PrRd/--
PrWr/-- MESI
M • Red-colored transitions: Unique
transitions in MESI compared to
PrRd/-- PrWr/
--
MSI
BusRd/
Silent
Flush
E PrWr/
BusUpgr transition on
BusRd/-- write
PrRd/--
BusRd/--
PrWr/
BusRdX/Flush
BusRdX
_ When “Shared” signal
BusRdX/--
S PrRd(S)/
BusRd
When “Shared” signal
is zero (there’s no
is zero (there’s no
PrRd(S)/ cached blocks in other
BusUpgr/-- cached blocks in other
BusRdX/--
BusRd
caches), go to E.
caches), go to E
Otherwise go to S.
I
BusRd/--
SJSU SAN
BusUpgr/-- JOSÉ STATE
15 UNIVERSITY
BusRdX/--
Example 1
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Data block X is only in memory.
Processor A reads block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
Then, Processor B reads block X. BusRd/ BusRd/
Flush E PrWr/ Flush E PrWr/
BusUpgr BusUpgr
BusRdX/ BusRd/-- BusRdX/ BusRd/--
Flush PrRd/-- Flush PrRd/--
BusRd/-- PrWr/ BusRd/-- PrWr/
BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
S PrRd(S)/ S PrRd(S)/
BusRd BusRd
PrRd(S)/ PrRd(S)/
BusUpgr/-- BusUpgr/--
BusRd BusRd
BusRdX/-- BusRdX/--

I I
BusRd/-- BusRd/--
BusUpgr/-- BusUpgr/--
BusRdX/-- BusRdX/--

cache cache
bus
XMemory
SJSU SAN JOSÉ STATE
16 UNIVERSITY
Example 1
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Data block X is only in memory.
Processor A reads block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
Then, Processor B reads block X. BusRd/ BusRd/
Flush E PrWr/ Flush E PrWr/
BusUpgr BusUpgr
[Processor A] Read_on X BusRdX/ BusRd/-- BusRdX/ BusRd/--
Action: PrRd(S)/BusRd Flush PrRd/--
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
Transition: I (initial) E BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
S PrRd(S)/ S PrRd(S)/
BusRd BusRd
PrRd(S)/ PrRd(S)/
BusUpgr/-- BusUpgr/--
BusRd BusRd
BusRdX/-- BusRdX/--

I I
BusRd/-- BusRd/--
BusUpgr/-- BusUpgr/--
BusRdX/-- BusRdX/--

cache cache
bus
XMemory
SJSU SAN JOSÉ STATE
17 UNIVERSITY
Example 1
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Data block X is only in memory.
Processor A reads block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
Then, Processor B reads block X. BusRd/ BusRd/
Flush E PrWr/ Flush E PrWr/
BusUpgr BusUpgr
[Processor A] Read_on X BusRdX/ BusRd/--
BusRd/-- BusRdX/ BusRd/--
Action: PrRd(S)/BusRd Flush PrRd/--
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
Transition: I (initial) E BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
S PrRd(S)/ S PrRd(S)/
[Processor B] Read on X BusRd BusRd
PrRd(S)/ PrRd(S)/
Action(B): PrRd(S)/BusRd BusUpgr/--
BusRd
BusUpgr/--
BusRd
BusRdX/-- BusRdX/--
Transition(B): I (initial) S
Action(A): BusRd/-- I I
Transition(A): ES BusRd/--
BusUpgr/--
BusRd/--
BusUpgr/--
BusRdX/-- BusRdX/--

cache X cache
bus
XMemory
SJSU SAN JOSÉ STATE
18 UNIVERSITY
Example 2
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Data block X is only in memory.
Processor A reads then writes block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
Then, Processor B reads block X. BusRd/ BusRd/
Flush E PrWr/ Flush E PrWr/
BusUpgr BusUpgr
BusRdX/ BusRd/-- BusRdX/ BusRd/--
Flush PrRd/-- Flush PrRd/--
BusRd/-- PrWr/ BusRd/-- PrWr/
BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
S PrRd(S)/ S PrRd(S)/
BusRd BusRd
PrRd(S)/ PrRd(S)/
BusUpgr/-- BusUpgr/--
BusRd BusRd
BusRdX/-- BusRdX/--

I I
BusRd/-- BusRd/--
BusUpgr/-- BusUpgr/--
BusRdX/-- BusRdX/--

cache cache
bus
XMemory
SJSU SAN JOSÉ STATE
19 UNIVERSITY
Example 2
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Data block X is only in memory.
Processor A reads then writes block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
Then, Processor B reads block X. BusRd/ BusRd/
Flush E PrWr/ Flush E PrWr/
BusUpgr BusUpgr
[Processor A] Read_on X BusRdX/ BusRd/-- BusRdX/ BusRd/--
Action: PrRd(S)/BusRd Flush PrRd/--
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
Transition: I (initial) E BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
S PrRd(S)/ S PrRd(S)/
BusRd BusRd
PrRd(S)/ PrRd(S)/
BusUpgr/-- BusUpgr/--
BusRd BusRd
BusRdX/-- BusRdX/--

I I
BusRd/-- BusRd/--
BusUpgr/-- BusUpgr/--
BusRdX/-- BusRdX/--

cache cache
bus
XMemory
SJSU SAN JOSÉ STATE
20 UNIVERSITY
Example 2
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Data block X is only in memory.
Processor A reads then writes block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
Then, Processor B reads block X. BusRd/ BusRd/
Flush E PrWr/ Flush E PrWr/
BusUpgr BusUpgr
[Processor A] Read_on X BusRdX/ BusRd/-- BusRdX/ BusRd/--
Action: PrRd(S)/BusRd Flush PrRd/--
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
Transition: I (initial) E BusRdX BusRdX
Silent BusRdX/--
_
BusRdX/--
_
S PrRd(S)/ S PrRd(S)/
[Processor A] Write on Xtransition BusRd BusRd
PrRd(S)/ PrRd(S)/
Action: PrWr/-- BusUpgr/--
BusRd
BusUpgr/--
BusRd
BusRdX/-- BusRdX/--
Transition: EM
I I
BusRd/-- BusRd/--
BusUpgr/-- BusUpgr/--
BusRdX/-- BusRdX/--

cache X cache
bus
XMemory
SJSU SAN JOSÉ STATE
21 UNIVERSITY
Example 2
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Data block X is only in memory.
Processor A reads then writes block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
Then, Processor B reads block X. BusRd/ BusRd/
Flush E PrWr/ Flush E PrWr/
BusUpgr BusUpgr
[Processor A] Read_on X BusRdX/ BusRd/-- BusRdX/ BusRd/--
Action: PrRd(S)/BusRd Flush PrRd/--
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
Transition: I (initial) E BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
S PrRd(S)/ S PrRd(S)/
[Processor A] Write on X BusRd BusRd
PrRd(S)/ PrRd(S)/
Action: PrWr/-- BusUpgr/--
BusRd
BusUpgr/--
BusRd
BusRdX/-- BusRdX/--
Transition: EM
I I
[Processor B] Read on X BusRd/--
BusUpgr/--
BusRd/--
BusUpgr/--
Action(B): PrRd(S)/BusRd BusRdX/-- BusRdX/--

Transition(B): I (initial) S
cache X cache
Action(A): BusRd/Flush
Transition(A): MS X
bus
XMemory
SJSU SAN JOSÉ STATE
22 UNIVERSITY
Snoopy Invalidation Tradeoffs
• Cache-to-cache vs. Memory-to-cache transfer
– On a BusRd, should data come from another cache or memory?
– Another cache
• might be faster if memory is slow or highly contended
• what if there are several caches sharing the same block? who will provide the
block?
– Memory
• would be simpler because no need to identify who (if there are several caches
sharing the same block) will provide the block
• requires writeback on MS transition

• Writeback on MS
– Is this necessary? What if the block is updated multiple times by several
caches for a while? Do we need to update memory whenever updating the
block?

SJSU SAN JOSÉ STATE


23 UNIVERSITY
Solution: MOESI
• Add a new state indicating the owner of a block
– Owner (O) state
• There can be only one cache in O state for each block
– All others must hold the data in S state
• Provide up-to-date copy of a block to requesting caches (cache-to-cache transfer)
• Memory writeback only when the block is evicted
• The cache in state E or M is implicitly the owner of the block.
– On BusRd, the cache in state E or M moves to O state

Valid
• S state in MOESI is “Shared and potentially dirty”
M O Modified
• AMD Opteron uses MOESI
E S Clean
Not
Shar Shar
Dirty until ed ed I
one shared
copy is
SJSU SAN JOSÉ STATE
24 evicted UNIVERSITY
MOESI
PrRd/--

PrWr/
PrWr/-- • Red-colored transitions: Unique
BusUpgr transitions in MOESI compared
BusRd/Flush BusRd/ M to MESI
PrRd/-- Flush
O • Flush here does not update
PrRd/-- PrWr/
BusRd/
Flush --
memory unless the block is
E PrWr/
BusUpgr
evicted  most block
Any request on this communication is via cache-to-
BusRdX/Flush
block is responded by PrRd/-- cache
this cache PrWr/
BusRdX
_
Valid
BusRdX/Flush S PrRd(S)/
BusRd

PrRd(S)/ M O Modified
BusUpgr/--
BusRd
BusRdX/--

BusUpgr/-- E S Clean
BusRdX/Flush I Not
BusRd/-- Shar
Shar
BusUpgr/--
ed ed I
BusRdX/--
This cache loses
ownership. 25 SJSU SAN JOSÉ STATE
UNIVERSITY
Multi-level Cache Hierarchy
• Processors typically have multi-level cache hierarchy
(i.e. L1, L2..)
• When a coherence request arrives, caches in all levels
should be checked  long latency
P P P

L1 L1 L1

L2 L2 L2
Interconnection Network

Memory

SJSU SAN JOSÉ STATE


26 UNIVERSITY
Multi-level Cache Hierarchy
• How can we reduce the latency of checking all levels
of caches?
• Inclusive cache
– Data in upper level cache is also maintained in lower level
caches
– Snoop only the lowermost level cache
– Lowermost cache needs to know upper level caches have
write hits
• Use write-through to keep the consistency among the caches
• Or, use write-back but maintain a flag bit that indicates the
dirtiness of data

SJSU SAN JOSÉ STATE


27 UNIVERSITY
4C’s
• 3C’s
– Compulsory miss
– Capacity miss
– Conflict miss
• 4C’s
– 3C’s + Coherence miss
• Coherence miss: Cache misses caused by cache
coherency

SJSU SAN JOSÉ STATE


28 UNIVERSITY
Coherence Miss Example
Example: Processor A Processor B
PrRd/-- PrRd/--
Data block X is only in memory. PrWr/-- PrWr/--

Processor A reads block X.


Then, Processor B writes block X.
M M
Lastly, Processor A reads block X.
BusRd/ PrRd/-- PrWr/ BusRd/ PrRd/-- PrWr/
[Processor A] Read on X Flush BusRd/-- BusUpgr Flush BusRd/-- BusUpgr

Action: PrRd/BusRd PrWr/BusRdX PrWr/BusRdX


Transition: I (initial) S BusRdX/Flush S BusRdX/Flush S

BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd


BusRdX/-- BusRdX/--

I I
BusRd/-- BusRd/--
BusUpgr/-- BusUpgr/--
miss BusRdX/-- BusRdX/--

cache cache
bus
XMemory
SJSU SAN JOSÉ STATE
29 UNIVERSITY
Coherence Miss Example
Example: Processor A Processor B
PrRd/-- PrRd/--
Data block X is only in memory. PrWr/-- PrWr/--

Processor A reads block X.


Processor B writes block X.
M M
Lastly, Processor A reads block X.
BusRd/ PrRd/-- PrWr/ BusRd/ PrRd/-- PrWr/
[Processor A] Read on X Flush BusRd/-- BusUpgr Flush BusRd/-- BusUpgr

Action: PrRd/BusRd PrWr/BusRdX PrWr/BusRdX

Transition: I (initial) S BusRdX/Flush S BusRdX/Flush S

[Processor B] Write on X BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd


BusRdX/-- BusRdX/--
Action(B): PrWr/BusRdX
Transition(B): I (initial) M
Action(A): BusRdX/-- I II
Transition(A): SI BusRd/-- BusRd/--
BusUpgr/--
BusRdX/-- miss BusUpgr/--
BusRdX/--

cache X X cache
bus
XMemory
SJSU SAN JOSÉ STATE
30 UNIVERSITY
Coherence Miss Example
Example: Processor A Processor B
PrRd/-- PrRd/--
Data block X is only in memory. PrWr/-- PrWr/--
Processor A reads block X.
Processor B writes block X.
M M
Lastly, Processor A reads block X.
BusRd/ BusRd/ PrRd/-- PrWr/
PrRd/-- PrWr/
[Processor A] Read on X Flush BusRd/-- BusUpgr Flush BusRd/-- BusUpgr

Action: PrRd/BusRd PrWr/BusRdX PrWr/BusRdX


Transition: I (initial) S BusRdX/Flush S BusRdX/Flush S

[Processor B] Write on XA had X in its BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd

Action(B): PrWr/BusRdX cache but BusRdX/-- BusRdX/--

encounter
Transition(B): I (initial) M a miss
Action(A): BusRdX/-- I I
Transition(A): SI
as X is out-dated BusRd/-- BusRd/--

miss BusUpgr/--
BusRdX/--
BusUpgr/--
BusRdX/--
[Processor A] Read on X
Action(A): PrRd/BusRd cache X X cache
Transition(A): IS X
Action(B): BusRd/Flush bus
XMemory
SJSU SAN JOSÉ STATE
Transition(B): MS 31 UNIVERSITY
Types of Coherence Misses
• Two types
– True sharing: Cache miss occurred due to the update on a word
in a cache block that your processor actually use

– False sharing: Cache miss occurred due to the update on a


word in a cache block that your process does not actually use

SJSU SAN JOSÉ STATE


32 UNIVERSITY
True sharing vs False sharing
• Suppose that a cache block X consists of two words
• Assume that we are using MSI protocol
Cache block State

Block X in Processor A Block X in Processor B Cache miss


Initial: Word1 Word2 S Initial: Word1 Word2 S due to true
sharing
time

hit
Write on Word1 Word1’ Word2 M Word1 Word2 I
miss
Word1’ Word2 S Read Word1 Word1’ Word2 S Cache miss due to
hit false sharing (B
Write on Word2 Word1’ Word2’ M Word1’ Word2 I never read Word2
miss
Read Word1 Word1’ Word2’ but encounters a
Word1’ Word2’ S S
miss due to the
updated Word2)

SJSU SAN JOSÉ STATE


33 UNIVERSITY
Final Exam
• Covers from Lecture 1 to 12
– From performance part of lecture 1

• Format
– Similar to Midterm1 and 2
– Calculator/pen/erasure are allowed

• Review
– Lecture slides
– Homework 1 to 6 solutions
– Quiz solutions
– Midterm solutions

SJSU SAN JOSÉ STATE


34 UNIVERSITY
Lecture 1
• Peak performance metric
– MIPS
• Speedup Calculation
– Speedup of X over Y =
– Average speedup
• CPU Time
= CPU Clock Cycles x Clock Cycle Time
= x x

= x Clock Cycle Time

SJSU SAN JOSÉ STATE


35 UNIVERSITY
Lecture 1
• Average CPI vs. Average IPC
– Arithmetic mean of CPI =

– Harmonic mean of IPC =

• Amdahl’s law
– Make the common case fast
– Overall Speedup =

SJSU SAN JOSÉ STATE


36 UNIVERSITY
Lecture 2
• ISA
– ISA is the interface between software and hardware
• ISA classification
– CISC
• Stack, Accumulator, Memory-Memory, Register-Memory
– RISC
• Register-Register (Load-Store)
• i.e. MIPS

SJSU SAN JOSÉ STATE


37 UNIVERSITY
Lecture 2-MIPS
• Address/data size
– 4 byte
• Endianess
– Big & Little both but we used little
• Instruction Types
– R-type

– I-type
LD/ST: sign-extension
Branch: sign-extension + shift-left-2

– J-type SAN JOSÉ STATE


38 SJSU
shift-left-2 UNIVERSITY
Lecture 3
• Instructions go through following steps
– Fetch
• Fetch instruction from I-mem
– Decode
• Identify opcode and read operands
– Execute
• R-type : Specified functions
• : Add
LD/ST
• Beq : Sub
• Jump : Nothing
– Mem
• Read/Write data from/to D-mem
– Writeback
• Update register 39 SJSU SAN JOSÉ STATE
UNIVERSITY
Lecture 3

SJSU SAN JOSÉ STATE


40 UNIVERSITY
Lecture 4
• Entire datapath is broken
into five pipeline stages
– Shorter cycle period
CPI = 1
– Mult. cycles per inst
• 5 CPI in 5 stage pipeline
– But, Can achieve CPI = 1
• by overlapping mult.
instructions
CPI = 5

SJSU SAN JOSÉ STATE


41 CPI = 1 UNIVERSITY
Lecture 4
• Pipeline stage registers
– act as temp. registers storing intermediate results and thus allowing previous
stage to be reused for another instruction

SJSU SAN JOSÉ STATE


42 UNIVERSITY
Lecture 4
• Total execution cycle without any hazards
– (K pipeline stages + N instructions -1) cycles

• Hazards
– Structural
• HW Organization (i.e. unified i- and d-memory)
– Data : stall pipeline stages until operand value becomes ready
• Data dependency

SJSU SAN JOSÉ STATE


43 UNIVERSITY
Lecture 5
• Dependencies among instructions
– RAW (Read-After-Write) : True dependency Causes hazards
• $1 of add  sub, $2 of sub  sll, $2 of sll  addi
– WAW (Write-After-Write) : False dependency
• $1 of add  addi, $2 of sub  sll
– WAR (Write-After-Read) : False dependency
• $2 of add  sub, $1 of sub  addi

RAW WAW WAR


add $1, $2, $3 add $1, $2, $3 add $1, $2, $3
sub $2, $1, $3 sub $2, $1, $3 sub $2, $1, $3
sll $2, $2, 3 sll $2, $2, 3 sll $2, $2, 3
addi $1, $2, 5 addi $1, $2, 5 addi $1, $2, 5
SJSU SAN JOSÉ STATE
44 UNIVERSITY
Lecture 5
• As far as earlier inst’s result is in pipeline, the result
value can be forwarded to the following instructions
• In arithmetic operations, no stall is needed
– cf. LW needs one stall cycle even with forwarding

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10

ADD IF ID EXE MEM WB

ADD $t3,$t1,$t2 SUB IF ID EXE MEM WB

SUB $t5,$t3,$t4 XOR IF ID EXE MEM WB

XOR $t7,$t5,$t3 Next+1 IF ID EXE MEM WB

Next+2 IF ID EXE MEM WB

Next+3 IF ID EXE MEM WB

SJSU SAN JOSÉ STATE


45 UNIVERSITY
Lecture 5
• Forwarding from MEM/WB to EXE
• One cycle stall is needed even with forwarding
LW $t1,4($s0)
ADD $t5,$t1,$t4
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12

LW IF ID EXE MEM WB

ADD IF ID ID EXE MEM WB

Next
IF IF ID EXE MEM WB
inst

Next+1 IF ID EXE MEM WB

Next+2 IF ID EXE MEM WB

Next+3 IF ID EXE MEM WB

Next+4 IF ID EXE MEM WB

SJSU SAN JOSÉ STATE


46 UNIVERSITY
Lecture 5
• Compiler can reorder code to avoid the stalls
• C code for A = B + E; C = B + F;

* assume that negated clocked register file and forwarding is used


lw $t1, 0($t0) lw $t1, 0($t0)
lw $t2, 4($t0) lw $t2, 4($t0)
stall
add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
add $t5, $t1, $t4 add $t5, $t1, $t4
stall
sw $t5, 16($t0) sw $t5, 16($t0)
5 cycles for the first inst + 5 cycles for the first inst +
Total 6 cycles for the remaining insts + 6 cycles for the remainingSAN insts +
execution time 2 x 1 stall cycles = 13 cycles47 0 stall cycles = 11SJSU
JOSÉ STATE
cyclesUNIVERSITY
Lecture 5
• Control Hazard
Actual Branch Outcome
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10
BEQ $a0,$a1,L1 (NT)
L2: ADD $s1,$t1,$t2 BEQ IF ID EXE MEM WB

SUB $t3,$t0,$s0 ADD IF ID EXE MEM WB

OR $s0,$t6,$t7 SUB IF ID EXE MEM WB

BNE $a0,$s1,L2 (T) OR IF ID EXE MEM WB

L1: AND $t3,$t6,$t7 BNE IF ID EXE MEM T!

SW $t5,0($s1) AND IF ID EXE

LW $s2,0($s5) SW IF ID
Flush
LW IF

SJSU SAN JOSÉ STATE


48 UNIVERSITY
Lecture 5
• Control Hazard
Actual Branch Outcome

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10
BEQ $a0,$a1,L1 (NT)
L2: ADD $s1,$t1,$t2 BEQ IF ID EXE MEM WB

SUB $t3,$t0,$s0 ADD IF ID EXE MEM WB

OR $s0,$t6,$t7 SUB IF ID EXE MEM WB

BNE $a0,$s1,L2 (T) OR IF ID EXE MEM WB

L1: AND $t3,$t6,$t7 BNE IF ID EXE MEM WB


SW $t5,0($s1) AND IF ID EXE nop nop
LW $s2,0($s5)
SW IF ID nop nop

LW IF nop nop

Instruction in the ADD IF ID

target address (L2) SUB IF

SJSU SAN JOSÉ STATE


49 UNIVERSITY
Lecture 5
• Early Branch Determination
– Branch target address can be calculated earlier than MEM by moving shift-left-2
and Adder
But, this may need extra
IF/ID ID/EX data forwarding
EX/MEM and stall
MEM/WB
because now the

Adder
Shift Sum
Left
2 operand value is needed
4 A

S
in ID stage (not in EXE
+

B
stage)
Read
0 CLK
Reg. 1 #
1 Read
PC Addr. Data Reg. 2 #
Read
Write data 1 Zero

ALU
Reg. #

==
I-Cache / I-MEM Read Res. Addr.
Write
data 2
0 0
Data Read
Register File 1 Data
1
Write
Sign Data
Extend D-Cache /
16 32
D-MemSAN JOSÉ STATE
50 SJSU UNIVERSITY
Lecture 5
• Early Branch Determination w/ predicted NT
Actual Branch Outcome
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10
BEQ $a0,$a1,L1 (NT)
L2: ADD $s1,$t1,$t2 BEQ IF ID EXE MEM WB

SUB $t3,$t0,$s0 ADD IF ID EXE MEM WB

OR $s0,$t6,$t7 SUB IF ID EXE MEM WB

BNE $a0,$s1,L2 (T) OR IF ID EXE MEM WB

L1: AND $t3,$t6,$t7 BNE IF ID


T!
EXE MEM WB

SW $t5,0($s1) AND Flush


IF
nop nop nop nop
LW $s2,0($s5) ADD IF ID EXE MEM

SUB IF ID EXE

Instruction in the
target address (L2)
SJSU SAN JOSÉ STATE
51 UNIVERSITY
Lecture 5
• Early Branch Determination w/ predicted NT If BNE has dependency
with its preceding inst,
Actual Branch Outcome
OR, one cycle stall is
BEQ $a0,$a1,L1 (NT)
CC1 CC2 CC3 CC4 CC5
required
CC6
to getCC8
CC7
the OR’s
CC9 CC10

result value
L2: ADD $s1,$t1,$t2 BEQ IF ID EXE MEM WB

SUB $t3,$t0,$s0 ADD IF ID EXE MEM WB

OR $s0,$t6,$t7 SUB IF ID EXE MEM WB

BNE $s0,$s1,L2 (T) OR IF ID EXE MEM WB

L1: AND $t3,$t6,$t7 BNE IF ID ID EXE


T!
MEM WB

SW $t5,0($s1) AND IF IF
nop nop nop
LW $s2,0($s5) ADD IF ID EXE

SUB IF ID

Instruction in the
target address (L2)
SJSU SAN JOSÉ STATE
52 UNIVERSITY
Lecture 6
• Dynamic Branch Predictor
– 1-bit and 2-bit Saturating Counter in each entry of a branch prediction buffer
• Could have more than two bits but two bits cover most patterns (i.e. loops)

Predict NT
Predict T
Transistion on T outcome
Transistion on NT outcome
11 10

0 1
00 01
Branch Prediction Buffer
prediction bit
FSM for Last-Outcome FSM for 2-bit
0 T
Prediction Saturating Counter
1 NT
2 T
3 T
T: Taken
SJSU SAN
4 NT JOSÉ STATE
NT: Not Taken 53 UNIVERSITY
Lecture 6
• Two-bit predictor is good for branches in loops
• How to improve prediction rate for the branches
other than the branches in loops?
Example:

if (a==2) then a = 0; If these two conditions succeed (untaken),


if (b==2) then b = 0;
if (a!=b) then ... This third if condition will fail (taken)

Third branch is correlated to first two branches


How can we apply correlations to the branch prediction?

SJSU SAN JOSÉ STATE


54 UNIVERSITY
Lecture 6
• Two-level Predictors
– Branch predictor buffer maintains one or multi-bit counters
– Each branch can either share a global branch predictor buffer or have its own
branch predictor buffer

Global history and


Global history and
global predictor
private predictor

Private history and Private history and


global predictor private predictor

SJSU SAN JOSÉ STATE


55 UNIVERSITY
Lecture 6
• Global history and global predictor
– Last N branches outcome is used to index global branch predictor buffer
– All branches share the same branch predictor buffer
– For the cases when all branches are strongly correlated with each other

Example: 2-bit history register and 2-bit predictor with the following initial values.
There are four branches in the code and their actual outcomes are T, T, NT, NT, respectively

Predictor Code
History (Predicted NT) BNE
0 00 (Actual outcome : T)
00 ..
1 10 BEQ
..
2 00 BEQ
(Update predictor)
3 01 ..
BNE
(Update history)
SJSU SAN JOSÉ STATE
56 UNIVERSITY
Lecture 6
• Global history and global predictor
– Last N branches outcome is used to index global branch predictor buffer
– All branches share the same branch predictor buffer
– For the cases when all branches are strongly correlated with each other
shift-left by 1
Example: 2-bit history register and 2-bit predictor with the following initial values.
regardless the
There are four branches in the code and their actual outcomes are T, T, NT, NT, respectively
branch outcome!!
Shift-left by 1. Predictor Code
History (Predicted NT) BNE
Add the recent 0 00 (Actual outcome : T)
branch outcome 0001 1 10
..
BEQ
to the LSB of ..
history register. 2 00 BEQ
(Update predictor)
Discard the shifted 3 01 ..
BNE
MSB to keep 2-bit only.
(Update history)
SJSU SAN JOSÉ STATE
57 UNIVERSITY
Lecture 6
• Global history and global predictor
– Last N branches outcome is used to index global branch predictor buffer
– All branches share the same branch predictor buffer
– For the cases when all branches are strongly correlated with each other

Example: 2-bit history register and 2-bit predictor with the following initial values.
There are four branches in the code and their actual outcomes are T, T, NT, NT, respectively

Predictor Code
History
0 01 BNE
01 ..
1 10 BEQ
..
2 00 BEQ
3 01 ..
BNE

SJSU SAN JOSÉ STATE


58 UNIVERSITY
Lecture 6
• In MIPS, exceptions managed by a
System Control Coprocessor (CP0)
• CPU provides the address of the
instruction where the event occurred
– The Exception Program Counter (EPC)
• contains the address of the instruction
– CPU might undo addition of 4 from fetch
cycle
• The Cause register contains a value that
indicates what type of event occurred
– For example:
Invalid Instruction:
Cause = 0x0000000A
Arithmetic Overflow:
Cause = 0x0000000C

SJSU SAN JOSÉ STATE


59 UNIVERSITY
Lecture 6
Overflow!

SJSU SAN JOSÉ STATE


60 UNIVERSITY
Lecture 6
Three instructions (including add) are flushed

exception
handler
code fetched

SJSU SAN JOSÉ STATE


61 UNIVERSITY
Lecture 7

• Instruction Level Parallelism (ILP)


• Basic idea: Execute several instructions in parallel
• We already do pipelining…
– But it can only push through at most 1 inst/cycle
• We want multiple instr/cycle
– Yes, it gets a bit complicated
• More transistors/logic
– That’s how we got from 486 (pipelined) to Pentium and
beyond

SJSU SAN JOSÉ STATE


62 UNIVERSITY
Lecture 7
• N-way Superscalar
– Fetch/Decode/Exe N insts concurrently
• Execute 2 instructions per cycle (2-way)
– One Integer/branch/memory instruction and one floating point instruction
– Instructions in pair must be independent
– Easy to implement because MIPS has separate register files for Int and Floating
point
• N-way superscalar
– 3~8 ways
• 3 way: int, branch, fp
• 6 way: branch, mem, 3 int, fp

IF ID EX ME
WB
IF ID FP1 FP2 FP3 FP4 FP5
SJSU SAN JOSÉ STATE
63 UNIVERSITY
Lecture 7
• Now let’s see the timing in 2-way superscalar

IF ID EX ME
WB
IF ID FP1 FP2 FP3 FP4 FP5
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12 CC13 CC14 CC15 CC16 CC17
ld.s $f0, 0($r1) IF ID EXE

ld.s $f1, 0($t2) IF ID


subi $t3, $t3, #1 IF
add.s $f2, $f1, $f0 IF

addi $t1, $t1, #4


addi $t2, $t2, #4
st.s $f2, -4($t1)
bnez $t3, Loop

SJSU SAN JOSÉ STATE


64 UNIVERSITY
Lecture 7
• Now let’s see the timing in 2-way superscalar

IF ID EX ME
WB
IF ID FP1 FP2 FP3 FP4 FP5
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12 CC13 CC14 CC15 CC16 CC17
ld.s $f0, 0($r1) IF ID EXE MEM

ld.s $f1, 0($t2) IF ID EXE


subi $t3, $t3, #1 IF ID
add.s $f2, $f1, $f0 IF ID

addi $t1, $t1, #4 IF

addi $t2, $t2, #4


st.s $f2, -4($t1)
bnez $t3, Loop

SJSU SAN JOSÉ STATE


65 UNIVERSITY
Lecture 7
• Now let’s see the timing in 2-way superscalar

IF ID EX ME
WB
IF ID FP1 FP2 FP3 FP4 FP5
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12 CC13 CC14 CC15 CC16 CC17
ld.s $f0, 0($r1) IF ID EXE MEM WB
ld.s $f1, 0($t2) IF ID EXE MEM WB
Not enough floating
subi $t3, $t3, #1 IF ID EXE MEM WB
point instructions
add.s $f2, $f1, $f0 IF ID ID FP1 FP2 FP3 FP4 FP5 WB
and dependencies
addi $t1, $t1, #4 IF ID EXE MEM WB
 no perf
addi $t2, $t2, #4 IF ID EXE MEM WB
improvement
st.s $f2, -4($t1) IF ID ID ID ID EXE MEM WB
bnez $t3, Loop IF IF IF IF ID EXE MEM WB

SJSU SAN JOSÉ STATE


66 UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Ex. Unrolling 2 loop iterations
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1)
add $t1, $t1, #4
addi $t2, $t2, #4
subi $t3, $t3, #1
remove loop index calculation in the middle
bnez $t3, Loop
remove branch in the middle
ld.s $f0, 0($t1)
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1)
addi $t1, $t1, #4
addi $t2, $t2, #4
subi $t3, $t3, #2
#1 instead,
SAN JOSÉ STATE
bnez $t3, Loop 67 loop index once for theSJSU
calculate combined iterations
UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Ex. Unrolling 2 loop iterations
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1)
addi $t1, $t1, #4 remove address calculation in the middle
addi $t2, $t2, #4

ld.s $f0, 0($t1)


4
ld.s $f1, 0($t2)
4
add.s $f2, $f1, $f0 instead,
st.s $f2, 0($t1)
4
update offsets of the mem ops in the second iteration
addi $t1, $t1, #4
#8
addi $t2, $t2, #4 calculate addresses once for the combined iterations
#8
subi $t3, $t3, #2
SJSU SAN JOSÉ STATE
bnez $t3, Loop 68 UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Ex. Unrolling 2 loop iterations
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1) Now, can we reorder insts across the
ld.s $f0, 4($t1) original two iterations? (i.e. can insts
ld.s $f1, 4($t2) in 2nd iteration be moved up to 1st
add.s $f2, $f1, $f0 iteration region?)
st.s $f2, 4($t1)
addi $t1, $t1, #8
addi $t2, $t2, #8
subi $t3, $t3, #2
bnez $t3, Loop

SJSU SAN JOSÉ STATE


69 UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Ex. Unrolling 2 loop iterations
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1) Now, can we reorder insts across the
ld.s $f0, 4($t1) original two iterations? (i.e. can insts
ld.s $f1, 4($t2) in 2nd iteration be moved up to 1st
add.s $f2, $f1, $f0 iteration region?)
st.s $f2, 4($t1)
addi $t1, $t1, #8
No, because the same registers are used
addi $t2, $t2, #8
in the two original iterations (ld.s can’t be moved
subi $t3, $t3, #2
above add.s because add.s should read the
bnez $t3, Loop
first ld.s’ f0 value)

SJSU SAN JOSÉ STATE


70 UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Ex. Unrolling 2 loop iterations
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1) Rename registers!
ld.s $f0, 4($t1)
ld.s $f1, 4($t2) In the original 2nd iteration code,
add.s $f2, $f1, $f0 $f0  $f3
st.s $f2, 4($t1)
$f1  $f4
addi $t1, $t1, #8
$f2  $f5
addi $t2, $t2, #8
subi $t3, $t3, #2
bnez $t3, Loop

SJSU SAN JOSÉ STATE


71 UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Ex. Unrolling 2 loop iterations
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1) Rename registers!
ld.s $f3, 4($t1)
ld.s $f4, 4($t2) In the original 2nd iteration code,
add.s $f5, $f4, $f3 $f0  $f3
st.s $f5, 4($t1)
$f1  $f4
addi $t1, $t1, #8
$f2  $f5
addi $t2, $t2, #8
subi $t3, $t3, #2
bnez $t3, Loop

SJSU SAN JOSÉ STATE


72 UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Now, let’s reorder insts across the two iterations!
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1)
ld.s $f3, 4($t1) Load insts to the beginning of the new loop
ld.s $f4, 4($t2)
add.s $f5, $f4, $f3
st.s $f5, 4($t1)
addi $t1, $t1, #8
Store insts to the end of the new loop
addi $t2, $t2, #8
subi $t3, $t3, #2
bnez $t3, Loop

SJSU SAN JOSÉ STATE


73 UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Now, let’s reorder insts across the two iterations!
ld.s $f1, 0($t2)
ld.s $f3, 4($t1)
ld.s $f4, 4($t2)
add.s $f2, $f1, $f0
add.s $f5, $f4, $f3
subi $t3, $t3, #2
addi $t1, $t1, #8
addi $t2, $t2, #8
st.s $f2, -8
0($t1)
st.s $f5, 4($t1) adjust offset for the mem ops
-4
bnez $t3, Loop

SJSU SAN JOSÉ STATE


74 UNIVERSITY
Lecture 7
• Limitations of Loop-Unrolling
– Works well if loop iterations are independent
• No loop-carried dependencies
– Problem with loop-carried RAW dependency
Example:
for (i=5;i<100;i++)
A[i]:= A[i-5] + B[i];
• Store and read on the same A array space in every 5 iterations
• Unroll limit : up to 5 iterations
• Unrolling more than 5 iterations  Loop-carried dependency limits code reordering
– Consumes more registers due to renaming
– Bigger code size
• Affects I-cache and memory
– Problem when the # of iterations is unknown at compile time

SJSU SAN JOSÉ STATE


75 UNIVERSITY
Lecture 8 Common data bus (CDB)
Execution result is propagated
through CDB to reservation
Front-end: station, ld/st buffer, and
Fetched insts are stored here. register file in WB stage
Insts are dispatched from this
inst queue to execution units
in FIFO fashion.

Address unit calculates


the target mem
address for mem
operations
For Register renaming
Architected registers
are renamed to the
entry id of reservation
Execution units. station
This picture shows only
Copyright © 2012, Elsevier Inc. All rights reserved.
floating-point execution units
SJSU SAN JOSÉ STATE
76 UNIVERSITY
Lecture 8
• Basic Tomasulo Algorithm’s Three Steps:
– Issue
• Get next instruction from instruction queue
• If available RS, issue the instruction to the RS with operand values if
available
• If operand values not available, stall the instruction
– Execute
• When all operands are ready, issue the instruction
• Loads and store maintained in program order through effective address
• No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
– Write result
• Write result on CDB into reservation stations, register file and store
buffers

SJSU SAN JOSÉ STATE


77 UNIVERSITY
Lecture 8
no. Instruction ISSUE EXE WB
Assume
Assumewe wehave
have: :
I1 ld.s $f6, 34($t2) 1 2-3 - -11MUL/DIV
MUL/DIVunit,
unit,11LD/ST
LD/STunit,
unit,11Arithmetic
Arithmeticunit
unit
I2 ld.s $f2, 45($t3) 2 Instruction takes :
Instruction takes :
I3 mul.s $f0, $f2, $f4 - -Load:
Load:22cycles,
cycles,Add/Sub:
Add/Sub:22cycles
cycles
I4 sub.s $f8, $f2, $f6 - -Mult: 10 cycles, Divide: 40 cycles
Mult: 10 cycles, Divide: 40 cycles
I5 div.s $f10 $f0, $f6
I6 add.s $f6, $f8, $f2 Reservation Stations
Busy Op Vj Vk Qj Qk A
LD1 1 I1 - $t2 34 + Regs[$t2]

LD2 1 I2 - $t3 45 + Regs[$t3]

AD1
If reg is already in AD2
reg file, fill the AD3
operand field with ML1
reg id ML2

F0 F2 F4 F6 F8 F10 F12
Register Status (Qi): LD2 LD1 …
SJSU SAN JOSÉ STATE
78 UNIVERSITY
Lecture 8
no. Instruction ISSUE EXE WB
Assume
Assumewe wehave
have: :
I1 ld.s $f6, 34($t2) 1 2-3 - -11MUL/DIV
MUL/DIVunit,
unit,11LD/ST
LD/STunit,
unit,11Arithmetic
Arithmeticunit
unit
I2 ld.s $f2, 45($t3) 2 Instruction takes :
Instruction takes :
I3 mul.s $f0, $f2, $f4 3 - -Load:
Load:22cycles,
cycles,Add/Sub:
Add/Sub:22cycles
cycles
I4 sub.s $f8, $f2, $f6 - -Mult: 10 cycles, Divide: 40 cycles
Mult: 10 cycles, Divide: 40 cycles
I5 div.s $f10 $f0, $f6
I6 add.s $f6, $f8, $f2 Reservation Stations
Busy Op Vj Vk Qj Qk A
LD1 1 I1 - $t2 34 + Regs[$t2]

LD2 1 I2 - $t3 45 + Regs[$t3]

If operand value is AD1


being computed by AD2
AD3
another inst, fill
ML1 1 I3 LD2 $f4
the operand field ML2
with RS id
F0 F2 F4 F6 F8 F10 F12
Register Status (Qi): ML1 LD2 LD1 …
SJSU SAN JOSÉ STATE
79 UNIVERSITY
Lecture 8
no. Instruction ISSUE EXE WB
Assume
Assumewe wehave
have: :
I1 ld.s $f6, 34($t2) 1 2-3 4 - -11MUL/DIV
MUL/DIVunit,
unit,11LD/ST
LD/STunit,
unit,11Arithmetic
Arithmeticunit
unit
I2 ld.s $f2, 45($t3) 2 4-5 6 Instruction takes :
Instruction takes :
I3 mul.s $f0, $f2, $f4 3 6 - 15 - -Load:
Load:22cycles,
cycles,Add/Sub:
Add/Sub:22cycles
cycles
I4 sub.s $f8, $f2, $f6 4 6-7 - -Mult: 10 cycles, Divide: 40 cycles
Mult: 10 cycles, Divide: 40 cycles
I5 div.s $f10 $f0, $f6 5
I6 add.s $f6, $f8, $f2 6 Reservation Stations
Busy Op Vj Vk Qj Qk A
LD1
LD2 1 I2 - $t3 45 + Regs[$t3]

If operand value is AD1 1 I4 $f2 $f6


AD2 1 I6 AD1 $f2
written back in the AD3
same cycle, newly ML1 1 I3 $f2 $f4
issued inst fetches ML2 1 I5 ML1 $f6
operand from reg
file so use reg id. F0 F2 F4 F6 F8 F10 F12
Register Status (Qi): ML1 AD2 AD1 ML2 …
SJSU SAN JOSÉ STATE
80 UNIVERSITY
Lecture 8
• Tomasulo Algorithm with ROB

The oldest inst in the ROB


can update register file
(commit in program order)

SAN JOSÉ STATE


81
Copyright © 2012, Elsevier Inc. All rights reserved. SJSU UNIVERSITY
Lecture 8
• Issue
– An inst can be issued only when an appropriate RS
entry and a ROB entry is available
– Registers are renamed by using ROB id
• Write Results
– Write result back to your ROB entry
– Register file is only updated by the oldest instruction
in the ROB
– Mark ready/finished bit in ROB

SJSU SAN JOSÉ STATE


82 UNIVERSITY
Lecture 8

• Commit
– When an inst is the oldest in the ROB
• i.e. ROB-head points to it
– Write result (if ready/finished bit is set)
• If register producing instruction: write to architected
register file
• If store: write to memory
– Advance ROB-head to next instruction

SJSU SAN JOSÉ STATE


83 UNIVERSITY
Tomasulo with ROB
no. Instruction ISSUE EXE WB COMMIT
Even though operand value is
I1 div $t2, $t3, $t4 1 2 - 41 updated
Assume
Assumewe wehave
to ROB (i.e. I2 and I3)
have: :
I2 mul $t1, $t5, $t6 2 3 - 12 13 but ifMUL/DIV
the value
- -22MUL/DIV is 1not
unit,
unit, updated
1Arithmetic
Arithmetic unit
unit
I3 add $t3, $t7, $t8 3 4 5
Instruction takes
to reg file,
Instruction takes:
dependent
: inst uses
- -Add/Sub:
ROB11id
Add/Sub: cycles
(see I4).
cycles
I4 mul $t1, $t1, $t3 14 15 - 24
- -Mult:
Mult: 10 cycles,
10 cycles, Divide:
Divide: 40
40cycles
cycles
I5 sub $t4, $t1, $t5 15 And the inst can begin
You can
You can bypass and
bypass and execute in the same
samecycle
I6 add $t1, $t4, $t2 execution by execute in thein
using value cycle
ROB
RF
Id Val Qi ROB Reservation Stations
$t1 -23 ROB4 Id Type Dest Val Ready Busy Op Vj Vk Qj Qk A
$t2 16 ROB1 1 $t2 AD1 1 I5 3 ROB4 $t5
$t3 45 ROB3 AD2
2 $t1 12 1
$t4 5 ROB5 AD3
3 $t3 3 1
$t5 3 ML1 1 I1 45 5 $t3 $t4
4 $t1 1 I4 12 3 ROB2 ROB3
$t6 4 ML2
5 $t4
$t7 1
6
$t8 2

SJSU SAN JOSÉ STATE


84 UNIVERSITY
Lecture 9
• Memory disambiguation
– Figuring out whether two addresses are equal to avoid hazards

• Conservative approach
– A ready load must wait until addresses of all preceding stores are known

• Optimistic approach: speculative disambiguation


– Speculatively assume that addresses of a load and its preceding stores
are different and just execute the load!
– Later, when a store’s address has been computed, check all the following
loads in the load/store buffer
• If a load has the same address and is already executed, then the load and all
following instructions must be replayed.

SJSU SAN JOSÉ STATE


85 UNIVERSITY
Lecture 9
• Conservative Approach
– Load should wait until all preceding stores’ addresses are
computed

no. Instruction ISSUE EXE WB COM


I1 ld $t3, 0($t6) 1 2-3 4 5
I2 mul $t4, $t3, $t5 2 4 - 13 14 15
I3 addi $t7, $t6, 4 3 4 5 16
I4 st $t7, 0($t4) 4 14 - 15 - 17
I5 sub $t1, $t1, $t2 5 6–7 8 18
I6 ld $t8, 4($t1) 6 16 - 17 18 19

I6 can begin execution at cycle 8 but waits until st address


is computed

SJSU SAN JOSÉ STATE


86 UNIVERSITY
Lecture 9
• Optimistic Approach
– Speculatively assume that addresses of a load and its
preceding stores are different and just execute the load!

no. Instruction ISSUE EXE WB COM


I1 ld $t3, 0($t6) 1 2-3 4 5
I2 mul $t4, $t3, $t5 2 4 - 13 14 15
I3 addi $t7, $t6, 4 3 4 5 16
I4 st $t7, 0($t4) 4 14 - 15 - 17
I5 sub $t1, $t1, $t2 5 6–7 8 18 Flush load and following
insts and replay them
I6 ld $t8, 4($t1) 6 8-9 10

SJSU SAN JOSÉ STATE


87 UNIVERSITY
Lecture 9

• DRAM = Dynamic RAM


• SRAM = Static RAM

• SRAM: 6T (transistor) per bit


– built with normal high-speed CMOS technology
• DRAM: 1T + 1 Capacitor per bit
– built with special DRAM process optimized for density

SJSU SAN JOSÉ STATE


UNIVERSITY
Lecture 9
• Hit: Data appears in some block of the cache
– Hit Rate: # hits / total accesses on the cache
– Hit Time: Time to access the cache
• Miss: Data needs to be retrieved from a block in the lower level (e.g., Block Y)
– Miss Rate: 1 - (Hit Rate)
– Miss Penalty: Average delay in the processor caused by each miss
• Average memory-access time (AMAT): Hit time + Miss rate x Miss penalty

Miss Lower Level Memory


Upper Level (cache/main memory)
Memory
Can’t find!
(cache)
From Processor data block X
data block Y

To Processor Hit Hit

SJSU SAN JOSÉ STATE


89 UNIVERSITY
Lecture 9

1 cycle 10 cycles Second 20 cycles 300 cycles


First-level Level
Cache Third
Cache
Level
Hit Time Main
L1 Cache
Memory
L2 (DRAM)

On Processor Die L3
• AMAT in multi-level cache organization
Off-Chip
= Thit(L1) + Miss_rate(L1) x
[ Thit(L2) + Miss_rate(L2) x
{ Thit(L3) + Miss_rate(L3) x T(memory) } ]

SJSU SAN JOSÉ STATE


90 UNIVERSITY
Lecture 9
• Memory blocks are mapped to cache lines
• Cache types w.r.t. Mapping
– Direct Mapped
• A memory value can be placed at a single corresponding location in the cache
• Fast indexing mechanism
– Set-Associative
• A memory value can be placed in any of a set of locations in the cache
• Slightly more complex search mechanism
– Fully-Associative
• A memory value can be placed in any location in the cache
• Extensive hardware resources required to search

SJSU SAN JOSÉ STATE


91 UNIVERSITY
Lecture 10
• A direct-mapped cache consists of a Tag RAM and a Data RAM
– Each line of both RAM is corresponding to a cache block (line)
• When a new data block is fetched to cache, tag field is stored to the Tag RAM and the
fetched data is stored to the Data RAM
Byte offset
0x77FF1C68 = 0111 0111 1111 1111 0001 1100 0110 1000

Index Word offset


Tag Find requested
Valid Bit Cache Tag Cache Data word location (2nd)

Byte 31 Byte 1 Byte 0 0 Find proper

:
block location
Byte 63 Byte 33 Byte 32 1

:
2
# of blocks

0111 0111 1111 1111 0001 1100 32-byte data containing 0x77FF1C68 3

: : :

Byte 255 Byte 224 7

:
SJSU SAN JOSÉ STATE
92 UNIVERSITY
Lecture 10
52-bit
• Given a 2MB, direct-mapped caches, line (block) size=64bytes
• Data address is 52 bits
Tag Index Block
• Tag size?
– block offset: 6 bits
– # blocks = 221/26 = 215  # bits in index: 15 bits ? bit ? bit ? bit
– # bits in an address: 52 bits
– # bits in tag = # bits in an address - # bits in index - # bits in block offset
= 52 – 15 – 6 = 31 bits

• Now change it to 16-way set associative, Tag size?


– # sets = # blocks/16 = 215/24 = 211  # bits in index: 11 bits
– # bits in tag = 52 – 11 – 6 = 35 bits

• How about if it’s fully associative, Tag size?


– # bits in tag = # bits in an address - # bits in block offset = 52 – 6 = 46 bits

SJSU SAN JOSÉ STATE


93 UNIVERSITY
Lecture 10
• The 3 C’s
– Compulsory (cold) Misses
• On the 1st reference to a block
• Related to # blocks accessed by a code, not related to the configuration of a cache
– Capacity Misses
• Happen when the cache space is not sufficient to hold data
• Can be reduced by increasing cache size
– Conflict Misses
• Happen when multiple memory blocks map to the same line in direct-mapped or map to
the same set in set-associative caches
• Can be reduced by increasing the associativity or cache size

SJSU SAN JOSÉ STATE


94 UNIVERSITY
Lecture 10
• Cache replacement policy
– When loading a new block (line), if the cache is already full, which block (line) should be
replaced?

• Random
– Replace a randomly chosen line
• FIFO
– Replace the oldest line
• LRU (Least Recently Used) 0
LRU group
– Replace the least recently used line
0 0
• pseudo-LRU
LRU LRU
– LRU but with less overhead
A B C D

- Cache blocks become leaf nodes.


- Intermediate nodes point to LRU.

SJSU SAN JOSÉ STATE


95 UNIVERSITY
Lecture 10
• Write through
– The value is written to both the cache line and to the lower-level memory.

CPU
All accesses L1 misses
L1
L2

All stores

Write Buffer

• Write back
– The value is written only to the cache line. The modified cache line is written to
main memory only when it has to be replaced.
– To distinguish modified cache line dirty bit is used.

SJSU SAN JOSÉ STATE


96 UNIVERSITY
Lecture 11
• Assumption
– A cache block size : 4 words
– Consecutive words accessed N Request on block N that causes primary miss
– Average memory access latency is 200 cycles N Request on block N that causes secondary miss
N Request on block N that causes hit

miss! 1st miss (200 cycles) miss! 2nd miss (200 cycles)
Blocking Cache 1 1 1 …

1 2
1 miss (200 cycles)
st

Non-Blocking Cache
with 2 MSHRs 1111 2nd miss (200 cycles)
Primary miss
MSHR-1 allocated
2222
Secondary miss MSHR released 3rd miss (200 cycles)
Primary miss
MSHR-2 allocated 3333
2 MSHRs are all occupied Primary miss
No further accesses are acceptable MSHR-1 allocated

97

SJSU SAN JOSÉ STATE


UNIVERSITY
Lecture 11

• Page replacement 31 12 11 0

– FIFO/LRU .. Virtual page number Page offset

• Write strategy Page table register x table entry size in byte


– Write-back
Page table

• +
Size of the page table?
– 32-bit virtual address
RWX V M R .. Physical page number
– 4 KB page
– page table entry: 4 B

# bits in page offset: 12 bits


# bits in virtual page number: 32 – 12 = 20 bits
# entries in the page table: 2 20
 220 x 4 B = 4 MB 29 12 11 0
Too Large.... Physical page number Page offset

SJSU SAN JOSÉ STATE


98 UNIVERSITY
Lecture 11

• Size of the page table? 31 24 23 18 17 12 11 0

– 32-bit virtual address V1 V2 V3 Page offset


– 4 KB page x table entry size in byte
Page table register x table entry size in byte
– page table entry: 4 B x table entry size in byte
– 3-level page table 0x400300 0x501000
• v1 : 8 bits, v2 : 6 bits, v3 : 6 bits
– + 0x501000
# active page tables per level +
• level 1: 1, level 2: 4, level 3: 5
0x400300 +

# entries in the 1st level page table: 28


Level 1 Page table Level 2 Page table Level 3 Page table
Size of the 1st level page table: 28 x 4 B = 1 KB
# entries in each 2nd and 3rd level page table: 26
Size of the 2nd level table : 4 tables x 26 x 4 B = 1 KB
Size of the 3rd level table : 5 tables x 26 x 4 B = 1.25 KB
 Total of 3.25 KB 29 12 11 0
way smaller than 4 MB single-level page table! Physical page number Page offset

SJSU SAN JOSÉ STATE


99 UNIVERSITY
Lecture 11

31 12 11 0
Virtual page number Page offset

.. TLB

..

29 12 11 0
Physical page number Page offset

SJSU SAN JOSÉ STATE


100 UNIVERSITY
Lecture 12
PrRd/--
• Write-back cache updates memory only PrWr/--

when the block is replaced


– When a processor updates a block, the other
processors should know their copies are M
stale  Invalidate
• Block States: BusRd/ PrRd/-- PrWr/
Flush BusRd/-- BusUpgr
– Invalid (I)
• This copy is stale PrWr/BusRdX
– Shared (S)
• One or more caches are sharing the same
BusRdX/Flush S
clean copy
• Memory is clean PrRd/BusRd
BusUpgr/--
– Modified (M) BusRdX/--
• There is one cache that has the most updated
copy of the block
• Other caches that have the same block copy
have old values (in invalid state)
I
• Memory is stale BusRd/--
BusUpgr/--
BusRdX/--

SJSU SAN JOSÉ STATE


101 UNIVERSITY
PrRd/--
Lecture 12
PrWr/--

M • Red-colored transitions: Unique


transitions in MESI compared to
PrRd/-- PrWr/
--
MSI
BusRd/
Silent
Flush
E PrWr/
BusUpgr transition on
BusRd/-- write
PrRd/--
BusRd/--
PrWr/
BusRdX/Flush
BusRdX
_ When “Shared” signal
BusRdX/--
S PrRd(S)/
BusRd
When “Shared” signal
is zero (there’s no
is zero (there’s no
PrRd(S)/ cached blocks in
BusUpgr/-- cached blocks in
BusRdX/--
BusRd
other caches), go to E.
other caches), go to E
Otherwise go to S.
I
BusRd/--
BusUpgr/--
BusRdX/--

SJSU SAN JOSÉ STATE


102 UNIVERSITY
Lecture 12
PrRd/--

PrWr/
PrWr/-- • Red-colored transitions: Unique
BusUpgr transitions in MOESI compared
BusRd/Flush
PrRd/--
BusRd/ M to MESI
Flush
O • Flush here does not update
PrRd/-- PrWr/
BusRd/
Flush
--
memory unless the block is
E PrWr/
BusUpgr
evicted  most block
Any request on this communication is via cache-to-
BusRdX/Flush
block is responded by PrRd/-- cache
this cache PrWr/
BusRdX
_
Valid
BusRdX/Flush S PrRd(S)/
BusRd

PrRd(S)/ M O Modified
BusUpgr/--
BusRd
BusRdX/--

BusUpgr/-- E S Clean
BusRdX/Flush I Not
BusRd/-- Shar
Shar
BusUpgr/--
ed ed I
BusRdX/--
This cache loses
ownership.

SJSU SAN JOSÉ STATE


103 UNIVERSITY
SJSU SAN JOSÉ STATE
UNIVERSITY

Вам также может понравиться