CMPE200 Lecture 12

CMPE 200
Lecture 12.
Multi-Processors
(P&H 5.10, 6.5, H&P 5.1-5.5)
Hyeran Jeon
SJSU SAN JOSÉ STATE

UNIVERSITY
Recap of previous lecture
• Application, at compile time, doesn’t know the size of underneath main
memory
• Main memory size may be insufficient to hold all code/data needed to run an
application
– Some parts of a code is loaded to main memory
– Some other parts of the code may remain in the disk
Virtual memory space

that is seen by an application 256MB DRAM
0x00000000 0x00000000
0x0FFFFFFF
0xFFFFFFFF

2 UNIVERSITY
Pages vs. Segments
• The unit of memory to fetch from storage
– similar to the concept of “block” in cache
• Page has fixed size
– Most systems use paging
memory space
• Segment has flexible size
– Some systems use paged segments Code/Data
• Code and data are stored in segment unit but Page
each segment is divided by pages
Segment
Page Segment
two (segment base and
Addressing fields One (offset)
offset)
Programmer visible? Invisible May be visible
Replacing a block Trivial Hard
Memory use efficiency Internal fragmentation External fragmentation
Efficient disk traffic Yes Not always

SAN JOSÉ STATE
3 SJSU UNIVERSITY
Translation Using a Page Table
• A page table in main memory 31 12 11 0
– accessed to get the physical address of a Virtual page number Page offset
given virtual address
Page table register x table entry size in byte
• Page table register
Page table
– contains the base address of the page
table +
• Table entry address =
Page table register value + Virtual page RWX V M R .. Physical page number
number x Page table entry size in byte
• Meta data
– RWX : Read/Write/eXecute
– V : Valid (If valid, the page is in memory.
Otherwise, page fault exception!) 29 12 11 0
– M : Modified Physical page number Page offset
– R : Referenced (used for page
replacement)
4 UNIVERSITY
Translating Using a Hierarchical Page Table
• Multi-level page table can 31 24 23 18 17 12 11 0

V1 V2 V3 Page offset
effectively reduce the size
x table entry size in byte
• Each level page table(s) maintains
the base address of the next- x table entry size in byte
level page table 0x400300 0x501000
• For each entry in the level N + 0x501000 +

page table, there can be up to
0x400300 +
one N+1 level page table
– If there is unaccessed memory region, Level 1 Page table Level 2 Page table Level 3 Page table
the corresponding page table doesn’t
need to be allocated
• Lower level page table is allocated
only when the corresponding entry
29 12 11 0
in the upper level page table is
Physical page number Page offset
accessed

5 UNIVERSITY
Example
• We use three-level page table and the virtual address fields are like below:
41 32 31 22 21 12 11 0
V1 V2 V3 Page offset
• What is the address coverage of an entry in each-level page table?

– first-level: 232 = 210 entries per second-level page table x second-level entry coverage
– second-level: 222 = 210 entries per third-level page table x third-level entry coverage
– third-level: 212

6 UNIVERSITY
Translation Lookaside Buffer (TLB)
• Page tables are stored in main memory
• At least two memory accesses per memory access by a program
– One for obtaining the physical address and second for getting the data
• How can we improve the performance?

• Let’s exploit the locality here too!
– When a translation for a virtual page number is used, it will probably be needed
again in the near future
– The references to the words on that page have both temporal and spatial locality.
 Cache the physical address of the frequently accessed pages

(Translation Lookaside Buffer)

7 UNIVERSITY
TLB and Cache Interaction
31 12 11 0
Virtual page number Page offset
• If cache tag uses physical
address
– Need to translate before cache
lookup
29 12 11 0
Tag Index Block
29 87 54 0
physical address is 30
bits in this example
= Data
SAN JOSÉ STATE
8 SJSU
Match  Hit! UNIVERSITY
Multiprocessor
• ILP is an optimization within a processor (or core)
• Parallelism in processor unit  Multiprocessor!
– Multiple CPUs or Multiple Cores within a CPU
• Two representative types:
– Shared-memory (SMP/CMP): shared data management is a key issue
– Message-passing: explicit exchange of data among the processors that
use their own memory
CPU CPU
Moore’s Law
CPU
Shared-memory multiprocessor (SMP) Quad-core Chip-multiprocessor (CMP)

SAN JOSÉ STATE
9 SJSU UNIVERSITY
Cache Coherence
• A multiprocessor is coherent if the results of any
execution of a program can be reconstructed by a
hypothetical serial order
• Write propagation
– Writes are visible to other processors
• Write serialization
– All writes to the same location are seen in the same order by all
processors
• Simple solution: inform copies on every update!

10 UNIVERSITY
FSM for Cache Blocks
• Each cache block’s status can be presented by a finite state
machine
• i.e. A Simple Protocol for Write-through Caches

• Valid (V): This copy is up-to-date
• Invalid (I): This copy is stale because remote copy (in another processor’s
cache) is modified
remote copy
• One
action of this cache “Valid” bit per cache block is modified
action of this cache
or message message
sent to bus BusWr/--
received from bus
PrRd/-- BusRd/--
PrWr/BusWr
BusRd/-- V I BusWr/--
PrRd/BusRd this processor

PrWr/BusWr reads an up-to-
date copy
11 UNIVERSITY
Write-back Caches: MSI-Invalidate Protocol
PrRd/--
• Write-back cache updates memory only PrWr/--
when the block is replaced

– When a processor updates a block, the other
processors should know their copies are stale M
 Invalidate
• Block States: BusRd/ PrRd/-- PrWr/
Flush BusRd/-- BusUpgr
– Invalid (I)
• This copy is stale PrWr/BusRdX
– Shared (S)
• One or more caches are sharing the same clean
BusRdX/Flush S
copy
• Memory is clean PrRd/BusRd
BusUpgr/--
– Modified (M) BusRdX/--
• There is one cache that has the most updated
copy of the block
• Other caches that have the same block copy
have old values (in invalid state)
I
• Memory is stale BusRd/--
BusUpgr/--
BusRdX/--
SAN JOSÉ STATE
12 SJSU UNIVERSITY
The Problem with MSI-Invalidate
• Typically many blocks are private (not shared with any other
caches)
• On a read, the block immediately goes to S state although it
may be the only copy to be cached (i.e., no other processor
will cache it)
• Why problem?
– Whenever a cache that has the block wants to write to the block, the
cache needs to broadcast “invalidate” even though it is the only cache
that holds the block copy.
– How can we reduce unnecessary broadcasting overhead?
• In the next class...

13 UNIVERSITY
Solution: MESI
• Add a new state indicating that this is the only cached copy of the block and it
is clean
– Exclusive (E) state
• There is only one cache that holds the block
• Memory is clean
• Block is placed into the E state if there isn’t any other caches that hold the
same block
– “Shared” signal on bus can detect that the copy is unique; snooping caches assert
the signal if they also have a copy
– On a read miss, go to E or S depending on the value returned by the shared line
– Silent transition from E to M is possible on write
• Mark S. Papamarcos and Janak H. Patel, “A low-overhead coherence solution

for multiprocessors with private cache memories, ” ISCA’84
• Employed in Intel, PowerPC, MIPS
14 UNIVERSITY
PrRd/--
PrWr/-- MESI
M • Red-colored transitions: Unique
transitions in MESI compared to
PrRd/-- PrWr/
--
MSI
BusRd/
Silent
Flush
E PrWr/
BusUpgr transition on
BusRd/-- write
PrRd/--
BusRd/--
PrWr/
BusRdX/Flush
BusRdX
_ When “Shared” signal
BusRdX/--
S PrRd(S)/
BusRd
When “Shared” signal
is zero (there’s no
PrRd(S)/ cached blocks in other
BusUpgr/-- cached blocks in other
BusRdX/--
BusRd
caches), go to E.
caches), go to E
Otherwise go to S.
I
BusRd/--
SJSU SAN
BusUpgr/-- JOSÉ STATE
15 UNIVERSITY
BusRdX/--
Example 1
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Data block X is only in memory.
Processor A reads block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
Then, Processor B reads block X. BusRd/ BusRd/
Flush E PrWr/ Flush E PrWr/
BusUpgr BusUpgr
BusRdX/ BusRd/-- BusRdX/ BusRd/--
Flush PrRd/-- Flush PrRd/--
BusRd/-- PrWr/ BusRd/-- PrWr/
BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
S PrRd(S)/ S PrRd(S)/
BusRd BusRd
PrRd(S)/ PrRd(S)/
BusUpgr/-- BusUpgr/--
BusRd BusRd
BusRdX/-- BusRdX/--
I I
BusRd/-- BusRd/--
BusRdX/-- BusRdX/--
cache cache
bus
XMemory
16 UNIVERSITY
Example 1
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
-- --
BusUpgr BusUpgr
[Processor A] Read_on X BusRdX/ BusRd/-- BusRdX/ BusRd/--
Action: PrRd(S)/BusRd Flush PrRd/--
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
Transition: I (initial) E BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
BusRd BusRd
PrRd(S)/ PrRd(S)/
BusRd BusRd
BusRdX/-- BusRdX/--
I I
BusRd/-- BusRd/--
BusRdX/-- BusRdX/--
cache cache
bus
XMemory
17 UNIVERSITY
Example 1
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
-- --
BusUpgr BusUpgr
[Processor A] Read_on X BusRdX/ BusRd/--
BusRd/-- BusRdX/ BusRd/--
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
_ _
BusRdX/-- BusRdX/--
[Processor B] Read on X BusRd BusRd
PrRd(S)/ PrRd(S)/
Action(B): PrRd(S)/BusRd BusUpgr/--
BusRd
BusUpgr/--
BusRd
BusRdX/-- BusRdX/--
Transition(B): I (initial) S
Action(A): BusRd/-- I I
Transition(A): ES BusRd/--
BusUpgr/--
BusRd/--
BusUpgr/--
BusRdX/-- BusRdX/--
cache X cache
bus
XMemory
18 UNIVERSITY
Example 2
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
Processor A reads then writes block X. PrRd/-- PrWr/ PrRd/-- PrWr/
-- --
BusUpgr BusUpgr
BusRdX/ BusRd/-- BusRdX/ BusRd/--
Flush PrRd/-- Flush PrRd/--
BusRd/-- PrWr/ BusRd/-- PrWr/
BusRdX BusRdX
_ _
BusRdX/-- BusRdX/--
BusRd BusRd
PrRd(S)/ PrRd(S)/
BusRd BusRd
BusRdX/-- BusRdX/--
I I
BusRd/-- BusRd/--
BusRdX/-- BusRdX/--
cache cache
bus
XMemory
19 UNIVERSITY
Example 2
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
-- --
BusUpgr BusUpgr
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
_ _
BusRdX/-- BusRdX/--
BusRd BusRd
PrRd(S)/ PrRd(S)/
BusRd BusRd
BusRdX/-- BusRdX/--
I I
BusRd/-- BusRd/--
BusRdX/-- BusRdX/--
cache cache
bus
XMemory
20 UNIVERSITY
Example 2
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
-- --
BusUpgr BusUpgr
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
Silent BusRdX/--
_
BusRdX/--
_
[Processor A] Write on Xtransition BusRd BusRd
PrRd(S)/ PrRd(S)/
Action: PrWr/-- BusUpgr/--
BusRd
BusUpgr/--
BusRd
BusRdX/-- BusRdX/--
Transition: EM
I I
BusRd/-- BusRd/--
BusRdX/-- BusRdX/--
cache X cache
bus
XMemory
21 UNIVERSITY
Example 2
Processor
PrRd/--
A Processor
PrRd/--
B
PrWr/-- PrWr/--
Example:
M M
-- --
BusUpgr BusUpgr
BusRd/-- PrWr/
Flush PrRd/--
BusRd/-- PrWr/
_ _
BusRdX/-- BusRdX/--
[Processor A] Write on X BusRd BusRd
PrRd(S)/ PrRd(S)/
Action: PrWr/-- BusUpgr/--
BusRd
BusUpgr/--
BusRd
BusRdX/-- BusRdX/--
Transition: EM
I I
[Processor B] Read on X BusRd/--
BusUpgr/--
BusRd/--
BusUpgr/--
Action(B): PrRd(S)/BusRd BusRdX/-- BusRdX/--
Transition(B): I (initial) S
cache X cache
Action(A): BusRd/Flush
Transition(A): MS X
bus
XMemory
22 UNIVERSITY
Snoopy Invalidation Tradeoffs
• Cache-to-cache vs. Memory-to-cache transfer
– On a BusRd, should data come from another cache or memory?
– Another cache
• might be faster if memory is slow or highly contended
• what if there are several caches sharing the same block? who will provide the
block?
– Memory
• would be simpler because no need to identify who (if there are several caches
sharing the same block) will provide the block
• requires writeback on MS transition
• Writeback on MS
– Is this necessary? What if the block is updated multiple times by several
caches for a while? Do we need to update memory whenever updating the
block?

23 UNIVERSITY
Solution: MOESI
• Add a new state indicating the owner of a block
– Owner (O) state
• There can be only one cache in O state for each block
– All others must hold the data in S state
• Provide up-to-date copy of a block to requesting caches (cache-to-cache transfer)
• Memory writeback only when the block is evicted
• The cache in state E or M is implicitly the owner of the block.
– On BusRd, the cache in state E or M moves to O state
Valid
• S state in MOESI is “Shared and potentially dirty”
M O Modified
• AMD Opteron uses MOESI
E S Clean
Not
Shar Shar
Dirty until ed ed I
one shared
copy is
24 evicted UNIVERSITY
MOESI
PrRd/--
PrWr/
PrWr/-- • Red-colored transitions: Unique
BusUpgr transitions in MOESI compared
BusRd/Flush BusRd/ M to MESI
PrRd/-- Flush
O • Flush here does not update
PrRd/-- PrWr/
BusRd/
Flush --
memory unless the block is
E PrWr/
BusUpgr
evicted  most block
Any request on this communication is via cache-to-
BusRdX/Flush
block is responded by PrRd/-- cache
this cache PrWr/
BusRdX
_
Valid
BusRdX/Flush S PrRd(S)/
BusRd
PrRd(S)/ M O Modified
BusUpgr/--
BusRd
BusRdX/--
BusUpgr/-- E S Clean
BusRdX/Flush I Not
BusRd/-- Shar
Shar
BusUpgr/--
ed ed I
BusRdX/--
This cache loses
ownership. 25 SJSU SAN JOSÉ STATE
UNIVERSITY
Multi-level Cache Hierarchy
• Processors typically have multi-level cache hierarchy
(i.e. L1, L2..)
• When a coherence request arrives, caches in all levels
should be checked  long latency
P P P
L1 L1 L1
L2 L2 L2
Interconnection Network
Memory

26 UNIVERSITY
Multi-level Cache Hierarchy
• How can we reduce the latency of checking all levels
of caches?
• Inclusive cache
– Data in upper level cache is also maintained in lower level
caches
– Snoop only the lowermost level cache
– Lowermost cache needs to know upper level caches have
write hits
• Use write-through to keep the consistency among the caches
• Or, use write-back but maintain a flag bit that indicates the
dirtiness of data

27 UNIVERSITY
4C’s
• 3C’s
– Compulsory miss
– Capacity miss
– Conflict miss
• 4C’s
– 3C’s + Coherence miss
• Coherence miss: Cache misses caused by cache
coherency

28 UNIVERSITY
Coherence Miss Example
Example: Processor A Processor B
PrRd/-- PrRd/--
Data block X is only in memory. PrWr/-- PrWr/--
Processor A reads block X.

Then, Processor B writes block X.
M M
Lastly, Processor A reads block X.
BusRd/ PrRd/-- PrWr/ BusRd/ PrRd/-- PrWr/
[Processor A] Read on X Flush BusRd/-- BusUpgr Flush BusRd/-- BusUpgr
Action: PrRd/BusRd PrWr/BusRdX PrWr/BusRdX

Transition: I (initial) S BusRdX/Flush S BusRdX/Flush S
BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd

BusRdX/-- BusRdX/--
I I
BusRd/-- BusRd/--
miss BusRdX/-- BusRdX/--
cache cache
bus
XMemory
29 UNIVERSITY
PrRd/-- PrRd/--

Processor B writes block X.
M M
BusRd/ PrRd/-- PrWr/ BusRd/ PrRd/-- PrWr/
[Processor B] Write on X BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd

BusRdX/-- BusRdX/--
Action(B): PrWr/BusRdX
Transition(B): I (initial) M
Action(A): BusRdX/-- I II
Transition(A): SI BusRd/-- BusRd/--
BusUpgr/--
BusRdX/-- miss BusUpgr/--
BusRdX/--
cache X X cache
bus
XMemory
30 UNIVERSITY
PrRd/-- PrRd/--
Processor B writes block X.
M M
BusRd/ BusRd/ PrRd/-- PrWr/
PrRd/-- PrWr/

[Processor B] Write on XA had X in its BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd
Action(B): PrWr/BusRdX cache but BusRdX/-- BusRdX/--
encounter
Transition(B): I (initial) M a miss
Action(A): BusRdX/-- I I
Transition(A): SI
as X is out-dated BusRd/-- BusRd/--
miss BusUpgr/--
BusRdX/--
BusUpgr/--
BusRdX/--
[Processor A] Read on X
Action(A): PrRd/BusRd cache X X cache
Transition(A): IS X
Action(B): BusRd/Flush bus
XMemory
Transition(B): MS 31 UNIVERSITY
Types of Coherence Misses
• Two types
– True sharing: Cache miss occurred due to the update on a word
in a cache block that your processor actually use
– False sharing: Cache miss occurred due to the update on a

word in a cache block that your process does not actually use

32 UNIVERSITY
True sharing vs False sharing
• Suppose that a cache block X consists of two words
• Assume that we are using MSI protocol
Cache block State
Block X in Processor A Block X in Processor B Cache miss

Initial: Word1 Word2 S Initial: Word1 Word2 S due to true
sharing
time
hit
Write on Word1 Word1’ Word2 M Word1 Word2 I
miss
Word1’ Word2 S Read Word1 Word1’ Word2 S Cache miss due to
hit false sharing (B
Write on Word2 Word1’ Word2’ M Word1’ Word2 I never read Word2
miss
Read Word1 Word1’ Word2’ but encounters a
Word1’ Word2’ S S
miss due to the
updated Word2)

33 UNIVERSITY
Final Exam
• Covers from Lecture 1 to 12
– From performance part of lecture 1
• Format
– Similar to Midterm1 and 2
– Calculator/pen/erasure are allowed
• Review
– Lecture slides
– Homework 1 to 6 solutions
– Quiz solutions
– Midterm solutions

34 UNIVERSITY
Lecture 1
• Peak performance metric
– MIPS
• Speedup Calculation
– Speedup of X over Y =
– Average speedup
• CPU Time
= CPU Clock Cycles x Clock Cycle Time
= x x
= x Clock Cycle Time

35 UNIVERSITY
Lecture 1
• Average CPI vs. Average IPC
– Arithmetic mean of CPI =
– Harmonic mean of IPC =
• Amdahl’s law
– Make the common case fast
– Overall Speedup =

36 UNIVERSITY
Lecture 2
• ISA
– ISA is the interface between software and hardware
• ISA classification
– CISC
• Stack, Accumulator, Memory-Memory, Register-Memory
– RISC
• Register-Register (Load-Store)
• i.e. MIPS

37 UNIVERSITY
Lecture 2-MIPS
• Address/data size
– 4 byte
• Endianess
– Big & Little both but we used little
• Instruction Types
– R-type
– I-type
LD/ST: sign-extension
Branch: sign-extension + shift-left-2
– J-type SAN JOSÉ STATE

38 SJSU
shift-left-2 UNIVERSITY
Lecture 3
• Instructions go through following steps
– Fetch
• Fetch instruction from I-mem
– Decode
• Identify opcode and read operands
– Execute
• R-type : Specified functions
• : Add
LD/ST
• Beq : Sub
• Jump : Nothing
– Mem
• Read/Write data from/to D-mem
– Writeback
• Update register 39 SJSU SAN JOSÉ STATE
UNIVERSITY
Lecture 3

40 UNIVERSITY
Lecture 4
• Entire datapath is broken
into five pipeline stages
– Shorter cycle period
CPI = 1
– Mult. cycles per inst
• 5 CPI in 5 stage pipeline
– But, Can achieve CPI = 1
• by overlapping mult.
instructions
CPI = 5

41 CPI = 1 UNIVERSITY
Lecture 4
• Pipeline stage registers
– act as temp. registers storing intermediate results and thus allowing previous
stage to be reused for another instruction

42 UNIVERSITY
Lecture 4
• Total execution cycle without any hazards
– (K pipeline stages + N instructions -1) cycles
• Hazards
– Structural
• HW Organization (i.e. unified i- and d-memory)
– Data : stall pipeline stages until operand value becomes ready
• Data dependency

43 UNIVERSITY
Lecture 5
• Dependencies among instructions
– RAW (Read-After-Write) : True dependency Causes hazards
• $1 of add  sub, $2 of sub  sll, $2 of sll  addi
– WAW (Write-After-Write) : False dependency
• $1 of add  addi, $2 of sub  sll
– WAR (Write-After-Read) : False dependency
• $2 of add  sub, $1 of sub  addi
RAW WAW WAR

add $1, $2, $3 add $1, $2, $3 add $1, $2, $3
sub $2, $1, $3 sub $2, $1, $3 sub $2, $1, $3
sll $2, $2, 3 sll $2, $2, 3 sll $2, $2, 3
addi $1, $2, 5 addi $1, $2, 5 addi $1, $2, 5
44 UNIVERSITY
Lecture 5
• As far as earlier inst’s result is in pipeline, the result
value can be forwarded to the following instructions
• In arithmetic operations, no stall is needed
– cf. LW needs one stall cycle even with forwarding
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10
ADD IF ID EXE MEM WB
ADD $t3,$t1,$t2 SUB IF ID EXE MEM WB
SUB $t5,$t3,$t4 XOR IF ID EXE MEM WB
XOR $t7,$t5,$t3 Next+1 IF ID EXE MEM WB
Next+2 IF ID EXE MEM WB

45 UNIVERSITY
Lecture 5
• Forwarding from MEM/WB to EXE
• One cycle stall is needed even with forwarding
LW $t1,4($s0)
ADD $t5,$t1,$t4
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12
LW IF ID EXE MEM WB
ADD IF ID ID EXE MEM WB
Next
IF IF ID EXE MEM WB
inst

46 UNIVERSITY
Lecture 5
• Compiler can reorder code to avoid the stalls
• C code for A = B + E; C = B + F;
* assume that negated clocked register file and forwarding is used

lw $t1, 0($t0) lw $t1, 0($t0)
lw $t2, 4($t0) lw $t2, 4($t0)
stall
add $t3, $t1, $t2 lw $t4, 8($t0)
sw $t3, 12($t0) add $t3, $t1, $t2
lw $t4, 8($t0) sw $t3, 12($t0)
add $t5, $t1, $t4 add $t5, $t1, $t4
stall
sw $t5, 16($t0) sw $t5, 16($t0)
5 cycles for the first inst + 5 cycles for the first inst +
Total 6 cycles for the remaining insts + 6 cycles for the remainingSAN insts +
execution time 2 x 1 stall cycles = 13 cycles47 0 stall cycles = 11SJSU
JOSÉ STATE
cyclesUNIVERSITY
Lecture 5
• Control Hazard
Actual Branch Outcome
BEQ $a0,$a1,L1 (NT)
L2: ADD $s1,$t1,$t2 BEQ IF ID EXE MEM WB
SUB $t3,$t0,$s0 ADD IF ID EXE MEM WB
OR $s0,$t6,$t7 SUB IF ID EXE MEM WB
BNE $a0,$s1,L2 (T) OR IF ID EXE MEM WB
L1: AND $t3,$t6,$t7 BNE IF ID EXE MEM T!
SW $t5,0($s1) AND IF ID EXE
LW $s2,0($s5) SW IF ID
Flush
LW IF

48 UNIVERSITY
Lecture 5
• Control Hazard
BEQ $a0,$a1,L1 (NT)
L1: AND $t3,$t6,$t7 BNE IF ID EXE MEM WB

SW $t5,0($s1) AND IF ID EXE nop nop
LW $s2,0($s5)
SW IF ID nop nop
LW IF nop nop
Instruction in the ADD IF ID
target address (L2) SUB IF

49 UNIVERSITY
Lecture 5
• Early Branch Determination
– Branch target address can be calculated earlier than MEM by moving shift-left-2
and Adder
But, this may need extra
IF/ID ID/EX data forwarding
EX/MEM and stall
MEM/WB
because now the
Adder
Shift Sum
Left
2 operand value is needed
4 A
S
in ID stage (not in EXE
+
B
stage)
Read
0 CLK
Reg. 1 #
1 Read
PC Addr. Data Reg. 2 #
Read
Write data 1 Zero
ALU
Reg. #
==
I-Cache / I-MEM Read Res. Addr.
Write
data 2
0 0
Data Read
Register File 1 Data
1
Write
Sign Data
Extend D-Cache /
16 32
D-MemSAN JOSÉ STATE
50 SJSU UNIVERSITY
Lecture 5
• Early Branch Determination w/ predicted NT
BEQ $a0,$a1,L1 (NT)
L1: AND $t3,$t6,$t7 BNE IF ID

T!
EXE MEM WB
SW $t5,0($s1) AND Flush

IF
nop nop nop nop
LW $s2,0($s5) ADD IF ID EXE MEM
SUB IF ID EXE
Instruction in the
target address (L2)
51 UNIVERSITY
Lecture 5
• Early Branch Determination w/ predicted NT If BNE has dependency
with its preceding inst,
OR, one cycle stall is
BEQ $a0,$a1,L1 (NT)
CC1 CC2 CC3 CC4 CC5
required
CC6
to getCC8
CC7
the OR’s
CC9 CC10
result value
BNE $s0,$s1,L2 (T) OR IF ID EXE MEM WB
L1: AND $t3,$t6,$t7 BNE IF ID ID EXE

T!
MEM WB
SW $t5,0($s1) AND IF IF
nop nop nop
LW $s2,0($s5) ADD IF ID EXE
SUB IF ID
Instruction in the
target address (L2)
52 UNIVERSITY
Lecture 6
• Dynamic Branch Predictor
– 1-bit and 2-bit Saturating Counter in each entry of a branch prediction buffer
• Could have more than two bits but two bits cover most patterns (i.e. loops)
Predict NT
Predict T
Transistion on T outcome
Transistion on NT outcome
11 10
0 1
00 01
Branch Prediction Buffer
prediction bit
FSM for Last-Outcome FSM for 2-bit
0 T
Prediction Saturating Counter
1 NT
2 T
3 T
T: Taken
SJSU SAN
4 NT JOSÉ STATE
NT: Not Taken 53 UNIVERSITY
Lecture 6
• Two-bit predictor is good for branches in loops
• How to improve prediction rate for the branches
other than the branches in loops?
Example:
if (a==2) then a = 0; If these two conditions succeed (untaken),

if (b==2) then b = 0;
if (a!=b) then ... This third if condition will fail (taken)
Third branch is correlated to first two branches

How can we apply correlations to the branch prediction?

54 UNIVERSITY
Lecture 6
• Two-level Predictors
– Branch predictor buffer maintains one or multi-bit counters
– Each branch can either share a global branch predictor buffer or have its own
branch predictor buffer
Global history and

Global history and
global predictor
private predictor
Private history and Private history and

global predictor private predictor

55 UNIVERSITY
Lecture 6
• Global history and global predictor
– Last N branches outcome is used to index global branch predictor buffer
– All branches share the same branch predictor buffer
– For the cases when all branches are strongly correlated with each other
Example: 2-bit history register and 2-bit predictor with the following initial values.
There are four branches in the code and their actual outcomes are T, T, NT, NT, respectively
Predictor Code
History (Predicted NT) BNE
0 00 (Actual outcome : T)
00 ..
1 10 BEQ
..
2 00 BEQ
(Update predictor)
3 01 ..
BNE
(Update history)
56 UNIVERSITY
Lecture 6
shift-left by 1
regardless the
branch outcome!!
Shift-left by 1. Predictor Code
History (Predicted NT) BNE
Add the recent 0 00 (Actual outcome : T)
branch outcome 0001 1 10
..
BEQ
to the LSB of ..
history register. 2 00 BEQ
(Update predictor)
Discard the shifted 3 01 ..
BNE
MSB to keep 2-bit only.
(Update history)
57 UNIVERSITY
Lecture 6
Predictor Code
History
0 01 BNE
01 ..
1 10 BEQ
..
2 00 BEQ
3 01 ..
BNE

58 UNIVERSITY
Lecture 6
• In MIPS, exceptions managed by a
System Control Coprocessor (CP0)
• CPU provides the address of the
instruction where the event occurred
– The Exception Program Counter (EPC)
• contains the address of the instruction
– CPU might undo addition of 4 from fetch
cycle
• The Cause register contains a value that
indicates what type of event occurred
– For example:
Invalid Instruction:
Cause = 0x0000000A
Arithmetic Overflow:
Cause = 0x0000000C

59 UNIVERSITY
Lecture 6
Overflow!

60 UNIVERSITY
Lecture 6
Three instructions (including add) are flushed
exception
handler
code fetched

61 UNIVERSITY
Lecture 7
• Instruction Level Parallelism (ILP)

• Basic idea: Execute several instructions in parallel
• We already do pipelining…
– But it can only push through at most 1 inst/cycle
• We want multiple instr/cycle
– Yes, it gets a bit complicated
• More transistors/logic
– That’s how we got from 486 (pipelined) to Pentium and
beyond

62 UNIVERSITY
Lecture 7
• N-way Superscalar
– Fetch/Decode/Exe N insts concurrently
• Execute 2 instructions per cycle (2-way)
– One Integer/branch/memory instruction and one floating point instruction
– Instructions in pair must be independent
– Easy to implement because MIPS has separate register files for Int and Floating
point
• N-way superscalar
– 3~8 ways
• 3 way: int, branch, fp
• 6 way: branch, mem, 3 int, fp
IF ID EX ME
WB
IF ID FP1 FP2 FP3 FP4 FP5
63 UNIVERSITY
Lecture 7
• Now let’s see the timing in 2-way superscalar
IF ID EX ME
WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12 CC13 CC14 CC15 CC16 CC17
ld.s $f0, 0($r1) IF ID EXE
ld.s $f1, 0($t2) IF ID

subi $t3, $t3, #1 IF
add.s $f2, $f1, $f0 IF
addi $t1, $t1, #4

addi $t2, $t2, #4
st.s $f2, -4($t1)
bnez $t3, Loop

64 UNIVERSITY
Lecture 7
IF ID EX ME
WB
ld.s $f0, 0($r1) IF ID EXE MEM
ld.s $f1, 0($t2) IF ID EXE

subi $t3, $t3, #1 IF ID
add.s $f2, $f1, $f0 IF ID
addi $t1, $t1, #4 IF
addi $t2, $t2, #4

st.s $f2, -4($t1)
bnez $t3, Loop

65 UNIVERSITY
Lecture 7
IF ID EX ME
WB
ld.s $f0, 0($r1) IF ID EXE MEM WB
ld.s $f1, 0($t2) IF ID EXE MEM WB
Not enough floating
subi $t3, $t3, #1 IF ID EXE MEM WB
point instructions
add.s $f2, $f1, $f0 IF ID ID FP1 FP2 FP3 FP4 FP5 WB
and dependencies
addi $t1, $t1, #4 IF ID EXE MEM WB
 no perf
addi $t2, $t2, #4 IF ID EXE MEM WB
improvement
st.s $f2, -4($t1) IF ID ID ID ID EXE MEM WB
bnez $t3, Loop IF IF IF IF ID EXE MEM WB

66 UNIVERSITY
Lecture 7
• Loop Unrolling helps to find instructions that can be executed concurrently
Code
Loop: ld.s $f0, 0($t1) Ex. Unrolling 2 loop iterations
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1)
add $t1, $t1, #4
addi $t2, $t2, #4
subi $t3, $t3, #1
remove loop index calculation in the middle
bnez $t3, Loop
remove branch in the middle
ld.s $f0, 0($t1)
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1)
addi $t1, $t1, #4
addi $t2, $t2, #4
subi $t3, $t3, #2
#1 instead,
SAN JOSÉ STATE
bnez $t3, Loop 67 loop index once for theSJSU
calculate combined iterations
UNIVERSITY
Lecture 7
Code
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1)
addi $t1, $t1, #4 remove address calculation in the middle
addi $t2, $t2, #4
ld.s $f0, 0($t1)

4
ld.s $f1, 0($t2)
4
add.s $f2, $f1, $f0 instead,
st.s $f2, 0($t1)
4
update offsets of the mem ops in the second iteration
addi $t1, $t1, #4
#8
addi $t2, $t2, #4 calculate addresses once for the combined iterations
#8
subi $t3, $t3, #2
bnez $t3, Loop 68 UNIVERSITY
Lecture 7
Code
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1) Now, can we reorder insts across the
ld.s $f0, 4($t1) original two iterations? (i.e. can insts
ld.s $f1, 4($t2) in 2nd iteration be moved up to 1st
add.s $f2, $f1, $f0 iteration region?)
st.s $f2, 4($t1)
addi $t1, $t1, #8
addi $t2, $t2, #8
subi $t3, $t3, #2
bnez $t3, Loop

69 UNIVERSITY
Lecture 7
Code
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1) Now, can we reorder insts across the
ld.s $f0, 4($t1) original two iterations? (i.e. can insts
ld.s $f1, 4($t2) in 2nd iteration be moved up to 1st
add.s $f2, $f1, $f0 iteration region?)
st.s $f2, 4($t1)
addi $t1, $t1, #8
No, because the same registers are used
addi $t2, $t2, #8
in the two original iterations (ld.s can’t be moved
subi $t3, $t3, #2
above add.s because add.s should read the
bnez $t3, Loop
first ld.s’ f0 value)

70 UNIVERSITY
Lecture 7
Code
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1) Rename registers!
ld.s $f0, 4($t1)
ld.s $f1, 4($t2) In the original 2nd iteration code,
add.s $f2, $f1, $f0 $f0  $f3
st.s $f2, 4($t1)
$f1  $f4
addi $t1, $t1, #8
$f2  $f5
addi $t2, $t2, #8
subi $t3, $t3, #2
bnez $t3, Loop

71 UNIVERSITY
Lecture 7
Code
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1) Rename registers!
ld.s $f3, 4($t1)
ld.s $f4, 4($t2) In the original 2nd iteration code,
add.s $f5, $f4, $f3 $f0  $f3
st.s $f5, 4($t1)
$f1  $f4
addi $t1, $t1, #8
$f2  $f5
addi $t2, $t2, #8
subi $t3, $t3, #2
bnez $t3, Loop

72 UNIVERSITY
Lecture 7
Code
Loop: ld.s $f0, 0($t1) Now, let’s reorder insts across the two iterations!
ld.s $f1, 0($t2)
add.s $f2, $f1, $f0
st.s $f2, 0($t1)
ld.s $f3, 4($t1) Load insts to the beginning of the new loop
ld.s $f4, 4($t2)
add.s $f5, $f4, $f3
st.s $f5, 4($t1)
addi $t1, $t1, #8
Store insts to the end of the new loop
addi $t2, $t2, #8
subi $t3, $t3, #2
bnez $t3, Loop

73 UNIVERSITY
Lecture 7
Code
Loop: ld.s $f0, 0($t1) Now, let’s reorder insts across the two iterations!
ld.s $f1, 0($t2)
ld.s $f3, 4($t1)
ld.s $f4, 4($t2)
add.s $f2, $f1, $f0
add.s $f5, $f4, $f3
subi $t3, $t3, #2
addi $t1, $t1, #8
addi $t2, $t2, #8
st.s $f2, -8
0($t1)
st.s $f5, 4($t1) adjust offset for the mem ops
-4
bnez $t3, Loop

74 UNIVERSITY
Lecture 7
• Limitations of Loop-Unrolling
– Works well if loop iterations are independent
• No loop-carried dependencies
– Problem with loop-carried RAW dependency
Example:
for (i=5;i<100;i++)
A[i]:= A[i-5] + B[i];
• Store and read on the same A array space in every 5 iterations
• Unroll limit : up to 5 iterations
• Unrolling more than 5 iterations  Loop-carried dependency limits code reordering
– Consumes more registers due to renaming
– Bigger code size
• Affects I-cache and memory
– Problem when the # of iterations is unknown at compile time

75 UNIVERSITY
Lecture 8 Common data bus (CDB)
Execution result is propagated
through CDB to reservation
Front-end: station, ld/st buffer, and
Fetched insts are stored here. register file in WB stage
Insts are dispatched from this
inst queue to execution units
in FIFO fashion.
Address unit calculates

the target mem
address for mem
operations
For Register renaming
Architected registers
are renamed to the
entry id of reservation
Execution units. station
This picture shows only
Copyright © 2012, Elsevier Inc. All rights reserved.
floating-point execution units
76 UNIVERSITY
Lecture 8
• Basic Tomasulo Algorithm’s Three Steps:
– Issue
• Get next instruction from instruction queue
• If available RS, issue the instruction to the RS with operand values if
available
• If operand values not available, stall the instruction
– Execute
• When all operands are ready, issue the instruction
• Loads and store maintained in program order through effective address
• No instruction allowed to initiate execution until all branches that
proceed it in program order have completed
– Write result
• Write result on CDB into reservation stations, register file and store
buffers

77 UNIVERSITY
Lecture 8
no. Instruction ISSUE EXE WB
Assume
Assumewe wehave
have: :
I1 ld.s $f6, 34($t2) 1 2-3 - -11MUL/DIV
MUL/DIVunit,
unit,11LD/ST
LD/STunit,
unit,11Arithmetic
Arithmeticunit
unit
I2 ld.s $f2, 45($t3) 2 Instruction takes :
Instruction takes :
I3 mul.s $f0, $f2, $f4 - -Load:
Load:22cycles,
cycles,Add/Sub:
Add/Sub:22cycles
cycles
I4 sub.s $f8, $f2, $f6 - -Mult: 10 cycles, Divide: 40 cycles
Mult: 10 cycles, Divide: 40 cycles
I5 div.s $f10 $f0, $f6
I6 add.s $f6, $f8, $f2 Reservation Stations
Busy Op Vj Vk Qj Qk A
LD1 1 I1 - $t2 34 + Regs[$t2]
LD2 1 I2 - $t3 45 + Regs[$t3]
AD1
If reg is already in AD2
reg file, fill the AD3
operand field with ML1
reg id ML2
F0 F2 F4 F6 F8 F10 F12
Register Status (Qi): LD2 LD1 …
78 UNIVERSITY
Lecture 8
Assume
Assumewe wehave
have: :
I1 ld.s $f6, 34($t2) 1 2-3 - -11MUL/DIV
MUL/DIVunit,
unit,11LD/ST
LD/STunit,
unit,11Arithmetic
Arithmeticunit
unit
I2 ld.s $f2, 45($t3) 2 Instruction takes :
Instruction takes :
I3 mul.s $f0, $f2, $f4 3 - -Load:
Load:22cycles,
cycles,Add/Sub:
Add/Sub:22cycles
cycles
I4 sub.s $f8, $f2, $f6 - -Mult: 10 cycles, Divide: 40 cycles
I5 div.s $f10 $f0, $f6
I6 add.s $f6, $f8, $f2 Reservation Stations
LD1 1 I1 - $t2 34 + Regs[$t2]
LD2 1 I2 - $t3 45 + Regs[$t3]
If operand value is AD1

being computed by AD2
AD3
another inst, fill
ML1 1 I3 LD2 $f4
the operand field ML2
with RS id
F0 F2 F4 F6 F8 F10 F12
Register Status (Qi): ML1 LD2 LD1 …
79 UNIVERSITY
Lecture 8
Assume
Assumewe wehave
have: :
I1 ld.s $f6, 34($t2) 1 2-3 4 - -11MUL/DIV
MUL/DIVunit,
unit,11LD/ST
LD/STunit,
unit,11Arithmetic
Arithmeticunit
unit
I2 ld.s $f2, 45($t3) 2 4-5 6 Instruction takes :
Instruction takes :
I3 mul.s $f0, $f2, $f4 3 6 - 15 - -Load:
Load:22cycles,
cycles,Add/Sub:
Add/Sub:22cycles
cycles
I4 sub.s $f8, $f2, $f6 4 6-7 - -Mult: 10 cycles, Divide: 40 cycles
I5 div.s $f10 $f0, $f6 5
I6 add.s $f6, $f8, $f2 6 Reservation Stations
LD1
LD2 1 I2 - $t3 45 + Regs[$t3]
If operand value is AD1 1 I4 $f2 $f6

AD2 1 I6 AD1 $f2
written back in the AD3
same cycle, newly ML1 1 I3 $f2 $f4
issued inst fetches ML2 1 I5 ML1 $f6
operand from reg
file so use reg id. F0 F2 F4 F6 F8 F10 F12
Register Status (Qi): ML1 AD2 AD1 ML2 …
80 UNIVERSITY
Lecture 8
• Tomasulo Algorithm with ROB
The oldest inst in the ROB

can update register file
(commit in program order)
SAN JOSÉ STATE

81
Copyright © 2012, Elsevier Inc. All rights reserved. SJSU UNIVERSITY
Lecture 8
• Issue
– An inst can be issued only when an appropriate RS
entry and a ROB entry is available
– Registers are renamed by using ROB id
• Write Results
– Write result back to your ROB entry
– Register file is only updated by the oldest instruction
in the ROB
– Mark ready/finished bit in ROB

82 UNIVERSITY
Lecture 8
• Commit
– When an inst is the oldest in the ROB
• i.e. ROB-head points to it
– Write result (if ready/finished bit is set)
• If register producing instruction: write to architected
register file
• If store: write to memory
– Advance ROB-head to next instruction

83 UNIVERSITY
Tomasulo with ROB
no. Instruction ISSUE EXE WB COMMIT
Even though operand value is
I1 div $t2, $t3, $t4 1 2 - 41 updated
Assume
Assumewe wehave
to ROB (i.e. I2 and I3)
have: :
I2 mul $t1, $t5, $t6 2 3 - 12 13 but ifMUL/DIV
the value
- -22MUL/DIV is 1not
unit,
unit, updated
1Arithmetic
Arithmetic unit
unit
I3 add $t3, $t7, $t8 3 4 5
Instruction takes
to reg file,
Instruction takes:
dependent
: inst uses
- -Add/Sub:
ROB11id
Add/Sub: cycles
(see I4).
cycles
I4 mul $t1, $t1, $t3 14 15 - 24
- -Mult:
Mult: 10 cycles,
10 cycles, Divide:
Divide: 40
40cycles
cycles
I5 sub $t4, $t1, $t5 15 And the inst can begin
You can
You can bypass and
bypass and execute in the same
samecycle
I6 add $t1, $t4, $t2 execution by execute in thein
using value cycle
ROB
RF
Id Val Qi ROB Reservation Stations
$t1 -23 ROB4 Id Type Dest Val Ready Busy Op Vj Vk Qj Qk A
$t2 16 ROB1 1 $t2 AD1 1 I5 3 ROB4 $t5
$t3 45 ROB3 AD2
2 $t1 12 1
$t4 5 ROB5 AD3
3 $t3 3 1
$t5 3 ML1 1 I1 45 5 $t3 $t4
4 $t1 1 I4 12 3 ROB2 ROB3
$t6 4 ML2
5 $t4
$t7 1
6
$t8 2

84 UNIVERSITY
Lecture 9
• Memory disambiguation
– Figuring out whether two addresses are equal to avoid hazards
• Conservative approach
– A ready load must wait until addresses of all preceding stores are known
• Optimistic approach: speculative disambiguation

– Speculatively assume that addresses of a load and its preceding stores
are different and just execute the load!
– Later, when a store’s address has been computed, check all the following
loads in the load/store buffer
• If a load has the same address and is already executed, then the load and all
following instructions must be replayed.

85 UNIVERSITY
Lecture 9
• Conservative Approach
– Load should wait until all preceding stores’ addresses are
computed
no. Instruction ISSUE EXE WB COM

I1 ld $t3, 0($t6) 1 2-3 4 5
I2 mul $t4, $t3, $t5 2 4 - 13 14 15
I3 addi $t7, $t6, 4 3 4 5 16
I4 st $t7, 0($t4) 4 14 - 15 - 17
I5 sub $t1, $t1, $t2 5 6–7 8 18
I6 ld $t8, 4($t1) 6 16 - 17 18 19
I6 can begin execution at cycle 8 but waits until st address

is computed

86 UNIVERSITY
Lecture 9
• Optimistic Approach
– Speculatively assume that addresses of a load and its
preceding stores are different and just execute the load!
no. Instruction ISSUE EXE WB COM

I1 ld $t3, 0($t6) 1 2-3 4 5
I2 mul $t4, $t3, $t5 2 4 - 13 14 15
I3 addi $t7, $t6, 4 3 4 5 16
I4 st $t7, 0($t4) 4 14 - 15 - 17
I5 sub $t1, $t1, $t2 5 6–7 8 18 Flush load and following
insts and replay them
I6 ld $t8, 4($t1) 6 8-9 10

87 UNIVERSITY
Lecture 9
• DRAM = Dynamic RAM

• SRAM = Static RAM
• SRAM: 6T (transistor) per bit

– built with normal high-speed CMOS technology
• DRAM: 1T + 1 Capacitor per bit
– built with special DRAM process optimized for density

UNIVERSITY
Lecture 9
• Hit: Data appears in some block of the cache
– Hit Rate: # hits / total accesses on the cache
– Hit Time: Time to access the cache
• Miss: Data needs to be retrieved from a block in the lower level (e.g., Block Y)
– Miss Rate: 1 - (Hit Rate)
– Miss Penalty: Average delay in the processor caused by each miss
• Average memory-access time (AMAT): Hit time + Miss rate x Miss penalty
Miss Lower Level Memory

Upper Level (cache/main memory)
Memory
Can’t find!
(cache)
From Processor data block X
data block Y
To Processor Hit Hit

89 UNIVERSITY
Lecture 9
1 cycle 10 cycles Second 20 cycles 300 cycles

First-level Level
Cache Third
Cache
Level
Hit Time Main
L1 Cache
Memory
L2 (DRAM)
On Processor Die L3
• AMAT in multi-level cache organization
Off-Chip
= Thit(L1) + Miss_rate(L1) x
[ Thit(L2) + Miss_rate(L2) x
{ Thit(L3) + Miss_rate(L3) x T(memory) } ]

90 UNIVERSITY
Lecture 9
• Memory blocks are mapped to cache lines
• Cache types w.r.t. Mapping
– Direct Mapped
• A memory value can be placed at a single corresponding location in the cache
• Fast indexing mechanism
– Set-Associative
• A memory value can be placed in any of a set of locations in the cache
• Slightly more complex search mechanism
– Fully-Associative
• A memory value can be placed in any location in the cache
• Extensive hardware resources required to search

91 UNIVERSITY
Lecture 10
• A direct-mapped cache consists of a Tag RAM and a Data RAM
– Each line of both RAM is corresponding to a cache block (line)
• When a new data block is fetched to cache, tag field is stored to the Tag RAM and the
fetched data is stored to the Data RAM
Byte offset
0x77FF1C68 = 0111 0111 1111 1111 0001 1100 0110 1000
Index Word offset

Tag Find requested
Valid Bit Cache Tag Cache Data word location (2nd)
Byte 31 Byte 1 Byte 0 0 Find proper
:
block location
Byte 63 Byte 33 Byte 32 1
:
2
# of blocks
0111 0111 1111 1111 0001 1100 32-byte data containing 0x77FF1C68 3
: : :
Byte 255 Byte 224 7
:
92 UNIVERSITY
Lecture 10
52-bit
• Given a 2MB, direct-mapped caches, line (block) size=64bytes
• Data address is 52 bits
Tag Index Block
• Tag size?
– block offset: 6 bits
– # blocks = 221/26 = 215  # bits in index: 15 bits ? bit ? bit ? bit
– # bits in an address: 52 bits
– # bits in tag = # bits in an address - # bits in index - # bits in block offset
= 52 – 15 – 6 = 31 bits
• Now change it to 16-way set associative, Tag size?

– # sets = # blocks/16 = 215/24 = 211  # bits in index: 11 bits
– # bits in tag = 52 – 11 – 6 = 35 bits
• How about if it’s fully associative, Tag size?

– # bits in tag = # bits in an address - # bits in block offset = 52 – 6 = 46 bits

93 UNIVERSITY
Lecture 10
• The 3 C’s
– Compulsory (cold) Misses
• On the 1st reference to a block
• Related to # blocks accessed by a code, not related to the configuration of a cache
– Capacity Misses
• Happen when the cache space is not sufficient to hold data
• Can be reduced by increasing cache size
– Conflict Misses
• Happen when multiple memory blocks map to the same line in direct-mapped or map to
the same set in set-associative caches
• Can be reduced by increasing the associativity or cache size

94 UNIVERSITY
Lecture 10
• Cache replacement policy
– When loading a new block (line), if the cache is already full, which block (line) should be
replaced?
• Random
– Replace a randomly chosen line
• FIFO
– Replace the oldest line
• LRU (Least Recently Used) 0
LRU group
– Replace the least recently used line
0 0
• pseudo-LRU
LRU LRU
– LRU but with less overhead
A B C D
- Cache blocks become leaf nodes.

- Intermediate nodes point to LRU.

95 UNIVERSITY
Lecture 10
• Write through
– The value is written to both the cache line and to the lower-level memory.
CPU
All accesses L1 misses
L1
L2
All stores
Write Buffer
• Write back
– The value is written only to the cache line. The modified cache line is written to
main memory only when it has to be replaced.
– To distinguish modified cache line dirty bit is used.

96 UNIVERSITY
Lecture 11
• Assumption
– A cache block size : 4 words
– Consecutive words accessed N Request on block N that causes primary miss
– Average memory access latency is 200 cycles N Request on block N that causes secondary miss
N Request on block N that causes hit
miss! 1st miss (200 cycles) miss! 2nd miss (200 cycles)
Blocking Cache 1 1 1 …
1 2
1 miss (200 cycles)
st
Non-Blocking Cache
with 2 MSHRs 1111 2nd miss (200 cycles)
Primary miss
MSHR-1 allocated
2222
Secondary miss MSHR released 3rd miss (200 cycles)
Primary miss
MSHR-2 allocated 3333
2 MSHRs are all occupied Primary miss
No further accesses are acceptable MSHR-1 allocated
97

UNIVERSITY
Lecture 11
• Page replacement 31 12 11 0
– FIFO/LRU .. Virtual page number Page offset
• Write strategy Page table register x table entry size in byte

– Write-back
Page table
• +
Size of the page table?
– 32-bit virtual address
RWX V M R .. Physical page number
– 4 KB page
– page table entry: 4 B
# bits in page offset: 12 bits

# bits in virtual page number: 32 – 12 = 20 bits
# entries in the page table: 2 20
 220 x 4 B = 4 MB 29 12 11 0
Too Large.... Physical page number Page offset

98 UNIVERSITY
Lecture 11
• Size of the page table? 31 24 23 18 17 12 11 0
– 32-bit virtual address V1 V2 V3 Page offset

– 4 KB page x table entry size in byte
– page table entry: 4 B x table entry size in byte
– 3-level page table 0x400300 0x501000
• v1 : 8 bits, v2 : 6 bits, v3 : 6 bits
– + 0x501000
# active page tables per level +
• level 1: 1, level 2: 4, level 3: 5
0x400300 +
# entries in the 1st level page table: 28

Level 1 Page table Level 2 Page table Level 3 Page table
Size of the 1st level page table: 28 x 4 B = 1 KB
# entries in each 2nd and 3rd level page table: 26
Size of the 2nd level table : 4 tables x 26 x 4 B = 1 KB
Size of the 3rd level table : 5 tables x 26 x 4 B = 1.25 KB
 Total of 3.25 KB 29 12 11 0
way smaller than 4 MB single-level page table! Physical page number Page offset

99 UNIVERSITY
Lecture 11
31 12 11 0
Virtual page number Page offset
.. TLB
..
29 12 11 0

100 UNIVERSITY
Lecture 12
PrRd/--
• Write-back cache updates memory only PrWr/--
when the block is replaced

– When a processor updates a block, the other
processors should know their copies are M
stale  Invalidate
• Block States: BusRd/ PrRd/-- PrWr/
Flush BusRd/-- BusUpgr
– Invalid (I)
• This copy is stale PrWr/BusRdX
– Shared (S)
• One or more caches are sharing the same
BusRdX/Flush S
clean copy
• Memory is clean PrRd/BusRd
BusUpgr/--
– Modified (M) BusRdX/--
• There is one cache that has the most updated
copy of the block
• Other caches that have the same block copy
have old values (in invalid state)
I
• Memory is stale BusRd/--
BusUpgr/--
BusRdX/--

101 UNIVERSITY
PrRd/--
Lecture 12
PrWr/--
M • Red-colored transitions: Unique

transitions in MESI compared to
PrRd/-- PrWr/
--
MSI
BusRd/
Silent
Flush
E PrWr/
BusUpgr transition on
BusRd/-- write
PrRd/--
BusRd/--
PrWr/
BusRdX/Flush
BusRdX
_ When “Shared” signal
BusRdX/--
S PrRd(S)/
BusRd
When “Shared” signal
PrRd(S)/ cached blocks in
BusUpgr/-- cached blocks in
BusRdX/--
BusRd
other caches), go to E.
other caches), go to E
Otherwise go to S.
I
BusRd/--
BusUpgr/--
BusRdX/--

102 UNIVERSITY
Lecture 12
PrRd/--
PrWr/
PrWr/-- • Red-colored transitions: Unique
BusUpgr transitions in MOESI compared
BusRd/Flush
PrRd/--
BusRd/ M to MESI
Flush
O • Flush here does not update
PrRd/-- PrWr/
BusRd/
Flush
--
memory unless the block is
E PrWr/
BusUpgr
evicted  most block
Any request on this communication is via cache-to-
BusRdX/Flush
block is responded by PrRd/-- cache
this cache PrWr/
BusRdX
_
Valid
BusRdX/Flush S PrRd(S)/
BusRd
PrRd(S)/ M O Modified
BusUpgr/--
BusRd
BusRdX/--
BusUpgr/-- E S Clean
BusRdX/Flush I Not
BusRd/-- Shar
Shar
BusUpgr/--
ed ed I
BusRdX/--
This cache loses
ownership.

103 UNIVERSITY
UNIVERSITY

CMPE200 Lecture 12

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CMPE200 Lecture 12

Загружено:

Авторское право:

Доступные форматы

CMPE 200

SJSU SAN JOSÉ STATE

Virtual memory space

SJSU SAN JOSÉ STATE

Replacing a block Trivial Hard

Memory use efficiency Internal fragmentation External fragmentation

Efficient disk traffic Yes Not always

• Multi-level page table can 31 24 23 18 17 12 11 0

level page table 0x400300 0x501000

• For each entry in the level N + 0x501000 +

SJSU SAN JOSÉ STATE

• What is the address coverage of an entry in each-level page table?

SJSU SAN JOSÉ STATE

• How can we improve the performance?

 Cache the physical address of the frequently accessed pages

SJSU SAN JOSÉ STATE

Shared-memory multiprocessor (SMP) Quad-core Chip-multiprocessor (CMP)

• Simple solution: inform copies on every update!

SJSU SAN JOSÉ STATE

• i.e. A Simple Protocol for Write-through Caches

PrRd/BusRd this processor

when the block is replaced

SJSU SAN JOSÉ STATE

• Mark S. Papamarcos and Janak H. Patel, “A low-overhead coherence solution

SJSU SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

Processor A reads block X.

Action: PrRd/BusRd PrWr/BusRdX PrWr/BusRdX

BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd

Processor A reads block X.

Action: PrRd/BusRd PrWr/BusRdX PrWr/BusRdX

Transition: I (initial) S BusRdX/Flush S BusRdX/Flush S

[Processor B] Write on X BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd

Action: PrRd/BusRd PrWr/BusRdX PrWr/BusRdX

[Processor B] Write on XA had X in its BusUpgr/-- PrRd/BusRd BusUpgr/-- PrRd/BusRd

Action(B): PrWr/BusRdX cache but BusRdX/-- BusRdX/--

– False sharing: Cache miss occurred due to the update on a

SJSU SAN JOSÉ STATE

Block X in Processor A Block X in Processor B Cache miss

SJSU SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

= x Clock Cycle Time

SJSU SAN JOSÉ STATE

– Harmonic mean of IPC =

SJSU SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

– J-type SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

SJSU SAN JOSÉ STATE

RAW WAW WAR

ADD IF ID EXE MEM WB

ADD $t3,$t1,$t2 SUB IF ID EXE MEM WB

SUB $t5,$t3,$t4 XOR IF ID EXE MEM WB

XOR $t7,$t5,$t3 Next+1 IF ID EXE MEM WB

Next+2 IF ID EXE MEM WB

Next+3 IF ID EXE MEM WB

SJSU SAN JOSÉ STATE

ADD IF ID ID EXE MEM WB

Next+1 IF ID EXE MEM WB

Next+2 IF ID EXE MEM WB