Cache Memory

Basic Philosophy
• Temporal Locality
• Spatial Locality
• Temporal Locality: if at one point in time a particular memory
location is referenced, then it is likely that the same location will be
referenced again in the near future. There is a temporal proximity
between the adjacent references to the same memory location. In this
case it is common to make efforts to store a copy of the referenced
data in special memory storage, which can be accessed faster.
Temporal locality is a very special case of the spatial locality, namely
when the prospective location is identical to the present location.
• Spatial Locality:
If a particular memory location is referenced at a particular time, then it
is likely that nearby memory locations will be referenced in the near
future. There is a spatial proximity between the memory locations,
referenced at almost the same time. In this case it is common to make
efforts to guess, how big neighbourhood around the current reference
is worthwhile to prepare for faster access.
Basic Terms
• Cache Block
• Miss/Hit
• Miss Rate/Hit Rate
• Miss Penalty
• Hit Time
• 3-Cs of caches
– Conflict
– Compulsory
– Capacity
Direct Mapped Cache
Assume 5-bit address bus and cache with 8 entries
Vali TA DATA Index

D4 – D 3 d G
000
TA 001
G 010
Processor 011
100
D2 - D0 Inde 101
x 110
111
Data Bus
=
HIT
Direct Mapped Cache
First Load
Vali TA DATA Index

D4 – D 3 TAG d G
0 000
= 01 0 001
0 010
Processor 0 011
0 100
D2 - D0 = 010 0 101
0 110
0 111
Data Bus
LD R1, (01010) ;remember 5-bit address bus, assume data is 8-
bit and
AA16 is stored at this location
First time, cause a MISS, data loaded from memory and cache HIT bit is
Direct Mapped Cache
After first load
Valid TA DATA Index

D4 – D 3 G
TAG 0 000
= 01 0 001
1 01 AA 010
Processor 0 011
0 100
D2 - D0 = 010 0 101
0 110
0 111
Data Bus
LD R1, (01010) ; AA16 is stored at this location, Cache HIT bit is
set to 1
Direct Mapped Cache
Second Load
Valid TA DATA Index

D4 – D 3 TAG
G
0 000
= 11
0 001
1 01 AA 010
Processor 0 011
0 100
D2 - D0 = 010 0 101
0 110
0 111
Data Bus
LD R1, (11010) ; assume 99 at address 11010
Same index but different TAG will cause a MISS, data loaded from
memory
Direct Mapped Cache
After Second Load
Valid TA DATA Index

D4 – D 3 G
TAG 0 000
= 11 0 001
1 11 99 010
Processor 0 011
0 100
D2 - D0 = 010 0 101
0 110
0 111
Data Bus
LD R1, (11010) ;remember 5-bit address bus, assume 99
First time, same index but different TAG will cause a MISS, data loaded
from memory
Cache Size Example
Direct Mapped
Valid TAG DATA
(15 bit) (32
32K X 48-bit 0
bit)
1 1111 1111 1111 1111 1111
Memory 1111
Processor Address Bus
32 K
Address Bus (A16 – A2) 0 Entries
0
0
0
(32-bit)
0 0 0000 0000 0000 0000 0000

0000
A31-
A17=15 =
(15-bit)
Processor Address bus = 32 bit (A)
Cache Storage = 128KB = 32 K Words (2N) with N = 15 Data Out
Number of blocks in cache (entries) = 32K
Tag Size = A- N- 2 = 32 – 15 – 2 (Byte offset) = 15
Cache Size = 128KB (data) + 32K X 15-bit (tag) + 32K X 1-bit (Hit bit) =
Cache Size Example (1)
Two-Way Set Associative
• Assume same processor (A = 32, D= 32)

• Assume same total storage of data =
128KB
• Two sets means we will have two direct
mapped caches with 64KB (128/2) each.
• 64KB = 16K words
• To address 16K X 32-bit memory we need
14-bit index.
• Hence Tag Size = 32-14-2 = 16
Two-Way Set Associative
SET 1 SET 2
Valid TAG DATA Valid TAG DATA
(16 (32 (16 bit) (32
(1 (1
16K X 49-bit bit)
0
bit) bit)
1111 1111 1111 1111 1111
bit)
0
bit)
Memories 1111
0 0
16 K
0 Entries 0
Address Bus (A16 – A2) 0 Address Bus (A16 – A2) 0
0 0
0 0
0 0000 0000 0000 0000 0000
0000
0
A31- A17 A31- A17

= =
Data Out
(16-bit) (16-bit)
Size = 2 (Sets) X
Data Out
16K X (32-bit + 16-bit + 1-
bit) 2:1 MUX
= 196KB
4-Way Set Associative
• Assume same processor (A = 32, D= 32)

• Assume same total storage of data =
128MB
• Four sets means we will have four direct
mapped caches with 32KB (128/4) each.
• 32KB = 8K words
• To address 8K X 32-bit memory we need
13-bit address.
• Hence Tag Size = 32-13-2 = 17
4-Way Set Associative
SET 1 SET 2 SET 3 SET 4
V TA V TA V TA V TA
G G G G
8K X 50-bit 17 17 17 17
0 0 0 0
Memories
0 0 0 0
8M 8M 8M
Entrie Entrie Entrie
0 s 0 s 0 s 0
Address Bus Address Bus Address Bus
Address Bus (A15 – A2) 0 0 0 (A15 – A2) 0
(A15 – A2) (A15 – A2)
0 0 0 0
0 0 0 0
0 0 0 0
A31- A16 A31- A16

A31- A16 A31- A16
= = = =
Data Out
Data Out
Data Out
(17-bit) (17-bit) (17-bit) (17-bit)
4:1 MUX
Size = 4 (Sets) X
8K X (32-bit + 17-bit + 1-bit)
Data Out to
processor
= 200KB
Organization of the data cache Alpha 21264
512 Entries Cache (2 Sets)
Alpha 21264 Processor44-Bit Virtual
Byte Offset (A5 – A0) SET 1 (Block Size = 64 bit) SET 2

Valid TAG DATA Valid TAG DATA
(29 (64 (29 bit) (64
(1 (1
bit) bit) bit)
bit) bit)
0 0
0 0
Index 512 entries Index 512

0 entries 0
Address Bus (A14 – A6) 0 Address Bus (A14 – A6) 0
0 0
0 0
0 0
Address
A44- A15 (29-bit Tag) A44- A15 (29-bit Tag)

= =
Data Out
(29-bit) (29-bit)
Size = 2 (Sets) X
Data Out
16K X (32-bit + 16-bit + 1-
bit) 2:1 MUX
= 196KB
Four Memory Hierarchy Questions
• Where can a block be placed
•Direct Mapped to Fully Associative
• How a block is found

•Tag Comparison
• Which block should be

replaced on a cache miss
(only for sets)
•LRU, Random, FIFO
4 Qs (Contd..)
• What Happens on a Write?
– Write Back – Main Memory only updated when data
is replaced from cache
– Write Through – The information is updated in upper
as well as lower level.
– Write Allocate: Allocate data in cache on write

– Write No-Allocate: Only write to next level.
3 Cs of Caches
• Classifying Misses: 3 Cs
– Compulsory — The first access to a block is not in the cache,
so the block must be brought into the cache.
Also called cold start misses or first reference misses.
(Misses in even an Infinite Cache)
– Capacity — If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due to
blocks being discarded and later retrieved.
(Misses in Fully Associative Size Cache)
– Conflict — If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded
and later retrieved if too many blocks map to its set. Also called
collision misses or interference misses.
Example (Single Level Cache)
Given Statistics Number of instructions Method
Load/Store Instructions: 50% Assume total instructions = 1000
Hit Time = 2 Clock Cycles, Hit rate = 90% Perfect Cache
Each instruction takes 2 clock cycles,
Miss Penalty = 40 CC Miss Rate = 10% hence 1000 * 2 = 2000Clock cycles
____________________________________ CPI (Perfect) = CC/IC = 2000/1000
Average Memory Access /instruction = 1.5 =2
Ave. Mem Access Time = Imperfect Cache
Calculate Extra Clock Cycle
Hit time + Miss rate * Miss Penalty
Number memory access = 1000 * 1.5 (
= 2 + 0.1 *40 = 2 + 4 = 6 1000 for I$ and 500 for D$)
; 4 is Penalty Cycles = 1500 Memory access in 1000
CPI = ? Instruction program.
Cache missed (at 10%) = 1500 * 0.1 =
CPI (with perfect cache) = 2 150
CPI (overall) = Extra(Penalty) Clock Cycles for Missed
CPI (perfect) + Extra Memory Stall Cache = 150 * 40 = 6000
Cycles/Instruction (penalty Cycles) Which is infact:
= IC × (Mem Access/Instruc) * Miss
Rate * Miss Penalty
= 2 + (6 – 2) * 1.5 Total clock cycle for instruction with
=2+6=8 perfect cache = 2000 Clock Cycles
Total for Program = 2000 + 6000 =
8000
CPI = 8000/1000 = 8.0
Pentium Family Cache Sizes (Aug 2007)
Processor Cache (Data Size Number of Cache Line

for L1, Ways Size
unified for (Set Size)
L2)
Pentium®, L1 16KB (8KB on 4 (2 on early 32 bytes
Pentium® early Pentium® processors)
Processor with processors)
MMX* Technology
L2 256KB or 512KB 4 32 bytes
Pentium® Pro, L1 16KB (8KB on 4 (2 on early 32 bytes

early P6 family processors)
Pentium® II, processors)
Pentium® III,
L2 128KB, 256KB, 4 32 bytes
Pentium®, 512KB, 1MB, 2MB
Xeon®, Celeron*
Pentium® 4 L1 8KB 4 64 bytes

L2 256KB 8 128 bytes
Example with 2-level Cache
Stats: L1: Hit Time = 2 Clock Cycles, Hit rate = 90%, Miss
Penalty to L2 = 10 CC (Hit time for L2)
L2: Local Hit Rate = 80%, Miss Penalty(L2)= 40 CC
Load/Store Instructions: 50%
HT = 40 CC
HT= 2 CC
Global Miss
Main
Hit rate = L Rate = ?
C 90%, L
P
1
U 1000
Memory
Out of
1000
2
Accesses: Memory Memory
100 Miss Accesses:
200 Miss
Example 2
Once again Perfect Cache CPI = 2.0
AMAT = Hit TimeL1 + Miss RateL1 (Hit
TimeL2 + Miss rateL2 × Miss PenaltyL2)
= 2 + 0.1 (10 + 0.2 × 40) = 3.8
CPI = CPI perfect + Extra Memory Stall
Cycles/instruction
= 2.0 + (3.8-2) × 1.5 = 4.7
Example 2 (contd..)
1000 Instruction Method
(Calculate Extra Clock Cycles starting from missing from L1)
Step 2 (Hit on L2)
Total Accesses in L2 = 150 (Misses from L1)
Extra CC on miss in L1 and hit in L2 = 150 * 10 = 1500
(eventually all get a hit – very imp)
Step 3 (Miss on L2)
Miss rate = (100-80) = 20%
Instructions missed on L2 = 150 × .2 = 30
Extra CC on miss in L2 = 30 × 40 = 1200
Total Extra Clock Cycles = 1500 + 1200 = 2700
Total Clock Cycles for the program = 2000 + 2700 = 4700
CPI = 4700/1000 = 4.7
Access Times Vs
Size and Associativity
16
14 1- Way (direct
(ns)
mapped)
2- Way
Access Time
12 4- Way
Fully Associative
10
0
4KB 8KB 16KB 32KB 64KB 128K 256K
Cache Size B B
Reducing Cache Misses
Loop Interchange
• Motivation: some programs have nested

loops that access data in non-sequential
order
• Solution: Simply exchanging the nesting of
the loops can make the code access the
data in the order it is stored =>
reduce misses by improving spatial locality;
reordering maximizes use of data in a
cache block before it is discarded
Loop Interchange example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x [i] [j] = 2 * x [i] [j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x [i] [j] = 2 * x [i] [j];
• Sequential accesses instead of striding through memory

every 100 words; improved spatial locality.
•
Reduces misses if the arrays do not fit in the cache.
Motorola’s PowerPC 604
Pentium

Cache Memory

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Cache Memory

Загружено:

Авторское право:

Доступные форматы

Basic Philosophy

Vali TA DATA Index

Vali TA DATA Index

Valid TA DATA Index

Valid TA DATA Index

Valid TA DATA Index

0 0 0000 0000 0000 0000 0000

• Assume same processor (A = 32, D= 32)

A31- A17 A31- A17

• Assume same processor (A = 32, D= 32)

A31- A16 A31- A16

Byte Offset (A5 – A0) SET 1 (Block Size = 64 bit) SET 2

Index 512 entries Index 512

A44- A15 (29-bit Tag) A44- A15 (29-bit Tag)

• How a block is found

• Which block should be

– Write Allocate: Allocate data in cache on write

Processor Cache (Data Size Number of Cache Line

Pentium® Pro, L1 16KB (8KB on 4 (2 on early 32 bytes

Pentium® 4 L1 8KB 4 64 bytes

• Motivation: some programs have nested

• Sequential accesses instead of striding through memory

Вам также может понравиться