Вы находитесь на странице: 1из 28

Basic Philosophy

• Temporal Locality
• Spatial Locality
• Temporal Locality: if at one point in time a particular memory
location is referenced, then it is likely that the same location will be
referenced again in the near future. There is a temporal proximity
between the adjacent references to the same memory location. In this
case it is common to make efforts to store a copy of the referenced
data in special memory storage, which can be accessed faster.
Temporal locality is a very special case of the spatial locality, namely
when the prospective location is identical to the present location.

• Spatial Locality:
If a particular memory location is referenced at a particular time, then it
is likely that nearby memory locations will be referenced in the near
future. There is a spatial proximity between the memory locations,
referenced at almost the same time. In this case it is common to make
efforts to guess, how big neighbourhood around the current reference
is worthwhile to prepare for faster access.
Basic Terms
• Cache Block
• Miss/Hit
• Miss Rate/Hit Rate
• Miss Penalty
• Hit Time
• 3-Cs of caches
– Conflict
– Compulsory
– Capacity
Direct Mapped Cache
Assume 5-bit address bus and cache with 8 entries

Vali TA DATA Index


D4 – D 3 d G
000
TA 001
G 010
Processor 011
100
D2 - D0 Inde 101
x 110
111

Data Bus

=
HIT
Direct Mapped Cache
First Load

Vali TA DATA Index


D4 – D 3 TAG d G
0 000
= 01 0 001
0 010
Processor 0 011
0 100
D2 - D0 = 010 0 101
0 110
0 111

Data Bus
LD R1, (01010) ;remember 5-bit address bus, assume data is 8-
bit and
AA16 is stored at this location
First time, cause a MISS, data loaded from memory and cache HIT bit is
Direct Mapped Cache
After first load

Valid TA DATA Index


D4 – D 3 G
TAG 0 000
= 01 0 001
1 01 AA 010
Processor 0 011
0 100
D2 - D0 = 010 0 101
0 110
0 111

Data Bus
LD R1, (01010) ; AA16 is stored at this location, Cache HIT bit is
set to 1
Direct Mapped Cache
Second Load

Valid TA DATA Index


D4 – D 3 TAG
G
0 000
= 11
0 001
1 01 AA 010
Processor 0 011
0 100
D2 - D0 = 010 0 101
0 110
0 111

Data Bus
LD R1, (11010) ; assume 99 at address 11010
Same index but different TAG will cause a MISS, data loaded from
memory
Direct Mapped Cache
After Second Load

Valid TA DATA Index


D4 – D 3 G
TAG 0 000
= 11 0 001
1 11 99 010
Processor 0 011
0 100
D2 - D0 = 010 0 101
0 110
0 111

Data Bus
LD R1, (11010) ;remember 5-bit address bus, assume 99
First time, same index but different TAG will cause a MISS, data loaded
from memory
Cache Size Example
Direct Mapped
Valid TAG DATA
(15 bit) (32
32K X 48-bit 0
bit)
1 1111 1111 1111 1111 1111

Memory 1111
Processor Address Bus

32 K
Address Bus (A16 – A2) 0 Entries
0
0
0
(32-bit)

0 0 0000 0000 0000 0000 0000


0000

A31-
A17=15 =
(15-bit)
Processor Address bus = 32 bit (A)
Cache Storage = 128KB = 32 K Words (2N) with N = 15 Data Out
Number of blocks in cache (entries) = 32K
Tag Size = A- N- 2 = 32 – 15 – 2 (Byte offset) = 15
Cache Size = 128KB (data) + 32K X 15-bit (tag) + 32K X 1-bit (Hit bit) =
Cache Size Example (1)
Two-Way Set Associative

• Assume same processor (A = 32, D= 32)


• Assume same total storage of data =
128KB
• Two sets means we will have two direct
mapped caches with 64KB (128/2) each.
• 64KB = 16K words
• To address 16K X 32-bit memory we need
14-bit index.
• Hence Tag Size = 32-14-2 = 16
Cache Size Example (1)
Two-Way Set Associative
SET 1 SET 2
Valid TAG DATA Valid TAG DATA
(16 (32 (16 bit) (32
(1 (1
16K X 49-bit bit)
0
bit) bit)
1111 1111 1111 1111 1111
bit)
0
bit)

Memories 1111

0 0

16 K
0 Entries 0
Address Bus (A16 – A2) 0 Address Bus (A16 – A2) 0
0 0
0 0
0 0000 0000 0000 0000 0000
0000
0

A31- A17 A31- A17


= =

Data Out
(16-bit) (16-bit)

Size = 2 (Sets) X
Data Out
16K X (32-bit + 16-bit + 1-
bit) 2:1 MUX

= 196KB
Cache Size Example (1)
4-Way Set Associative

• Assume same processor (A = 32, D= 32)


• Assume same total storage of data =
128MB
• Four sets means we will have four direct
mapped caches with 32KB (128/4) each.
• 32KB = 8K words
• To address 8K X 32-bit memory we need
13-bit address.
• Hence Tag Size = 32-13-2 = 17
Cache Size Example (1)
4-Way Set Associative
SET 1 SET 2 SET 3 SET 4
V TA V TA V TA V TA
G G G G
8K X 50-bit 17 17 17 17
0 0 0 0
Memories
0 0 0 0
8M 8M 8M
Entrie Entrie Entrie
0 s 0 s 0 s 0
Address Bus Address Bus Address Bus
Address Bus (A15 – A2) 0 0 0 (A15 – A2) 0
(A15 – A2) (A15 – A2)
0 0 0 0
0 0 0 0
0 0 0 0

A31- A16 A31- A16


A31- A16 A31- A16
= = = =

Data Out

Data Out
Data Out
(17-bit) (17-bit) (17-bit) (17-bit)

4:1 MUX

Size = 4 (Sets) X
8K X (32-bit + 17-bit + 1-bit)
Data Out to
processor
= 200KB
Organization of the data cache Alpha 21264
512 Entries Cache (2 Sets)
Alpha 21264 Processor44-Bit Virtual

Byte Offset (A5 – A0) SET 1 (Block Size = 64 bit) SET 2


Valid TAG DATA Valid TAG DATA
(29 (64 (29 bit) (64
(1 (1
bit) bit) bit)
bit) bit)
0 0
0 0

Index 512 entries Index 512


0 entries 0
Address Bus (A14 – A6) 0 Address Bus (A14 – A6) 0
0 0
0 0
0 0
Address

A44- A15 (29-bit Tag) A44- A15 (29-bit Tag)


= =

Data Out
(29-bit) (29-bit)

Size = 2 (Sets) X
Data Out
16K X (32-bit + 16-bit + 1-
bit) 2:1 MUX

= 196KB
Four Memory Hierarchy Questions
• Where can a block be placed
•Direct Mapped to Fully Associative

• How a block is found


•Tag Comparison

• Which block should be


replaced on a cache miss
(only for sets)
•LRU, Random, FIFO
4 Qs (Contd..)
• What Happens on a Write?
– Write Back – Main Memory only updated when data
is replaced from cache
– Write Through – The information is updated in upper
as well as lower level.

– Write Allocate: Allocate data in cache on write


– Write No-Allocate: Only write to next level.
3 Cs of Caches
• Classifying Misses: 3 Cs
– Compulsory — The first access to a block is not in the cache,
so the block must be brought into the cache.
Also called cold start misses or first reference misses.
(Misses in even an Infinite Cache)
– Capacity — If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due to
blocks being discarded and later retrieved.
(Misses in Fully Associative Size Cache)
– Conflict — If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded
and later retrieved if too many blocks map to its set. Also called
collision misses or interference misses.
Example (Single Level Cache)
Given Statistics Number of instructions Method
Load/Store Instructions: 50% Assume total instructions = 1000
Hit Time = 2 Clock Cycles, Hit rate = 90% Perfect Cache
Each instruction takes 2 clock cycles,
Miss Penalty = 40 CC Miss Rate = 10% hence 1000 * 2 = 2000Clock cycles
____________________________________ CPI (Perfect) = CC/IC = 2000/1000
Average Memory Access /instruction = 1.5 =2
Ave. Mem Access Time = Imperfect Cache
Calculate Extra Clock Cycle
Hit time + Miss rate * Miss Penalty
Number memory access = 1000 * 1.5 (
= 2 + 0.1 *40 = 2 + 4 = 6 1000 for I$ and 500 for D$)
; 4 is Penalty Cycles = 1500 Memory access in 1000
CPI = ? Instruction program.
Cache missed (at 10%) = 1500 * 0.1 =
CPI (with perfect cache) = 2 150
CPI (overall) = Extra(Penalty) Clock Cycles for Missed
CPI (perfect) + Extra Memory Stall Cache = 150 * 40 = 6000
Cycles/Instruction (penalty Cycles) Which is infact:
= IC × (Mem Access/Instruc) * Miss
Rate * Miss Penalty
= 2 + (6 – 2) * 1.5 Total clock cycle for instruction with
=2+6=8 perfect cache = 2000 Clock Cycles
Total for Program = 2000 + 6000 =
8000
CPI = 8000/1000 = 8.0
Pentium Family Cache Sizes (Aug 2007)

Processor Cache (Data Size Number of Cache Line


for L1, Ways Size
unified for (Set Size)
L2)
Pentium®, L1 16KB (8KB on 4 (2 on early 32 bytes
Pentium® early Pentium® processors)
Processor with processors)
MMX* Technology
L2 256KB or 512KB 4 32 bytes

Pentium® Pro, L1 16KB (8KB on 4 (2 on early 32 bytes


early P6 family processors)
Pentium® II, processors)
Pentium® III,
L2 128KB, 256KB, 4 32 bytes
Pentium®, 512KB, 1MB, 2MB
Xeon®, Celeron*

Pentium® 4 L1 8KB 4 64 bytes


L2 256KB 8 128 bytes
Example with 2-level Cache
Stats: L1: Hit Time = 2 Clock Cycles, Hit rate = 90%, Miss
Penalty to L2 = 10 CC (Hit time for L2)
L2: Local Hit Rate = 80%, Miss Penalty(L2)= 40 CC
Load/Store Instructions: 50%
HT = 40 CC
HT= 2 CC
Global Miss
Main
Hit rate = L Rate = ?
C 90%, L
P
1
U 1000
Memory
Out of
1000
2
Accesses: Memory Memory
100 Miss Accesses:
200 Miss
Example 2
Once again Perfect Cache CPI = 2.0
AMAT = Hit TimeL1 + Miss RateL1 (Hit
TimeL2 + Miss rateL2 × Miss PenaltyL2)
= 2 + 0.1 (10 + 0.2 × 40) = 3.8
CPI = CPI perfect + Extra Memory Stall
Cycles/instruction
= 2.0 + (3.8-2) × 1.5 = 4.7
Example 2 (contd..)
1000 Instruction Method
(Calculate Extra Clock Cycles starting from missing from L1)
Step 2 (Hit on L2)
Total Accesses in L2 = 150 (Misses from L1)
Extra CC on miss in L1 and hit in L2 = 150 * 10 = 1500
(eventually all get a hit – very imp)
Step 3 (Miss on L2)
Miss rate = (100-80) = 20%
Instructions missed on L2 = 150 × .2 = 30
Extra CC on miss in L2 = 30 × 40 = 1200
Total Extra Clock Cycles = 1500 + 1200 = 2700
Total Clock Cycles for the program = 2000 + 2700 = 4700
CPI = 4700/1000 = 4.7
Access Times Vs
Size and Associativity
16

14 1- Way (direct
(ns)

mapped)
2- Way
Access Time

12 4- Way
Fully Associative
10

0
4KB 8KB 16KB 32KB 64KB 128K 256K
Cache Size B B
Reducing Cache Misses
Loop Interchange

• Motivation: some programs have nested


loops that access data in non-sequential
order
• Solution: Simply exchanging the nesting of
the loops can make the code access the
data in the order it is stored =>
reduce misses by improving spatial locality;
reordering maximizes use of data in a
cache block before it is discarded
Loop Interchange example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x [i] [j] = 2 * x [i] [j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x [i] [j] = 2 * x [i] [j];

• Sequential accesses instead of striding through memory


every 100 words; improved spatial locality.

Reduces misses if the arrays do not fit in the cache.
Motorola’s PowerPC 604
Pentium

Вам также может понравиться