Академический Документы
Профессиональный Документы
Культура Документы
Admin
Homework #4 Due November 2nd
Work on Projects
Midterm
Max 98
Min 50
Mean 80
Associativity
Block size
Capacity
Number of sets S = C/(BA)
1-way (Direct-mapped)
A = 1, S = C/B
N-way set-associative
Fully associativity
S = 1, C = BA
Write Policies
We know about write-through vs. write-back
Assume: a 16-bit write to memory location 0x00
causes a cache miss.
Do we change the cache tag and update data in the
block?
Yes: Write Allocate
No: Write No-Allocate
Write-around cache
Write-through no-write-allocate
: : : :
Cache Data
B31
B24
Sub-block3 Sub-block2
SB0s V Bit
SB1s V Bit
Cache Tag
SB2s V Bit
SB3s V Bit
B7
B0 0
Sub-block1 Sub-block0 1
2
3
:
31
Byte 1023
Byte 992
5
CPS 220
Cache Performance
CPU time = (CPU execution clock cycles + Memory stall
clock cycles) x clock cycle time
Memory stall clock cycles = (Reads x Read miss rate x
Read miss penalty + Writes x Write miss rate x Write
miss penalty)
Memory stall clock cycles = Memory accesses x Miss
rate x Miss penalty
CPS 220
Cache Performance
CPUtime = IC x (CPIexecution + (Mem accesses per
instruction x Miss rate x Miss penalty)) x Clock cycle
time
hits are included in CPIexecution
Misses per instruction = Memory accesses per
instruction x Miss rate
CPUtime = IC x (CPIexecution + Misses per instruction x
Miss penalty) x Clock cycle time
CPS 220
Example
Example 2
Two caches: both 64KB, 32 byte blocks, miss penalty
70ns, 1.3 references per instruction, CPI 2.0 w/ perfect
cache
direct mapped
Cycle time 2ns
Miss rate 1.4%
2-way associative
Cycle time increases by 10%
Miss rate 1.0%
Which is better?
Compute average memory access time
Compute CPU time
10
Example 2 Continued
Ave Mem Acc Time =
Hit time + (miss rate x miss penalty)
1-way: 2.0 + (0.014 x 70) = 2.98ns
2-way: 2.2 + (0.010 x 70) = 2.90ns
11
CPS 220
12
CPS 220
13
Reducing Misses
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache,
so the block must be brought into the cache. These are also
called cold start misses or first reference misses.
(Misses in Infinite Cache)
CPS 220
14
1-way
Conflict
0.12
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
CPS 220
128
64
32
16
Compulsory
15
0.14
1-way
0.12
Conflict
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
CPS 220
128
64
32
16
Compulsory
16
Conflict
2-way
4-way
8-way
60%
40%
Capacity
20%
CPS 220
128
64
32
16
0%
Compulsory
17
CPS 220
18
20%
Miss
Rate
4K
15%
16K
10%
64K
5%
256K
256
128
64
32
16
0%
CPS 220
19
CPS 220
20
Associativity
2-way 4-way
2.15
2.07
1.86
1.76
1.67
1.61
1.48
1.47
1.32
1.32
1.24
1.25
1.20
1.21
1.17
1.18
8-way
2.01
1.68
1.53
1.43
1.32
1.27
1.23
1.20
CPS 220
21
DATA
?
TAG
DATA
Mem
?
Alvin R. Lebeck 2006
CPS 220
22
Miss Penalty
Time
CPS 220
23
CPS 220
24
CPS 220
25
CPS 220
26
Reducing Misses
Classifying Misses: 3 Cs
CompulsoryThe first access to a block is not in the cache,
so the block must be brought into the cache. These are also
called cold start misses or first reference misses.
(Misses in Infinite Cache)
CPS 220
27
Data
Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
CPS 220
28
CPS 220
29
CPS 220
30
CPS 220
31
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
CPS 220
32
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
CPS 220
33
0.05
50
100
150
Blocking Factor
CPS 220
34
1.5
2.5
Performance Improvement
merged
arrays
loop
interchange
loop fusion
CPS 220
blocking
35
Cache
Memory
Alvin R. Lebeck 2006
Cache Mapping
36
Memory
Alvin R. Lebeck 2006
Cache Mapping
37
22 33
66 77
10
10 11
11
00 11 44
22 33 66
88 99 12
12
55
77
12
13 14
14 15
15
12 13
13
13
10
11 14
14 15
15
10 11
00 33 44 55
11 22 77 66
14
14 13
13 88 99
15
12 11
11 10
10
15 12
4-D blocked
Morton order
Hilbert order
38
Performance Improvement
CPU
Clock rate
L1 cache
L2 cache
L3 cache
RAM
TLB entries
Page size
BLKMXM
RECMXM
STRASSEN
CHOL
STDHAAR
NONHAAR
UltraSPARC 2i
300MHz
16KB/32B/1
512KB/64B/1
320MB
64
8KB
Ultra 10
4D
0.93
0.78
0.68
0.62
UltraSPARC 2
300 MHz
16KB/32B/1
2MB/64B/1
512MB
64
8KB
Ultra 60
MO
4D
1.06
0.95
0.94
0.87
0.85
0.67
0.64
0.61
0.58
Alpha 21164
500MHz
8KB/32B/1
96KB/64B/3
2MB/64B/1
512MB
64
8KB
Miata
MO
4D
MO
1.05
0.97
0.95
0.94
0.95
0.79
0.91
0.67
0.64
0.42
0.43
0.58
0.40
0.40
39
8
TSS/Ultra 10
5
4
3
2
1
0
500
520
540
560
580
600
40
Summary
CPUtime = IC CPI
Execution
Memory accesses
Instruction
Program Transformations
Change Algorithm
Change Data Layout
CPS 220
41