Computer Architecture and Organization: Lecture16: Cache Performance

Computer Architecture and Organization
Lecture16: Cache Performance
Majid Khabbazian mkhabbazian@ualberta.ca

Electrical and Computer Engineering University of Alberta
April 9, 2013
CPU execution time: revisited

Lets now account for cycles during which the processor is stalled waiting for a memory access
CPU execution time= (CPU clock cycles + Memory stall cycles) x Clock cycle time
Memory stall cycles

Simplifying assumptions:
CPU clock cycles include the time to handle a cache hit The processor is stalled during a cache miss
Memory stall cycles=number of misses x Miss penalty =IC x (Memory accesses / Instructions) x Miss rate x Miss penalty
Miss rates and miss penalties are often different for reads and writes!
Memory stall cycles=IC x Reads per instruction x Read miss rate x Read miss penalty
+ IC x Writes per instruction x Write miss rate x Write miss penalty
Example 15.2
Assume we have a computer where the cycles per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hit?
Average memory access time

A measure of memory hierarchy performance
Average memory access time = Hit time + Miss rate x Miss penalty
Impact on Processor Performance

What is the impact of average memory access time on processor performance?
Assumptions:
Processor stalls during misses.
We have an in-order execution processor
Memory hierarchy dominates other reasons of stalls.
Then, average memory access can somewhat predict processor performance

The impact is higher on processors with low CPI
6
Example 15.3
Assume that the cache miss penalty is 200 clock cycles, and all instructions normally take 1.0 clock cycles (ignoring memory stalls). Assume that the average miss rate is 2%, and there is an average of 1.5 memory references per instructions.
What is the impact on performance when behavior of the cache is included? Compare this to the case where there is no cache.
Some basic cache optimization
Intel Core i7 Cache Hierarchy

Processor package Core 0 Core 3
Regs
L1 L1 d-cache i-cache L2 unified cache
Regs
L1 L1 d-cache i-cache L2 unified cache
L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles
L2 unified cache: 256 KB, 8-way, Access: 11 cycles

L3 unified cache: 8 MB, 16-way, Access: 30-40 cycles Block size: 64 bytes for all caches.
L3 unified cache (shared by all cores)
Main memory
Intel Smart Cache
10
Intel Smart Cache
11
Improve Cache Performance

improve cache and memory access times:
Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty
Reducing each of these! Simultaneously?
Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.
12
1) Larger Block Size

Larger block size to reduce miss rates
Take advantage of special locality
The block size should not be too large!

Extreme case: there is only one block What if the working set is larger than the cache ALSO Larger block size increases miss penalty!
How to decide then?

13
Example 16.1
Assume the memory system takes 80 clock cycles of overhead and then delivers 16 bytes every 2 clock cycles. Based on the following table, which block size has the smallest average memory access time? Assume the hit time is 1 independent of block size.
Block size Miss rate
16 3.94%
32 2.87%
64 2.64%
128 2.77%
14
2) Larger Caches
Larger caches to reduce miss rate
The obvious way to reduce capacity miss
Drawbacks?
Potentially longer hit time Higher cost Higher power
15
3) Higher Associativity
Higher associativity to reduce miss rate
Reduces conflict misses
Drawbacks?
Longer hit time
16
4) Multilevel Caches
Multilevel caches to reduce miss penalty
Reducing cache penalty can be just as beneficial as reducing miss rate Miss penalty is increasing (DRAMs become relatively slower than CPUs)
Average memory access

How to analyze for a two-level cache?
17

Computer Architecture and Organization: Lecture16: Cache Performance

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computer Architecture and Organization: Lecture16: Cache Performance

Загружено:

Авторское право:

Доступные форматы

Computer Architecture and Organization

Lecture16: Cache Performance

Majid Khabbazian mkhabbazian@ualberta.ca

CPU execution time: revisited

Memory stall cycles

Average memory access time

Impact on Processor Performance

Memory hierarchy dominates other reasons of stalls.

Then, average memory access can somewhat predict processor performance

Some basic cache optimization

Intel Core i7 Cache Hierarchy

L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles

L2 unified cache: 256 KB, 8-way, Access: 11 cycles

L3 unified cache (shared by all cores)

Intel Smart Cache

Intel Smart Cache

Improve Cache Performance

Reducing each of these! Simultaneously?

1) Larger Block Size

The block size should not be too large!

How to decide then?

Block size Miss rate

Average memory access

Вам также может понравиться