Вы находитесь на странице: 1из 55

CS 1104 Help Session I Caches

Colin Tan, ctank@comp.nus.edu.sg S15-04-15

Topics for Today


Session I - Cache, Session II - Virtual Memory Basically we will pick off important topics and elaborate on them. Also will have simple problems to solve
Please have your calculator, paper and pen ready!

Help session notes are available at:


http://www.comp.nus.edu.sg/~ctank

Memory Hierarchy
Motivation
Not all memory is created equal
Cheap Memory => Slow Fast Memory => Expensive
DRAM, 70 ns access time, $5/MByte SRAM, 8 ns access time, $100/MByte

The purpose of memory hierarchy is that it allows us to have huge amounts of cheap that operate close to the speed of expensive but fast memory.

How do we do that?
We can achieve this by using locality. Locality is the behavior of programs such that:
Data or instructions that was recently accessed is likely to be accessed in the near future (Temporal Locality) The variable x in this code fragment will have temporal locality for(int i=0; i<100; i++) y= y+x The neighbors of data or instructions that was recently accessed are likely to also be accessed (Spatial Locality) Program execution tends to be sequential, so if an instruction was just executed, it is likely that the next instruction next to it will also be executed.

Caches
Locality means that only small portions of main memory will be used in the near future. We can create a small memory system with fast but expensive devices to store these portions of main memory. We can then access these portions of memory very quickly. This is the concept behind caches.

How Do Caches Help?


The average time to access memory (AMAT) is given by:
AMAT = Tcache + miss_rate x Tmemory
Tcache = Time to read the cache (8ns for SRAM cache)

Tmemory = Time to read main memory (70ns for DRAM) miss_rate = Probability of not finding what we want in the cache.

Because of locality, miss_rate is very small


Typically about 3% to 5%.

On the average, time to access memory is very close to that of the cache (8ns) rather than that of the main memory (70ns).

How do Caches Help?


Yet at the same time, we have the benefit of being able to have large amounts of memory
This is because most of our memory is cheap DRAM!

The net effect is that caches allow us to have huge amounts of cheap, slow memory, yet have access times of fast, expensive memory.

Basics of Addressing
Memory Addresses
Memory is a huge array of data. Just like arrays, we must specify the index of the piece of data that we want, so that we can get it out of the array. In memory systems, this index is called an Address.

Where do Addresses Come From?


For instruction fetches, the address of the instruction (i.e. the location in the memory where the instruction is in) comes from the Program Counter. For data fetches, the address comes from the ALU stage of the pipeline whenever we do a lw or sw operation. In the MIPS architecture, addresses are 32bit numbers.

Cache Architecture
Caches consist of blocks (or lines). Each block stores data from memory:
Block

Block allocation problem:


Given data from an address A, how do we decide which block of cache its data should go to?

The Block Allocation Problem


3 possible solutions:
Data from each address A will go to to a fixed block.
Direct Mapped Cache

Data from each address A will go to any block.


Fully associative cache

Data from address A will go to a fix set of blocks.


Data may be put into any block within a set. Set associative cache.

Direct Mapped Caches


The value of a portion of memory address is used to decide which block to send the data to:
Address A Tag Block Index Block Offset Byte Offset

The Block Index portion is used to decide which block data from this address should go to.

Example
The number of bits in the block index is log2N, where N is the total number of blocks. For a 4-block cache, the block index portion of the address will be 2 bits, and these 2 bits can take on the value of 00, 01, 10 or 11. The exact value of these 2 bits will determine which block the data for that address will go to.

Example
The value of the two block index bits will determine which block the data will go to, following the scheme shown below:
Cache
00 01 10 11

Solving Direct-Mapped Cache Problems


Question 7.7
Basic formula:
Blk_Addr = floor(word_address/words_per_block) mod N
N here is the total number of blocks in the cache This is the mathematical version of taking the value of the Block Index bits from the address.

A Complication: Multiple Word Blocks


Single word blocks do not support spatial locality
Spatial locality: Likelihood of accessing neighbor of a piece of data that was just accessed is high. But with single word blocks, none of the neighbors are in cache!
All accesses to neighbors that were not accessed before will miss!

An Example Question 7.8

Accessing Individual Words


In our example, each block has 4 words. But we always access memory 1 word at a time! (e.g. lw) Use the Block Offset to specify which of the 4 words in a block we want to read:
Address A

Tag

Block Index Block Offset

Byte Offset

The Block Offset


Number of block offset bits = log2M, where M is the number of words per block. For our example, M=4. So number of block offset bits is 2. These two bits can take on the values of 00, 01, 10 and 11. Note that for single word blocks, the number of block offset bits is log2 1, which is 0. I.e. There are no block offset bits for single-word blocks. These values determine exactly which word within a block address A is referring to:

The Block Offset


Word 00
00 01 10 11

Word 01

Word 10

Word 11

4 block cache, 4 words per block

The value of the 2 block offset bits (see previous slide) determine if our address A is referring to word00, word01, word10 or word11.

Who am I? Purpose of the Tag


Many different addresses may map to the same block: e.g. (Block Index portions shown highlighted)
01000 00010010
01010 00010010

00000000
00000000

00
00

11011

00010010

00000000

00

All 3 addresses are different, but all map to block 00010010

Disambiguation
We need a way to disambiguate the situation
Otherwise how do we know that the data in block x actually comes from address A and not from another address A that has the same block index bit value?

The portion of the address A to the left of the Block Index can be used for disambiguation. This portion is called the tag, and the tag for address A is stored in the cache together with address A data.

The Tag
Tag
00 01 10 11

Word 00

Word 01

Word 10

Word 11

When we access the cache, the Tag portion and Block Index portions of address A are extracted. The Block Index portion will tell the cache controller which block of cache to look at. The Tag portion is compared against the tag stored in the block. If the tags match, we have a cache hit. The data is read from the cache.

Accessing Individual Bytes


Address A

Tag

Block Index Block Offset

Byte Offset

MIPS addresses are byte addresses, and actually index individual bytes rather than words. Each MIPS word consists of 4 bytes. The byte offset tells us exactly which byte within a word we are referring to.

Advantages & Disadvantages of Direct Mapped Caches


Advantages:
Simple to implement Fast performance
Less time to detect a cache hit => less time to get data from the cache => faster performance

Disadvantages
Poor temporal locality.
Many addresses may map to the same block. The next time address A is accessed, it may have been replaced by the contents of address A.

Improving Temporal Locality The Fully Associative Cache


In the fully associative cache, data from an address A can go to any block in cache.
In practice, data will go into the first available cache block. When the cache is full, a replacement policy is invoked to choose which block of cache to throw out.

Example
Question 7.22

Searching the Cache


In the fully associative cache, an address A is split into the following parts:
Address A

Tag

Block Offset

Byte Offset

The cache controller will search the entire cache to see if it can find a block with the same tag value as the tag portion of A. If it can find such a block, we have a cache hit, and the controller reads the data from the cache.

Advantages and Disadvantages Fully Associative Cache


Good temporal locality properties
Flexible block placement allows smart replacement policies such that blocks that are likely to be referenced again will not be replaced. E.g. LRU, LFU.

Disadvantages
Complex and too expensive for large caches
Each block needs a comparator to check the tag. With 8192 blocks, we need 8192 comparators!

A Compromise Set Associative Caches


Represents a compromise between directmapped and fully associative caches. Cache is divided into sets of blocks. An address A is mapped directly to a set using a similar scheme as for direct mapped caches. Once the set has been determined, the data from A may be stored in any block within a set - Fully associative within a set!

Set Associative Cache


An n-way set associative cache will have n blocks per set. For example, for a 16-block cache that is implemented as a 2-way set associative cache, each set has 2 blocks, and we have a total of 8 sets.

Set Associative Cache


Block 0 Block 1

Set 000 Set 001 Set 010 Set 011 Set 100 Set 101 Set 110 Set 111

An address A will be divided into:


Address A Tag Set Index Block Offset Byte Offset

Accessing a Set Associative Cache


Address A
Tag Set Index Block Offset Byte Offset

The Set Index portion of address A is extracted. This is used to index the sets (i.e. If the Set Index portion is 010, then this address is mapped to Set 010). The tag portion of A is extracted and compared against the tags stored in Block 0 and Block 1 of Set 010.

Accessing a Set Associative Cache


If a match is made either in Block 0 or Block 1 of Set 010, then we have a cache hit, and the data for A is read from the cache block. If we have a miss, then the data for A is fetched from main memory, and placed in the first available block in Set 010. If no blocks are available, a replacement policy is invoked to choose a block to replace.

Example
Question 7.20
Basic formula:
Blk_Addr = floor(word_address/words_per_block) mod N
Here N is the number of sets, NOT NUMBER OF BLOCKS! This is the mathematical version of taking the value of the Block Index bits from the address.

Multi-block Set vs. Multi-word blocks?


Confusion often arises over multi-block sets (or n-way set associative) vs. multi-word blocks. Each block in a set can itself have multiple words, like the blocks in question 7.8. Each block will also have its own tag.

Advantages and Disadvantages Set Associative Cache


Advantages
Almost as simple to build as a direct-mapped cache. Only n comparators are needed for an n-way set associative cache. For 2-way set-associative, only 2 comparators are needed to compare tags. Supports temporal locality by having full associativity within a set.

Advantages and Disadvantages Set Associative Cache


Disadvantages
Not as good as fully-associative cache in supporting temporal locality. For LRU schemes, because of small associativity, actually possible to have 0% hit rate for temporally local data. E.g. If our accesses are A1 A2 A3 A1 A2 A3, and if A1, A2 and A3 map to the same 2-way set, then hit rate is 0% as subsequent accesses replace previous accesses in the LRU scheme.

Writing to Cache
Remember that data in cache is merely a copy of data in main memory. When data that is stored in a cache block is modified (e.g. when doing a sw to address A), then the copy in cache will become inconsistent with the copy in memory. Need a way to maintain consistency.

Memory/Cache Consistency 2 solutions


Write-through cache
In the write-through cache, consistency between cache data and memory data is maintained by updating both main memory and cache. This is very slow
Must wait for both cache and memory writes to complete before CPU can proceed. Memory writes are very slow!

Memory/Cache Consistency 2 solutions


Write-back Cache
Only the cache copy of data is updated. When the data in a block is updated, a special flag called the dirty bit will be set to indicate that the cache copy is now inconsistent with the memory copy. If the block is chosen for replacement (either by replacement policy or because another address A maps to the same block), then the memory copy is updated if the dirty bit is set. If dirtybit is not set, the block is simply replaced.

Nitty-Gritty Use of Addresses by Cache


Addresses are used to access cache. For Direct Mapped Cache:
Address A Tag Block Index Block Offset Byte Offset

Nitty-Gritty Use of Addresses by Cache


Direct Mapped Cache
Address A

Tag

Block Index Block Offset

Byte Offset

Number of Byte Offset Bits B= log2(number of bytes per word) On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) 0 bits for 1-word blocks Number of Block Index Bits I = log2(Number of blocks) Number of tag bits = address_length - B - W- I address_length is 32 bits on MIPS

Nitty-Gritty Use of Addresses by Cache


Fully Associative Cache
Address A Tag Block Offset Byte Offset

Number of Byte Offset Bits B= log2(number of bytes per word) On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) 0 bits for 1-word blocks Number of tag bits = address_length - B - W address_length is 32 bits on MIPS Note that there are no index bits for fully associative caches.

Nitty-Gritty Use of Addresses by Cache


Set-Associative Cache
Address A Tag Set Index Block Offset Byte Offset

Number of Byte Offset Bits B= log2(number of bytes per word) On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) 0 bits for 1-word blocks Number of Set Index Bits S = log2(Number of sets) Number of tag bits = address_length - B - W- S address_length is 32 bits on MIPS

Example
A cache built for the MIPS architecture has a total size 128 KB. Find the total number of tag, set index, block index, block offset, and byte offset bits for a given address A for each of the following cache architectures:
Direct Mapped, 1 word per block. Direct Mapped, 8 words per block Fully associative, 2 words per block 2-way set associative, 4 words per block

Example
Basic things you first need to work out:
What types of information do I need to determine for each cache architecture?
E.g. for set-associative, need to determine byte-offset, block offset, set index and tag bits.

What is the cache size in terms of words? What is the total number of blocks that we would have, or the total number of sets?
This will give us the number of index bits.

Any other important information?

Nitty-Gritty Cache Housekeeping Flags


Other than the data and tag bits, cache blocks need to store housekeeping flags. The dirty bit (D) we saw earlier is an example. Other bits include:
Valid bit (V)
When a cache first starts up, the tag and data bits are random. It is possible to have a cache hit because the tag from an address may match a random number in the tag field of a block. But the data is random in invalid! The Valid bit is normally off, and will be set when valid data is written to a block.

Nitty-Gritty Cache Housekeeping Flags


Use Bit (U)
This is used by the LRU replacement algorithm to determine which block is LRU. Present only in fully-associative and set-associative caches using LRU replacement policies.

A cache may thus look as complicated as this: V D U Tag Word 00 Word 01 Word 10

Word 11

Total Number of Bits in Cache?


When we speak of cache size, we normally refer to how many bytes of main memory data the cache can hold
E.g. a 64KB cache can hold up to 64KB of main memory data

But as we can see, data is not the only thing stored in a cache block.
We also have the tag and housekeeping flags!

Thus the total number of bits needed to implement a cache can be much bigger than the specified cache size!

Example
We want to implement a 256KB write-back cache on the MIPS architecture. The cache will be 4-way set associative, with 4 word blocks. The LRU replacement policy will be used. Find the total number of bits of SRAM required to implement this cache.

Example
Analysis
What housekeeping flags will be needed? What is the size of the data portion of each block? What is the number of blocks? What is the number of sets? What is the number of tag bits?
What is the number of byte offset, block offset and set index bits required?

Based on this analysis, you should be able to get the answer.

Summary
Caches
Make use of locality to make it possible to have small amounts of fast expensive memory hold a copy of main memory data that is likely to be accessed soon. Allows fast access of huge amounts of memory.

Cache types
Direct Mapped
Simple, fast Poor temporal locality

Summary
Fully Associative
Flexibility of block placement allows smart replacement algorithms that promotes temporal locality. Expensive, slow.

Set Associative
Simpler to build than fully associative, yet gives good temporal locality through flexible placement of blocks (just like fully-associative). Limited associativity can sometimes give poor performance

Summary
Writing policies
Write-through
Simple to implement Slow

Write-back
Fast Difficult to implement

Housekeeping flags
Need extra info for the running of the cache

Total Cache Sizes vs. Cache Sizes


Not the same thing!

Вам также может понравиться