Академический Документы
Профессиональный Документы
Культура Документы
Memory Hierarchy
Motivation
Not all memory is created equal
Cheap Memory => Slow Fast Memory => Expensive
DRAM, 70 ns access time, $5/MByte SRAM, 8 ns access time, $100/MByte
The purpose of memory hierarchy is that it allows us to have huge amounts of cheap that operate close to the speed of expensive but fast memory.
How do we do that?
We can achieve this by using locality. Locality is the behavior of programs such that:
Data or instructions that was recently accessed is likely to be accessed in the near future (Temporal Locality) The variable x in this code fragment will have temporal locality for(int i=0; i<100; i++) y= y+x The neighbors of data or instructions that was recently accessed are likely to also be accessed (Spatial Locality) Program execution tends to be sequential, so if an instruction was just executed, it is likely that the next instruction next to it will also be executed.
Caches
Locality means that only small portions of main memory will be used in the near future. We can create a small memory system with fast but expensive devices to store these portions of main memory. We can then access these portions of memory very quickly. This is the concept behind caches.
Tmemory = Time to read main memory (70ns for DRAM) miss_rate = Probability of not finding what we want in the cache.
On the average, time to access memory is very close to that of the cache (8ns) rather than that of the main memory (70ns).
The net effect is that caches allow us to have huge amounts of cheap, slow memory, yet have access times of fast, expensive memory.
Basics of Addressing
Memory Addresses
Memory is a huge array of data. Just like arrays, we must specify the index of the piece of data that we want, so that we can get it out of the array. In memory systems, this index is called an Address.
Cache Architecture
Caches consist of blocks (or lines). Each block stores data from memory:
Block
The Block Index portion is used to decide which block data from this address should go to.
Example
The number of bits in the block index is log2N, where N is the total number of blocks. For a 4-block cache, the block index portion of the address will be 2 bits, and these 2 bits can take on the value of 00, 01, 10 or 11. The exact value of these 2 bits will determine which block the data for that address will go to.
Example
The value of the two block index bits will determine which block the data will go to, following the scheme shown below:
Cache
00 01 10 11
Tag
Byte Offset
Word 01
Word 10
Word 11
The value of the 2 block offset bits (see previous slide) determine if our address A is referring to word00, word01, word10 or word11.
00000000
00000000
00
00
11011
00010010
00000000
00
Disambiguation
We need a way to disambiguate the situation
Otherwise how do we know that the data in block x actually comes from address A and not from another address A that has the same block index bit value?
The portion of the address A to the left of the Block Index can be used for disambiguation. This portion is called the tag, and the tag for address A is stored in the cache together with address A data.
The Tag
Tag
00 01 10 11
Word 00
Word 01
Word 10
Word 11
When we access the cache, the Tag portion and Block Index portions of address A are extracted. The Block Index portion will tell the cache controller which block of cache to look at. The Tag portion is compared against the tag stored in the block. If the tags match, we have a cache hit. The data is read from the cache.
Tag
Byte Offset
MIPS addresses are byte addresses, and actually index individual bytes rather than words. Each MIPS word consists of 4 bytes. The byte offset tells us exactly which byte within a word we are referring to.
Disadvantages
Poor temporal locality.
Many addresses may map to the same block. The next time address A is accessed, it may have been replaced by the contents of address A.
Example
Question 7.22
Tag
Block Offset
Byte Offset
The cache controller will search the entire cache to see if it can find a block with the same tag value as the tag portion of A. If it can find such a block, we have a cache hit, and the controller reads the data from the cache.
Disadvantages
Complex and too expensive for large caches
Each block needs a comparator to check the tag. With 8192 blocks, we need 8192 comparators!
Set 000 Set 001 Set 010 Set 011 Set 100 Set 101 Set 110 Set 111
The Set Index portion of address A is extracted. This is used to index the sets (i.e. If the Set Index portion is 010, then this address is mapped to Set 010). The tag portion of A is extracted and compared against the tags stored in Block 0 and Block 1 of Set 010.
Example
Question 7.20
Basic formula:
Blk_Addr = floor(word_address/words_per_block) mod N
Here N is the number of sets, NOT NUMBER OF BLOCKS! This is the mathematical version of taking the value of the Block Index bits from the address.
Writing to Cache
Remember that data in cache is merely a copy of data in main memory. When data that is stored in a cache block is modified (e.g. when doing a sw to address A), then the copy in cache will become inconsistent with the copy in memory. Need a way to maintain consistency.
Tag
Byte Offset
Number of Byte Offset Bits B= log2(number of bytes per word) On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) 0 bits for 1-word blocks Number of Block Index Bits I = log2(Number of blocks) Number of tag bits = address_length - B - W- I address_length is 32 bits on MIPS
Number of Byte Offset Bits B= log2(number of bytes per word) On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) 0 bits for 1-word blocks Number of tag bits = address_length - B - W address_length is 32 bits on MIPS Note that there are no index bits for fully associative caches.
Number of Byte Offset Bits B= log2(number of bytes per word) On MIPS this is usually 2 bits Number of Block Offset Bits W= log2(Number of words per block) 0 bits for 1-word blocks Number of Set Index Bits S = log2(Number of sets) Number of tag bits = address_length - B - W- S address_length is 32 bits on MIPS
Example
A cache built for the MIPS architecture has a total size 128 KB. Find the total number of tag, set index, block index, block offset, and byte offset bits for a given address A for each of the following cache architectures:
Direct Mapped, 1 word per block. Direct Mapped, 8 words per block Fully associative, 2 words per block 2-way set associative, 4 words per block
Example
Basic things you first need to work out:
What types of information do I need to determine for each cache architecture?
E.g. for set-associative, need to determine byte-offset, block offset, set index and tag bits.
What is the cache size in terms of words? What is the total number of blocks that we would have, or the total number of sets?
This will give us the number of index bits.
A cache may thus look as complicated as this: V D U Tag Word 00 Word 01 Word 10
Word 11
But as we can see, data is not the only thing stored in a cache block.
We also have the tag and housekeeping flags!
Thus the total number of bits needed to implement a cache can be much bigger than the specified cache size!
Example
We want to implement a 256KB write-back cache on the MIPS architecture. The cache will be 4-way set associative, with 4 word blocks. The LRU replacement policy will be used. Find the total number of bits of SRAM required to implement this cache.
Example
Analysis
What housekeeping flags will be needed? What is the size of the data portion of each block? What is the number of blocks? What is the number of sets? What is the number of tag bits?
What is the number of byte offset, block offset and set index bits required?
Summary
Caches
Make use of locality to make it possible to have small amounts of fast expensive memory hold a copy of main memory data that is likely to be accessed soon. Allows fast access of huge amounts of memory.
Cache types
Direct Mapped
Simple, fast Poor temporal locality
Summary
Fully Associative
Flexibility of block placement allows smart replacement algorithms that promotes temporal locality. Expensive, slow.
Set Associative
Simpler to build than fully associative, yet gives good temporal locality through flexible placement of blocks (just like fully-associative). Limited associativity can sometimes give poor performance
Summary
Writing policies
Write-through
Simple to implement Slow
Write-back
Fast Difficult to implement
Housekeeping flags
Need extra info for the running of the cache