Академический Документы
Профессиональный Документы
Культура Документы
o o o o o o o
N-Way Set Associative Cache Sizes of Fields for N-Way Set Associative 2-Way Set Associative Cache 3-Way Set Associative Cache Set Associative vs. Direct Mapped
Fully Associative Cache Unifying Theory Cache Design and Other Details
o o o o
The purpose of this document is to help people have a more complete understanding of what memory cache is and how it works. I discuss the implementation and comparative advantages of direct mapped cache, N-way set associative cache, and fully-associative cache. Also included are details about cache design. I try to give a complete treatment of the more important aspects of cache. This is a supplement for the notes and text, not a replacement. Although this text can be read on its own without reference to the book or the notes, this is more text heavy, while the notes provide more valuable visual aides and a far more concise overview than I provide here. Also, there are many aspects of cache described in the notes that I do not describe in this version of the text, like sub-blocks. The reason you should care is because a programmer can design software to take advantage of cache. If a programmer is aware of cache and its trappings, it is a simple matter to arrange memory accesses to take advantage of cache and have faster run times. A great proprotion of computing is accessing memory, and anything that can be done to speed memory accesses can only be a good thing. Because of this, I consider cache to be one of the more important topics of computer hardware. Because cache is such a multifaceted subject, and because learning about cache requires the introduction of a lot of new ideas, this document is necessarily long. As with all of these notes, I tried to make each section portable, so that if you only need to firm up your understanding on one subject, you may skip to the relevant section.
Overview
The purpose of cache is to speed up memory accesses by storing recently used chunks of memory. Now, main memory (RAM) is nice and cheap. Not as cheap as disk memory, but cheap nevertheless. The disadvantage of this sort of memory is that it is a bit slow. An access from main memory into a
register may take 20-50 clock cycles. However, there are other sorts of memory that are very fast, that can be accessed within a single clock cycle. This fast memory is what we call cache. However, this fast memory is prohibitively expensive, so we can't compose our main memory entirely of it. However, we can have a small amount, say 8K to 128K on average, in which we store some of our data. But what data do we choose? The way it happens, we choose data that has been most recently used. This is a policy that takes advantage of something called spatial and temporal locality. Spatial and temporal locality of memory accesses refers to the idea that if you access a certain chunk of memory, in the near future you're likely to access that memory, or memory near it, again. For example, you execute consecutive instructions in memory, so the next dozen of so instructions that you access are probably contained in a given contiguous block of data, so the next instruction would be accessed close by. Similarly, if you're accessing elements of an array, they will be located consecutively, so it makes sense to store a block of data that has recently been used. Separate variables called in a function are usually close by as well. That is what cache does. When we want to access memory at a certain address, we look in the cache to see if it is there. If it is there, we get it from cache instead of going all the way to memory; that situation is called a "cache hit." If the data is not in cache, then we bring the data from memory to cache, which takes longer since the data must be brought in from the smaller memory; that situation is called a "cache miss," and that delay that we endure because that memory must be brought from main memory is called a "miss penalty."
more cache blocks up here... V T L N N+1 N+2 N+3 N+4 more cache blocks down here...
You can think about the direct mapped cache this way. Each row in the table to the left represents a cache block. We have our valid bit which tells us if this cache block currently holds data. Next is our tag, a number that tell us where in memory this bit is from. After that, we have our line, which is the data that we have stored in cache. The number to the right is just the cache index. Cache blocks are indexed like elements in an array, and there is a cache index associated with each cache block. As you see, I show blocks N through N+4, where N is some integer such that these are valid cache blocks for this particular cache.
As a nota bene, you must know that for dirct mapped cache, cache sizes and line sizes (and thus the number of cache blocks) are made in powers of two. It works out better for the hardware design if this is the case. Therefore, we're not going to see direct mapped cache's with 15 kilobytes, or 700 bytes, or what have you. We're working only in powers of two for direct mapped cache. For a direct mapped cache, suppose we have a cache size of 2M bytes with a line size of 2L bytes. That means that there are 2M-L cache blocks. Now, we use the address of the data we're trying to access to determine which of these cache blocks we should look at to determine if the data we want is already in memory.
Tag
Index
Byte Select
We use the address in this way. Certain bits in our 32 bit memory address have special significance. We group the bits into three distinct groups we call fields. In this discussion, M and L are defined just as they were a paragraph ago. The rightmost L bits of the address is the byte select field. If data is found in a cache block, since addresses are byte addressed, we use this field to determine which of the bytes in a line we're trying to access. The M-L bits just to the left of the byte select field is called the index field. As we said before, we have 2M-L cache blocks in our cache. These cache blocks are ordered, just as in an array. There is a first one (with index 0) and a last one (with index M-L-1). If we're given an address, the value represented by the bits of the index field tell us which of these cache blocks we should examine. The leftmost 32-M bits of the address, just to the left of the index field, is called the tag field. This field is put into the tag part of the cache block, and that uniquely identifies from where in main memory the data in the line of this cache block came from. If we're looking at a cache block, we presumably know what index of this cache is, and therefore in conjunction with the cache tag we can compose the addresses from which the data in the line came from. When we store a line of data into a cache block, we store the value represented by the tag field into the tag part of the cache block. As a side note, it is not written in stone that the index field must be composed of the bits immediately to the left of the byte select field. However, the way that memory accesses tend to work, it happens that if the index field is in the bits immediately to the left of the byte select field, when we look for data, it is in cache rather than in main memory a larger proportion of the time. While there might be some special cases in which having the index field elsewhere would yield a higher hit ratio, in general it is best to have the index field immediately to the left of the byte select field. When we're checking cache to determine if the memory at the address we're looking for is actually in cache, we take the address, and extract the index field. We find the corresponding cache block. We then check to see if this cache block is valid. If it's not valid, then nothing is stored in this cache block, so we obviously must reference main memory. However, if the valid bit is set, then we compare the tag in the cache block to the tag field of our address. If the tags are the same, then we know that data we're looking for and the data in the cache block are the same. Therefore, we may access this data from cache. At this point we take the value in the byte select field of the address. Suppose this value is N. We are then looking for byte N in this line (hence the name byte select).
Example
However, these things are always more clear when we talk about a specific example. Let's talk about my first computer, a Macintosh IIcx. Suppose we have a direct mapped cache whose cache lines are 32 bytes (25 bytes) in length, and suppose that the cache size is 512 bytes (29 bytes). Since there are 512 bytes in the cache, and 32 bytes per cache block, that means that there are 512/32 = 16 cache blocks. In reality, a 512 byte cache is a very small cache; you'd never find a cache that small in today's computers. However, it makes the example that follows managable. Now, suppose we have addresses that are thirty-two bits in length. In this example, taking M and L to have the same definitions as they did at the beginning of our discussion on direct mapped cache, M = 9 and L = 5. So, our byte select field is the rightmost 5 bits. Our index field is the next ML = 4 bits. Our tag field is the leftmost 32-M = 23 bits. This actually makes sense. Since we have 16 cache blocks, we can uniquely represent the 16 possibilities of cache blocks we may address by 4 bits, which is the size of our index field. Similarly, we have 32 bytes that we can access in the data potion of our cache block since line size is 32 bytes, and we may uniquely address these 32 bytes with 5 bits, and indeed that is the size of our byte select field. To illustrate the working of this, I'll go step by step through a series of memory accesses. In my table represenation of cache I omitted the cache line to save space, because it really doesn't matter what the line holds. The data in cache is the data in cache, and that's all there is to it. We're interested in what memory is accessed where, because that is what affects the workings of the cache. That changes the state of the valid bits and the tag numbers of our cache blocks. For this example, I provide a series of tables, which represents the cache blocks, and their corresponding valid bits and tags. The first number in a row in the table is the index of this particular cache block. The next number is the valid bit. The rightmost number is the tag, which tells us where the data in the line (which I do not show) is from. Initially The table for the initial conditions shows that the valid bits are all set to zero, because nothing has yet been stored in cache. Step 1 For step one, suppose we access the memory at address 0x0023AF7C. The looking at that in binary, that is 00000000001000111010111101111100 2. If we separate these bits into the lengths of the fields we have determined with dashes between each field, that is more easily read as 00000000001000111010111-1011-11100. So, our index is 1011 2=1110. We look at index 11. Is there anything in there yet? We see there is not. Therefore, we load the data contained at memory addresses 0x0023AF60 through 0x0023AF7F into the 32 byte line of the cache block with index 11.
Note that the memory addresses between those two addresses are the memory addresses that would have the same tag and index fields if we broke the addresses into their respective fields. Now that the data is loaded into the line, we set the tag for this cache block to be the same as the tag of our address, since this cache block will now hold a line of data from the addresses with the same index and tag. This tag is 1000111010111 2=456710. We also must set our valid bit to true since this cache block now holds a block of data. These changes are reflected in the table for step 1. After these changes and the memory loads, we know that the data we want is in cache block 11. Our byte select field is 111002 = 2810, so we get the 29th byte of our line. Since 0 would address the first byte, 28 addresses the 29th byte. It's exactly as though the cache line is an array of chars (a byte length quantity), and the byte select is the index of the particular element of that array that we're looking at right now. Now, this is a cache miss, but there are different sorts of cache misses. Notice that no data has been loaded into cache yet. That is because no memory accesses have taken place. Because we're just starting out, this is called a "compulsory miss." We cannot help but not have data in this cache block yet, because we just started. Step 2 After step 1, suppose we have another memory access, this time at memory address 0x0023AE4F. Ignoring the intervening binary operations, this works out to have a byte select field of 15, an index field of 2, and a tag field of 4567. If we look at cache block 2, we see that it is not valid. Similar to last time, we load the 32 bytes of memory from addresses 0x0023AE40 to 0x0023AE5F into the line of this cache block, change the tag to 4567, and set the valid bit to true. The tag field is the same as in the previous operation; this was intentional on my part. As we see, the fact that the tag is the same here as it was last time is of no significance, since the two are part of completely different indexed cache blocks; they are unrelated insofar as cache is concerned. Similar to last time, the data we want is in the cache line. Our byte select field is 15, so we access the 16th byte (since 0 would address the first byte) of the line in cache block 2. Even though we've already had a memory access in step two, this miss is also called a compulsory miss because the cache block that we're loading into hasn't yet had any data loaded into it. Step 3 Suppose now we have a memory access at address 0x148FE054 (in binary, 00010100100011111110000 0010 10100). This works out to have a byte select field of 10100 2=2010, index field of 0010 2=210, and a tag field of 000101001000111111100002=67377610. We look at cache block 2. We see that it is valid. We then compare the tags. 673776 is not equal to 4567, which is the tag of cache block 2. This means that, even though there is data in this cache block, the data we want is not in here. We must load the data, and we do so just as if this block had not been valid. We load the data at addresses 0x147FE040 to 0x147FE05F into the line at cache block 2. We set the tag to 673776. We don't need to set the valid bit to one since it already is one. Now this cache block holds a new portion of data. As you see, we had two pieces of data that wanted to be in the same cache block: there was the data we loaded in the previous step, and the data we just loaded in this step. Now, this is not a conflict miss. What would be a conflict miss is if we tried to load the data that used to be in the cache (the data that we just replaced) back into cache. When this sort of miss occurs, that is called a "conflict miss," or rather a "collision." We also see one of the primary weaknesses of the direct mapped cache. At the end of step two we had two blocks used up, which means that we're using 64 bytes of our 512 byte cache. However, in this step, step three, because of a conflict miss we overwrote one of those blocks even though we had 448 bytes unused in our cache. Though this example is completely contrived, this sort of cache inefficiency is a common problem with direct mapped cache. Once we load the data into this cache and set the tag field appropriately, we can find the byte that we want using the byte select field, at index 2 (the 3rd byte). Step 4 Now suppose we try to access memory address 0x0023AF64. This works out to have byte select field of 001002=410, index field of 10112=1110, and tag field of 10001110101112=456710. We look at cache block 11, and it is valid, so we compare the tags. The tags are the same, it turns out, so we finally have a cache hit! The data that we want is in this cache block. Since our byte select field is 4, we access the fifth byte of this block's cache line. We don't modify cache at all; tags and valid bits remain the same, and we needn't pull any data from main main memory since the memory we want has already been loaded into cache.
This block was loaded with data in step one. The memory access for step one was 0x0023AF7C. Now, this step's memory access, 0x0023AF64, is obviously a different address than that, but 0x0023AF64 and 0x0023AF7C are pretty close. Since in the memory access for 0x0023AF7C we loaded that cache block's line with memory from 0x0023AF60 to 0x0023AF7F (32 bytes in all), the data for 0x0023AF64 could be found there. So, two memory accesses may access the same data in cache even though they are not exactly the same.
This set of tables trace the 16 cache blocks of the cache example through each of the memory accesses described above. Each "step" table reflects the state of cache at the end of each step. Initially 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Step 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Step 2 0 0 0 0 1 4567 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Step 3 0 0 0 0 1 673776 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Step 4 0 0 0 0 1 673776 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 11 0 0 12 0 0 13 0 0 14 0 0 15 0 0
10 0 0 11 1 4567 12 0 0 13 0 0 14 0 0 15 0 0
10 0 0 11 1 4567 12 0 0 13 0 0 14 0 0 15 0 0
10 0 0 11 1 4567 12 0 0 13 0 0 14 0 0 15 0 0
10 0 0 11 1 4567 12 0 0 13 0 0 14 0 0 15 0 0
Practice Problems
These are just a couple of questions that demonstrate the most basic concepts of direct mapped cache. The answers are provided after all the questions. These aren't terribly difficult, but they should get you thinking about cache. All of these questions pertain to direct mapped cache, in a computer with 32-bit addressing. Questions 1. 2. If we have a cache size of 4KB with a 16 byte line size, what will be the sizes of the three fields (tag, line, and byte select) which we divide our address into? Suppose our fields are arranged as described above (first tag field, then index field, then byte select field) in a cache of 16KB with 32 byte line sizes. What would be the values of each of the three fields for the following addresses? a. b. c. d. 0x00B248AC 0x5002AEF3 0x10203000 0x0023AF7C
Answers
1. 2.
Our cache size is 4KB = 212 bytes, with a line size of 16 bytes = 24 bytes. Therefore, our byte select field is 4 bits, our index field is 12-4 = 8 bits, which leaves 20 bits for our tag field. This is actually two problems in one. We must first determine the sizes of each of the fields. Our cache size is 16KB = 214 bytes, with a line size of 32 bytes = 25 bytes. So, our byte select field is 5 bits, our index field is 14-5 = 9 bits, which leaves 18 bits for our tag field. a. b. c. d. For 0x00B248AC, tag is 0x2C9, index is 0x45, and byte select is 0xC. For 0x5002AEF3, tag is 0x1400A, index is 0x177, and byte select is 0x13. For 0x10203000, tag is 0x4080, index is 0x180, and byte select is 0x0. For 0x0023AF7C, tag is 0x8E, index is 0x17B, and byte select is 0x1C.
Block 1 No tag
Block 2 No tag
Now, suppose we have a series of these memory accesses to the same index set as described, with the following tags for each memory access: 0x00010, 0x8A02F, 0x00010, 0xF1EEE. For the first access (tag 0x00010), we're assuming that this index set is empty. So, we load the data with tag 0x00010 into the first of our two cache blocks.
Block 1 0x00010
Block 2 No tag
It is important to remember that in N-way set associative caches there is a structure for each index set that tells us the order in which the index set's cache blocks were last accessed. For our two-way cache, the possible states for the order of access are "first, then second" or "second, then first." For a three-way cache, the possible states are "first, then second, then third," "first, then third, then second," etc. (There are six states in all for a three-way cache.) This tells our hardware just which block to replace if we must replace a block, so that our more recently used blocks are not disposed of before their time. For the second access (tag 0x8A02F), we compare the tags. 0x00010 does not equal 0x8A02F, but we have the second cache block unused (that is, invalid), so we stick the data with tag 0x8A02F into the second of the two cache blocks. At this point we have another structure in this index set that lets the hardware know that the second cache block has been used most recently.
Block 1 0x00010
Block 2 0x8A02F
For the third access (tag 0x00010), we compare the tags. The tags for the first of our two cache blocks match. Our tags don't change and we don't load any more data, but we change our structure that describes the order in which the blocks were last accessed to reflect that the first block was more recently accessed.
Block 1 0x00010
Block 2 0x8A02F
For the fourth and last access (tag 0xF1EEE), we compare the tags. There is no match, and there are no blocks free for this index set. So, what happens? We replace the block that has been least recently used. We have stored the information that, for this index set, the first cache block was more recently used. Therefore, we replace the second block. We keep the data with tag 0x00010, and discard the data with tag 0x8A02F and load the data with tag 0xF1EEE in its place into the second block. Now we change the order of access to reflect that the second block was more recently accessed than the first block.
Block 1 0x00010
Block 2 0xF1EEE
A simplified view of a 3-way set associative cache with 16 index sets. Index Block 1 0 Block 2 Block 3 Third, 0x49206 0xC6F76 0x65204
0x04272 0x61756
Second, then 0xE2C20 first, then third. Third, then second, then first. First, then second, then third.
No tag
No tag
No tag
0x62757 0x42073
0x68652
0x0646F 0x65732
Second, then 0x06E6F third, then first. Second, then first, then third. First, then second, then third.
0x65206 No tag
No tag
second. Second, then first, then third. First, then second, then third.
10
0x78A7F No tag
No tag
11
First, then 0x22448 0x37CAD 0x3DA1E third, then second. Second, then 0x5D56C third, then first. First, then 0x202E1 third, then second. Second, then first, then third. Third, then second, then first.
12
0x33C1A 0x14E66
13
0x3414D 0x2585F
14
15
Unifying Theory
If this all seems too complex, the important thing to realize about all these cache types is that they are all really the same cache type: N-way set associative. The direct mapped cache is actually the special case of one-way set associative. For each "index set" there is only one cache block. You may notice that there is no structure in direct mapped cache that keeps track of the access history of the blocks of each index set, but that is because there is only one possibility; that the one block we have was accessed least recently (or most recently depending on how you like to look at it). Fully associative cache is equivalent to the special case of N-way set associative cache where N is chosen such that N equals the total number of blocks in the cache, i.e., such that there is only one index set. It is true that N-way set associative cache has an index field and fully associative cache does not, but in this case the index field works out to have a length of 0 bits, i.e. there is no index field. This is consistent with our discussion of fully associative cache.
necessarily have a miss. If we have a program that jumps around between two distinct memory locations, accessing one area, then accessing the other, back and forth, we'll necessarily have a miss every time, whereas if we had two cache blocks, after the data is initially loaded we'll have a hit every time. The point is that line sizes are best taken in moderation.
Types of Misses
A cache miss occurs when the data we're looking for is not in cache, and so we must load it from main memory. In the example for direct mapped cache, we alluded to the fact that there are different types of cache misses. We mentioned two, but there are four, and how a system designer goes about minimizing misses depends greatly on the type of misses involved. In this explanation I provide an analogous situation. Suppose we're writing a paper in Perkins. We can take books (data) from stacks (main memory) and put them on our desk for easy reference (cache). Compulsory Miss A compulsory miss occurs at the first access of that memory location, so memory location has not yet been loaded into cache. These are also called cold start misses, or first reference misses. There isn't much that can be done about this problem. However, since a compulsory miss only occurs when data is first loaded, in the long run with a program with many memory accesses, compulsory misses are not a problem. This situation is analogous to when you're just starting your research in Perkins. You desk is empty, so you must first get some of the books that you initially want. You can't help this, since the books won't be at the desk waiting for you when you arrive. Conflict Miss (Collision) A conflict miss occurs when there are many memory accesses mapped to the same index set in a cache, so data in a block of in this index set may be discarded even though another block in another relatively unused index set might be even older. If this said block is discarded and later retrieved, that is called a conflict miss. This doesn't have a clean analogy to my Perkin's library situation, but suppose that there is a policy in the library that you may only have a certain number of books of the same color at any one time. For example, you may have a maximum of three blue books, and three red books, and three yellow books, and three fuchsia books. You may have space on the desk for more, but since you have three fuchsia books, you'll have to put a book back into stacks before you get another. In this analogy, the index sets are like the different colors, and the amount of books is like the set associativity. In this example we had a maximum of three books of a certain color at one time, so this was like a 3-way set associative cache. We can alleviate this problem if we increase the number of books of a certain color we can have at once. That is, if we increase the set associativity. We can also increase the size of the cache. All other things being equal, this has the effect of increasing the amount of index sets in the cache. In our analogy that would be the same as not only enlarging the des, but also refining the spectrum of colors. For example, at first we may only consider three colors: red, yellow, and blue. However, we may include other colors, and instead our spectrum may be red, orange, brown, yellow, aquamarine, and blue. Some colors that were considered red may now be considered orange, some colors that used to be considered yellow we may now call brown, et cetera. We increase the number of books we may have on our desk in this sneaky way. Looking outside of the analogy for a second, if we double the cache size, our index field grows by one bit. Instead of an index of 10010, for example, we could have indexes of 010010 OR 110010; our "spectrum" of indexes has increased. As you may see, a fully associative cache would not have this problem because we ignore the nonsense of index sets entirely. We only have one "index set" (if you can call it that) and you can store as many cache blocks in that index set as there is space in the cache. Capacity Miss A capacity miss occurs when a memory location is accessed once, but later because the cache fills up, that data is discarded, and then when we get a miss when accessing that memory location again because that data is no longer in memory. The analogy for this is simple. Your desk fills up, so you must send back a book to make room for new books on the desk, even though you may need the book you're sending back later. The solution to this is to increase the cache size; that is, get a bigger desk. Invalidation Miss What happens if main memory whose data is also loaded in cache is modified? In that case the cache data has been invalidated, and if the computer looks for it again it will have to be reloaded from main memory to get the up-to-date information. You might say, "that could never happen; all memory accesses are piped through the memory system, so cache would catch it." That's not always the case. I/O devices often write directly to memory (especially in older machines), so it is possible that we may have to reload data from main memory even though it may be in cache. This is called an invalidation miss.
The analogy for this is that a new edition of a book you have on your desk comes out, and it has new or corrected information. Regardless of how recently you brought this now out-of-date book to your desk, you must get the new book rather than refer to dated material.
Writing to Memory
I alluded to the potential problems of memory writing in my brief discussion of the invalidation miss, but for the most part throughout this text I have begged the issue of what happens to cache in the event of a memory write. Memory writes are difficult to implement because something in memory changes when you do a memory write, and it is difficult to keep cache and main memory consistent with whatever change was made by the program. Memory reads, on the other hand, are no problem. If all a program does is read memory, memory cannot change, so there is no need to make sure that cache and memory are consistent with each other. Before we go about making cache and memory consistent, however, the primary question is where our data will go; will writes first go in cache, in which case we'd have to bring main memory up to date later (called a "write back" policy), or write directly to main memory (called a "write through" policy)? The answer is that we can choose from either of these policies. Write Back Write back is the policy of writing changes to cache. This is simple enough; our cache blocks stay valid since they have the latest data. The problem with that is that if modified blocks ever were overwritten in the cache, we'd have to write the now discarded data back to main memory or changes will be lost. One inefficient solution would be to simply write all discarded data back to main memory, but this is undesirable; if every cache miss involved not only a read from main memory but in addition a write to main memory, there would be a dramatic increase in the miss penalty. A more efficient solution is to assign a "dirty bit" to each cache block. In the course of a program's execution if a block of data is modified we set the dirty bit, and write the data back to main memory if the dirty bit for the block we're discarding has been set. The advantage of a write back policy is that instead of writing to slower memory, we're writing to cache. Just as we're more likely to read from the same location in memory in a given chunk of time, the same can be said for memory writes. Instead of writing to memory every time we have a memory write, we only write back once we're done with that block of data. One disadvantage of this is that its control is somewhat complicated compared to write through. Write Through Write through is the policy of writing directly to cache, but also writing to memory. The problem is that writing data back to RAM slows things down. The solution is to have something called a write buffer. We stick our changed data in the write buffer, and whenever we have a DRAM write cycle the write buffer is emptied and the data is changed. A problem arises here. A write buffer exists so a computer can continue to execute instructions faster than the time it takes to write new data through to main memory. It is therefore conceivable that a whole bunch of write instructions could fill the buffer faster than the buffer's capacity to get rid of data already in the buffer. This is not an insurmountable problem. We can install a level two cache to hold writes. This also has the effect of speeding up the process of writing back to memory, since this L2 cache will be faster than DRAM. By appropriately regulating the flow of data between DRAM and the L2 cache, write overflow may be overcome.
Sub-blocks
This revision does not cover sub-blocks. Sorry. If it's any comfort, I don't even remember them being mentioned in lecture, and we never did anything with them. I'm writing this on vacation, so I'm working from memory; I can't remember much about them aside from the fact that sub-blocks reduce the miss penalty.