Вы находитесь на странице: 1из 12

Term Paper

Cahe Coherence Schemes:

An analysis of some practically used scheme. [With the advent of parallel processors that can have multi level of memories and is being coherence is being managed. Support your answers with some real numerical facts]multilevel caches, Cache consistency and Cache Coherence becomes crucial issue. Taking example of any dual core machine from Intel explain how cache

Submitted by:Vinay kumar

Regd. NO-11112270 Roll no:-RD1107A47

Guided By:

Mr. Avinash Bhagat.

Department of Computer Science and Applicationm Lovely Professional University Phagwara

CPU cache
A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.

When the processor needs to read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory. Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation lookaside buffer (TLB) used to speed up virtual-to-physical address translation for both executable instructions and data. The data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.; see Multi-level caches).

Cache Entries
Memory is split into "locations," which correspond to cache "lines". Each data access involving the cache uses this size, which tends to be larger than the largest CPU request size. (This is usually the size of a CPU register: 2 bytes for a PDP-11, 4 bytes for a general purpose register/8 bytes for a floating-point register in the MIPS architecture, and 16 bytes for an XMM register in x86 processors with SSE). Each location in memory can be identified by a physical memory address. When memory is copied to the cache, a cache entry is created. It can include:

the requested memory location (now called a tag) a copy of the data

When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. The cache checks for the contents of the requested memory location in any cache lines that might contain that address. If the processor finds that the memory location is in the cache, a cache hit has occurred (otherwise, a cache miss).

In case of a cache hit, the processor immediately reads or writes the data in the cache line. In case of a cache miss, the cache allocates a new entry, and copies in data from main memory. Then, the request is fulfilled from the contents of the cache.

Cache Performance
The proportion of accesses that result in a cache hit is known as the hit rate, and can be a measure of the effectiveness of the cache for a given program or algorithm. Read misses delay execution because they require data to be transferred from memory much slower than the cache itself. Write misses may occur without such penalty, since the processor can continue execution while data is copied to main memory in the background. Instruction caches are similar to data caches, but the CPU only performs read accesses (instruction fetches) to the instruction cache. (With Harvard-architecture CPUs, instruction and data caches can be separated for higher performance, but they can also be combined to reduce the hardware overhead.)

Cache coherence
In computing, cache coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource.

Multiple Caches of Shared Resource When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system. Referring to the "Multiple Caches of Shared Resource" figure, if the top client has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts and maintain consistency between cache and memory. I will refer to the multi-core processor shown in figure 1. Imagine that there are two threads running through the processor; one in core 1 and one in core 2. Now imagine that each core accesses, from the main memory, variable 'x' and places that variable in its cache. Now, if core 1 modifies the value of variable 'x', then, the value that core 2

has in its cache for variable 'x' is out of sync with the value core 1 has in its cache. This is an important issue with multi-core processors. Actually, this problem is not very different from multi processor (multiple chips) cache coherency problems.

Figure 1.

The cache coherence problem is:

Multiple copies of the same data can exist in different caches simultaneously, and if processors are allowed to update their own copies freely, an inconsistent view of memory can result. There are two write policies: Write back: Write operations are usually made only to the cache. Main memory is only updated when the corresponding cache line is flushed from the cache. Write through: All write operations are made to main memory as well as to the cache, ensuring that main memory is always valid. It is clear that a write back policy can result in inconsistency. If two caches contain the same line, and the line is updated in one cache, the other cache will unknowingly have an invalid value. Subsequently read to that invalid line produce invalid results. Even with the write through policy, inconsistency can occur unless other cache monitor the memory traffic or receive some direct notification of the update. For any cache coherence protocol, the objective is to let recently used local variables get into the appropriate cache and stay there through numerous reads and write, while using the protocol to maintain consistency of shared variables that might be in multiple caches at the same time.

Write through protocol:

A write through protocol can be implemented in two fundamental versions.

Write through with update protocol:

When a processor writes a new value into its cache, the new value is also written into the memory module that holds the cache block being changed. Some copies of this block may exist in other caches, these copies must be updated to reflect the change caused by the write operation. The simplest way of doing this is to broadcast the written data to all processor modules in the system. As each processor module receives the broadcast data, it updates the contents of the affected cache block if this block is present in its cache.

Write through with invalidation of copies:

When a processor writes a new value into its cache, this value is written into the memory module, and all copies in the other caches are invalidated. Again broadcasting can be used to send the invalidation requests through the system.

Write back protocol:

In the write-back protocol, multiple copies of a cache block may exist if different processors have loaded (read) the block into their caches. If some processor wants to change this block, it must first become an exclusive owner of this block.

When the ownership is granted to this processor by the memory module that is the home location of the block. All other copies, including the one in the memory module, are invalidated. Now the owner of the block may change the contents of the memory. When another processor wishes to read this block, the data are sent to this processor by the current owner. The data are also sent to the home memory module, which requires ownership and updates the block to contain the latest value. There are software and hardware solutions for cache coherence problem.

There are software and hardware solutions for cache coherence problem.
Software solution:

In software approach, the detecting of potential cache coherence problem is transferred from run time to compile time, and the design complexity is transferred from hardware to software. On the other hand, compile time; software approaches generally make conservative decisions. Leading to inefficient cache utilization. Compiler-based cache coherence mechanism perform an analysis on the code to determine which data items may become unsafe for caching, and they mark those items accordingly. So, there are some more cacheable items, and the operating system or hardware does not cache those items. The simplest approach is to prevent any shared data variables from being cached. This is too conservative, because a shared data structure may be exclusively used during some periods and may be effectively read-only during other periods. It is only during periods when at least one process may update the variable and at least one other process may access the variable then cache coherence is an issue More efficient approaches analyze the code to determine safe periods for shared variables. The compiler then inserts instructions into the generated code to enforce cache coherence during the critical periods. In other words:-

Compiler tags data as cacheable and non-cacheable.Only read-only data is considered cachable and put in private cache. All other data are non-cachable, and can be put in a global cache, if available.

Hardware Solutions. Hardware-based solutions are generally referred to as cache coherence protocols. These solutions provide dynamic recognition at run time of potential inconsistency conditions. Because the problem is only dealt with when it actually arises, there is more effective use of caches, leading to improved performance over a software approach. In addition, these approaches are transparent to the programmer and the compiler, reducing the software development burden. Hardware schemes differ in a number of particulars, including where the state information about data lines is held, how that information is organized, where coherence is enforced, and the enforcement mechanisms. In general, hardware schemes can be divided into two categories: directory protocols and snoopy protocols.

DIRECTORY PROTOCOLS:- Directory protocols collect and maintain information

about where copies of lines reside. Typically, there is a centralized controller that is part of the main memory controller, and a directory that is stored in main memory. The directory contains global state information about the contents of the various local caches. When an individual cache controller makes a request, the centralized controller checks and issues necessary commands for data transfer between memory and caches or between caches. It is also responsible for keeping the state information up to date; therefore, every local action that can affect the global state of a line must be reported to the central controller. Typically, the controller maintains information about which processors have a copy of which lines. Before a processor can write to a local copy of a line, it must request exclusive access to the line from the controller. Before granting this exclusive access, the controller sends a message to all processors with a cached copy of this line, forcing each processor to invalidate its copy. After receiving acknowledgments back from each such processor, the controller grants exclusive access to the requesting processor.When another processor tries to read a line that is exclusively granted to another processor, it will send a miss notification to the controller.The controller then issues a command to the processor holding that line that requires the processor

to do a write back to main memory.The line may now be shared for reading by the original processor and the requesting processor. Directory schemes suffer from the drawbacks of a central bottleneck and the overhead of communication between the various cache controllers and the central controller. However, they are effective in large-scale systems that involve multiple buses or some other complex interconnection scheme. SNOOPY PROTOCOLS:- Snoopy protocols distribute the responsibility for maintaining Cache coherence among all of the cache controllers in a multiprocessor. A cache must recognize when a line that it holds is shared with other caches.When an update action is performed on a shared cache line, it must be announced to all other caches by a broadcast mechanism. Each cache controller is able to snoop on the network to observe these broadcasted notifications, and react accordingly. Snoopy protocols are ideally suited to a bus-based multiprocessor, because the shared bus provides a simple means for broadcasting and snooping. However, because one of the objectives of the use of local caches is to avoid bus accesses, care must be taken that the increased bus traffic required for broadcasting and snooping does not cancel out the gains from the use of local caches. Two basic approaches to the snoopy protocol have been explored: write invalidate write update (or write broadcast) With a write-invalidate protocol, there can be multiple readers but only one writer at a time. Initially, a line may be shared among several caches for reading purposes.When one of the caches wants to perform a write to the line, it first issues a notice that invalidates that line in the other caches, making the line exclusive to the writing cache. Once the line is exclusive, the owning processor can make cheap local writes until some other processor requires the same line. With a write-update protocol, there can be multiple writers as well as multiple readers.When a processor wishes to update a shared line, the word to be updated is distributed to all others, and caches containing that line can update it. Neither of these two approaches is superior to the other under all circumstances. Performance depends on the number of local caches and the pattern of memory reads and writes. Some systems implement adaptive protocols that employ both write-invalidate and write-update mechanisms. The write-invalidate approach is the most widely used in commercial multiprocessor systems, such as the Pentium 4 and PowerPC. It marks the state of every cache line (using two extra bits in the cache tag) as modified, exclusive, shared, or invalid. For this reason, the write-invalidate protocol is called MESI. For simplicity in the presentation, we do not examine the mechanisms involved in coordinating among both level 1 and level 2 locally as well as at the same time coordinating across the distributed multiprocessor. This would not add any new principles but would greatly complicate the discussion.

The MESI Protocol

To provide cache consistency on an SMP, the data cache often supports a protocol known as MESI. For MESI, the data cache includes two status bits per tag, so that each line can be in one of four states: Modified: The line in the cache has been modified (different from main memory) and is available only in this cache.

MESI Cache Line States.

Exclusive: The line in the cache is the same as that in main memory and is not present in any other cache. Shared: The line in the cache is the same as that in main memory and may be present in another cache.

Invalid: The line in the cache does not conta in valid data.

Table 17.1 summarizes the meaning of the four states. Figure 17.7 displays a state diagram for the MESI protocol. Keep in mind that each line of the cache has its own state bits and therefore its own realization of the state diagram. Figure 17.7a shows the transitions that occur due to actions initiated by the processor attached to this cache. Figure 17.7b shows the transitions that occur due to events that are snooped on the common bus. This presentation of separate state diagrams for processor-initiated and bus-initiated actions helps to clarify the logic of the MESI protocol. At any time a cache line is in a single state. If the next event is from the attached processor, then the transition is dictated by Figure 17.7a and if the next event is from the bus, the transition is dictated by Figure 17.7b. Let us look at these transitions in more detail. READ MISS When a read miss occurs in the local cache, the processor initiates a memory read to read the line of main memory containing the missing address. The processor inserts a signal on the bus that alerts all other processor/cache units to snoop the transaction.There are a number of possible outcomes: If one other cache has a clean (unmodified since read from memory) copy of the line in the exclusive state, it returns a signal indicating that it shares this line. The responding processor then transitions the state of its copy from

exclusive to shared, and the initiating processor reads the line from main memory and transitions the line in its cache from invalid to shared. If one or more caches have a clean copy of the line in the shared state, each of them signals that it shares the line. The initiating processor reads the line and transitions the line in its cache from invalid to shared. If one other cache has a modified copy of the line, then that cache blocks the memory read and provides the line to the requesting cache over the shared bus.The responding cache then changes its line from modified to shared.1 The line sent to the requesting cache is also received and processed by the memory controller, which stores the block in memory. If no other cache has a copy of the line (clean or modified), then no signals are returned. The initiating processor reads the line and transitions the line in its cache from invalid to exclusive. READ HIT When a read hit occurs on a line currently in the local cache, the processor simply reads the required item.There is no state change:The state remains modified, shared, or exclusive. WRITE MISS When a write miss occurs in the local cache, the processor initiates a memory read to read the line of main memory containing the missing address. For this purpose, the processor issues a signal on the bus that means read-with-intent-tomodify (RWITM).When the line is loaded, it is immediately marked modified.With respect to other caches, two possible scenarios precede the loading of the line of data. First, some other cache may have a modified copy of this line In this case, the alerted processor signals the initiating processor that another processor (state = modify). has a modified copy of the line. The initiating processor surrenders the bus and waits. The other processor gains access to the bus, writes the modified cache line back to main memory, and transitions the state of the cache line to invalid (because the initiating processor is going to modify this line). Subsequently, the initiating processor will again issue a signal to the bus of RWITM and then read the line from main memory, modify the line in the cache, and mark the line in the modified state. The second scenario is that no other cache has a modified copy of the requested line. In this case, no signal is returned, and the initiating processor proceeds to read in the line and modify it. Meanwhile, if one or more caches have a clean copy of the line in the shared state, each cache invalidates its copy of the line, and if one cache has a clean copy of the line in the exclusive state, it invalidates its copy of the line. WRITE HIT When a write hit occurs on a line currently in the local cache, the effect depends on the current state of that line in the local cache: Shared: Before performing the update, the processor must gain exclusive ownership of the line. The processor signals its intent on the bus. Each processor that has a shared copy of the line in its cache transitions the sector from shared to invalid.The initiating processor then performs the update and transitions its copy of the line from shared to modified. Exclusive: The processor already has exclusive control of this line, and so it simply performs the update and transitions its copy of the line from exclusive to modified.

Modified: The processor already has exclusive control of this line and has the line marked as modified, and so it simply performs the update.

Reference: Computer Organisation and Architecure, By William Stallings Publisher PHI 2 David A Patterson, Computer Architecture A Quantitative Approach, Pearson Education Asia.
http://www.techopedia.com/definition/300/cache-coherence http://www.intel.com/content/www/us/en/homepage.html http://en.wikipedia.org/wiki/Write-once_(cache_coherence) http://dl.acm.org/citation.cfm?id=79810