Академический Документы
Профессиональный Документы
Культура Документы
Broadcasting is expensive Maintain the cache state explicitly List of caches that have a copy: many read-only copies but one writable copy P1 $ P2 $
Directory
Directory
Directory Protocol
Directory Memory X
Cache
Terminology
Home node
Dirty node
Owner node
Exclusive node
The node containing the processor that issues a request for the block
Blocks whose home is local to the issuing processor
Local block
Basic Operations
Read miss to a block in modified state
P $
CA
1. Read request
Mem/Dir
P $
CA
Mem/Dir
Requestor
3. Read request
P $ CA
Home
4b. Revision message
Mem/Dir
Owner
1. RdEx request
Mem/Dir
P $ CA
P $
CA
Mem/Dir
Requestor
3. Invalidation requests
Home
P $ CA
Sharer
Mem/Dir
Sharer
Mem/Dir
Centralized
Hierarchical
Hierarchy
Cache-based
Caches
Most straightforward Low latency: parallel invalidation Main disadvantage: storage overhead P*B
Increase the cache block Access time and network traffic increased due to false sharing Use hierarchical protocol: Stanford DASH: Node has bus-based 4-processor
Bits per directory entry Mostly only a few caches have a copy of a block Storage overhead: log P * k (number of copies) Overflow methods are needed Diri X
Motivation
Set the broadcast bit in case of overflow Broadcast invalidation messages to all nodes Simple Increase write latency Wasting communication bandwidth Invalidate the copy of one sharer Bad for widely shared read-mostly data Degradation for intensive sharing of read-only and read-mostly data due to increased miss ratio
If P <= i, i pointers If P > i, each bit stands for a region of r processors (coarse vector)
Invalidations to the regions of caches SGI Origin Robust to different sharing patterns 70% less memory message traffic than broadcast and at least 8% less than other schemes
8 pointers
1 11
P0
P1
P2
P3
P0 P4 overflow P8
P1 P5 P9
P2 P6 P10
P3 P7 P11
P4
P8
P5
P9
P6
P10
P7
P11
P12
P13
P14
P15
P12
P13
P14
P15
The current i pointers and a pointer to the new sharer are saved into a special portion of the local main memory by software MIT Alewife: LimitLESS Cost of interrupts and software handling is high. Directory entry contains a hardware pointer into the local memory. Similar to software mechanism without software overhead.
Difference is list manipulation is done in a special-purpose protocol processor rather than by the general-purpose processor
Dynamic Pointers
Nodes connected by scalable interconnect Partitioned shared memory Processing nodes are themselves multiprocessors Distributed directory-based cache coherence P1 $ Presence bit Dirty bit P1 $ P1 $
P1 $
Memory
Conclusions
Full map is most appropriate up to a modest number of processors Diri CVr , Diri DP are most likely candidates
Coarse vector: lack of accuracy on overflow Dynamic pointer: processing cost due hardware list manipulation
Total number of directory entries The total amount of cache memory is much less than the total main memory Organize the directory as a cache This cache has no need for a backing store
Motivation
Sparse directory
Spatial locality is not an issue: one entry per block References stream is heavily filtered, consisting of only those references that were not satisfied in the processor caches. With directory size factor = 8, associativity of 4, and LRU replacement: very close to that of full-map directory
Protocol Optimization
Two major goals + one
Latency Reduction
1. req L 3. intervention 1. req 4a. revise H 4. response 2. intervention R 3. response
2. res
Read miss
Insert the requestor at the head of the list Insert the requestor at the head of the list Invalidate the sharers by traversing the list: long latency! Delete itself from the list
Write miss
Write back
Small overhead: single pointer Easier to provide fairness and to avoid livelock Sending invalidations is not centralized at the home, but rather distributed among sharers Long latency, long occupancy of communication assist Modifying the sharing list requires careful coordination and mutual exclusion.
Disadvantages
Latency Reduction
1. inv H 5. inv S2 6. ack S3 2a. inv S2 4b. ack (b) intervention forwarding 1. inv 2. inv 3. inv 3a. inv S3 2b. ackS1 3b. ack
S1
4. ack
S2
S3
Sequent NUMA-Q
IEEE standard Scalable Coherent Interface (SCI) protocol Targeted toward commercial workloads
Databases and transaction processing Intel SMP as the processing nodes DataPump network interface from Vitesse semiconductor Custom IQ Link: directory logic, remote cache (32MB, 4-way) Quad P IQ-Link Peripheral interface I/O PCI I/O P P P Snooping within Quad
Commodity hardware
Mem
Quad
IQ Link Implementation
Directory for locally allocated data Tags for remotely allocated but locally cached data Orion bus controller
DataPump
Manage snooping and requesting logic GaAs chip for transport protocol of the SCI standard
Remote tag
Local directory
DataPump
Network side tag: SDRAM Orion bus controller (OBIC) Remote Local data&tag directory
Directory States
Home
No remote cache (quad) in the system contains a copy of the block A processor cache in the home quad itself may have a copy since this is not visible to SCI protocol, but managed by bus protocol within the quad. One or more remote caches may have a read-only copy
Fresh
Gone
Another remote cache contains a writable (exclusive or dirty) copy. No valid copy exists on the local node
ONLY, HEAD, TAIL, MID Dirty, clean (exclusive state in MESI), fresh (data may not be written until memory is informed), copy (unmodified and readible), and so on
SCI Standard
Three primitive operations
List construction: adding a new node to the head of the list Rollout: remove a node from a list Purge (invalidation): the head node invalidate all other nodes Minimal protocol: does not permit read sharing: only one copy Typical protocol: NUMA-Q Full protocol: implement all standards of SCI
Levels of protocol
The home updates the blocks state to FRESH and sends data The requestor updates from PENDING to ONLY_FRESH
Insert the node at the head The previous head changes its state from HEAD_FRESH to MID_VALID or from ONLY_FRESH to TAIL_VALID The requestor changes its state from PENDING to HEAD_FRESH The home stays in the GONE state and sends a pointer to the previous head The previous head changes its state from HEAD_DIRTY to MID_VALID or from ONLY_DIRTY to TAIL_VALID The requestor sets its state to HEAD_DIRTY
Old head
ONLY_ FRESH
Requestor
PENDING
Old head
ONLY_ FRESH
FRESH
FRESH
1st Round
Requestor
PENDING
Old head
TAIL_ VALID
FRESH
FRESH
2nd Round
Home memory
Home memory
When the write is in the middle of the list, remove itself from the list (rollout) and add itself again to the head (list construction)
Purge sharing list: strict request-response manner Writer stays in the pending state while the purging is in progress Writer goes into pending state and sends a request to the home that changes from FRESH to GONE and replies Writer goes into a different pending state and purge the list
If in HEAD_FRESH state
Hierarchical Coherence
Snoop-Snoop System
Simplest way to build large scale cachecoherent MPs Coherence monitor
Remote (access) cache Local state monitor - keep state information on data locally allocated, remotely cached Should be larger than the sum of processor caches and quite associative Should be lockup-free Issue invalidation request to the local bus when a block is replaced
Remote cache
Highest performance SDRAM caches B1 follows a standard snooping protocol Much larger than L1 caches (set associative) Must maintain inclusion L2 cache acts as a filter for B1-bus and L1-caches L2 cache can be DRAM based since fewer references get to it
P1 $
P1 $
P1 $
P1 $ B1
Coherence monitor
Coherence monitor B2
Misses to main memory just require single traversal to the root of the hierarchy. Placement of shared data is not an issue. Misses to local data structures (e.g., stack) also have to traverse the hierarchy, resulting in higher traffic and latency. Memory at the global bus must be highly interleaved. Otherwise bandwidth to it will not scale.
Disadvantages
Main memory is distributed among clusters. reduces global bus traffic (local data & suitably placed shared data) reduces latency (less contention and local accesses are faster) example machine: Encore Gigamax
P1
$
P1
$
B1
M
Coherence monitor
Coherence monitor
B2
Summary
Advantages:
Conceptually simple to build (apply snooping recursively) Can get merging and combining of requests in hardware Physical hierarchies do not provide enough bisection bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-d grid problems) Latencies often larger than in direct networks
Disadvantages:
L1 directory tracks which of its children processing nodes have a copy of the memory block. L2 directory tracks which of its children L1 directories have a copy the memory block. It also tracks which local memory blocks are cached outside Inclusion is maintained between processor caches and L1 directory
Processing nodes
L1 directory L2 directory
Internal nodes
All three circles in this rectangle represent the same processing node
Storage overhead
Each level has about the same amount of memory C: cache size, b: branch factor, M: memory size, B: block size Reduce the number of network hops. But, increase the end-to-end transactions, increase latency Root becomes the bottleneck
C log b P MB
Performance overhead
Locality effect
Disadvantages
Long uncontended latency Bandwidth requirements near the root of the hierarchy