Distributed Shared Mem

Distributed Shared Memory
Directory Based Cache Coherence

Why snooping is a bad idea?
Broadcasting is expensive Maintain the cache state explicitly List of caches that have a copy: many read-only copies but one writable copy P1 $ P2 $
Directory

Directory
Memory CA Scalable interconnection network CA
Directory Protocol
Directory Memory X
Interconnection Network X Cache
Cache
Terminology
Home node
The node in whose main memory the block is allocated

The node that has a copy of the block in its cache in modified state
Dirty node
Owner node
The node that currently hosts the valid copy of a block

The node that has a copy in exclusive state
Exclusive node
Local node, or requesting node
The node containing the processor that issues a request for the block
Blocks whose home is local to the issuing processor
Local block
Basic Operations
Read miss to a block in modified state
P $
CA
1. Read request
Mem/Dir
P $
CA
Mem/Dir
2. Response with owner identifier
Requestor
3. Read request
P $ CA
Home
4b. Revision message
Mem/Dir
4a. Data reply
Owner
Basic Operations (Contd)

Write miss to a block with two sharers
P $
CA
1. RdEx request
Mem/Dir
P $ CA
P $
CA
Mem/Dir
2. Response with sharers identifiers 4. Invalidation acknowledgement
Requestor
3. Invalidation requests
Home
P $ CA
Sharer
Mem/Dir
Sharer
Mem/Dir
Alternatives for Organizing Directories

Directory storage schemes Flat
Directory
Centralized
Hierarchical
Hierarchy
information is in a fixed place: home
of cache that guarantee the inclusion property
How to find source of directory information? How to locate copies? Memory-based

Directory
Cache-based
Caches
information co-located with memory modules that is home

Stanford
holding a copy of the memory block form a linked list

IEEE
DASH/FLASH, SGI Origin, etc
SCI, Sequent NUMA-Q
Flat Directory Schemes
Full-Map Directory Limited Directory Chained Directory
Memory-Based Directory Schemes

Full bit vector (full-map) directory

Most straightforward Low latency: parallel invalidation Main disadvantage: storage overhead P*B
Increase the cache block Access time and network traffic increased due to false sharing Use hierarchical protocol: Stanford DASH: Node has bus-based 4-processor
Storage Reducing Optimization: Directory Width

Directory width
Bits per directory entry Mostly only a few caches have a copy of a block Storage overhead: log P * k (number of copies) Overflow methods are needed Diri X

Motivation
Limited (pointer) directory

i indicates number of pointers (i < P) X indicates invalidation methods: broadcast or non-broadcast
Overflow Methods for Limited Protocol

Diri B (Broadcast)

Set the broadcast bit in case of overflow Broadcast invalidation messages to all nodes Simple Increase write latency Wasting communication bandwidth Invalidate the copy of one sharer Bad for widely shared read-mostly data Degradation for intensive sharing of read-only and read-mostly data due to increased miss ratio
Diri NB (Not Broadcast)

Overflow Methods for Limited Protocol (Contd)

Diri CVr (Coarse Vector)
Representation changes to a coarse bit vector

If P <= i, i pointers If P > i, each bit stands for a region of r processors (coarse vector)
Invalidations to the regions of caches SGI Origin Robust to different sharing patterns 70% less memory message traffic than broadcast and at least 8% less than other schemes
Coarse Bit Vector Scheme

2 pointers
0 4bits 4bits Overflow bit 1
8 pointers
1 11
P0
P1
P2
P3
P0 P4 overflow P8
P1 P5 P9
P2 P6 P10
P3 P7 P11
P4
P8
P5
P9
P6
P10
P7
P11
P12
P13
P14
P15
P12
P13
P14
P15
Overflow Methods for Limited Protocol (Contd)

Diri SW (Software)

The current i pointers and a pointer to the new sharer are saved into a special portion of the local main memory by software MIT Alewife: LimitLESS Cost of interrupts and software handling is high. Directory entry contains a hardware pointer into the local memory. Similar to software mechanism without software overhead.
Diri DP (Dynamic Pointers)

Difference is list manipulation is done in a special-purpose protocol processor rather than by the general-purpose processor
Stanford FLASH Directory overhead: 7-9% of main memory
Dynamic Pointers
Circular sharing list Director: continuous region in main memory
Stanford DASH Architecture

DASH => Directory Architecture for SHared memory

Nodes connected by scalable interconnect Partitioned shared memory Processing nodes are themselves multiprocessors Distributed directory-based cache coherence P1 $ Presence bit Dirty bit P1 $ P1 $
P1 $
Memory
Directory Interconnection network
Conclusions
Full map is most appropriate up to a modest number of processors Diri CVr , Diri DP are most likely candidates

Coarse vector: lack of accuracy on overflow Dynamic pointer: processing cost due hardware list manipulation
Storage Reducing Optimization: Directory Height

Directory height
Total number of directory entries The total amount of cache memory is much less than the total main memory Organize the directory as a cache This cache has no need for a backing store
Motivation
Sparse directory

When an entry is replaced, send invalidations to the nodes with copies
Spatial locality is not an issue: one entry per block References stream is heavily filtered, consisting of only those references that were not satisfied in the processor caches. With directory size factor = 8, associativity of 4, and LRU replacement: very close to that of full-map directory
Protocol Optimization
Two major goals + one
Reduce the number of network transactions per memory operation
Reduce the bandwidth demand Reduce the uncontended latency
Reduce the number of actions on the critical path
Reduce the endpoint assist occupancy per transaction
Reduce the uncontended latency as well as endpoint contention
Latency Reduction
1. req L 3. intervention 1. req 4a. revise H 4. response 2. intervention R 3. response
2. res
(b) intervention forwarding 1. req 2. intervention 3a. revise H R
4b. response (a) strict request-response
3b. response (c) reply forwarding
A read request to a block in exclusive state
Cache-Based Directory Schemes

Directory is a doubly linked list of entries
Read miss
Insert the requestor at the head of the list Insert the requestor at the head of the list Invalidate the sharers by traversing the list: long latency! Delete itself from the list
Write miss
Write back
Head pointer Main memory (home) $ Node 0 Node 1 Node 2
Tradeoffs with Cache-Based Schemes

Advantages

Small overhead: single pointer Easier to provide fairness and to avoid livelock Sending invalidations is not centralized at the home, but rather distributed among sharers Long latency, long occupancy of communication assist Modifying the sharing list requires careful coordination and mutual exclusion.
Disadvantages

Latency Reduction
1. inv H 5. inv S2 6. ack S3 2a. inv S2 4b. ack (b) intervention forwarding 1. inv 2. inv 3. inv 3a. inv S3 2b. ackS1 3b. ack
3. inv 1. inv 2. ack S1 4. ack
(a) strict request-response
S1
4. ack
S2
S3
(c) reply forwarding
Sequent NUMA-Q
IEEE standard Scalable Coherent Interface (SCI) protocol Targeted toward commercial workloads
Databases and transaction processing Intel SMP as the processing nodes DataPump network interface from Vitesse semiconductor Custom IQ Link: directory logic, remote cache (32MB, 4-way) Quad P IQ-Link Peripheral interface I/O PCI I/O P P P Snooping within Quad
Commodity hardware

Quad ring Quad
Mem
Quad
IQ Link Implementation
Directory for locally allocated data Tags for remotely allocated but locally cached data Orion bus controller
DataPump
SCI ring (1GB/s)
Manage snooping and requesting logic GaAs chip for transport protocol of the SCI standard
Directory controller (SCLIC)
Remote tag
Local directory
DataPump
Network side tag: SDRAM Orion bus controller (OBIC) Remote Local data&tag directory
SCI link interface controller
Bus side tag: SRAM

Quad bus
Manage SCI coherence protocol programmable
Directory States
Home

No remote cache (quad) in the system contains a copy of the block A processor cache in the home quad itself may have a copy since this is not visible to SCI protocol, but managed by bus protocol within the quad. One or more remote caches may have a read-only copy
Fresh
Gone
Another remote cache contains a writable (exclusive or dirty) copy. No valid copy exists on the local node
Remote Cache States

7 bits: 29 stable states + many pending or transient states Each stable state has two parts
First part: where the cache entry is located:
ONLY, HEAD, TAIL, MID Dirty, clean (exclusive state in MESI), fresh (data may not be written until memory is informed), copy (unmodified and readible), and so on
Second part: actual state
SCI Standard
Three primitive operations

List construction: adding a new node to the head of the list Rollout: remove a node from a list Purge (invalidation): the head node invalidate all other nodes Minimal protocol: does not permit read sharing: only one copy Typical protocol: NUMA-Q Full protocol: implement all standards of SCI
Levels of protocol

Handling Read Requests

If the directory state is HOME

The home updates the blocks state to FRESH and sends data The requestor updates from PENDING to ONLY_FRESH
Insert the node at the head The previous head changes its state from HEAD_FRESH to MID_VALID or from ONLY_FRESH to TAIL_VALID The requestor changes its state from PENDING to HEAD_FRESH The home stays in the GONE state and sends a pointer to the previous head The previous head changes its state from HEAD_DIRTY to MID_VALID or from ONLY_DIRTY to TAIL_VALID The requestor sets its state to HEAD_DIRTY
If the directory state is FRESH

If the directory state is GONE

An Example of Read Miss

Requestor
INVALID
Old head
ONLY_ FRESH
Requestor
PENDING
Old head
ONLY_ FRESH
FRESH
FRESH
1st Round
Requestor
PENDING
Home memory Old head

ONLY_ FRESH
Home memory Requestor

HEAD_ FRESH
Old head
TAIL_ VALID
FRESH
FRESH
2nd Round
Home memory
Home memory
Handling Write Requests

Only the head node is allowed to write a block and issue invalidation
When the write is in the middle of the list, remove itself from the list (rollout) and add itself again to the head (list construction)
If the writer cache block is in HEAD_DIRTY state

Purge sharing list: strict request-response manner Writer stays in the pending state while the purging is in progress Writer goes into pending state and sends a request to the home that changes from FRESH to GONE and replies Writer goes into a different pending state and purge the list
If in HEAD_FRESH state

Eventually, the writer state goes to ONLY_DIRTY
Handling Writeback and Replacements

Need of rollouts
Replacement, invalidation, write

First sets itself to a pending state to prevent the race condition. The node closer to the tail has priority and is rolled out first. Set the state to invalid Goes into a pending state and the downstream node change its state (ex: MID_VALID -> HEAD_DIRTY) Updates the pointer in the home node. If the node is the only node, change the home node state to HOME Serve writeback first since buffering is complicated and miss on the remote cache will be infrequent (vs. write-buffer in bus system)
Middle node rollback

Head node rollback

Writeback upon a miss
Hierarchical Coherence
Snoop-Snoop System
Simplest way to build large scale cachecoherent MPs Coherence monitor

Remote (access) cache Local state monitor - keep state information on data locally allocated, remotely cached Should be larger than the sum of processor caches and quite associative Should be lockup-free Issue invalidation request to the local bus when a block is replaced
Remote cache

Snoop-Snoop with Global Memory

First level cache

Highest performance SDRAM caches B1 follows a standard snooping protocol Much larger than L1 caches (set associative) Must maintain inclusion L2 cache acts as a filter for B1-bus and L1-caches L2 cache can be DRAM based since fewer references get to it
P1 $
P1 $
P1 $
P1 $ B1
Second level cache

Coherence monitor
Coherence monitor B2
Snoop-Snoop with Global Memory (Contd)

Advantages

Misses to main memory just require single traversal to the root of the hierarchy. Placement of shared data is not an issue. Misses to local data structures (e.g., stack) also have to traverse the hierarchy, resulting in higher traffic and latency. Memory at the global bus must be highly interleaved. Otherwise bandwidth to it will not scale.
Disadvantages
Cluster Based Hierarchies

Key idea

Main memory is distributed among clusters. reduces global bus traffic (local data & suitably placed shared data) reduces latency (less contention and local accesses are faster) example machine: Encore Gigamax
L2 cache can be replaced by a tag-only routercoherence switch.

P1 P1
P1
$
P1
$
B1
M
Coherence monitor
Coherence monitor
B2
Summary
Advantages:

Conceptually simple to build (apply snooping recursively) Can get merging and combining of requests in hardware Physical hierarchies do not provide enough bisection bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-d grid problems) Latencies often larger than in direct networks
Disadvantages:
Hierarchical Directory Scheme

The internal nodes contain only directory information
L1 directory tracks which of its children processing nodes have a copy of the memory block. L2 directory tracks which of its children L1 directories have a copy the memory block. It also tracks which local memory blocks are cached outside Inclusion is maintained between processor caches and L1 directory
Processing nodes
L1 directory L2 directory
Logical trees may be embedded in any physical hierarchy
A Multirooted Hierarchical Directory

Processing nodes at leaves
Internal nodes
Directory tree for p0s memory p0 p7
All three circles in this rectangle represent the same processing node
Organization and Overhead

Organization
Separate directory structure for every block
Storage overhead
Each level has about the same amount of memory C: cache size, b: branch factor, M: memory size, B: block size Reduce the number of network hops. But, increase the end-to-end transactions, increase latency Root becomes the bottleneck
C log b P MB
Performance overhead

Performance Implication of Hierarchical Coherence

Advantages
Combining of requests for a block
Reduce traffic and contention Reduce transit latency and contention
Locality effect
Disadvantages
Long uncontended latency Bandwidth requirements near the root of the hierarchy

Distributed Shared Mem

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Distributed Shared Mem

Загружено:

Авторское право:

Доступные форматы

Distributed Shared Memory

Directory Based Cache Coherence

Memory CA Scalable interconnection network CA

Interconnection Network X Cache

The node in whose main memory the block is allocated

The node that currently hosts the valid copy of a block

Local node, or requesting node

2. Response with owner identifier

4a. Data reply

Basic Operations (Contd)

2. Response with sharers identifiers 4. Invalidation acknowledgement

Alternatives for Organizing Directories

information is in a fixed place: home

of cache that guarantee the inclusion property

How to find source of directory information? How to locate copies? Memory-based

information co-located with memory modules that is home

holding a copy of the memory block form a linked list

DASH/FLASH, SGI Origin, etc

SCI, Sequent NUMA-Q

Flat Directory Schemes

Full-Map Directory Limited Directory Chained Directory

Memory-Based Directory Schemes

Storage Reducing Optimization: Directory Width

Limited (pointer) directory

i indicates number of pointers (i < P) X indicates invalidation methods: broadcast or non-broadcast

Overflow Methods for Limited Protocol

Diri NB (Not Broadcast)

Overflow Methods for Limited Protocol (Contd)

Representation changes to a coarse bit vector

Coarse Bit Vector Scheme

Overflow Methods for Limited Protocol (Contd)

Diri DP (Dynamic Pointers)

Stanford FLASH Directory overhead: 7-9% of main memory

Circular sharing list Director: continuous region in main memory

Stanford DASH Architecture

Directory Interconnection network

Storage Reducing Optimization: Directory Height

When an entry is replaced, send invalidations to the nodes with copies

Reduce the number of network transactions per memory operation

Reduce the bandwidth demand Reduce the uncontended latency

Reduce the number of actions on the critical path

Reduce the endpoint assist occupancy per transaction

Reduce the uncontended latency as well as endpoint contention

(b) intervention forwarding 1. req 2. intervention 3a. revise H R

4b. response (a) strict request-response

3b. response (c) reply forwarding

A read request to a block in exclusive state

Cache-Based Directory Schemes

Head pointer Main memory (home) $ Node 0 Node 1 Node 2

Tradeoffs with Cache-Based Schemes

3. inv 1. inv 2. ack S1 4. ack

(a) strict request-response

(c) reply forwarding

Quad ring Quad

SCI ring (1GB/s)

Directory controller (SCLIC)

SCI link interface controller

Bus side tag: SRAM

Manage SCI coherence protocol programmable

Remote Cache States

First part: where the cache entry is located:

Second part: actual state

Handling Read Requests