Вы находитесь на странице: 1из 42

Distributed Shared Memory

Directory Based Cache Coherence


Why snooping is a bad idea?

Broadcasting is expensive Maintain the cache state explicitly List of caches that have a copy: many read-only copies but one writable copy P1 $ P2 $

Directory

Directory

Memory CA Scalable interconnection network CA

Directory Protocol
Directory Memory X

Interconnection Network X Cache

Cache

Terminology
Home node

The node in whose main memory the block is allocated


The node that has a copy of the block in its cache in modified state

Dirty node

Owner node

The node that currently hosts the valid copy of a block


The node that has a copy in exclusive state

Exclusive node

Local node, or requesting node

The node containing the processor that issues a request for the block
Blocks whose home is local to the issuing processor

Local block

Basic Operations
Read miss to a block in modified state
P $
CA

1. Read request
Mem/Dir

P $
CA

Mem/Dir

2. Response with owner identifier

Requestor

3. Read request
P $ CA

Home
4b. Revision message
Mem/Dir

4a. Data reply

Owner

Basic Operations (Contd)


Write miss to a block with two sharers
P $
CA

1. RdEx request
Mem/Dir
P $ CA

P $
CA

Mem/Dir

2. Response with sharers identifiers 4. Invalidation acknowledgement

Requestor
3. Invalidation requests

Home

P $ CA

Sharer

Mem/Dir

Sharer

Mem/Dir

Alternatives for Organizing Directories


Directory storage schemes Flat
Directory

Centralized

Hierarchical
Hierarchy

information is in a fixed place: home

of cache that guarantee the inclusion property

How to find source of directory information? How to locate copies? Memory-based


Directory

Cache-based
Caches

information co-located with memory modules that is home


Stanford

holding a copy of the memory block form a linked list


IEEE

DASH/FLASH, SGI Origin, etc

SCI, Sequent NUMA-Q

Flat Directory Schemes

Full-Map Directory Limited Directory Chained Directory

Memory-Based Directory Schemes


Full bit vector (full-map) directory

Most straightforward Low latency: parallel invalidation Main disadvantage: storage overhead P*B

Increase the cache block Access time and network traffic increased due to false sharing Use hierarchical protocol: Stanford DASH: Node has bus-based 4-processor

Storage Reducing Optimization: Directory Width


Directory width

Bits per directory entry Mostly only a few caches have a copy of a block Storage overhead: log P * k (number of copies) Overflow methods are needed Diri X

Motivation

Limited (pointer) directory


i indicates number of pointers (i < P) X indicates invalidation methods: broadcast or non-broadcast

Overflow Methods for Limited Protocol


Diri B (Broadcast)

Set the broadcast bit in case of overflow Broadcast invalidation messages to all nodes Simple Increase write latency Wasting communication bandwidth Invalidate the copy of one sharer Bad for widely shared read-mostly data Degradation for intensive sharing of read-only and read-mostly data due to increased miss ratio

Diri NB (Not Broadcast)


Overflow Methods for Limited Protocol (Contd)


Diri CVr (Coarse Vector)

Representation changes to a coarse bit vector


If P <= i, i pointers If P > i, each bit stands for a region of r processors (coarse vector)

Invalidations to the regions of caches SGI Origin Robust to different sharing patterns 70% less memory message traffic than broadcast and at least 8% less than other schemes

Coarse Bit Vector Scheme


2 pointers
0 4bits 4bits Overflow bit 1

8 pointers
1 11

P0

P1

P2

P3

P0 P4 overflow P8

P1 P5 P9

P2 P6 P10

P3 P7 P11

P4
P8

P5
P9

P6
P10

P7
P11

P12

P13

P14

P15

P12

P13

P14

P15

Overflow Methods for Limited Protocol (Contd)


Diri SW (Software)

The current i pointers and a pointer to the new sharer are saved into a special portion of the local main memory by software MIT Alewife: LimitLESS Cost of interrupts and software handling is high. Directory entry contains a hardware pointer into the local memory. Similar to software mechanism without software overhead.

Diri DP (Dynamic Pointers)


Difference is list manipulation is done in a special-purpose protocol processor rather than by the general-purpose processor

Stanford FLASH Directory overhead: 7-9% of main memory

Dynamic Pointers

Circular sharing list Director: continuous region in main memory

Stanford DASH Architecture


DASH => Directory Architecture for SHared memory

Nodes connected by scalable interconnect Partitioned shared memory Processing nodes are themselves multiprocessors Distributed directory-based cache coherence P1 $ Presence bit Dirty bit P1 $ P1 $

P1 $

Memory

Directory Interconnection network

Conclusions
Full map is most appropriate up to a modest number of processors Diri CVr , Diri DP are most likely candidates

Coarse vector: lack of accuracy on overflow Dynamic pointer: processing cost due hardware list manipulation

Storage Reducing Optimization: Directory Height


Directory height

Total number of directory entries The total amount of cache memory is much less than the total main memory Organize the directory as a cache This cache has no need for a backing store

Motivation

Sparse directory

When an entry is replaced, send invalidations to the nodes with copies

Spatial locality is not an issue: one entry per block References stream is heavily filtered, consisting of only those references that were not satisfied in the processor caches. With directory size factor = 8, associativity of 4, and LRU replacement: very close to that of full-map directory

Protocol Optimization
Two major goals + one

Reduce the number of network transactions per memory operation

Reduce the bandwidth demand Reduce the uncontended latency

Reduce the number of actions on the critical path

Reduce the endpoint assist occupancy per transaction

Reduce the uncontended latency as well as endpoint contention

Latency Reduction
1. req L 3. intervention 1. req 4a. revise H 4. response 2. intervention R 3. response

2. res

(b) intervention forwarding 1. req 2. intervention 3a. revise H R

4b. response (a) strict request-response

3b. response (c) reply forwarding

A read request to a block in exclusive state

Cache-Based Directory Schemes


Directory is a doubly linked list of entries

Read miss

Insert the requestor at the head of the list Insert the requestor at the head of the list Invalidate the sharers by traversing the list: long latency! Delete itself from the list

Write miss

Write back

Head pointer Main memory (home) $ Node 0 Node 1 Node 2

Tradeoffs with Cache-Based Schemes


Advantages

Small overhead: single pointer Easier to provide fairness and to avoid livelock Sending invalidations is not centralized at the home, but rather distributed among sharers Long latency, long occupancy of communication assist Modifying the sharing list requires careful coordination and mutual exclusion.

Disadvantages

Latency Reduction
1. inv H 5. inv S2 6. ack S3 2a. inv S2 4b. ack (b) intervention forwarding 1. inv 2. inv 3. inv 3a. inv S3 2b. ackS1 3b. ack

3. inv 1. inv 2. ack S1 4. ack

(a) strict request-response

S1
4. ack

S2

S3

(c) reply forwarding

Sequent NUMA-Q
IEEE standard Scalable Coherent Interface (SCI) protocol Targeted toward commercial workloads

Databases and transaction processing Intel SMP as the processing nodes DataPump network interface from Vitesse semiconductor Custom IQ Link: directory logic, remote cache (32MB, 4-way) Quad P IQ-Link Peripheral interface I/O PCI I/O P P P Snooping within Quad

Commodity hardware

Quad ring Quad

Mem

Quad

IQ Link Implementation
Directory for locally allocated data Tags for remotely allocated but locally cached data Orion bus controller

DataPump

SCI ring (1GB/s)

Manage snooping and requesting logic GaAs chip for transport protocol of the SCI standard

Directory controller (SCLIC)

Remote tag

Local directory

DataPump

Network side tag: SDRAM Orion bus controller (OBIC) Remote Local data&tag directory

SCI link interface controller

Bus side tag: SRAM


Quad bus

Manage SCI coherence protocol programmable

Directory States
Home

No remote cache (quad) in the system contains a copy of the block A processor cache in the home quad itself may have a copy since this is not visible to SCI protocol, but managed by bus protocol within the quad. One or more remote caches may have a read-only copy

Fresh

Gone

Another remote cache contains a writable (exclusive or dirty) copy. No valid copy exists on the local node

Remote Cache States


7 bits: 29 stable states + many pending or transient states Each stable state has two parts

First part: where the cache entry is located:

ONLY, HEAD, TAIL, MID Dirty, clean (exclusive state in MESI), fresh (data may not be written until memory is informed), copy (unmodified and readible), and so on

Second part: actual state

SCI Standard
Three primitive operations

List construction: adding a new node to the head of the list Rollout: remove a node from a list Purge (invalidation): the head node invalidate all other nodes Minimal protocol: does not permit read sharing: only one copy Typical protocol: NUMA-Q Full protocol: implement all standards of SCI

Levels of protocol

Handling Read Requests


If the directory state is HOME

The home updates the blocks state to FRESH and sends data The requestor updates from PENDING to ONLY_FRESH
Insert the node at the head The previous head changes its state from HEAD_FRESH to MID_VALID or from ONLY_FRESH to TAIL_VALID The requestor changes its state from PENDING to HEAD_FRESH The home stays in the GONE state and sends a pointer to the previous head The previous head changes its state from HEAD_DIRTY to MID_VALID or from ONLY_DIRTY to TAIL_VALID The requestor sets its state to HEAD_DIRTY

If the directory state is FRESH


If the directory state is GONE


An Example of Read Miss


Requestor
INVALID

Old head
ONLY_ FRESH

Requestor
PENDING

Old head
ONLY_ FRESH

FRESH

FRESH

1st Round
Requestor
PENDING

Home memory Old head


ONLY_ FRESH

Home memory Requestor


HEAD_ FRESH

Old head
TAIL_ VALID

FRESH

FRESH

2nd Round

Home memory

Home memory

Handling Write Requests


Only the head node is allowed to write a block and issue invalidation

When the write is in the middle of the list, remove itself from the list (rollout) and add itself again to the head (list construction)

If the writer cache block is in HEAD_DIRTY state


Purge sharing list: strict request-response manner Writer stays in the pending state while the purging is in progress Writer goes into pending state and sends a request to the home that changes from FRESH to GONE and replies Writer goes into a different pending state and purge the list

If in HEAD_FRESH state

Eventually, the writer state goes to ONLY_DIRTY

Handling Writeback and Replacements


Need of rollouts

Replacement, invalidation, write


First sets itself to a pending state to prevent the race condition. The node closer to the tail has priority and is rolled out first. Set the state to invalid Goes into a pending state and the downstream node change its state (ex: MID_VALID -> HEAD_DIRTY) Updates the pointer in the home node. If the node is the only node, change the home node state to HOME Serve writeback first since buffering is complicated and miss on the remote cache will be infrequent (vs. write-buffer in bus system)

Middle node rollback


Head node rollback


Writeback upon a miss

Hierarchical Coherence

Snoop-Snoop System
Simplest way to build large scale cachecoherent MPs Coherence monitor

Remote (access) cache Local state monitor - keep state information on data locally allocated, remotely cached Should be larger than the sum of processor caches and quite associative Should be lockup-free Issue invalidation request to the local bus when a block is replaced

Remote cache

Snoop-Snoop with Global Memory


First level cache

Highest performance SDRAM caches B1 follows a standard snooping protocol Much larger than L1 caches (set associative) Must maintain inclusion L2 cache acts as a filter for B1-bus and L1-caches L2 cache can be DRAM based since fewer references get to it

P1 $

P1 $

P1 $

P1 $ B1

Second level cache


Coherence monitor

Coherence monitor B2

Snoop-Snoop with Global Memory (Contd)


Advantages

Misses to main memory just require single traversal to the root of the hierarchy. Placement of shared data is not an issue. Misses to local data structures (e.g., stack) also have to traverse the hierarchy, resulting in higher traffic and latency. Memory at the global bus must be highly interleaved. Otherwise bandwidth to it will not scale.

Disadvantages

Cluster Based Hierarchies


Key idea

Main memory is distributed among clusters. reduces global bus traffic (local data & suitably placed shared data) reduces latency (less contention and local accesses are faster) example machine: Encore Gigamax

L2 cache can be replaced by a tag-only routercoherence switch.


P1 P1

P1
$

P1
$

B1
M

Coherence monitor

Coherence monitor

B2

Summary
Advantages:

Conceptually simple to build (apply snooping recursively) Can get merging and combining of requests in hardware Physical hierarchies do not provide enough bisection bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-d grid problems) Latencies often larger than in direct networks

Disadvantages:

Hierarchical Directory Scheme


The internal nodes contain only directory information

L1 directory tracks which of its children processing nodes have a copy of the memory block. L2 directory tracks which of its children L1 directories have a copy the memory block. It also tracks which local memory blocks are cached outside Inclusion is maintained between processor caches and L1 directory

Processing nodes

L1 directory L2 directory

Logical trees may be embedded in any physical hierarchy

A Multirooted Hierarchical Directory


Processing nodes at leaves

Internal nodes

Directory tree for p0s memory p0 p7

All three circles in this rectangle represent the same processing node

Organization and Overhead


Organization

Separate directory structure for every block

Storage overhead

Each level has about the same amount of memory C: cache size, b: branch factor, M: memory size, B: block size Reduce the number of network hops. But, increase the end-to-end transactions, increase latency Root becomes the bottleneck

C log b P MB

Performance overhead

Performance Implication of Hierarchical Coherence


Advantages

Combining of requests for a block

Reduce traffic and contention Reduce transit latency and contention

Locality effect

Disadvantages

Long uncontended latency Bandwidth requirements near the root of the hierarchy

Вам также может понравиться