Вы находитесь на странице: 1из 9

Memory Consistency Models

cs.nmsu.edu /~pfeiffer/classes/573/notes/consistency.html

Note: There is a really good tutorial on memory consistency models at


ftp://gatekeeper.dec.com/pub/DEC/WRL/research-reports/WRL-TR-95.7.pdf. A great deal of the information in these
notes comes from that paper.

These notes describe some of the important memory consistency models which have been considered in recent
years. The basic point is going to be that trying to implement our intuitive notion of what it means for memory to be
consistent is really hard and terribly expensive, and isn't necessary to get a properly written parallel program to run
correctly. So we're going to produce a series of weaker definitions that will be easier to implement, but will still allow
us to write a parallel program that runs predictably.

Notation
In describing the behavior of these memory models, we are only interested in the shared memory behavior - not
anything else related to the programs. We aren't interested in control flow within the programs, data manipulations
within the programs, or behavior related to local (in the sense of non-shared) variables. There is a stnadard notation
for this, which we'll be using in what follows.

In the notation, there will be a line for each processor in the system, and time proceeds from left to right. Each
shared-memory operation performed will appear on the processor's line. The two main operations are Read and
Write, which are expressed as

W(var)value

which means "write value to shared variable var", and

R(var)value

which means "read shared variable var, obtaining value."

So, for instance, W(x)1 means "write a 1 to x" and R(y)3 means "read y, and get the value 3."

More operations (especially synchronization operations) will be defined as we go on. For simplicity, variables are
assumed to be initialized to 0.

An important thing to notice about this is that a single high-level language statement (like x = x + 1;) will typically
appear as several memory operations. If x previously had a value of 0, then that statement becomes (in the
absence of any other processors)

P1: R(x)0 W(x)1


-----------------

1/9
On a RISC-style processor, it's likely that C statement would have turned into three instructions: a load, an add,
and a store. Of those three instructions, two affect memory and are shown in the diagram.

On a CISC-style processor, the statement would probably have turned into a single, in-memory add instruction.
Even so, the processor would have executed the instruction by reading memory, doing the addition, and then writing
memory, so it would still appear as two memory operations.

Notice that the actual memory operations performed could equally well have been performed by some completely
different high level language code; maybe an if-then-else statement that checked and then set a flag. If I ask
for memory operations and there is anything in your answer that looks like a transformation or something of the
data, then something is wrong!

Strict Consistency
The intuitive notion of memory consistency is the strict consistency model. In the strict model, any read to a
memory location X returns the value stored by the most recent write operation to X. If we have a bunch of
processors, with no caches, talking to memory through a bus then we will have strict consistency. The point here is
the precise serialization of all memory accesses.

We can give an example of what is, and what is not, strict consistency and also show an example of the notation for
operations in the memory system. As we said before, we assume that all variables have a value of 0 before we
begin. An example of a scenario that would be valid under the strict consistency model is the following:

P1: W(x)1
-----------------------
P2: R(x)1
R(x)1

This says, ``processor P1 writes a value of 1 to variable x; at some later time processor P2 reads x and obtains a
value of 1. Then it reads it again and gets the same value''

Here's another scenario which would be valid under strict consistency:

P1: W(x)1
-------------------------------
P2: R(x)0 R(x)1

This time, P2 got a little ahead of P1; its first read of x got a value of 0, while its second read got the 1 that was
written by P1. Notice that these two scenarios could be obtained in two runs of the same program on the same
processors.

Here's a scenario which would not be valid under strict consistency:

2/9
P1: W(x)1
-----------------------
P2: R(x)0
R(x)1

In this scenario, the new value of x had not been propagated to P2 yet when it did its first read, but it did reach it
eventually.

I've also seen this model called atomic consistency.

Sequential Consistency
Sequential consistency is a slightly weaker model than strict consistency. It was defined by Lamport as the result
of any execution is the same as if the reads and writes occurred in some order, and the operations of each
individual processor appear in this sequence in the order specified by its program.

In essence, any ordering that could have been produced by a strict ordering regardless of processor speeds is valid
under sequential consistency. The idea is that by expanding from the sets of reads and writes that actually
happened to the sets that could have happened, we can reason more effectively about the program (since we can
ask the far more useful question, "could the program have broken?"). We can reason about the program itself, with
less interference from the details of the hardware on which it is running. It's probably fair to say that if we have a
computer system that really uses strict consistency, we'll want to reason about it using sequential consistency

The third scenario above would be valid under sequential consistency. Here's another scenario that would be valid
under sequential consistency:

P1: W(x)1
-----------------------
P2: R(x)1
R(x)2
-----------------------
P3: R(x)1
R(x)2
-----------------------
P4: W(x)2

This one is valid under sequential consistency because the following alternate interleaving would have been valid
under strict consistency:

3/9
P1: W(x)1
-----------------------------
P2: R(x)1
R(x)2
-----------------------------
P3: R(x)1
R(x)2
-----------------------------
P4: W(x)2

Here's a scenario that would not be valid under sequential consistency:

P1: W(x)1
-----------------------
P2: R(x)1
R(x)2
-----------------------
P3: R(x)2
R(x)1
-----------------------
P4: W(x)2

Oddly enough, the precise definition, as given by Lamport, doesn't even require that ordinary notions of causality be
maintained; it's possible to see the result of a write before the write itself takes place, as in:

P1: W(x)1
-----------------------
P2: R(x)1

This is valid because there is a different ordering which, in strict consistency, would yield P2 reading x as having a
value of 1. This isn't a flaw in the model; if your program can indeed violate causality like this, you're missing some
synchronization operations in your program. Note that we haven't talked about synchronization operations yet; we
will soon.

Cache Coherence
Most authors treat cache coherence as being virtually synonymous with sequential consistency; it is perhaps
surprising that it isn't. Sequential consistency requires a globally (i.e. across all memory locations) consistent view
of memory operations, cache coherence only requires a locally (i.e. per-location) consistent view. Here's an
example of a scenario that would be valid under cache coherence but not sequential consistency:

4/9
P1: W(x)1 W(y)2
-----------------------
P2: R(x)0 R(x)2 R(x)1 R(y)0
R(y)1
-----------------------
P3: R(y)0 R(y)1 R(x)0 R(x)1
-----------------------
P4: W(x)2 W(y)1

P2 and P3 both saw P1's write to x as occurring after P4's (and in fact P3 never saw P4's write to x at all), and saw
P4's write to y as occurring after P1's (this time, neither saw P1's write as occurring at all). But P2 saw P4's write to
y as occurring after P1's write to x, while P3 saw P1's write to x occurring after P4's write to y.

This couldn't happen with a snoopy-cache based scheme. But it certainly could with a directory-based scheme.

Do We Really Need Such a Strong Model?


Consider the following situation in a shared memory multiprocessor: processes running on two processors each
change the value of a shared variable x, like this:

P1 P2

x = x + x = x +
1; 2;

What happens? Without any additional information, there are four different orders in which the two processes can
execute these statements, resulting in three different results:

P1 executes first
x will get a new value of 3.
P2 executes first
x will get a new value of 3.
P1 and P2 both read the data; P1 writes the modified version before P2 does.
x will get a new value of 2.
P1 and P2 both read the data; P2 writes the modified version before P1 does.
x will get a new value of 1.

We can characterize a program like this pretty easily and concisely: it's got a bug. With a bit more precision, we can
say it has a data race: there is a variable modified by more than one process in a way such that the results depend
on who gets there first. For this program to behave reliably, we have to have locks guaranteeing that one of the
processes performs its entire operation before the other one starts.

So... given that we have a data race, and the program's behavior is going to be unpredictable anyway, does it really
matter if all the processors see all the changes in the same order? Attempting to achieve strict or sequential
consistency might be regarded as trying to support the semantics of buggy programs -- since the result of the
5/9
program is random anyway, why should we care whether it results in the right random value? But it gets worse, as
we consider in the next sections...

Optimizations and Consistency


Even if the program contains no bugs as written, compilers actually don't support sequential consistency in general
(compilers don't see the existence of other processors in general, let alone a consistency model. We can argue that
perhaps this argues a need for languages with parallel semantics, but as long as programmers are going to use C
and Java for parallel programs we're going to have to support them). Most languages support a semantics in which
program order is maintained for each memory location, but not across memory locations; this gives compilers
freedom to reorder code. So, for instance, if a program writes two variables x and y, and they do not depend on
each other, the compiler is free to write these two values to memory in either order without affecting the correctness
of the program. In a parallel environment, however, it is quite likely that a process running on some other processor
does depend on the order in which x and y were written.

Two-process mutual exclusion gives a good example of this. Remember the code to enter a critical section is given
by

flag[i] = true;
turn = 1-i;
while (flag[1-i] && (turn == (1-i)))
;

If the compiler decides (for whatever reason) to reverse the order of the writes to flag[i] and turn, this is
perfectly correct code in a single-process environment but broken in a multiprocessing environment (and, of course,
that's the situation that matters).

Worse, since processors support out of order execution, there's no guarantee that the program, as executed, will
perform its memory accesses in the order specified by the machine code! Worse, as processors and caches get
ever more tightly coupled, and as machines use more and more aggressive instruction reording, these sorts of
optimizations can end up happening in hardware with little or no control (it's very easy to imagine a machine
finishing the update to turn while it's still setting flag[i] up above, since accessing flag[i] involves access to
an array).

This is a little bit of a red herring, since we can require that our compiler perform accesses of shared memory in the
order specified by the program (the volatile keyword specifies this). In the case of Intel processors, we can also
force some ordering on memory accesses by using the lock prefix on instructions. But notice that what we are
doing by adding these keywords and prefixes is establishing places in the code where we care about the precise
ordering, and places where we do not. The following memory models expand on this idea.

Processor Consistency
This model is also called PRAM (an acronym for Pipelined Random Access Memory, not the Parallel Random
Access Machine model from computability theory) consistency. It is defined as Writes done by a single processor
are received by all other processors in the order in which they were issued, but writes from different
processors may be seen in a different order by different processors. The basic idea of processor consistency
is to better reflect the reality of networks in which the latency between different nodes can be different.

6/9
The last scenario in the sequential consistency section, which wasn't valid for sequential consistency, would be
valid for processor consistency. Here's how it could come about, in a machine in which the processors are
connected by something more complex than a bus:

1. The processors are connected in a linear array, like this.


2. On the first cycle, P1 and P4 write their values and propagate them.
3. On the second cycle, the value from P1 has reached P2, and the value from P4
has reached P3. They read the values, seeing 1 and 2 respectively.
4. On the third cycle, the values have made it two hops. So now P2 sees 2 and P3 sees 1.

So you can see we meet the "hard" part of the definition (the part requiring writes from a single processor getting
seen in-order) somewhat vacuously: P1 and P4 only make one write each, so P2 and P3 end up seeing P1's writes,
and P4's writes, in order. But the point of the example is the counterintuitive part of the definition: they don't see the
writes from P1 and from P4 as being in the same order.

Here's a scenario which would not be valid for processor consistency:

P1: W(x)1 W(x)2


----------------------------------
P2: R(x)2 R(x)1

P2 has seen the writes from P1 in an order different than they were issued.

It turns out that the two-process mutual exclusion code above is broken under processor consistency.

One final note on processor consistency and pram consistency is that some authors make processor consistency
slightly stronger than PRAM by requiring PC to be both PRAM consistent and cache coherent.

Synchronization Accesses vs. Ordinary Accesses


A correctly written shared-memory parallel program will use mutual exclusion to guard access to shared variables. In
the first buggy example above, we can guarantee deterministic behavior by adding a barrier to the code, which we'll
denote as S for reasons that will become apparent later:

P1 P2

x = x +
1;

S; S;

x = x +
2;

In general, in a correct parallel program we obtain exclusive access to a set of shared variables, manipulate them

7/9
any way we want, and then relinquish access, distributing the new values to the rest of the system. The other
processors don't need to see any of the intermediate values; they only need to see the final values.

With this in mind, we can look at the different types of memory accesses more carefully. Here's a figure that shows a
classification of shared memory accesses[Gharachorloo]:

The various types of memory accesses are


defined as follows:

Shared Access
Actually, we can have shared
access to variables vs. private
access. But the questions we're
considering are only relevant for
shared accesses, so that's all
we're showing.
Competing vs. Non-Competing
If we have two accesses from different processors, and at least one is a write, they are
competing accesses. They are considered as competing accesses because the result depends
on which access occurs first (if there are two accesses, but they're both reads, it doesn't matter
which is first).
Synchronizing vs. Non-Synchroning
Ordinary competing accesses, such as variable accesses, are non-synchronizing accesses.
Accesses used in synchronizing the processes are (of course) synchronizing accesses.
Acquire vs. Release
Finally, we can divide synchronization accesses into accesses to acquire locks, and accesses
to release locks.

Remember that synchronization accesses should be much less common than other competing accesses (if you're
spending all your time performing synchronization accesses there's something seriously wrong with your program!).
So we can further weaken the memory models we use by treating sync accesses differently from other accesses.

Weak Consistency
Weak consistency results if we only consider competing accesses as being divided into synchronizing and non-
synchronizing accesses, and require the following properties:

1. Accesses to synchronization variables are sequentially consistent.


2. No access to a synchronization variable is allowed to be performed until all previous writes have completed
everywhere.
3. No data access (read or write) is allowed to be performed until all previous accesses to synchronization
variables have been performed.

Here's a valid scenario under weak consistency, which shows its real strength:

8/9
P1: W(x)1 W(x)2 S
------------------------------------
P2: R(x)0 R(x)2 S
R(x)2
------------------------------------
P3: R(x)1 S
R(x)2

In other words, there is no requirement that a processor broadcast the changed values of variables at all until the
synchronization accesses take place. In a distributed system based on a network instead of a bus, this can
dramatically reduce the amount of communication needed (notice that nobody would deliberately write a program
that behaved like this in practice; you'd never want to read variables that somebody else is updating. The only reads
would be after the S. I've mentioned in lecture that there are a few parallel algorithms, such as relaxation algorithms,
that don't require normal notions of memory consistency. These algorithms wouldn't work in a weakly consistent
system that really deferred all data communications until sync points).

Release Consistency
Having the single synchronization access type requires that, when a synchronization occurs, we need to globally
update memory - our local changes need to be propagated to all the other processors with copies of the shared
variable, and we need to obtain their changes. Release consistency considers locks on areas of memory, and
propagates only the locked memory as needed. It's defined as follows:

1. Before an ordinary access to a shared variable is performed, all previous acquires done by the
process must have completed successfully.
2. Before a release is allowed to be performed, all previous reads and writes done by the
process must have completed.
3. The acquire and release accesses must be sequentially consistent.

One Last Point


It should be pretty clear that a sync access is a pretty heavyweight operation, since it requires globally syncronizing
memory. But where the strength of these memory models comes is that the cost of these sync operations isn't any
worse than the cost of every memory access in a sequentially consistent system.

References
Gharachorloo, K., D. Lenoski, J. Ludon, P. Gibbons, A. Gupta, and J. Hennessy, ``Memory consistency and
event ordering in scalable shared-memory multiprocessors,'' in Proceedings of the 17th International
Symposium on Computer Architecture (1990) 15-26

9/9

Вам также может понравиться