February 14, 2017
This article was contributed by Jade Alglave,
Paul E. McKenney, Alan Stern, Luc Maranget, and Andrea Parri
Introduction
This article is organized as follows, with the intended audience
for each section in parentheses:
-
Introduction to the Strong Linux-Kernel Memory Model
(people interested in understanding the memory model).
- Design of the Strong Model
(people interested in the ideas behind the formal memory model).
- Adjustments for the
DEC Alpha
(people interested in how the memory model had to be changed to
match the Alpha architecture).
- Adjustments for ARM
(people interested in how the memory model had to be changed to
match the ARM architecture).
- Adjustments for
other architectures
(people interested in the future evolution of the memory model).
- Strong-Model Bell File
(masochists and other people interested in a deep understanding
of the Linux-kernel memory model).
- Strong-Model Cat File
(masochists and other people interested in a deep understanding
of the Linux-kernel memory model).
- (More TBD.)
This is followed by the inevitable
answers to the quick quizzes.
This section is mostly concerned with the strong memory model.
The weak (but still fairly strong as these things go)
model is derived from the strong one by relaxing several of the
less-important constraints.
The strong Linux-kernel memory model started out as an operational
model, based on the PPCMEM model for PowerPC as presented in
two papers (“Understanding POWER Multiprocessors”
[pdf]
and “Synchronising C/C++ and POWER”
[pdf])
by Susmit Sarkar, Peter Sewell, and others.
Our model was a modified version of theirs, changed to take into account
the requirements of the kernel.
herd-style Bell and Cat files
were developed as a formal axiomatization of this model.
The model then was modified to handle the peculiarities of the DEC Alpha,
and the Cat file was modified accordingly.
Some time later we incorporated ideas from the
Flowing and POP models for ARM, as presented in “Modelling the
ARMv8 Architecture, Operationally: Concurrency and ISA”
[pdf]
by Shaked Flur, Peter Sewell, and others
(together with the supplementary material
[pdf]).
This design proved to be so different from the PowerPC-oriented
operational model that there was no reasonable way to unify the two.
Instead, we abandoned our operational model and concentrated on the
formal herd model, weakening it so that it would accept litmus
tests allowed by the ARM architecture as defined by the Flowing model.
Nevertheless, the original operational model offers a very good basis
for understanding our formal model and so we present it here,
along with a discussion of the changes needed to adapt it to Alpha and
the issues raised by ARM.
The operational model divides a computer system into two parts:
the processors (or CPUs), which execute instructions,
and the memory subsystem,
which propagates information about writes and barriers among the
CPUs and is also responsible for determining the coherence order.
When a CPU executes a write or certain kinds of barriers,
it tells the memory subsytem.
And when a CPU needs to load a value from memory or cache to execute a read,
it asks the memory subsystem to provide the value.
The Processor Subsystem
Although the underlying operations involved in executing an instruction
on a modern CPU can be quite complicated,
nevertheless there always comes a point where the CPU has finished
evaluating all of the instruction's inputs and outputs and commits
itself irrevocably to using those values.
Conceptually, each instruction that gets executed is committed
at a single, precise moment in time.
(Instructions that don't get executed, such as those started speculatively
in what turns out to be the unused arm of a conditional branch,
are not committed.)
Instructions may commit in any order and at any rate, subject to certain
constraints.
For example, an instruction controlled by a conditional branch can't
be committed before the branch is, because until that time
the CPU hasn't decided for certain whether the branch will be taken.
The full set of constraints on the order of instruction execution
(which is almost but not quite the same as comittal, differing only
for read instructions) is listed
below.
For instructions involving only quantities that are local to the CPU,
such as those computing register-to-register arithmetic,
that's all there is to it:
The CPU carries out the operations required
by the instruction when the inputs are available,
eventually commits to the result, and moves on.
But some instructions need more.
In particular, some require the CPU to communicate with the memory subsystem.
Writes and memory barriers are the simplest case.
When a CPU commits a write instruction, it tells the memory subsystem
the target address of the write and the value to be stored there.
It can't do this before the write commits, because once the information
has been sent to the memory subsystem there's no way to take it back.
Similarly, when a CPU commits one of the barriers that affect
write-propagation order, it informs the memory subsystem, which then
uses that information to control the way writes get propagated.
Quick Quiz 1:
But couldn't a CPU designer create a memory subsystem that did
allow writes to be taken back?
Answer
Reads are more complicated.
When a CPU starts to execute a read instruction,
it first has to calculate the target address, which may involve
adding index or base register values to a constant offset.
It then checks to see if the most recent write (in program order)
to that target address is still uncommitted;
if it is then the CPU takes the value to be stored by that write
and uses it as the value for the read.
This is called store forwarding, and it is a form of out-of-order
execution (the read can be committed before the program-earlier write).
But if there was no prior write to that address or the most recent one
has already been committed, then the CPU has to
ask the memory subsystem to retrieve the value at the target address.
Either way, we say that the read is satisfied,
and this also takes place at a precise moment in time.
A read instruction cannot commit until it has been satisfied.
There's more to it than that, however.
The act of satisfying a read is not irrevocable.
It may turn out, for example, that the values used in calculating the
target address were themselves not yet committed and hence are still
subject to change.
If that happens, the read instruction will need to be restarted:
The target address must be recalculated and the read must be satisfied again.
This can happen several times before the read is committed.
In fact, it can even happen several times without the read ever being
committed, if the read was started speculatively and then abandoned.
Of course, once a read has been committed then it can no longer be restarted.
Thus, a CPU carries out a read instruction by satisfying it
(perhaps more than once) and eventually committing it.
For most other instruction types, execution only involves committing the
instruction, but there is one exception.
A strong memory barrier (such as smp_mb())
is not finished when it commits.
Instead, the CPU has to wait for the strong barrier to be
acknowledged by the memory subsystem.
This doesn't happen until the memory subsystem has propagated the
barrier to all the other CPUs in the system,
and the CPU is not allowed to begin executing any instructions that
come after the strong barrier in program order until then.
This is what makes these barriers so strong (and so slow!).
Quick Quiz 2:
Why can't CPU designers use speculation to hide the slowness of
strong barriers?
Answer
The Memory Subsystem
The memory subsystem accepts write, read, and barrier requests from the CPUs.
It propagates the write and barrier requests to all the CPUs in the system
and provides responses to read requests.
It also determines the coherence order of all writes to each variable
and provides a mechanism for making certain operations atomic.
Handling read requests is quite simple.
When a CPU submits a read request for a specified target address,
the memory subsystem finds the latest write
(in the target address's coherence order) that has propagated to the
CPU and returns the value stored by that write.
This means, among other things, that a CPU cannot read from a write
until the write has propagated to that CPU, as you would expect.
It's important that the write be the coherence-latest;
otherwise the system could violate the read-read
coherence rule
if a po- (program order-) earlier read had already read from
the coherence-latest write.
Accepting a write request from a CPU is a little more complicated.
To begin with, the memory subsystem has to decide where the write
will fit into the coherence order for the target address.
Specifically, it must ensure that the write is assigned to a position
in the coherence order that is after any other writes to the same address
which have already propagated to that CPU.
This is necessary because the CPU might already have executed a po-earlier
instruction which read from one of those
other writes; if the new write were to be come before that other write
in the coherence order then it would violate the read-write
coherence rule.
In addition, the memory subsystem has to
propagate the write to all the other CPUs
(a write or barrier is considered to “propagate” to its own CPU
at the time it is committed) and to the coherence point.
The coherence point is a notional place in the system where
writes and barriers get sent,
in much the same way that they are propagated to CPUs.
If you like, you can think of the coherence point as being the place
where writes finally pass out of all the internal caches and buffers,
down into memory for storage.
The key aspect of the coherence point is that different writes
to the same address arrive at the coherence point in their coherence order.
In effect, the order of their arrival at the coherence point defines
the coherence order.
Whether this is because the memory subsystem first decides on a
coherence order and then sends writes to the coherence point in that
order, or because it sends writes willy-nilly to the coherence point
and then uses the order of their arrival as the coherence order,
doesn't matter.
What does matter is that once a write has reached the coherence point,
its position in the coherence order is fixed; it is impossible for any
future writes to be assigned an earlier position.
Quick Quiz 3:
Isn't this single coherence point a huge bottleneck on large systems?
Answer
This fact is crucial for atomic operations.
The memory model represents an atomic read-modify-write (RMW) operation as
two events: a read R followed by a write W
(or conditionally followed by a write, for operations like cmpxchg()).
What makes the operation atomic is that no other writes to that address,
from any CPU, are allowed to intervene between R and W.
In other words, the memory subsystem guarantees that the write immediately
preceding W in the coherence order is the write which R
reads from.
The operational model specifies that it does this, in part, by arranging for
W to reach the coherence point at the time when it commits
(as opposed to some arbitrarily later time, like an ordinary write).
As a result, no future write will be able to sneak in before W
in the coherence order, and avoiding such “sneak writes”
is what enforces the atomic property.
Quick Quiz 4:
But how could the system possibly prevent some other write on some other
CPU from taking place between the time the RMW's read and write execute?
Is there some Big System Lock implemented in hardware that will totally
destroy scalability???
Answer
Other than these requirements and the constraints imposed by memory barriers,
the order in which writes are propagated to CPUs and reach the coherence point
is unrestricted.
In particular, these orders don't have to bear any resemblance to the order
in which the write requests were originally sent to the memory subsystem.
It's entirely possible for two CPUs to write to the same address at
different times and have the second write come before the first in the
coherence order or be propagated before the first to a third CPU.
The kernel memory model has two broad categories of memory barriers:
those whose effects are entirely local to a single CPU and those that
interact with the memory subsystem.
Barriers in the first category can only constrain the order in which
the CPU executes instructions, whereas those in the second group
can also affect the order of propagation of writes.
Associated with each memory barrier are two sets of instructions,
called the barrier's pre-set and its post-set.
These sets vary according to the type of barrier, but all barriers
other than the three “read-dependency” barriers
share these features:
- A barrier cannot commit until every instruction in its pre-set
has committed, and
- An instruction in the barrier's post-set cannot commit or be
satisfied until the barrier has committed.
The various barriers included in this memory model,
and their types and pre- and post-sets, are listed
in the following table in order of increasing strength.
(Contrary to what you might expect, smp_load_acquire(),
rcu_dereference(), and lockless_dereference()
are represented in the memory subsystem as a read request followed by a
separate barrier request.
Similarly, smp_store_release() and rcu_assign_pointer()
are represented as a barrier request followed by a separate write request.
Thus, each requires the CPU to issue two requests to the memory subsystem.)
Barrier |
Type |
Pre-set |
Post-set |
rcu_dereference(),
lockless_dereference() |
Read-dependency |
itself |
all po-later reads with a dependency from this read |
smp_read_barrier_depends() |
Read-dependency |
all po-earlier reads |
all po-later reads with a dependency from a po-earlier read |
smp_load_acquire() |
Execution-order |
itself |
all po-later memory accesses |
smp_rmb() |
Execution-order |
all po-earlier reads |
all po-later reads |
smp_wmb() |
B-cumulative |
all po-earlier writes |
all po-later writes (*) |
smp_store_release(),
rcu_assign_pointer() |
A-cumulative (**) |
all po-earlier memory accesses (*) |
itself and members of its release sequence |
smp_mb(),
synchronize_rcu() |
Strong (A- and B-cumulative) |
all po-earlier memory accesses (*) |
all po-later memory accesses (*) |
(*) as modified by the cumulativity requirements described below.
(**) also B-cumulative when read by a load-acquire, as described
here.
Quick Quiz 5:
The terms “A-cumulativity” and “B-cumulativity”
aren't particularly mnemonic, are they?
Answer
The read-dependency and execution-order barriers are purely local to their
own CPU.
(In fact, in this model the read-dependency barriers have no effect at all.
We will see later that they matter only when the model is altered to
work with the DEC Alpha.)
However, when the CPU commits one of the others,
collectively referred to as “propagation-order” barriers,
it informs the memory subsystem about the barrier, and the memory subsystem
propagates the barrier to all the other CPUs.
This is where the barrier's propagation ordering effects come into play:
- The memory subsystem will not propagate a barrier to a CPU until
all the writes in the barrier's pre-set have been propagated to
that CPU.
- The memory subsystem will not propagate a write in a barrier's
post-set to a CPU until the barrier has been propagated to that CPU.
The same is true for the order in which writes and barriers reach the
coherence point; in this respect barriers treat the coherence point
much like another CPU.
In addition, the memory subsystem does not
acknowledge a strong barrier
until the barrier has been propagated to every CPU
and has reached the coherence point
(and as mentioned above,
the CPU will not satisfy or commit any instructions po-after a strong
barrier until the barrier has been acknowledged).
The propagation-order barriers enjoy varying degrees
of cumulativity.
This means that the barriers affect the order of propagations, not just
of writes issued by the barrier's own CPU, but also of writes issued
by other CPUs.
In effect, the barriers' pre- and post-sets are enlarged:
- The pre-set of an A-cumulative barrier includes
all writes that have propagated to the barrier's CPU before
the barrier is committed.
- The post-set of a B-cumulative barrier includes
all memory accesses on another CPU that execute after the barrier has
propagated to that CPU.
Quick Quiz 6:
By symmetry, shouldn't a B-cumulative barrier's post-set include all writes
that propagate to the barrier's CPU after the barrier is committed?
Answer
The memory model adopts the idea of
release sequences
(slightly altered) from C11.
For any store-release instruction (such as smp_store_release(),
rcu_assign_pointer(), or xchg_release()),
the release sequence headed
by that instruction includes the instruction itself as well as
all po-later writes to the same address.
The release sequence also includes, recursively,
any atomic RMW instruction accessing the same address,
on any CPU, that reads from a write in the release sequence.
Every write in the release sequence belongs to the
associated barrier's post-set.
Some examples of cumulativity and release sequences are presented
below.
To see how this works out in practice, consider this litmus test
(an example of the “Store-Buffering” pattern):
Strong Model Litmus Test #1
1 C C-SB+o-mb-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 int r1;
9
10 WRITE_ONCE(*x, 1);
11 smp_mb();
12 r1 = READ_ONCE(*y);
13 }
14
15 P1(int *x, int *y)
16 {
17 int r2;
18
19 WRITE_ONCE(*y, 1);
20 smp_mb();
21 r2 = READ_ONCE(*x);
22 }
23
24 exists
25 (0:r1=0 /\ 1:r2=0)
Using the features of the memory model already discussed, we can
show that this test's “exists” condition will never be satisfied.
When the test is executed, one of the two memory barriers must be
acknowledged before or at the same time as the other
(they are both strong barriers).
Suppose the barrier in P0 gets acknowledged first.
Then the following events have to occur in the order listed, for the
reasons shown:
- P0's write to x propagates to P1 before P0's memory barrier
does, because the write is in the barrier's pre-set.
- P0's memory barrier propagates to P1 before it is acknowledged,
because a strong memory barrier is not acknowledged until
it has propagated to every CPU.
- P0's memory barrier is acknowledged before or at the same time as
P1's barrier, by assumption.
- P1's read of x is satisfied after P1's barrier is
acknowledged, because the read comes after the barrier
in program order.
Hence the write to
x propagates to P1 before P1's read is
satisfied.
Since that write is the last one in the coherence order for
x,
it is the one that will be used to satisfy the read.
Therefore
r2 will end up equal to 1, not 0.
The opposite case (where P1's barrier is acknowledged first) is
symmetrical.
In neither case is it possible for
r1 and
r2 both to
be equal to 0.
The strong model agrees:
Outcome for Strong Model Litmus Test #1
1 Test C-SB+o-mb-o+o-mb-o Allowed
2 States 3
3 0:r1=0; 1:r2=1;
4 0:r1=1; 1:r2=0;
5 0:r1=1; 1:r2=1;
6 No
7 Witnesses
8 Positive: 0 Negative: 3
9 Condition exists (0:r1=0 /\ 1:r2=0)
10 Observation C-SB+o-mb-o+o-mb-o Never 0 3
11 Hash=a61f698662bb72c2ed1755812580d385
Processor-Local Ordering Requirements
While executing instructions, a CPU observes various ordering requirements.
Some of these are obvious (an instruction can't be executed before the CPU
knows how or whether to execute it, as mentioned
earlier).
Others are less obvious but are necessary to avoid violating the four
coherence rules.
The first and simplest requirement is that a CPU will not commit
an instruction that is po-after a conditional branch
until the branch itself is committed.
At the hardware level this is true even for trivial conditionals;
CPUs do not recognize that expressions like “x == x”
must always hold.
The situation in higher-level languages is not as simple, because
optimizing compilers do recognize such things.
They will happily eliminate the unnecessary test and conditional jump
from the object code entirely,
leaving no ordering requirement for the CPU to obey at runtime.
Quick Quiz 7:
Given the steadily increasing number of transistors, why couldn't a CPU
analyze code to detect at least some classes of “x == x”
comparisons?
Answer
Compilers also realize that code following the end of an
“if (...) then {...} else {...}” statement
will be executed regardless of which branch is taken,
and they are free to move such code up before the start of the
“if” in the absence of any reason not to.
Therefore our memory model applies this ordering requirement
only to the instructions
that are within the “then {...} else {...}”
branches of the conditional, i.e., those that are directly under its control.
These instructions cannot be committed before the instructions that compute
the “if” condition and decide whether to take the branch.
The memory model has a weakness in this area.
If both branches of an “if” statement
write the same value to the same variable,
an optimizing compiler may replace both writes with a single write
that is executed before the “if” statement branches.
This resulting single write then would not be subject to the ordering
requirement at runtime, even though the model says it should be.
(We must emphasize that this requirement applies only to committing
instructions. Reads following a conditional branch can be satisfied
before the branch is committed, and they often are.
Even if a read belongs to the arm of the branch that is ultimately not taken,
the CPU may speculatively start executing the read before it knows
which way the branch will go.
Although all architectures have barriers that can prevent speculative reads,
such as the isb and isync instructions on ARM and PowerPC,
respectively, the Linux kernel does not include any facility
specifically meant for this purpose.
If you really want to prevent a read in a conditional branch
from being satisfied speculatively, you can always use smp_rmb().)
The preceding requirement was about whether a CPU should execute
an instruction.
The next ordering requirement is about how an instruction should
be executed.
If an instruction has a
dependency
from a po-earlier read, then the instruction cannot commit until the
read does.
This is simply because the address or data the instruction will use isn't
irrevocable until the earlier read commits.
The next pair of requirements affect only writes.
A write cannot commit until all sources of address dependencies to a
memory-access instruction po-before the write have committed.
This is necessary because the po-earlier memory access might generate
an invalid-address exception, in which case the write should not be executed.
However, the CPU can't know whether the earlier access's target
address will turn out to be invalid until the target address is fully
determined, which means that all sources of an address dependency have
to be committed.
Furthermore, a write cannot commit until all po-earlier instructions
accessing the same address have committed.
This requirement is necessary to enforce the coherence rules.
If two writes to the same address committed and were sent to the memory
subsystem out of order,
the memory subsystem would put the po-earlier write later in the
address's coherence order, which would violate write-write coherence.
And if a write committed before a po-earlier read of the same address
was satisfied, the memory subsystem would use the value stored by the write
(or something even later in the coherence order) to satisfy the
read request, which would violate read-write coherence.
But since the po-earlier read may be restarted (and thus satisfied again)
at any time up until it commits,
this means the write must not commit until the read is committed.
The next three requirements are concerned with restarting reads.
A read instruction R must be restarted after:
- A po-earlier read R' is satisfied, where R' is
the source of an address dependency to R, or the source
of an address or data dependency to a write that was forwarded
to R.
- A po-earlier read R' of the same address is satisfied,
unless R read from the same write as R' or
was forwarded from a write that is po-after R'.
- A po-earlier write W to the same address is committed,
unless R was forwarded from W or
from a write that is po-after W.
It follows that
R cannot commit until each of the
R'
or
W
accesses mentioned here has committed (because until then,
R'
might restart and thus be satisfied again or
W might commit,
requiring
R to restart).
The reason for the first case is pretty obvious; the other two
are more obscure.
For the second case,
suppose R' is po-before R and they read from different
writes, W' and W respectively.
Assuming W was not forwarded to R, this means that
W was the most recent write in the coherence order to have
propagated to the CPU as of the time when R was satisfied.
Similarly, either R' was forwarded from W' or else
W' was the most recent write in the coherence order to have
propagated to the CPU as of the time when R' was satisfied.
But if R' was satisfied after R, then
W' must come after W in the coherence order—if
it came before then R' would have read from W instead
of W'.
(We can discount the possibility that W' was forwarded to
R'; if it had been then it or a write po-after R'
would been forwarded to R.)
But W' being coherence-later than W
would be a violation of read-read coherence.
Thus R' must be satisfied before R, and the only
way to guarantee this is to require that R be restarted after
R' is satisfied.
For the third case, suppose W was the last write before
R (in program order) to access the same address,
and R was satisfied before W committed but
was not forwarded from it.
(This could happen if W's target address had not yet been
been determined at the time R was satisfied.)
Then R would have either read from some other write W'
which had already propagated to the CPU, or else been forwarded from
some other write W' that was po-earlier than W.
Either way, when W did commit later on,
it would be assigned a position in the coherence order after W'.
This would violate write-read coherence.
Thus R must be satisfied at some time after W commits,
and the only way to guarantee this is to
require that R be restarted when W commits.
The last two ordering requirements reflect the operation of memory barriers.
That is, instructions in the barrier's pre-set must commit before the
barrier does, and operations in the post-set must commit after.
When we express these ordering requirements in the memory model,
it turns out that when a read instruction commits is relatively unimportant;
what really matters is when the read is satisfied for the last time.
Therefore we will say that a read executes when it is last
satisfied, whereas all other instructions execute when they commit.
In these terms, the ordering requirements take the following form.
Let A and B be instructions with A before
B in program order.
Then A must execute before B if any of the following hold:
- A is a conditional branch and B is a write
instruction controlled by A.
- There is a dependency from A to B.
- There is a dependency from A to a write that is
forwarded to B.
- B is a write and A is the source of an
address dependency to a memory access instruction between them.
- B is a write and A accesses the same address
as B.
- A and B are reads of the same address, and
B does not read from the same write as A and
is not forwarded from a write that is between them.
- A is a write and B is a read of the same address,
and B is not forwarded from A or from a write
that is between them.
- B is a barrier and A is in its pre-set.
- A is a barrier and B is in its post-set.
(Taken together, requirements 8 and 9 say that A must
execute before B whenever they are separated in program order
by a suitable barrier, such that A is in the barrier's pre-set
and B is in the barrier's post-set.)
Collectively these are called Preserved Program Order (PPO) requirements,
because they force the CPU to execute certain instructions in program order.
The PPO requirements were expressed almost verbatim in the
original version of the strong kernel memory model.
Together with the description of how the memory subsystem works,
they suffice to guarantee that the four coherence rules will always be obeyed.
The resulting model applied quite nicely to x86, Sparc, and PowerPC;
however, it was not accurate for some other architectures.
Quick Quiz 8:
Given all these constraints, how can weak-memory CPUs possibly expect to
attain any benefits of any sort compared to strong-memory CPUs?
Answer
The following litmus tests illustrate the ideas behind A- and B-cumulativity
and release sequences.
Strong Model Litmus Test #2
1 C C-wmb-is-not-A-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 WRITE_ONCE(*x, 1);
9 }
10
11 P1(int *x, int *y, int *z)
12 {
13 r1 = READ_ONCE(*x);
14 WRITE_ONCE(*z, r1);
15 smp_wmb();
16 WRITE_ONCE(*y, 1);
17 }
18
19 P2(int *x, int *y)
20 {
21 r2 = READ_ONCE(*y);
22 smp_rmb();
23 r3 = READ_ONCE(*x);
24 }
25
26 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
This test's
exists clause can be satisfied.
Even though
P0's write to
x propagates to
P1
before
P1 reads
x
(as proved by the fact that
r1=1 at the end),
and a data dependency forces the read to occur before
z is written,
which happens before the
smp_wmb() barrier commits,
the write to
x is not in the barrier's pre-set
because
smp_wmb() is not A-cumulative.
As a result, the barrier and the write to
y are allowed to
propagate to
P2 before the write to
x does
(although not before the write to
z).
In the end, this is just a fancy way of saying that
smp_wmb()
doesn't order two writes if the first write was executed on a different
CPU from that which executed the
smp_wmb().
Strong Model Litmus Test #3
1 C C-release-is-A-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 WRITE_ONCE(*x, 1);
9 }
10
11 P1(int *x, int *y)
12 {
13 r1 = READ_ONCE(*x);
14 smp_store_release(y, 1);
15 }
16
17 P2(int *x, int *y)
18 {
19 r2 = READ_ONCE(*y);
20 smp_rmb();
21 r3 = READ_ONCE(*x);
22 }
23
24 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
By contrast, this test's
exists clause cannot be satisfied.
Because
smp_store_release() is A-cumulative and because
P0's write to
x propagates to
P1
before the
smp_store_release() commits, the write is in the
release barrier's pre-set.
Consequently the new value of
x must propagate to
P2
before the store to
y can.
Since
P2 is forced to read
x after reading
y
(by the
smp_rmb()), and since it sees the new value of
y, it must also see the new value of
x.
Strong Model Litmus Test #4
1 C C-wmb-is-B-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_wmb();
10 WRITE_ONCE(*y, 1);
11 }
12
13 P1(int *y, int *z)
14 {
15 r1 = READ_ONCE(*y);
16 WRITE_ONCE(*z, r1);
17 }
18
19 P2(int *x, int *z)
20 {
21 r2 = READ_ONCE(*z);
22 smp_rmb();
23 r3 = READ_ONCE(*x);
24 }
25
26 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
B-cumulativity refers to writes that occur after the barrier.
Even though we don't tend to think of
smp_wmb() as ordering
writes carried out by other CPUs, in every Linux-supported architecture
for which we know the details, it does, courtesy of the fact that
smp_wmb() is B-cumulative.
In this example, the write to x, the smp_wmb() barrier,
and the write to y must propagate from P0 to P1
in order.
The data dependency from P1's read of y to its write
of z forces the write to occur after the new value of y
has been seen, and hence after the smp_wmb() barrier has propagated
to P1.
As a result, the write to z is in the barrier's post-set,
so the barrier must propagate to P2 before the write can.
That's what being B-cumulative means.
Before P2 can read the new value of z, the barrier and
hence the new value of x must have propagated there.
Therefore P2's read of x must see the new value,
and so the exists clause cannot be satisfied.
Strong Model Litmus Test #5
1 C C-release-is-not-B-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 }
11
12 P1(int *y, int *z)
13 {
14 r1 = READ_ONCE(*y);
15 WRITE_ONCE(*z, r1);
16 }
17
18 P2(int *x, int *z)
19 {
20 r2 = READ_ONCE(*z);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
As before, the write to
x, the
smp_store_release()'s
barrier, and the write to
y must propagate from
P0
to
P1 in order, and so the write to
z must occur
after the barrier has reached
P1.
Nevertheless, because
smp_store_release() is not B-cumulative,
the write to
z isn't in the barrier's post-set,
and so the new value of
z is allowed to propagate to
P2
before either the barrier or the new value of
x.
Consequently it is possible for
P2 to read the new value of
z followed by the old value of
x.
Despite what the previous
example shows, in this memory model
smp_store_release() is B-cumulative along pathways where
it is read by smp_load_acquire().
The following example illustrates this point.
Strong Model Litmus Test #6
1 C C-release-acquire-is-B-cumulative.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 }
11
12 P1(int *y, int *z)
13 {
14 r1 = smp_load_acquire(y);
15 WRITE_ONCE(*z, 1);
16 }
17
18 P2(int *x, int *z)
19 {
20 r2 = READ_ONCE(*z);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
In this litmus test,
P0's
smp_store_release() is read
by
P1's smp_load_acquire().
As a result, the release barrier acts B-cumulatively and so
P1's
write of
z cannot propagate to
P2 until the barrier has.
Hence it is not possible for
P2 to read the new value of
z
followed by the old value of
x.
Note that this applies only along the pathway of the
smp_load_acquire().
This example:
Strong Model Litmus Test #7
1 C C-release-B-cumulative-only-on-acquire-path.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 }
11
12 P1(int *y, int *z)
13 {
14 r1 = READ_ONCE(*y);
15 WRITE_ONCE(*z, r1);
16 }
17
18 P2(int *x, int *z)
19 {
20 r2 = READ_ONCE(*z);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 P3(int *y)
26 {
27 r4 = smp_load_acquire(y);
28 }
29
30 exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0 /\ 3:r4=1)
is the same as the
Strong Model Litmus Test #5
example above, except that it has a fourth thread P3 which uses
smp_load_acquire() to read the value of
y stored
by
P0.
However, this interaction does not cause the release barrier's effect
on
P1 to be B-cumulative;
the
exists clause is still allowed to succeed.
(The barrier's effect on P3
is B-cumulative, but the example
does not probe this fact.)
The following litmus test shows a non-trivial use of a release sequence.
Strong Model Litmus Test #8
1 C C-relseq.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 WRITE_ONCE(*y, 2);
11 }
12
13 P1(int *y)
14 {
15 r1 = xchg_relaxed(y, 3);
16 }
17
18 P2(int *x, int *y)
19 {
20 r2 = READ_ONCE(*y);
21 smp_rmb();
22 r3 = READ_ONCE(*x);
23 }
24
25 exists (1:r1=2 /\ 2:r2=3 /\ 2:r3=0)
The release sequence headed by
P0's
smp_store_release()
to
y includes
WRITE_ONCE(*y, 2), because that
write is po-after the store-release.
It also includes the atomic
xchg_relaxed() operation in
P1,
because that operation accesses
y and reads from a write in the
release sequence (the
WRITE_ONCE()).
Consequently
P1's atomic write belongs to the release barrier's
post-set, and it cannot propagate to
P2 before the
barrier and the write to
x do.
Note that smp_store_release() barriers do not become B-cumulative
along paths where smp_load_acquire() reads from an arbitrary
member of the release sequence,
but only when the load-acquire reads directly from the store-release itself.
This is illustrated by the following example.
Strong Model Litmus Test #9
1 C C-relseq.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*x, 1);
9 smp_store_release(y, 1);
10 WRITE_ONCE(*y, 2);
11 }
12
13 P1(int *y)
14 {
15 r1 = xchg_relaxed(y, 3);
16 }
17
18 P2(int *y, int *z)
19 {
20 r2 = smp_load_acquire(y);
21 WRITE_ONCE(*z, 1);
22 }
23
24 P3(int *x, int *z)
25 {
26 r3 = READ_ONCE(*z);
27 smp_rmb();
28 r4 = READ_ONCE(*x);
29 }
30
31 exists (1:r1=2 /\ 2:r2=3 /\ 3:r3=1 /\ 3:r4=0)
Here the
smp_load_acquire() in P2 reads from the
xchg_relaxed() in P1, which is part of the
smp_store_release()'s release sequence as before.
But because it does not read directly from the
smp_store_release()
instruction, it does not cause the barrier to act B-cumulatively.
Hence P2's write to
z is allowed to propagate to P3 before
P0's release barrier or write to
x.
The strong kernel model essentially consists of checks for six types
of forbidden cycles, related to cache coherence, atomic operations,
instruction execution times, write and barrier propagation,
arrival at the coherence point, and
the RCU guarantee.
The last one is concerned more with how RCU is implemented in the kernel
than with hardware instruction processing, so we will not discuss it here.
The model treats
cache coherence
in exactly the same way as in the
Coherent-RMO model did,
by defining the coherence-order relation and including a
“coherence” check that requires the union of po-loc
and coherence-order to have no cycles.
The treatment of atomic RMW operations is even simpler.
What makes these operations atomic is that no write from another CPU
is allowed to intervene (in the coherence order) between
the write that the RMW operation reads from and the RMW's own write.
The model defines a relation that links the read event of each RMW instruction
to the instruction's write event by way of such an intervening write,
and includes an “atomic” check that requires this relation
to be empty.
Instruction execution ordering comes in several different forms.
For now we'll consider the easiest version, as defined for the original
PowerPC-oriented operational memory model.
Because we are designing a memory model, we are especially concerned
with ordering of memory accesses,
and we will ignore other kinds of instructions.
The conditions which require one memory access to execute after another
can be grouped in four general categories (listed along with the relations
the model uses to express them):
- A read that reads from a write on another CPU must execute after
that write (the rfe relation).
- An access in a non-read-dependency memory barrier's post-set
must execute after an access in the barrier's pre-set (the
ordering-fence relation).
- Any of the
Preserved Program Order
requirements listed earlier can force a CPU to execute
one access after another (the ppo relation).
- A barrier's write-propagation properties can justify an
argument by contradiction showing that under the right circumstances,
if two accesses are on the same CPU then the first could not
have executed after the second; it follows that the second must have
executed after the first (the obs relation).
The
hb relation incorporates these requirements.
If there is a sequence of
hb links from
A to
B,
it means that
A must execute before
B.
As mentioned
earlier,
any cycle in the times of execution of instructions is prohibited;
accordingly, the memory model contains a “happens-before” check
that requires the
hb relation not to have any cycles.
It may seem a little odd to group the ordering-fence
relation separately from the ppo relation, since they
are essentially the same sort of thing;
doing it this way seems to be historical practice.
Grouping the rfe relation separately from the others makes much
more sense, because it is the only way to directly relate two accesses
on different CPUs.
(Although a barrier's pre- and post-sets may be considered to contain
accesses from other CPUs, thanks to cumulativity, the ordering-fence
relation only affects accesses on the barrier's own CPU.
Cumulativity is handled separately.)
The ppo relation requires an in-depth discussion.
It is where the differences between the PowerPC, Alpha, and ARM architectures
show up most strongly.
We'll begin with an approach tailored to the PowerPC-oriented model
and then later discuss the changes required for Alpha and ARM.
In a nutshell, the memory model defines a relation for each of the types
of PPO links listed earlier, except the ones involving barriers:
- The builtin ctrl relation links a read access to any
memory access in a conditional branch that depends on the value
of the read.
The memory model computes (ctrl & (R*W)) to limit the
targets of these control dependencies to write accesses.
- The builtin addr and data relations link
a read access to any later accesses with an address or data dependency
from the read.
The dep relation is defined as the union of these two.
- The model defines the dep-rfi relation as the concatenation
of a dep link and an rfi link.
rfi links writes to reads in the same thread that read
from them.
These reads may have been satisfied by forwarding the write or by
going through the memory subsystem, but either way the effect is
the same: The read at the end of the rfi link must execute
after the read at the start of the dep link.
- The addrpo relation is defined as the concatenation of an
addr link and a po link.
The memory model computes (addrpo & (R*W)), which thus
connects reads carrying an address dependency to writes that are
po-after the target of the dependency.
- The builtin po-loc relation links a memory access to any
po-later access of the same address.
The memory model computes (po-loc & (M*W)) to limit the
targets of these links to writes.
- The rdw (“read different writes”) relation
is defined as (po-loc & (fre ; rfe)).
It thus links two reads of the same address in the same thread,
if they read from different writes and the po-later read
takes its value from a write in a different thread.
Thanks to the coherence rules, this is equivalent to saying that
the later read was not forwarded from a write between the two reads.
- The detour relation is defined as
(po-loc & (coe ; rfe)).
It thus links a write to a po-later read (of the same address
in the same thread) that reads from a write in a different thread.
This certainly implies that the read was not forwarded from the
write or from another write in between.
The case where the read does read from the write,
through the memory subsystem instead of by forwarding,
is left as an exercise for the reader.
The
ppo relation in the PowerPC-based model is the union of
these relations.
Quick Quiz 9:
Following up on exercise for the reader in the detour
relationship, what happens if the value from the write is forwarded
to that thread's later read?
Answer
The rmb, wmb, mb, sync,
acq-po, and po-relass relations are defined
to link accesses in each of the barriers' pre-sets to accesses in the
corresponding post-sets.
(po-relass handles both smp_store_release() and
rcu_assign_pointer(); the kernel implements the latter by means
of the former.)
The ordering-fence relation is the union of these.
It thus incorporates all but the read-dependency barriers.
Finally, the hb relation is taken to be the union of ppo,
ordering-fence, rfe, and one other relation,
called obs (short for “observation”).
In order to explain obs, it is necessary to discuss first
the effects of the propagation-order barriers: synchronize_rcu(),
smp_mb(), smp_wmb(), and the various store-release
instructions, such as smp_store_release() and
rcu_assign_pointer().
This is where cumulativity comes into play.
An A-cumulative barrier's pre-set is closed under (rfe^-1).
That is, if R is a read in the pre-set of an A-cumulative barrier
and W ⟶ R is a link in rfe (where W
is a write), then W must also be in the barrier's pre-set.
The reason is simple.
Because it is in the pre-set, R must precede the barrier
in program order and must execute before the barrier does.
The fact that W is on another CPU and R reads from
W means that W must have propagated to the barrier's
CPU before R executed, hence before the barrier was committed.
By the definition of A-cumulativity, W must therefore be a member
of the barrier's pre-set.
Similarly, a B-cumulative barrier's post-set is closed under hb.
That is, if X is in the post-set of a B-cumulative barrier
and X ⟶ Y is a link in hb, then Y
must also be in the barrier's post-set.
The fact that X is in the post-set means that the barrier
propagates to X's CPU before X executes.
If the hb link from X to Y
is anything other than rfe, it says that Y executes
on the same CPU as X and after X does, hence after
the barrier propagates to that CPU.
If the link is an instance of rfe, it says that X
is a write and Y reads from X on a different CPU.
Since X is in the barrier's post-set,
the barrier must propagate to Y's CPU before X
does, hence before Y executes.
Either way, by the definition of B-cumulativity, Y must be a member
of the barrier's post-set.
(Note that being closed under hb is a transitive property.
In other words, if X ⟶ Y and Y ⟶ Z are
links in hb, and if X is in a B-cumulative barrier's
post-set, then Y and Z must both be in the
post-set as well.
By contrast, being closed under (rfe^-1) is not transitive.
It is not possible to have X ⟶ Y and Y ⟶ Z
both be links in rfe, because Y can't be
both a read event and a write event.)
Quick Quiz 10:
What about RMW (read-modify-write) instructions, such as xchg()
or atomic_inc()?
Don't they constitute both a read and a write?
Answer
(Incidentally, it is a subtle point worth mentioning that this property of
hb—that B-cumulative post-sets are closed under it—is
perhaps even more important than the original characterization of
hb in terms of one instruction executing before another.
When the instructions are on the same CPU, the two concepts amount to
the same thing.
But when the instructions are on different CPUs, the concepts are definitely
not the same.
In order to have an hb link from X to Y on another
CPU, it is not sufficient for X to execute before Y.
It is also necessary for every B-cumulative barrier that propagates to
X's CPU before X executes to also propagate to
Y's CPU before Y executes.
Guaranteeing this requires some sort of causal connection between the CPUs,
such as a write on X's CPU after X executes that is read by
Y's CPU before Y executes.
This helps explain why the only inter-CPU part of the definition of hb
is rfe.)
Putting these ideas together we can define the propbase relation,
which links accesses in a propagation-order barrier's pre-set to accesses
in its post-set, taking cumulativity into account.
For simplicity, we'll ignore release sequences
and the effect of load-acquire reading from store-release.
propbase then looks like this
(the actual definition of propbase in the strong kernel model
is considerably more complicated,
in part because it doesn't ignore these extra factors):
let AB-cum-propbase = rfe? ; strong-fence ; hb*
let A-cum-propbase = AB-cum-propbase | (rfe? ; po-relass)
let B-cum-propbase = AB-cum-propbase | (wmb ; hb* )
let propbase = A-cum-propbase | B-cum-propbase
Here
strong-fence is the union of
mb and
sync
(the two types of strong barriers), and
po-relass covers all
the store-release instructions.
These definitions express the fact that strong barriers are both A- and
B-cumulative, store-release barriers are A-cumulative, and
smp_wmb()
is B-cumulative.
Thus
AB-cum-propbase applies to barriers that are both
A-cumulative and B-cumulative,
A-cum-propbase applies to barriers
that are A-cumulative,
B-cum-propbase applies to barriers that are
B-cumulative, and
propbase gives the complete relation.
(The extra space between the ‘
*’ and
‘
)’ characters is necessary because
herd,
following OCaml, reserves “
*)” for marking the
end of a comment.)
Notice that every relation making up the definition of propbase
is also part of hb.
In fact, propbase is a sub-relation of hb.
However, it expresses more than the simple fact that one instruction
executes before another.
Whenever there is a propbase link from a memory access X
to another access Y, the operational model says that the following
somewhat unwieldy ordering property (which we can call the
“Propbase” property) holds:
If X is a write, let W = X; otherwise let W
be any write that propagates to X's CPU before X executes.
Then W propagates to Y's CPU before Y
executes. In addition, if Y is a write then W propagates
to any other CPU before Y propagates to that CPU, and likewise for the
coherence point.
Detailed justification of this property is left to the reader;
it depends on the fact that if a propagation-order barrier's pre-set
can contain a read then the barrier is A-cumulative.
(Load-acquire barriers don't count, because they are only execution-order
barriers, not propagation-order.)
Details aside, a notable aspect of the Propbase property is that it is
transitive.
In other words, if X and Y obey the Propbase property
and Y and Z do so too, then X and Z
also obey the property.
Verification of this fact is also left as an exercise for the reader.
Now suppose we have the following situation: A is a write,
X is a write on another CPU to the same address, coming after
A in the coherence order, Y is also a write connected
to X by a series of propbase links, and B
is a read that reads from Y.
We can write this symbolically as:
A ⟶coe X ⟶propbase+ Y
⟶rfe B
where the subscripts specify the relations indicated by the arrows.
(
propbase+ means a sequence of one or more
propbase
relations.)
Finally, suppose that
A and
B are on the same CPU.
Because the Propbase property is transitive, X and Y
must have the Propbase property.
Therefore X must propagate to B's (and A's) CPU
before Y does, and hence before B executes.
Is it possible for A to execute after B?
No, because if it did, the memory subsystem would be forced to assign
A a position after X in the coherence order,
since X had already propagated to A's CPU when
A executed.
But we are assuming that A comes before X in the
coherence order (the coe link between them).
We conclude that in this situation, A must execute before B.
The same conclusion would hold if A was a read and the link from
A to X was fre rather than coe.
In that case, the memory subsystem would satisfy the read A
from the coherence-latest write that had already propagated to A's
CPU, and if A executed after B then that write would be
X or something even later.
But the fre link from A to X says that
the write used to satisfy A was coherence-earlier than X,
and so A cannot execute after B.
This is the reasoning behind the short-obs relation,
which is defined as
let short-obs = ((coe|fre) ; propbase+ ; rfe) & int
where
int (short for “internal”) links any two events
in the same thread.
The model also uses a more general relation,
obs, defined as:
let obs = short-obs |
((hb* ; (coe|fre) ; propbase* ; B-cum-propbase; rfe) & int)
The second part of the
obs relation applies when
the last sequential
propbase link involves a B-cumulative barrier.
To understand the significance of this, suppose we have:
A ⟶hb* U ⟶coe X
⟶propbase* Z ⟶B-cum-propbase Y
⟶rfe B
with
A and
B on the same CPU.
If
A executed after
B then by B-cumulativity,
A would also be in the final barrier's post-set.
Consequently so would
U, and thus
X and
U
would have the Propbase property.
Just as before, this would contradict the fact that there is a
coe
link from
U to
X.
Thus we can conclude that whenever there is an obs link from
A to B (which includes the case of short-obs),
A must execute before B.
And since the two events are on the same CPU, this means they
should be ordered by hb.
This is why the definition of hb incorporates obs,
as mentioned earlier.
(You may have noticed that the definitions of hb and obs
are mutually recursive; obs is a part of hb,
and the definition of obs involves B-cum-propbase,
which uses hb.
Fortunately herd has no trouble working with recursive definitions!)
Here is a simple example of short-obs, to help cement these ideas.
This is a litmus test we have seen before, the “message-passing”
pattern:
C-MP+o-mb-o+o-mb-o.litmus
1 C C-MP+o-mb-o+o-mb-o
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*y, 1);
9 smp_mb();
10 WRITE_ONCE(*x, 1);
11 }
12
13 P1(int *x, int *y)
14 {
15 int r1;
16 int r2;
17
18 r1 = READ_ONCE(*x);
19 smp_mb();
20 r2 = READ_ONCE(*y);
21 }
22
23 exists
24 (1:r1=1 /\ 1:r2=0)
If
r1=1 and
r2=0 at the end then there is a
short-obs link from the read of
*y (line 20)
to the read of
*x (line 18):
Because
r2 is 0, the read of
*y does not see the write
in line 8, so they are linked by
fre.
The
smp_mb() in line 9 then provides a
propbase link
to the write in line 10.
And that write is linked by
rfe to the read in line 18,
because
r1 is 1.
Together these constitute an instance of
short-obs.
This implies that the condition in the “exists”
clause requires the read in line 20 to execute before the read in
line 18.
But the smp_mb() barrier between them forces the reads to execute
in program order.
We thus have a cycle in hb:
Line 18 ⟶ line 20 because of the memory barrier, and
line 20 ⟶ line 18 because of the short-obs link.
This cycle violates the “happens-before” check,
and so the memory model says the litmus test cannot succeed.
Quick Quiz 11:
But this short-obs link goes backward from line 20 to
line 18!
How can a backward link on a single CPU represent a
“happens-before” ordering relation???
Answer
The fact that obs plays a role in such a simple litmus test
underlines its importance.
It is the single relation in the memory model that accounts for the
effects of barriers on propagation of writes between CPUs.
Writes get sent to the coherence point as well as propagating among CPUs,
and if we create a relation that links writes by their order of arrival
at the coherence point, this relation must not contain a cycle.
In the Linux-kernel memory model, this relation is cpord
(for “coherence-point ordering”), and the
“propagation” check requires that cpord be acyclic.
Some parts of the definition of cpord are quite straightforward.
A write in a propagation-order barrier's pre-set must arrive at the
coherence point before a write in the barrier's post-set.
In addition, the fundamental property of the coherence point
says that for any two writes to the same address,
the write that is earlier in the coherence order for that address
must arrive at the coherence point first.
For these reasons, cpord contains the relations
propbase & (W*W) (that is, the propbase relation
restricted to links between two writes) and co.
Another fairly straightforward aspect concerns the write events in
atomic RMW instructions.
According to the operational memory model, such writes
arrive at the coherence point at the same time as they commit;
this is how the model provides atomicity.
As a result, any write which executes after an RMW write must also arrive
at the coherence point after it (since writes can't arrive at the
coherence point before they execute).
The atomic-hb relation expresses this idea.
It is defined as:
let atomic-hb = hb+ & ((RMW & W)*W)
and was included in the original definition of
cpord.
It had to be removed later on, however, when we found out that the
ARM architecture does
not require atomic RMW writes to reach
the coherence point when they commit.
The final part of cpord is not at all obvious.
At first glance, one wouldn't think that arrival times at the
coherence point would bear any connection to reads, since reads don't
get sent to the coherence point in the way that writes do.
Nevertheless, reads do play an important role in the definition of
cpord.
The connection has to do with the unique properties of strong barriers.
To explore these ideas, let's extend the meaning of cpord.
For writes, it will continue to refer to the time when the write arrives
at the coherence point.
For reads, we will let cpord simply refer to the time when the
read executes.
So now suppose that there is a propbase link from A
to B, where A and B can be either reads
or writes.
What can we say about the existence of a cpord link from A
to B?
The case where A is a read is trivial.
The propbase link says that A executes before B
(indeed, propbase is a sub-relation of hb),
and therefore there must be a cpord link from A to
B regardless of whether B is a read or a write.
If A and B are both writes then we know that A
is in the pre-set of the propbase's memory barrier and B
is in the post-set; hence A must arrive at the coherence point
before B and once again we have a cpord link.
The most interesting case is the remaining one, where A is a write
and B is a read.
In this case, although A is guaranteed to execute before B,
there is no guarantee that A will arrive at the coherence point
before B executes.
Thus, there need not be a cpord link from A to B.
But what if the propbase's memory barrier is a strong barrier
(such as smp_mb() or synchronize_rcu())?
Instructions in a strong barrier's post-set do not execute until after
the barrier has been acknowledged, and the barrier does not get
acknowledged until every write in its pre-set has propagated to every CPU
and to the coherence point—this is what makes the barrier
“strong”.
This means that B cannot execute until after A has
reached the coherence point, and therefore we do have a cpord
link from A to B.
A little thought shows that the same is true if A and B
are connected by a sequence of propbase links in which the last
propbase involves a strong memory barrier.
And we can even go one step further, by applying an argument
similar to the one underlying short-obs:
Suppose R is a read, A is a write on another CPU
to the same address, related to R by an fre link,
and B is a read or write connected to A by a sequence of
propbase links where the barrier in the last propbase
is a strong barrier.
In symbols:
R ⟶fre A ⟶
propbase* ; strong-propbase B
where
strong-propbase stands for a
propbase relation
containing a strong barrier.
Strong barriers are both A- and B-cumulative, so expanded out in full,
strong-propbase becomes
(rfe? ; strong-fence ; hb* ).
Unlike the situation with
short-obs, we do not need to assume that
R and
B are on the same CPU.
Nevertheless, we can still conclude that
R must execute before
B can execute or reach the coherence point, implying that there
is a
cpord link from
R to
B.
The reasoning is much like before.
By the Propbase property A is in the final strong barrier's
pre-set, so it must propagate to every CPU before the barrier is acknowledged.
And since B is in the strong barrier's post-set, it cannot
execute or reach the coherence point until after the barrier is acknowledged.
If R executed later than B then A would already
have propagated to R's CPU, and so R would read from
A (or another write even later in the coherence order).
But this is not possible, because of the fre link from
R to A.
These ideas are expressed by the strong-prop relation, defined as:
let strong-prop = fre? ; propbase* ; rfe? ; strong-fence ; hb*
and included in definition of
cpord.
The initial
fre link is optional to allow for the case mentioned
earlier, where a sequence of
propbase links ending in one containing
a strong memory barrier connects a write to a read.
The strong barrier at the end is crucial; without it we could not guarantee
the existence of a
cpord link.
Quick Quiz 12:
Readers who go to the trouble of reading the actual definition of
cpord in the Linux-kernel strong memory model will see that
it includes the co, propbase & (W*W), and strong-prop
terms mentioned earlier, but it does not include any terms corresponding to
the “trivial” case of a propbase or hb+ link
starting from a read.
Why not?
Answer
No other checks?
The checks we have introduced into the memory model above
(“coherence”,
“atomic”,
“happens-before”, and
“propagation”)
may seem like an eclectic collection.
Nevertheless, it can be shown that they suffice to duplicate nearly
all the results prescribed by the operational model.
The only shortcoming has to do with some subtle interactions between
B-cumulative memory barriers and the “propagation” check,
and demonstrating these interactions requires very unrealistic litmus tests
containing cycles that span at least four threads.
We will not examine these issues here.
In any case, it turns out that this small weakness is dwarfed by the compromises
the memory model is forced to accept in order to accomodate the
idiosyncracies of the Alpha and ARM architectures (in addition to
those of PowerPC that are accommodated by our initial memory model)
as we shall see.
Alpha deviates from the operational model described above in one
very significant way: It uses a software-visible split data cache.
In terms of the memory model, this means that writes which propagate
to a CPU in one order might be perceived by that CPU in the opposite
order.
For example, suppose that P0 writes to both x and y,
and the writes propagate in that order to P1.
If x and y happen to be located in different cache lines
and the two cache lines are handled by different parts of P1's data cache,
it may happen that the part of the cache responsible for handling the
write to x is busy while another part of the cache is able to
handle the write to y right away.
P1's CPU would then see y's new value before seeing x's,
a result that would not be allowed by the memory model as described above.
Here's an example litmus test to illustrate the point.
Strong Model Litmus Test #10
1 C alpha-split-cache-example1
2 {
3 int u = 0;
4 int v = 0;
5 int *p = &u;
6 }
7
8 P0(int **p, int *v)
9 {
10 WRITE_ONCE(*v, 1);
11 smp_mb();
12 WRITE_ONCE(*p, v);
13 }
14
15 P1(int **p)
16 {
17 int *r1;
18 int r2;
19
20 r1 = READ_ONCE(*p);
21 r2 = READ_ONCE(*r1);
22 }
23
24 exists (1:r1=v /\ 1:r2=0);
The smp_mb() in P0 forces the write to v to propagate
to P1 before the write to p, and
the address dependency from P1's read of p to its read of
*r1 forces these reads to be executed in program order.
Nevertheless, the split-cache arrangement may cause P1 to see the
new value of p (i.e., &v) and then the
old value of v (i.e., 0).
This odd behavior can be observed on real Alpha hardware,
and it shows up when we run the litmus test through the model:
Outcome for Strong Model Litmus Test #10
1 Test alpha-split-cache-example1 Allowed
2 States 3
3 1:r1=u; 1:r2=0;
4 1:r1=v; 1:r2=0;
5 1:r1=v; 1:r2=1;
6 Ok
7 Witnesses
8 Positive: 1 Negative: 2
9 Condition exists (1:r1=v /\ 1:r2=0)
10 Observation alpha-split-cache-example1 Sometimes 1 2
11 Hash=b73c984509551a6a5ffe49d86c9a2d04
Inserting a call to smp_read_barrier_depends() (see line 21):
Strong Model Litmus Test #11
1 C alpha-split-cache-example2
2 {
3 int u = 0;
4 int v = 0;
5 int *p = &u;
6 }
7
8 P0(int **p, int *v)
9 {
10 WRITE_ONCE(*v, 1);
11 smp_mb();
12 WRITE_ONCE(*p, v);
13 }
14
15 P1(int **p)
16 {
17 int *r1;
18 int r2;
19
20 r1 = READ_ONCE(*p);
21 smp_read_barrier_depends();
22 r2 = READ_ONCE(*r1);
23 }
24
25 exists (1:r1=v /\ 1:r2=0);
prevents the unwanted result:
Outcome for Strong Model Litmus Test #11
1 Test alpha-split-cache-example2 Allowed
2 States 2
3 1:r1=u; 1:r2=0;
4 1:r1=v; 1:r2=1;
5 No
6 Witnesses
7 Positive: 0 Negative: 2
8 Condition exists (1:r1=v /\ 1:r2=0)
9 Observation alpha-split-cache-example2 Never 0 2
10 Hash=9dbdccdc417b3ece717775094d822a49
The
smp_read_barrier_depends() call forces P1 to wait until
all writes that have already propagated to its cache have been fully handled
and are available for reading.
Thus, the new value of
v is always visible to P1 whenever it
sees
&v in
p.
Quick Quiz 13:
Given how important split caches are for attaining full performance on
superscalar CPUs, why don't any non-Alpha architectures have split caches?
Answer
In order to accomodate Alpha's unique behavior, we modified the memory model
to include a delay between the time when a write propagates to a CPU
and the time when the memory subsystem can use that write to satisfy
a read request.
Furthermore, instead of assuming a simple split, the model allows
the cache to be completely fragmented, with an independent segment for
each memory address.
The model introduces the notion of a horizon time.
Given a read instruction R on processor Pn, horiz(R)
(the horizon time for R) is the most recent time h,
before or equal to the time when R executes,
such that a write to R's target address propagating to Pn
from another processor by time h would be fully handled by the cache
and available for satisfying R.
Writes that propagate to Pn after h are considered still to be
“below the horizon” when R executes,
and so are not visible and cannot be used for satisfying R.
(This restriction does not apply to writes executed by Pn itself;
each processor can see its own writes at any time.)
The memory model requires that once a write has propagated to Pn and been
fully handled by the cache, it is visible to all future reads.
It follows that for each Pn and each address A,
the horizon times of reads on Pn that target A must not decrease
over time (figuratively, a write cannot fall back below the horizon after
it has become visible).
If such things were allowed to happen, we could have a violation of the
read-read coherence rule.
The model also requires that every memory barrier other than
smp_wmb() behave like
smp_read_barrier_depends(), in that it forces the CPU to wait
for all writes that propagated to the CPU before the barrier committed
to become visible.
In other words, if Pn executes a memory barrier at time t then
any read instruction R in the barrier's post-set will have
horiz(R) ≥ t.
If the barrier is a strong one, the CPU is required to wait for all
writes that propagated to the CPU before the barrier was acknowledged
to become visible.
While the actual values of the horizon times during an execution are
unknowable, we can nevertheless reason about their values.
As mentioned above, we know that horiz(R) ≤ exec(R),
where exec(R) is the time when R executes.
And we know that if R and R' target the same address
and execute on the same CPU then exec(R) ≤ exec(R') implies
horiz(R) ≤ horiz(R').
More important is this key fact about horizon times:
If R reads from a write W
and W' is another write to the same address
that comes after W in the coherence order,
then W' must propagate to R's CPU after horiz(R).
Otherwise W' would have been available to satisfy R,
so R would have read from it instead of from W.
This conclusion remains true even if W is forwarded to R.
In that case, R executes before W does;
then the fact that W' comes after W in the coherence order
means that W' must propagate to W's (and R's) CPU
after W executes, hence after R executes
and thus after horiz(R).
Adding the concept of horizon times complicates the ordering of reads.
When we say that read A is ordered before read B,
we could now mean any of four things:
- exec(A) ≤ horiz(B)
(we call this the Alpha-strong-ppo relation);
- horiz(A) ≤ horiz(B)
(the Alpha-normal-ppo relation);
- exec(A) ≤ exec(B)
(the Alpha-weak-ppo relation);
or
- horiz(A) ≤ exec(B)
(the memory model doesn't use this).
These conditions are not mutually exclusive.
Alpha-normal-ppo and
Alpha-weak-ppo both contain
Alpha-strong-ppo, because
horiz(A) ≤ exec(A)
and
horiz(B) ≤ exec(B) respectively.
This classification can be extended to cover write instructions
by declaring a write's horizon time simply to be the same as its
execution time.
We now face a subtle question:
Which of these various PPO relations should be included in hb?
To answer this, think back to the Propbase property defined
earlier.
If W and R possess the Propbase property then we know
that W propagate's to R's CPU before R executes.
But on Alpha, that doesn't mean very much.
We would really like to know that W propagates to R's CPU
before horiz(R), so that W is available to satisfy R.
Furthermore, we want to retain the principle that a B-cumulative barrier's
post-set is closed under hb.
In other words, if W and R have the Propbase property
(implying that W propagates to R's CPU before
horiz(R))
where the relevant memory barrier is B-cumulative,
and if R ⟶ R' is a link in hb
where R and R' are on the same CPU,
then W and R' should also have the Propbase property,
implying that W propagates to that CPU before horiz(R').
The only way to guarantee this is to require that
horiz(R) ≤ horiz(R') whenever there is an hb link
from R to R' on the same CPU.
Therefore, in order to make things work out the way we want,
we will redefine hb to include alpha-normal-ppo,
not Alpha-weak-ppo.
However, this doesn't mean we can ignore Alpha-weak-ppo entirely.
Suppose there is an Alpha-weak-ppo link from A to B
followed by an Alpha-strong-ppo link from B to C.
Then we have exec(A) ≤ exec(B) ≤ horiz(C),
so effectively there is an Alpha-strong-ppo link from A
to C and thus we should have A ⟶ C in hb.
To see how this works out in practice,
let's consider each of the
program order requirements
mentioned earlier and determine how each one falls under this classification.
- A is a read, B is a write, and there is a
control dependency from A to B
(i.e., whether B gets executed depends on which leg
of some conditional branch is taken, and the branch's condition
depends on the value read by A).
Since A's value isn't known until A executes,
we have exec(A) ≤ exec(B) = horiz(B).
Thus there is an Alpha-strong-ppo link from A
to B.
- There is an address or data dependency from A to B.
Then B can't execute until A's value is known,
so exec(A) ≤ exec(B).
If B is a read then this yields only an
Alpha-weak-ppo link, whereas if B is a write
then as above it yields an Alpha-strong-ppo link.
- There is an address or data dependency from A to a write
that is forwarded to B.
The write can't be forwarded to B before A's
value is known, so again we have exec(A) ≤ exec(B)
and this yields an Alpha-weak-ppo link.
- B is a write and A is the source of an
address dependency to a memory access instruction between them.
Then B can't execute until it is known whether the
intervening memory access will generate an addressing exception,
which requires A's value to be known.
Thus exec(A) ≤ exec(B) = horiz(B), so this yields
an Alpha-strong-ppo link.
- B is a write and A accesses the same address
as B.
Then B can't execute until A does.
As before, since B is a write this yields an
Alpha-strong-ppo link.
- A and B read from different writes to the same
address, and B is not forwarded from a write that is
between them.
In this situation, the ordering requirement states that the CPU must
execute B after it executes A, so
exec(A) ≤ exec(B) and we have an Alpha-weak-ppo
link.
But the read-read coherence rule says that the write which B
reads from must come later in the coherence order
than the write which A reads from.
Then by the key fact about horizon times,
B's write must have propagated to the CPU
at a time after horiz(A), but before horiz(B)
as otherwise B would not have read from it.
This means it must also be true that horiz(A) ≤ horiz(B),
and so in this case we also have an Alpha-normal-ppo link
from A to B.
- A is a write and B is a read of the same address,
and B is not forwarded from A or from a write
that is between them.
Since the write that B reads from must come later in the
coherence order than A, it must have propagated to the CPU
after A executed.
But since it propagated to the CPU at or before horiz(B),
we see that exec(A) ≤ horiz(B).
Thus this case yields an Alpha-strong-ppo link.
- There is a memory barrier with A in the pre-set and
B in the post-set.
Then A executes before the barrier does and B's
horizon time is later than the barrier execution time,
so exec(A) ≤ horiz(B) and again we have an
Alpha-strong-ppo link.
The same type of analysis can be applied to the other two relations
that make up hb, namely, rfe and obs.
If A is a write and B reads from A on another CPU
then A must propagate to B's CPU before horiz(B).
Thus exec(A) ≤ horiz(B), so rfe is analogous to
Alpha-strong-ppo (if we overlook the fact that PPO relations
only apply to instructions on the same CPU).
As for obs, consider first a short-obs scenario:
A ⟶fre X ⟶propbase Y
⟶rfe B
where
A and
B are on the same CPU.
We know that
Y propagates to this CPU before
horiz(B)
because of the
rfe link between them.
The
propbase link from
X to
Y implies that
X propagates to this CPU before
Y does.
But the
fre link from
A to
X says that
X comes later in the coherence order than the write which
A reads from.
So by the key fact about horizon times,
X propagates to
A's
CPU after
horiz(A), which means that
horiz(A) < horiz(B).
Thus
short-obs yields an
Alpha-normal-ppo link.
In a more general obs scenario we might have:
A ⟶hb* U ⟶fre X
⟶B-cum-propbase Y ⟶rfe B
with
A and
B on the same CPU.
If
horiz(B) ≤ horiz(A) then there would be an
Alpha-normal-ppo link from
B to
A,
and hence an
hb link.
Together with the
rfe from
Y to
B,
this would provide a sequence of
hb links connecting
Y all the way to
U.
Then by B-cumulativity,
U would be in the post-set of the
B-cum-propbase's memory barrier, implying that
X
would propagate to
U's CPU before
horiz(U).
But this would contradict the fact that
U reads from a write
that is earlier than
X in the coherence order, as expressed
by the
fre link from
U to
X.
Therefore we must have
horiz(A) ≤ horiz(B), meaning that
obs in general yields an
Alpha-normal-ppo link.
A similar analysis of
short-obs and
obs applies when
the link to
X is
coe rather than
fre.
To sum up, most PPO relations give rise to Alpha-strong-ppo links.
The exceptions are:
- obs yields an Alpha-normal-ppo link.
- addr between two reads and dep-rfi
yield Alpha-weak-ppo links.
- rdw yields a link that is in both Alpha-weak-ppo
and Alpha-normal-ppo, but not in Alpha-strong-ppo.
The
hb relation can contain arbitrary sequences of PPO links,
with one exception:
There must not be an
Alpha-weak-ppo link followed by an
Alpha-normal-ppo link.
For example, if we had
A ⟶ B in
Alpha-weak-ppo
and
B ⟶ C in
Alpha-normal-ppo,
the meaning would be that
exec(A) ≤ exec(B) and
horiz(B) ≤ horiz(C).
But since
exec(B) ≥ horiz(B),
we would not be able to deduce any ordering between
A and
C.
In fact, this is exactly what happens in
Strong Model Litmus Test #10
above.
In that litmus test there is a cycle consisting of a
short-obs
link followed by an
addr link between two reads.
As we saw, such cycles are allowed on Alpha.
This suggests redefining
hb somewhat along these lines:
let hb = Alpha-normal-ppo | (ppo* ; Alpha-strong-ppo) | rfe
where
ppo is the union of all the various PPO relations.
There is one additional wrinkle.
Suppose we have a sequence of Alpha-weak-ppo links with
a read-dependency barrier somewhere in the middle, such as this:
int **r1, **g;
int *r2;
int r3;
r1 = READ_ONCE(g); /* A */
smp_read_barrier_depends(); /* B */
r2 = READ_ONCE(*r1); /* C */
r3 = READ_ONCE(*r2); /* D */
Here there are address dependencies between reads
A and
C
and between reads
C and
D.
On a PowerPC system, this would already give an
hb* link from
A to
D.
On Alpha, all we can conclude is that
exec(A) ≤ exec(C) ≤ exec(D).
However, on Alpha the read-dependency memory barrier at
B
forces the horizon times for all po-later reads to be after the execution time
of the barrier, which in turn is after the execution time of
A.
This means
exec(A) ≤ exec(B) ≤ horiz(D),
which gives us an
Alpha-strong-ppo link from
A to
D
and therefore an
hb link.
Thus, in this peculiar situation we always have an
hb* link from
A to
D, even though the reasons for the link are different
on different types of system.
The memory model takes this into account by defining a new relation
(how else?) with the unwieldy name “rd-addr-dep-rfi”:
let rd-addr-dep-rfi = (addr | dep-rfi)+ & rd-dep-fence
This expresion omits
rdw because of issues related to the
ARM memory model,
but otherwise it captures the idea of a sequence of
Alpha-weak-ppo
links with an intervening read-dependency barrier.
The definition of
Alpha-strong-ppo can then include
rd-addr-dep-rfi, giving the desired results.
This extended memory model for Alpha reduces to the original
if we assume the horizon times for different memory addresses are always equal
(i.e., the cache is not segmented).
Under this assumption the extended model will allow the same set of behaviors
as the original,
which is reassuring; it means the model still applies as before
to architectures other than Alpha.
Unfortunately, the memory model as developed above is
not a very good fit for the ARM architecture.
The published memory models for ARMv8 differ in a number of important respects
from the model we have described so far.
Most of the differences involve the memory subsystem,
and most of the differences that affect the processor subsystem
are concerned with how it interacts with the memory subsystem.
The difference with perhaps the most widespread ramifications involves
how the memory subsystem responds to read requests.
Earlier we said that the response would be the value stored by the
coherence-latest write that has propagated to the CPU making the request,
because otherwise the read might observe a coherence-earlier value than
a po-earlier read of the same address did, violating the read-read
coherence rule.
But this requirement is stronger than necessary; all we really need
is that the response to a read request should be coherence-later than
(or the same as) any po-earlier read responses or committed writes
for the same address.
It doesn't have to be the very latest write available,
and on ARM it often isn't.
Furthermore, the ARM memory model does not include any feature
analogous to acknowledging a strong memory barrier.
Instructions following such a barrier can be executed as soon as
the barrier has been committed.
These two facts have some rather subtle effects on the ordering properties
of memory accesses.
For example, either one of them invalidates the reasoning we used
when analyzing
Strong Model Litmus Test #1
above.
The test remains forbidden even on ARM, but not for the reasons we gave.
Instead, the ARM memory model guarantees that
if a write W reaches the coherence point before a strong
(smp_mb()) barrier, then the response to any read that is
po-after the barrier and accesses the same address as W will be
the value stored by W or a coherence-later write.
(No similar guarantee is made by the PowerPC-based memory model
presented above,
which is an indication of how much ARM differs from PowerPC.)
This is enough to show that the “exists” clause in
Strong Model Litmus Test #1
will never be satisfied.
When the test is executed, one of the two memory barriers must reach
the coherence point before the other.
Suppose the barrier in P0 gets there first.
Since P0's write to x is in the barrier's pre-set,
it will reach the coherence point before the barrier does
and hence before P1's barrier does.
Thus P1's read of x, which is po-after the barrier,
is guaranteed to see P0's write (there aren't any coherence-later
writes to x in the test program),
and so r2 will end up equal to 1, not 0.
As before, the opposite case is symmetrical.
Another difference concerns the way CPUs execute writes.
On ARM, two writes to the same address are permitted to commit
out of program order.
Earlier we said that if this happened, it would cause the memory subsystem
to put the po-later write earlier in the coherence order,
thereby violating the write-write coherence rule.
ARM gets around this problem in a very straightforward way:
When a write W commits after a po-later write W'
to the same address, the CPU simply skips sending W
to the memory subsystem!
As a result, W never gets assigned an explicit location
in the coherence order (effectively, it ends up ordered immediately before
the next write, in program order, to the same address),
it never reaches the coherence point,
and it never gets propagated to any other CPUs.
We say that W has been obscured by W',
or more colloquially, erased.
Its effects don't disappear entirely, because the value stored by
W can still be forwarded to reads that lie between W
and W'.
But this is the next best thing;
from a system-wide standpoint, the end result is practically the same
as if W' had followed so closely on the heels of W that
W was overwritten before any other CPU had a chance to
read from it.
There are certain circumstances in which a write W
cannot be obscured.
For example, if the CPU encounters a memory barrier that orders W
before some other write W' to the same address,
then W' cannot commit until after the barrier does,
and the barrier cannot commit until after W does,
so the writes cannot commit out of program order
and W' will not obscure W.
Also, the writes associated with store-release instructions
(smp_store_release() or rcu_assign_pointer())
or atomic RMW instructions are not allowed to be obscured.
More trivially, W will not be obscured if there are no
po-later writes to the same address in its process to obscure it.
Taking all of this into account will complicate the final memory model,
as you might imagine.
There are some other, less important, differences in the operation
of the CPU subsystem in the ARM model:
- When a store-release instruction is committed,
the CPU does not issue a barrier request followed by a write request;
instead it issues a write request that is specially marked
as being a store-release.
- Similarly, a load-acquire instruction gives rise to
a single read request that is specially marked as being a load-acquire.
- An smp_rmb() instruction does not act entirely within the CPU;
it causes the CPU to issue a barrier request to the memory subsystem.
- The rules for restarting a read instruction after a po-earlier
instruction has accessed the same address are slightly looser;
the read does not need to be restarted if it was issued after the
po-earlier instruction was.
- A write instruction that accesses the same address as a po-earlier read
may be committed before the read is committed, provided the read has
already been issued and the CPU knows that it will not be restarted.
To better understand the ARM model,
we must examine how the memory subsystem works in some detail.
It is more highly structured than the memory subsystem in the PowerPC model,
consisting of a hierarchical arrangement of buffers lying between
the CPUs at the top and the memory at the bottom.
There is a buffer immediately below each CPU; these feed down into
some buffers below them, and then buffers below those, and so on,
down to the lowermost buffer, which feeds into memory.
The coherence point is the place at the bottom of the lowest buffer.
For example, a four-processor system would have four buffers at
the topmost layer, then at the next layer there might be
a buffer below CPUs 0 and 1 and another below CPUs 2 and 3,
and a single buffer below all the CPUs
in the final layer, as shown in this figure (panel A):
Other arrangements of buffers (such as that in panel B) are possible,
provided they follow the general hierarchical arrangement:
buffers always feed down, never up;
a buffer can receive input from multiple buffers above it but can
provide output only to a single buffer below; and there is a single
lowermost buffer which all the others eventually lead to.
The memory models do not specify the buffer sizes or topology,
and in practice you cannot even rely on the topology remaining unchanged
over time, because the scheduler can migrate a process from one CPU to another,
effectively altering the arrangement of the buffers below that process.
(The essential difference between the Flowing and POP models is that
the Flowing model assumes a fixed buffer topology,
whereas the POP model does not keep explicit track of the buffers
and thus is compatible with any arrangement.
The POP model is more general, but the Flowing model is easier to
reason about.)
A key point of this design is that the memory subsystem does not provide
the response to a read request immediately.
Issuing a read request and receiving the response (which is then used
to satisfy the read) are two separate events, and the CPU is free
to work on other instructions in between.
When a CPU issues a write, barrier, or read request, the request
enters the CPU's buffer at the top, flows through the buffer and then
down into the top of the buffer below, and so on, eventually passing
out the bottom of the lowermost buffer, to memory.
Thus, the coherence order is simply the order in which write requests
flow down to memory.
When a write request reaches memory, the value in the request gets
stored at the write's target address.
When a barrier request reaches memory, its job is finished and it disappears.
And when a read request reaches memory, a response is generated using
the value held in memory at the read's target address.
However, the response to a read request may be generated before
the request reaches memory, while it is still flowing through a buffer.
If the request immediately below the read in the buffer
is a write to the same address,
the memory subsystem can respond to the read using the value stored
by that write.
(Exception: a load-acquire request is not allowed to be satisfied by
a store-release request while still in a buffer.
The only way for a load-acquire instruction to read the value written
by a store-release instruction is for the load-acquire request to flow
all the way down to memory.)
When this happens, the memory subsystem deletes the read request;
the memory subsystem never generates more than one response to a request.
The flow of requests down through a buffer is not always
First-In-First-Out.
Subject to certain restrictions, a request is allowed to exchange
places with the request immediately below it (it passes
the lower request).
The complete list of reorder restrictions is rather elaborate;
among the most important ones are:
- A read or write request may not pass another read or write request
with the same target address.
- No barrier request may pass another barrier request.
- No request may pass an smp_mb() barrier or vice versa.
- An smp_wmb() barrier request may not pass a write request
from its own CPU, and it may not be passed by any write request.
- An smp_rmb() barrier request may not pass a read request
from its own CPU, and it may not be passed by a write or read request
from its own CPU.
- A store-release write request may not pass any write or read request.
A write is said to propagate to a CPU when its request flows into
a buffer below that CPU, because before that time there is no way for the CPU
to read from the write,
whereas afterward it is possible for a read
request issued by the CPU to be satisfied by the write request,
whether in a buffer or in memory.
This picture explains why a read might not be satisfied by
the coherence-latest write to have propagated to the read's CPU.
A read request R for the value of x, for example,
might not be satisfied until it reaches memory and obtains an old value,
even though a write request W containing a new value for x
may already have flowed down to a buffer below R's CPU.
Provided that W lies higher than R in the
chain of buffers leading from the CPU to memory,
R will be unable to read the value stored by W:
The write request can't pass the read request
because they have the same target address,
so it can never reach a buffer position immediately below R
and it cannot reach memory before R does.
Thus R will end up being satisfied by an earlier value of x
even though the coherence-later value in W had already
propagated to R's CPU by that time.
Now we can understand how the ARM memory model enforces
the strong-barrier guarantee mentioned above.
Suppose F is a strong fence and R is a read that is
po-after F.
Suppose also that W is a write to the same address as R,
and W reaches the coherence point before F does.
Then W must flow down to memory before F, and since
R cannot pass F, it cannot reach memory before W.
Thus, if R is satisfied from memory then it must read the value
stored by W or a coherence-later write.
But what if R is satisfied while it is still in a buffer?
Let W' be the write request that satisfies R.
Since it is immediately below R in the buffer at the time
that R is satisfied, it must also be above F.
And since W' cannot pass F, it must reach memory
after W, which means it must be coherence-later than W.
The case where R is not issued at all, but is forwarded from a
po-earlier write, is left to the reader.
(Hint: The write which is forwarded to R must also be po-after
F, as F cannot commit before a po-earlier memory access
and R cannot be satisfied until F has committed.)
Regardless, no matter how things work out, in the end R will
read from W or a coherence-later write, as guaranteed.
One other aspect of the storage subsystem needs mentioning.
In the PowerPC-based operational model, the write part of an atomic RMW
instruction reaches the coherence point at the time it commits.
In the ARM memory model this is not true; RMW writes flow down through
the buffers just like any other.
Instead, the architecture relies on an intricate system of interlocks
to prevent other write requests to the same address from flowing down
into the wrong buffer at the wrong time and thereby “sneaking”
in ahead of the RMW write, violating the atomic property.
Like we saw with the Alpha, the fact that reads are issued and satisfied
at different times leads to an ambiguity when we want to order read
instructions.
If we say that instruction A is ordered before B,
where one or both is a read, we could mean:
where
issue(A) is the time when
A's read request is
issued to the memory subsystem, and
exec(A) is the time when
A is executed (which is the time when the read response is
received, if
A is a non-forwarded read).
TO BE CONTINUED...
Currently there are none.
This may change in the future as we become aware of the individual
requirements of other CPU families.
Quick Quiz 14:
Why weren't adjustments needed for PowerPC, given that it has a weak
memory model?
Answer
Quick Quiz 15:
Why weren't adjustments needed for Itanium, given that it allows
reads to the same variable to be reordered?
Answer
Quick Quiz 16:
But what if some new CPU had an even weaker memory model than
Alpha, ARM, and PowerPC?
Mightn't that invalidate a lot of Linux-kernel code?
Answer
The full
Bell
file for Alan Stern's strong model
is as follows:
strong-kernel.bell
1 "Linux kernel strong memory model"
2
3 (*
4 * Copyright (C) 2016 Alan Stern <stern@rowland.harvard.edu>,
5 * Andrea Parri <parri.andrea@gmail.com>
6 *
7 * This program is free software; you can redistribute it and/or modify
8 * it under the terms of the GNU General Public License as published by
9 * the Free Software Foundation; either version 2 of the License, or
10 * (at your option) any later version.
11 *
12 * This program is distributed in the hope that it will be useful,
13 * but WITHOUT ANY WARRANTY; without even the implied warranty of
14 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 * GNU General Public License for more details.
16 *
17 * You should have received a copy of the GNU General Public License
18 * along with this program; if not, you can access it online at
19 * http://www.gnu.org/licenses/gpl-2.0.html.
20 *)
21
22 enum Accesses = 'once (*READ_ONCE,WRITE_ONCE,ACCESS_ONCE*) ||
23 'release (*smp_store_release*) ||
24 'acquire (*smp_load_acquire*) ||
25 'assign (*rcu_assign_pointer*) ||
26 'deref (*rcu_dereference*) ||
27 'lderef (*lockless_dereference*)
28 instructions R[{'once,'acquire,'deref,'lderef}]
29 instructions W[{'once,'release,'assign}]
30 instructions RMW[{'once,'acquire,'release}]
31
32 enum Barriers = 'wmb (*smp_wmb*) ||
33 'rmb (*smp_rmb*) ||
34 'mb (*smp_mb*) ||
35 'rb_dep (*smp_read_barrier_depends*) ||
36 'rcu_read_lock (*rcu_read_lock*) ||
37 'rcu_read_unlock (*rcu_read_unlock*) ||
38 'sync (*synchronize_rcu*)
39 instructions F[Barriers]
40
41 (* Treat 'release and 'assign identically; same for 'deref and 'lderef *)
42 let ReleaseAssign = Release | Assign
43 let XDeref = Deref | Lderef
44
45 let rmb = fencerel(Rmb) & (R*R)
46 let wmb = fencerel(Wmb) & (W*W)
47 let mb = fencerel(Mb)
48 let sync = (po & (_ * Sync)) ; (po?)
49
50 let rb-dep = fencerel(Rb_dep) & (R*R)
51 let acq-po = po & (Acquire*_)
52 let xderef-po = po & (XDeref*M)
53 let po-relass = po & (_*ReleaseAssign)
54
55 let rd-dep-fence = rb-dep | xderef-po
56 let strong-fence = mb | sync
57
58 (* Compute matching pairs of nested Rcu_read_lock and Rcu_read_unlock *)
59 let matched = let rec
60 unmatched-locks = Rcu_read_lock \ domain(matched)
61 and unmatched-unlocks = Rcu_read_unlock \ range(matched)
62 and unmatched = unmatched-locks | unmatched-unlocks
63 and unmatched-po = (unmatched * unmatched) & po
64 and unmatched-locks-to-unlocks = (unmatched-locks *
65 unmatched-unlocks) & po
66 and matched = matched | (unmatched-locks-to-unlocks \
67 (unmatched-po ; unmatched-po))
68 in matched
69
70 (* Validate nesting *)
71 flag ~empty Rcu_read_lock \ domain(matched) as unbalanced-rcu-locking
72 flag ~empty Rcu_read_unlock \ range(matched) as unbalanced-rcu-locking
73
74 (* Outermost level of nesting only *)
75 let crit = matched \ (po^-1 ; matched ; po^-1)
Taking this one piece at a time:
- Bell File: Memory Accesses.
- Bell File: Barriers.
-
Bell File: Relating Barriers and Memory Accesses.
-
Bell File: Relating One-Sided Barriers and Memory Accesses.
-
Bell File: Classes of Fences.
-
Bell File: RCU Read-Side Critical Sections.
The “"Linux kernel strong memory model"” is a name
that has no effect on the model's meaning.
The following portion of the Bell file defines the
types of memory accesses, which correspond to the Linux kernel's
READ_ONCE(),
WRITE_ONCE(),
ACCESS_ONCE(),
smp_store_release(),
smp_load_acquire(),
rcu_assign_pointer(),
rcu_dereference(), and
lockless_dereference() primitives:
22 enum Accesses = 'once (*READ_ONCE,WRITE_ONCE,ACCESS_ONCE*) ||
23 'release (*smp_store_release*) ||
24 'acquire (*smp_load_acquire*) ||
25 'assign (*rcu_assign_pointer*) ||
26 'deref (*rcu_dereference*) ||
27 'lderef (*lockless_dereference*)
28 instructions R[{'once,'acquire,'deref,'lderef}]
29 instructions W[{'once,'release,'assign}]
30 instructions RMW[{'once,'acquire,'release}]
The “enum Accesses” statement defines the
types of memory references, corresponding to the C functions listed
in the comments.
These correspondences are defined in herd's
linux.def
macro file.
The “instructions R” identifies which of the above
types of memory references may be associated with a read instruction,
the “instructions W” identifies which may be associated
with a write instruction, and the “instructions RMW”
identifies which may be associated with a read-modify-write instruction.
For example, the association of 'acquire with the R
set of instructions corresponds to the Linux kernel's
smp_load_acquire() primitive.
Note well that the above code simply defines names for the Linux-kernel
memory-access primitives.
The herd tool also uses this code to check the instruction
annotations in the linux.def file,
for example, __load{acquire) is legal but __load(release)
is not.
Later code in both the Bell and Cat files will define their effect on
memory ordering.
The next portion of the Bell file defines the
types of barrier-like constructs, namely
smp_wmb(),
smp_rmb(),
smp_mb(),
smp_read_barrier_depends(),
rcu_read_lock(),
rcu_read_unlock(), and
synchronize_rcu().
32 enum Barriers = 'wmb (*smp_wmb*) ||
33 'rmb (*smp_rmb*) ||
34 'mb (*smp_mb*) ||
35 'rb_dep (*smp_read_barrier_depends*) ||
36 'rcu_read_lock (*rcu_read_lock*) ||
37 'rcu_read_unlock (*rcu_read_unlock*) ||
38 'sync (*synchronize_rcu*)
39 instructions F[Barriers]
40
41 (* Treat 'release and 'assign identically; same for 'deref and 'lderef *)
42 let ReleaseAssign = Release | Assign
43 let XDeref = Deref | Lderef
The “enum Barriers” defines the types of barriers
corresponding to the C functions listed in the comments
(as set up in the
linux.def
macro file).
The “instructions F[Barriers]” says that these
types may be used in various sorts of barrier instructions.
Quick Quiz 17:
Given that this is about memory barriers, why
“instructions F[Barriers]” instead of perhaps
“instructions B[Barriers]”?
Answer
The definition of ReleaseAssign allows the model to treat
the smp_store_release(), rcu_assign_pointer(),
and the write portion of xchg_release()
identically from a memory-ordering perspective, and the
definition of XDeref does likewise for
lockless_dereference() and rcu_dereference().
As with the memory accesses, the above code only defines names.
These barrier-like constructs' ordering properties will be defined by
later code in the Bell and Cat files.
The next portion of the Bell file defines the relation between a given
barrier-like construct and its process's surrounding memory accesses:
45 let rmb = fencerel(Rmb) & (R*R)
46 let wmb = fencerel(Wmb) & (W*W)
47 let mb = fencerel(Mb)
48 let sync = (po & (_ * Sync)) ; (po?)
The fencerel(S) function in herd's standard library returns
a relation containing
all pairs of events in which the first event precedes (in program order)
an event in the S set
(for example, an “Rmb” event
in the case of the line 26 above)
and the second follows it, with all three events being in the same thread.
As an example, the following snippet:
r1 = READ_ONCE(*x);
smp_rmb();
r2 = READ_ONCE(*y);
smp_mb();
WRITE_ONCE(*z, r3);
would produce an “
rmb” relation containing only one link:
- r1 = READ_ONCE(*x) ⟶
r2 = READ_ONCE(*y)
and an “
mb” relation containing three links:
- r1 = READ_ONCE(*x) ⟶
WRITE_ONCE(*z, r3),
- smp_rmb() ⟶
WRITE_ONCE(*z, r3), and
- r2 = READ_ONCE(*y) ⟶
WRITE_ONCE(*z, r3).
The “
rmb” relation doesn't include the other possible
links because of the
“
& (R*R)” clause in its definition,
which intersects the full
fencerel(Rmb) relation
with the relation containing all pairs of reads (
R*R).
This is appropriate because
smp_rmb() orders only reads,
not writes.
(For database programming fans,
the “&” operator can be thought of
as doing a database full equijoin operation, so that the result
is only those elements that appear in both operands.
Similarly, the “*” operator can be thought of as a
database unconstrained join operation, in this case
providing all combinations of pairs of read events.
Later on, we will encounter operations that cannot be easily
represented by
SQL,
so we will shift to the nomenclature used
for mathematical sets.)
The “let wmb = fencerel(Wmb) & (W*W)” definition
acts similarly, but it extracts pairs of writes rather than reads, as required
for smp_wmb().
The “let mb = fencerel(Mb)” definition
keeps all events in the fencerel(Mb) relation,
as required for smp_mb().
(It even keeps events that don't correspond to memory accesses,
such as the smp_rmb() event in the above example,
although they are irrelevant here.)
Finally, the “let sync = (po & (_ * Sync)) ; (po?)”
definition uses a modified formula in place of
“fencerel(Sync)”.
It is different from the others in that it also includes pairs
where the second event is the synchronize_rcu() call
rather than something following it.
Otherwise it is like the definition of the mb relation.
Quick Quiz 18:
Why wouldn't “let sync = fencerel(Sync)”
work just as well as the modified definition?
Answer
This portion of the Bell file relates smp_rmb(),
smp_wmb(), smp_mb(), and synchronize_rcu()
to the surrounding code within a given process, but says nothing about
cross-process ordering properties, which will be defined in later
Bell and Cat code.
The next portion of the Bell file defines some relations involving
“one-sided” barriers
(smp_load_acquire(),
smp_store_release(),
rcu_assign_pointer(),
smp_read_barrier_depends(),
rcu_dereference(), and
lockless_dereference())
and their surrounding instructions:
50 let rb-dep = fencerel(Rb_dep) & (R*R)
51 let acq-po = po & (Acquire*_)
52 let xderef-po = po & (XDeref*M)
53 let po-relass = po & (_*ReleaseAssign)
The “acq-po” line defines the relation
appropriate for acquire operations, including smp_load_acquire()
and the read portion of xchg_acquire().
This is the intersection of the program order (po) relation with
the set of all pairs of events in which the first is an Acquire
and the second can be anything (the “_” wildcard).
The “po-relass” definition works quite similarly, but with
prior memory accesses rather than subsequent ones and with releases
(smp_store_release(), rcu_assign_pointer(), and
the write portion of xchg_release())
rather than acquires.
Consider the following example containing code fragments
running on two threads, where x, y, and
z are all initially zero:
Thread 0 Thread 1
-------- --------
WRITE_ONCE(*x, 1); r2 = smp_load_acquire(y);
r1 = READ_ONCE(*z); r3 = READ_ONCE(*x);
smp_store_release(y, 1); WRITE_ONCE(*z, 1);
This results in the following
po links:
- WRITE_ONCE(*x, 1) ⟶
r1 = READ_ONCE(*z),
- WRITE_ONCE(*x, 1) ⟶
smp_store_release(y, 1),
- r1 = READ_ONCE(*z) ⟶
smp_store_release(y, 1),
- r2 = smp_load_acquire(y) ⟶
r3 = READ_ONCE(x);,
- r2 = smp_load_acquire(y) ⟶
WRITE_ONCE(z, 1),
- r3 = READ_ONCE(*x); ⟶
WRITE_ONCE(*z, 1).
The first three links relate events in Thread 0 and
the last three relate events in Thread 1.
(The number of links in po is clearly quadratic
in the number of statements in a given thread, but that is OK because
several other things are exponential!
Knowing this, you can understand
why this sort of verification technique is unlikely to
handle all 20 million lines of the Linux kernel at one go.
Instead, these techniques should be applied to small but critical
segments of code.)
In this example, there is only one Acquire event:
“r2 = smp_load_acquire(y)”.
Intersecting po with the set of
all pairs of events in which the first is an Acquire
gives the acq-po relation:
- r2 = smp_load_acquire(y) ⟶
r3 = READ_ONCE(x);,
- r2 = smp_load_acquire(y) ⟶
WRITE_ONCE(z, 1),
This naturally lists all pairs of instructions whose execution order is
constrained by Thread 1's
smp_load_acquire().
The “rb-dep” definition is the same as that of
“rmb” earlier,
except that it applies to smp_read_barrier_depends()
instead of smp_rmb().
The “xderef-po” definition is the same as that of
“acq-po”, but for lockless_dereference()
and rcu_dereference() instead of smp_load_acquire().
Note that these three relations do not
correspond exactly to ordering constraints,
because smp_read_barrier_depends(), rcu_dereference(),
and lockless_dereference() only
order pairs of accesses where the second is “dependent” on the first
(more precisely, where there is an address dependency between them);
this restriction is described in more detail later on.
Note also that this portion of the Bell file defined only the relationships
between these one-sided barriers and the surrounding code within a
given process.
Cross-process ordering properties are defined by later Bell and Cat code.
The next portion of the Bell file forms two groups of fences by strength:
55 let rd-dep-fence = rb-dep | xderef-po
56 let strong-fence = mb | sync
The members of the rd-dep-fence group
(smp_read_barrier_depends(),
rcu_dereference(), and
lockless_dereference())
cannot provide any ordering at all unless a dependency is also present.
In contrast, the members of the strong-fence group can enforce full
globally visible transitive ordering.
Pervasive use of the
members of the strong-fence family will result in agreement
on the order even of completely unrelated memory references.
In fact, as noted
earlier,
placing one of these strong fences between each pair of memory
references in each process will forbid all but SC executions.
On the other hand, stronger fences often incur larger performance penalties.
These groups will be used in the Cat file to organize the various
ordering requirements.
The final section of the Bell file is the most complex, due to the
fact that rcu_read_lock() and rcu_read_unlock()
must come in matching pairs within a given process and can be nested.
Therefore, the purpose of the following code is to find the outermost
pair of rcu_read_lock() and rcu_rcu_unlock() invocations
in a single nested set, and to differentiate correctly between any
unrelated nested sets in a given process.
58 (* Compute matching pairs of nested Rcu_read_lock and Rcu_read_unlock *)
59 let matched = let rec
60 unmatched-locks = Rcu_read_lock \ domain(matched)
61 and unmatched-unlocks = Rcu_read_unlock \ range(matched)
62 and unmatched = unmatched-locks | unmatched-unlocks
63 and unmatched-po = (unmatched * unmatched) & po
64 and unmatched-locks-to-unlocks = (unmatched-locks *
65 unmatched-unlocks) & po
66 and matched = matched | (unmatched-locks-to-unlocks \
67 (unmatched-po ; unmatched-po))
68 in matched
69
70 (* Validate nesting *)
71 flag ~empty Rcu_read_lock \ domain(matched) as unbalanced-rcu-locking
72 flag ~empty Rcu_read_unlock \ range(matched) as unbalanced-rcu-locking
73
74 (* Outermost level of nesting only *)
75 let crit = matched \ (po^-1 ; matched ; po^-1)
The “matched” relation is defined by the
mutually recursive set of definitions on lines 61-70.
The idea behind this code is to associate an unmatched
Rcu_read_lock event with a later unmatched Rcu_read_unlock event
whenever no unmatched events lie between them,
and to repeat this operation recursively until nothing more can be matched.
To that end, lines 61-64 form the sets of not-yet-matched
Rcu_read_lock and Rcu_read_unlock events and their union.
Line 65 then forms the relation of all pairs of these unmatched
events that occur in the same thread, in program order.
Lines 66-67 similarly form the relation of all such pairs
where the first member of the pair is a Rcu_read_lock event
and the second is an Rcu_read_unlock.
The interesting part is lines 68-69, which take
pairs of unmatched Rcu_read_lock and Rcu_read_unlock events
and add them to the “matched” relation,
but only if there are no unmatched events in between.
They do this by applying the
“\” (backslash) subtraction operator to remove from
the unmatched-locks-to-unlocks relation
any pairs having an
intervening unmatched Rcu_read_lock or Rcu_read_unlock.
The “;” operator sequences relations
(if relation x contains a⟶b
and relation y contains b⟶c
then (x ; y) will contain a⟶c).
In this case, you can see that
“unmatched-po ; unmatched-po”
contains all pairs a⟶c of unmatched events for which
a third unmatched event b lies between them in program order.
The only purpose of line 70 is to prevent the
unmatched-locks,
unmatched-unlocks,
unmatched,
unmatched-po, and
unmatched-locks-to-unlocks
definitions from leaking out to the surrounding context.
Grammatically speaking, the construction used here is a
let rec expression inside a let statement.
In fact, let or let rec expressions
are very much like GCC's statement expressions;
the statement in lines 59-68 is syntactically analogous to
“x = ({int x = u; if (x < v) x = v; x;})”.
Line 73 then checks whether there are any unmatched Rcu_read_lock events,
and line 74 does the same for unmatched Rcu_read_unlock events.
The “flag ~empty” statement flags the litmus test
as containing a semantic error if the specified set isn't empty,
and the “as ...” clause merely
provides a name to identify the particular failure mode.
Lastly, line 77 computes those matching pairs which lie
at the outermost level of nesting.
They are the important ones, because they delimit
RCU read-side critical sections.
It does this by subtracting from “matched”
all pairs which lie entirely between another matched pair.
The “^-1” inversion operator computes the
converse of a given relation; that is, it computes the collection
of all links a⟶b such that
b⟶a is in the given relation.
Thus, po^-1 contains all pairs of events in reverse
program order.
To see how “(po^-1 ; matched ; po^-1)” selects
inner matched pairs, consider the following example:
1 rcu_read_lock();
2 rcu_read_lock();
3 rcu_read_unlock();
4 rcu_read_unlock();
Starting at line 2, a “
po^-1” step takes us back to
line 1, a “
matched” step takes us to line 4,
and a second “
po^-1” takes us back to line 3.
Thus, this expression correctly identifies line 2 ⟶ line 3
as an inner matched pair.
You can easily see that this mechanism will remove from the
matched relation any
matched pairs that are nested within another matched pair.
We are now ready to proceed to the Cat file.
The full
Cat
file for Alan Stern's strong model is as follows:
strong-kernel.cat
1 "Linux kernel strong memory model"
2
3 (*
4 * Copyright (C) 2016 Alan Stern <stern@rowland.harvard.edu>,
5 * Andrea Parri <parri.andrea@gmail.com>
6 *
7 * This program is free software; you can redistribute it and/or modify
8 * it under the terms of the GNU General Public License as published by
9 * the Free Software Foundation; either version 2 of the License, or
10 * (at your option) any later version.
11 *
12 * This program is distributed in the hope that it will be useful,
13 * but WITHOUT ANY WARRANTY; without even the implied warranty of
14 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 * GNU General Public License for more details.
16 *
17 * You should have received a copy of the GNU General Public License
18 * along with this program; if not, you can access it online at
19 * http://www.gnu.org/licenses/gpl-2.0.html.
20 *)
21
22 include "cos-opt.cat"
23 include "lock.cat"
24
25 let com = rf | co | fr
26 let coherence-order = po-loc | com
27 acyclic coherence-order as coherence
28
29 empty rmw & (fre;coe) as atomic
30
31
32 let rf = rf | next-crit
33 let rfe = rf & ext
34
35 let exec-order-fence = rmb | acq-po | lk-po
36 let propagation-fence = strong-fence | wmb | po-relass | po-ul
37 let ordering-fence = propagation-fence | exec-order-fence
38
39 (* Determine the release sequences *)
40 let rel-seq = [ReleaseAssign] ; coi? ; (rf ; rmw)*
41 let po-rel-seq = po ; rel-seq
42
43 (* On Alpha, rd-dep-fence makes addr, dep-rfi, and rdw strong *)
44 let dep = addr | data
45 let dep-rfi = dep ; rfi
46 let rd-addr-dep-rfi = (addr | dep-rfi)+ & rd-dep-fence
47 let rdw = po-loc & (fre ; rfe)
48 let rd-rdw = rdw & rd-dep-fence
49 let po-loc-ww = po-loc & (W*W)
50 let detour = (po-loc & (coe ; rfe)) \ (po-loc-ww ; po-loc)
51 let addrpo = addr ; po
52
53 (* The set of writes that are bounded by the end of the thread
54 or by a fence before the next write to the same address *)
55 let BOUNDED-W = W \ domain(po-loc-ww \ ordering-fence)
56 (* The set of "non-obscurable" writes on ARM *)
57 let NOW = domain(rfe) | range(rmw) | ReleaseAssign |
58 BOUNDED-W | domain(detour)
59 (* The set of "obscurable" writes *)
60 let OW = W \ NOW
61 (* The set of reads which might be forwarded from obscurable writes *)
62 let OR = range(rfi & (OW*R))
63
64 let nco = co & (NOW*W)
65 let ncoe = nco & ext
66
67 let strong-ppo = rd-addr-dep-rfi | ordering-fence |
68 ((dep | ctrl | addrpo) & (R*W))
69 let Alpha-strong-ppo = strong-ppo | rd-rdw | detour |
70 (po-loc & ((M\OW\OR)*W))
71 let ARM-strong-ppo = strong-ppo | addr | dep-rfi
72 let ppo = Alpha-strong-ppo | ARM-strong-ppo | rdw
73
74 let rfe-ppo = strong-ppo | (ARM-strong-ppo ; ppo* ; Alpha-strong-ppo)
75 let po-relass-acq-hb = (po ; (rfe & (ReleaseAssign*Acquire)) ; rfe-ppo) |
76 (po-ul ; next-crit ; lk-po)
77
78 (* Release paired with Acquire is both A- and B-cumulative *)
79 let AB-cum-hb = strong-fence | po-relass-acq-hb
80 let A-cum-hb = AB-cum-hb | po-relass | po-rel-seq
81 let B-cum-hb = AB-cum-hb | wmb
82
83 let hb0 = (ppo* ; Alpha-strong-ppo) | (rfe ; rfe-ppo)
84 let propbase0 = propagation-fence | (rfe? ; A-cum-hb)
85
86 let rec B-cum-propbase = (B-cum-hb ; hb* ) |
87 (rfe? ; AB-cum-hb ; hb* )
88 and propbase = propbase0 | B-cum-propbase
89 and short-obs = ((ncoe|fre) ; propbase+ ; rfe) & int
90 and obs = short-obs |
91 ((hb* ; (ncoe|fre) ; propbase* ; B-cum-propbase ; rfe) & int)
92 and hb = hb0 | (obs ; rfe-ppo)
93
94 acyclic hb as happens-before
95 irreflexive (short-obs ; Alpha-strong-ppo) as observation
96
97
98 let strong-prop = fre? ; propbase* ; rfe? ; strong-fence ; hb* ; obs?
99 let prop = (propbase & (W*W)) | strong-prop
100 let cpord = nco | prop
101
102 acyclic cpord as propagation
103
104
105 (* Propagation between strong fences *)
106 let rcu-order = hb* ; obs? ; cpord* ; fre? ; propbase* ; rfe?
107
108 (* Chains that can prevent the RCU grace-period guarantee *)
109 let gp-link = sync ; rcu-order
110 let cs-link = po? ; crit^-1 ; po? ; rcu-order
111 let rcu-path0 = gp-link |
112 (gp-link ; cs-link) |
113 (cs-link ; gp-link)
114 let rec rcu-path = rcu-path0 |
115 (rcu-path ; rcu-path) |
116 (gp-link ; rcu-path ; cs-link) |
117 (cs-link ; rcu-path ; gp-link)
118
119 irreflexive rcu-path as rcu
Quick Quiz 19:
This strong model is insanely complex!!!
How can anyone be expected to understand it???
Answer
First, the “Linux kernel strong memory model”
is the title,
and the “include "cos.cat"” pulls in some
common definitions, similar to the C language's
“#include <stdio.h>”.
Again, taking the remainder of the file one piece at a time:
-
Cat File: SC Per Location and Atomics.
-
Cat File: More Classes of Fences.
-
Cat File: Release Sequences.
-
Cat File: Intra-Thread Ordering.
- Cat File: Obscured Writes.
-
Cat File: Preserved Program Order.
- Cat File: Cumulativity.
- Cat File: Happens-Before.
- Cat File: Coherence Points.
- Cat File: RCU.
The first section of the litmus.cat file enforces
SC per location, which again means that all CPUs agree on the
order of reads and writes to any given single location, that is,
cache coherence,
as was provided by the
Coherent-RMO model.
This section also provides ordering constraints for RMW atomic
operations.
25 let com = rf | co | fr
26 let coherence-order = po-loc | com
27 acyclic coherence-order as coherence
28
29 empty rmw & (fre;coe) as atomic
The first three lines define cache coherence, as discussed
above,
ensuring that per-variable communications relations (rf, co,
and fr) are consistent with program order.
Line 28 enforces the
atomicity of RMW operations on a given variable.
Recall that the “rmw” relationship connects a
given RMW operation's read to its write.
Note also that “fre;coe” connects any read of a given
variable to some later write of that same variable, where at least
one intervening write was executed by some other thread.
If the initial read was a given RMW operation's read and the final
write was this same RMW operation's write, atomicity has been violated:
Some other thread's write appeared after the RMW's read but before its
write.
Therefore, the last line prohibits such violations by requiring
that the intersection of “rmw” and
“fre;coe” be empty.
The next portion of the file defines classes of fences in addition
to those in
the Bell file.
Quick Quiz 20:
Why aren't these additional classes of fences in the
Bell file
where the other classes live?
Answer
32 let rf = rf | next-crit
33 let rfe = rf & ext
34
35 let exec-order-fence = rmb | acq-po | lk-po
36 let propagation-fence = strong-fence | wmb | po-relass | po-ul
37 let ordering-fence = propagation-fence | exec-order-fence
The rf and rfe relations are redefined to include
lock-based critical sections.
This might be pulled into herd's predefined relations at some point
in the future,
but handling of locking is currently in prototype stage.
The members of the exec-order-fence group
(smp_rmb() and smp_load_acquire())
can be thought of as providing
ordering by restricting execution, for example, waiting for previous reads
to complete before executing subsequent instructions.
(In practice, hardware architects have all sorts of optimizations at
their disposal that provide the needed ordering without necessarily
actually waiting.)
The ordering properties of any member of the
rd-dep-fence
and
exec-order-fence groups do not propagate outside of that member's
process.
Such fences cannot provide global ordering except in situations involving only
causal reads-from (rf) links; any non-causal coherence or
from-read links (co or fr, respectively) require
a stronger type of barrier.
The members of the propagation-fence group include
strong-fence
(smp_mb() and synchronize_rcu())
in addition to smp_wmb(), smp_store_release(),
and rcu_assign_pointer().
These barriers all provide some form of
cumulativity,
and when the smp_store_release() and rcu_assign_pointer()
are paired with smp_load_acquire(), all of them provide
B-cumulativity.
Either way, as the name suggests, the effects of these barriers propagate
to other processes.
Finally, ordering-fence is simply the union of
exec-order-fence and propagation-fence,
that is, the set of fence-like things that do not rely on dependencies.
The next portion of the file defines C11-style
release sequences
described
above:
35 (* Determine the release sequences *)
36 let rel-seq = [ReleaseAssign] ; coi? ; (rf ; rmw)*
37 let po-rel-seq = po ; rel-seq
The rel-seq relation links any release operation to any
later store to that same variable from that same thread, as well
as to any sequence of read-modify-write atomic operations to that
same variable, as long as that sequence is not interrupted by
a non-read-modify-write update.
The po-rel-seq relation then links events that are ordered
before a release operation to the events in the release sequence
headed by that operation.
The next portion of the file defines the intra-thread ordering
relationships described
earlier.
Here “intra-thread” means that the ordered accesses are within
the same thread.
Some of the relationships will reference other threads.
43 (* On Alpha, rd-dep-fence makes addr, dep-rfi, and rdw strong *)
44 let dep = addr | data
45 let dep-rfi = dep ; rfi
46 let rd-addr-dep-rfi = (addr | dep-rfi)+ & rd-dep-fence
47 let rdw = po-loc & (fre ; rfe)
48 let rd-rdw = rdw & rd-dep-fence
49 let po-loc-ww = po-loc & (W*W)
50 let detour = (po-loc & (coe ; rfe)) \ (po-loc-ww ; po-loc)
51 let addrpo = addr ; po
The addr and data
relations define address and data dependencies respectively, and
the dep relation is simply their union.
An address dependency occurs when a previously loaded value is used
to form the address of a subsequent load or store within the same thread.
A data dependency occurs when a previously loaded value is used to form
the value stored by a subsequent store within the same thread.
The dep-rfi relation extends dep with an
internal rf relation, which implies that the
dep must have been a read-to-write dependency.
The Linux kernel does not respect read-to-read address dependencies
unless: (1) The dependency is headed by rcu_dereference()
or lockless_dereference() or
(2) There is an smp_read_barrier_depends() between the
load heading the dependency and the dependent memory reference.
This requirement for a special operation helps to document the intent,
and also allows architectures to include any special instructions
required to enforce dependency ordering. (For example, DEC Alpha
requires a memory barrier if the dependent access is a read.)
The rd-addr-dep-rfi relation therefore contains only those sequences
of address dependencies (and dep-rfi relations)
that are headed by lockless_dereference()
or rcu_dereference() or that span an
smp_read_barrier_depends().
This relation excludes any read-to-read dependencies
that DEC Alpha won't provide ordering for.
The “rdw” relation
(for “Read Different Writes”)
links pairs of loads from the same address within a given thread,
where at least
one store to the same address from some other thread intervened
between the two loads.
The rd-rdw relation contains relations from rdw
that are headed by lockless_dereference()
or rcu_dereference() or that span an
smp_read_barrier_depends().
The po-loc-ww relation links pairs of writes by the same thread
to the same variable.
This relation helps to define the detour relation, which
links writes that are overwritten by some other process, whose
value is returned by a subsequent read in the first process,
but without an intervening write to that same variable within the
first process.
Finally, the addrpo relation
links a read heading an address dependency with any write
following the dependent access, in program order.
Quick Quiz 21:
Why would a write following an address dependency get any special
treatment?
After all, there does not appear to be any particular ordering relationship
in the general case.
Answer
As described in the
ARM
section, writes can be
obscured
under certain conditions.
The following section of the Cat file accounts for this:
53 (* The set of writes that are bounded by the end of the thread
54 or by a fence before the next write to the same address *)
55 let BOUNDED-W = W \ domain(po-loc-ww \ ordering-fence)
56 (* The set of "non-obscurable" writes on ARM *)
57 let NOW = domain(rfe) | range(rmw) | ReleaseAssign |
58 BOUNDED-W | domain(detour)
59 (* The set of "obscurable" writes *)
60 let OW = W \ NOW
61 (* The set of reads which might be forwarded from obscurable writes *)
62 let OR = range(rfi & (OW*R))
There are a number of reasons why a given write might not be
“obscurable”, and the first two relations form the full
set of non-obscurable writes.
The BOUNDED-W set contains those writes that cannot be
obscured, either due to them being the last write to their
target address in the thread
(hence the subtraction of the domain of po-loc-ww),
or due to them being the last write to their target address before
a memory barrier
(hence the subtraction of the domain of ordering-fence).
The NOW set (for “Non-Obscurable Writes”)
then adds writes that are read by some
other process (domain(rfe)),
writes that are part of an atomic read-modify-write operation
(range(rmw)),
writes having release semantics (ReleaseAssign), and
writes that were overwritten by a write in some other process
before they could be forwarded to a later read that precedes any
further writes to the same address in the same process
(domain(detour)).
The OW set then forms the set of obscurable writes
by subtracting the set of non-obscurable writes from the set
of all writes.
Even though a write might be obscured, its value might nevertheless be
forwarded to later reads by that same thread, and this set of reads
is formed in OR.
The nco relation is that subset of the co
relation that links only from non-obscurable writes.
However, note that nco might contain links to
obscurable writes.
Finally, the ncoe relation is that subset of the co
relation that links from a non-obscurable write to some (possibly
obscurable) write to that same variable by some other process.
The next section of the file defines several flavors of
“preserved program order”, which is abbreviated as ppo.
These take into account special properties of both
Alpha and
ARM:
67 let strong-ppo = rd-addr-dep-rfi | ordering-fence |
68 ((dep | ctrl | addrpo) & (R*W))
69 let Alpha-strong-ppo = strong-ppo | rd-rdw | detour |
70 (po-loc & ((M\OW\OR)*W))
71 let ARM-strong-ppo = strong-ppo | addr | dep-rfi
72 let ppo = Alpha-strong-ppo | ARM-strong-ppo | rdw
73
74 let rfe-ppo = strong-ppo | (ARM-strong-ppo ; ppo* ; Alpha-strong-ppo)
75 let po-relass-acq-hb = (po ; (rfe & (ReleaseAssign*Acquire)) ; rfe-ppo) |
76 (po-ul ; next-crit ; lk-po)
The basic point of the ppo relations is to arrive at the set
of relations providing ordering within the context of a single process.
The strong-ppo relation includes a number of components, all
of which count as “strong” for both Alpha and ARM:
- Dependencies enforced by lockless_dereference(),
rcu_dereference(), or smp_read_barrier_depends()
(rd-addr), but possibly including internal reads from
dependent writes;
- Accesses separated by a sufficiently strong fence
(ordering-fence);
- Dependencies leading to writes
(((dep | ctrl | addrpo) & (R*W))).
Next are the aspects of ppo limited by ARM, in the
Alpha-strong-ppo relation.
This relation contains links that count as “strong” for
Alpha but either “weak” or “strong” for ARM.
It thus includes everything in strong-ppo, plus:
- Read-to-read dependencies involving only one variable
whose ordering is enforced by
lockless_dereference(), rcu_dereference(), or
smp_read_barrier_depends(), and which also have
an intervening write to this variable from some other process
(rd-rdw);
- A write followed by a read from the same variable within the
same process, with no intervening write to that same variable
within that same process, but where the read returns the value
written by some other process (detour);
- A pair of accesses to the same variable by the same process,
but excluding pairs where the first access is an obscurable
write or reads from an obscurable write
((po-loc & ((M\OW\OR)*W)).
Quick Quiz 22:
Why is this relation called “Alpha-strong-ppo”
if it talks about obscurable writes?
Obscurable writes are a feature of the ARM architecture, not Alpha!
What gives?
Answer
Next are the aspects of ppo limited by Alpha, in the
ARM-strong-ppo relation.
This relation contains links that count as “strong” for
ARM but either “weak” or “strong” for Alpha.
Thus it also includes everything in strong-ppo, plus:
- Address dependencies (addr) and
- Read-to-write dependencies of any sort followed by a later read
within that same thread reads from the dependent write
(dep-rfi).
This leads to ppo itself, which combines
Alpha-strong-ppo, ARM-strong-ppo, and
rdw (which is “strong” for neither architecture).
Recall that rdw links pairs of reads from the same variable
within the same process, where there was an intervening write to that
variable by some other process.
Quick Quiz 23:
But ARM-strong-ppo (and thus ppo) includes addr,
which links reads to later dependent reads, which DEC Alpha does not respect.
How is this supposed to work?
Answer
The rfe-ppo relation allows ppo relations to be chained
together (within the same process, of course).
Notably, it requires its first link to be “strong” for ARM
(as required for the first link following an rfe link)
and its last link to be “strong” for Alpha
(as required to prevent an Alpha-weak-ppo link from being
followed by an Alpha-normal-ppo link).
Finally, the po-relass-acq-hb relation links pairs of accesses
ordered by a single link of a release-acquire chain, which might be
a store-release read by a load-acquire or might be an lock release
followed by the corresponding lock acquisition.
The following relations define
cumulativity,
which is loosely related to the
concept of transitivity:
78 (* Release paired with Acquire is both A- and B-cumulative *)
79 let AB-cum-hb = strong-fence | po-relass-acq-hb
80 let A-cum-hb = AB-cum-hb | po-relass | po-rel-seq
81 let B-cum-hb = AB-cum-hb | wmb
Quick Quiz 24:
Why don't these relations match those shown in the
Execution ordering: write propagation and cumulativity section?
Answer
The AB-cum-hb relation includes strong fences and
release-acquire pairs, both of which come closest to providing
full transitivity.
The A-cum-hb relation includes AB-cum-bh, and also
links memory accesses to later
release operations in the same thread, and furthermore takes
release sequences into account.
Both of these release-related relations provide ordering with certain
prior accesses by other threads, hence A-cumulativity.
Finally, the B-cum-hb relation also includes AB-cum-bh,
and adds writes ordered by smp_wmb(), which provides ordering
with certain later accesses by other threads, hence B-cumulativity.
The next portion of the file combines the effects of dependencies,
barriers, other PPO relations, and indirect reasoning (
obs)
to arrive at a temporal
happens-before
(
hb) and a basic
propagation
(
propbase) relation.
83 let hb0 = (ppo* ; Alpha-strong-ppo) | (rfe ; rfe-ppo)
84 let propbase0 = propagation-fence | (rfe? ; A-cum-hb)
85
86 let rec B-cum-propbase = (B-cum-hb ; hb* ) |
87 (rfe? ; AB-cum-hb ; hb* )
88 and propbase = propbase0 | B-cum-propbase
89 and short-obs = ((ncoe|fre) ; propbase+ ; rfe) & int
90 and obs = short-obs |
91 ((hb* ; (ncoe|fre) ; propbase* ; B-cum-propbase ; rfe) & int)
92 and hb = hb0 | (obs ; rfe-ppo)
93
94 acyclic hb as happens-before
95 irreflexive (short-obs ; Alpha-strong-ppo) as observation
The hb relation has two base cases for its mutually
assured recursive definition, hb0 and propbase0.
The hb0 relation is either a sequence of intrathread
ppo relations ending with an Alpha-strong-ppo
or a read that returns a value from some other process's write
followed by an rfe-ppo relation (a sequence of intrathread
ppo relations starting with an ARM-strong-ppo and
ending with an Alpha-strong-ppo), which links to an access
that will remain ordered with respect to the initial write.
The propbase0 relation links two accesses where the first
is in a fence's pre-set and the second is in the fence's post-set,
taking A-cumulativity into account (by allowing an initial
rfe step if the fence is A-cumulative) but not B-cumulativity.
The general idea behind the hb relation is to build a series
of steps that provide temporal ordering.
The difficulty in doing so arises from the fact that B-cumulative links,
which form an important part of the hb relation via the
propbase and obs relations,
are themselves defined in terms of hb,
thus requiring a mutually recursive definition of all three.
Taking the components of the mutually assured recursion for hb
in turn:
- The B-cum-propbase relation defines the B-cumulative
parts of the propbase relation, in terms of a
B-cumulative fence (B-cum-hb) or a fence that is
both A- and B-cumulative (AB-cum-hb) with an optional
initial rfe, followed by an arbitrarily long sequence
of hb links.
In the first case the fence could be smp_wmb();
in the second it could be a strong fence (smp_mb() or
synchronize_rcu()) or a release-acquire pair.
- The propbase relationship combines propbase0
(its non-B-cumulative part) with B-cum-propbase.
- The short-obs relation starts with a non-obscurable write or a
read on one process, links via ncoe or fre
(respectively) to a write in another process, goes through at least one
propbase step, and then comes back to the original process
via an rfe link.
As the name implies, a short-obs link shows that
the initial write or read precedes the events in
the other process, whereas the final read
observes
those events and therefore happens after them (and hence after
the initial write or read).
- The obs relation is a specialized version of
short-obs, allowing an extra sequence of hb links
to occur at the start of the chain but then requiring that the
last propbase link in the chain be B-cumulative.
- Finally, the hb relation starts with hb0
and adds in the links generated by obs followed by
rfe-ppo.
The need for a trailing rfe-ppo is an acknowledgment
that the Alpha- and ARM-specific ordering requirements of
obs are much like those of the rfe that is
its final link (see the second part of the definition of
hb0 above).
The hb relation is then declared to be acyclic and the concatenation
of short-obs and Alpha-strong-ppo is declared
irreflexive.
Quick Quiz 25:
The hb relation seems to have a lot of moving parts, and
the choices seem a bit on the arbitrary side.
What gives?
Answer
Quick Quiz 26:
The definition of B-cum-propbase allows for either a B-cumulative
fence or an A- and B-cumulative fence preceded by an optional rfe.
But an A- and B-cumulative fence with no preceding rfe would
fall into both cases.
What's the story?
Answer
Even in weakly ordered systems, write and barrier propagation covers more
than just communication among CPUs.
It also involves sending data to memory, which requires passing through the
coherence point.
The corresponding ordering relations are described below.
98 let strong-prop = fre? ; propbase* ; rfe? ; strong-fence ; hb* ; obs?
99 let prop = (propbase & (W*W)) | strong-prop
100 let cpord = nco | prop
101
102 acyclic cpord as propagation
The
strong-prop relation
extends the ordering of a strong fence both forwards (hb* ; obs?)
and backwards (fre? ; propbase* ; rfe?).
The prop relation expresses the fact that a propbase
link between two writes forces the first to reach the coherence point
before the second, and combines this with strong-prop.
Then cpord combines the coherence ordering of non-obscurable writes
(nco) with the aforementioned prop, to obtain the
full coherence-point ordering relation.
Finally, cpord is required to be acyclic.
Quick Quiz 27:
The nco relation mentioned above requires the first write to be
non-obscurable, but not the second.
What sense is there in saying that write A reaches the coherence
point before write B if B is obscured?
Answer
The happens-before and coherence-points machinery can be complex, but
fortunately, many common use cases take simple paths through this
machinery.
For example:
Strong Model Litmus Test #12
1 C C-ISA2+o-rel+acq-rel+acq-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_store_release(b, 1);
10 }
11
12 P1(int *b, int *c)
13 {
14 int r1;
15
16 r1 = smp_load_acquire(b);
17 smp_store_release(c, 1);
18 }
19
20 P2(int *c, int *a)
21 {
22 int r2;
23 int r3;
24
25 r2 = smp_load_acquire(c);
26 r3 = READ_ONCE(*a);
27 }
28
29 exists
30 (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
After all three threads have completed, is the outcome shown
on line 30 possible?
Referring to the
Bell
and
Cat
files, we see that
line 8⟶9 is a member of po,
line 9⟶16 is a member of rfe and
furthermore is an example of (rfe & (ReleaseAssign*Acquire)),
and line 16⟶17 is a member of acq-po.
This, in turn, is a sub-relation of exec-order-fence,
of ordering-fence, of strong-ppo, and finally of
rfe-ppo.
But any member of the sequence
po;(rfe & (ReleaseAssign*Acquire));rfe-ppo
(that is, line 8⟶9⟶16⟶17)
is also a member of po-relass-acq-hb and thus of AB-cum-hb,
and also of B-cum-hb, and in turn of B-cum-propbase
and finally of propbase.
Line 26⟶8 is a member of fre,
and as noted in the paragraph above, line 8⟶17
is a member of propbase,
and line 17⟶25 is a member of rfe.
Therefore, line 26⟶25 is a member of short-obs
and hence obs.
Line 25⟶26 is a member of acq-po,
hence (like line 16⟶17 above) a member of rfe-ppo.
Putting these two together, line 26⟶26 is a member of hb,
constituting a length-one cycle.
This cycle is prohibited by an acyclic requirement,
so the exists clause of
Strong Model Litmus Test #12
cannot be satisfied.
This is confirmed by running the command:
herd7 -conf strong.cfg C-ISA2+o-rel+acq-rel+acq-o.litmus
Which produces the following output:
Outcome for Strong Model Litmus Test #12
1 Test C-ISA2+o-rel+acq-rel+acq-o Allowed
2 States 7
3 1:r1=0; 2:r2=0; 2:r3=0;
4 1:r1=0; 2:r2=0; 2:r3=1;
5 1:r1=0; 2:r2=1; 2:r3=0;
6 1:r1=0; 2:r2=1; 2:r3=1;
7 1:r1=1; 2:r2=0; 2:r3=0;
8 1:r1=1; 2:r2=0; 2:r3=1;
9 1:r1=1; 2:r2=1; 2:r3=1;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)
14 Observation C-ISA2+o-rel+acq-rel+acq-o Never 0 7
15 Hash=9762857b08e4db85dbbf52a7b43068e9
The “
Never 0 7” should be reassuring, given that
this cycle is analogous to a series of lock releases and acquires, which
had jolly well better be fully ordered!
Let's now look at a roughly similar example:
Strong Model Litmus Test #13
1 C C-W+WRC+o-rel+acq-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_store_release(b, 1);
10 }
11
12 P1(int *b, int *c)
13 {
14 int r1;
15 int r2;
16
17 r1 = smp_load_acquire(b);
18 r2 = READ_ONCE(*c);
19 }
20
21 P2(int *c, int *a)
22 {
23 int r3;
24
25 WRITE_ONCE(*c, 1);
26 smp_mb();
27 r3 = READ_ONCE(*a);
28 }
29
30 exists
31 (1:r1=1 /\ 1:r2=0 /\ 2:r3=0)
After all three threads have completed, is the result
shown on line 31 possible?
A key point is that line 18⟶25 and
line 18⟶8 are both members of fre.
Furthermore, the only example of rfe is
line 9⟶17.
This means that there cannot be a cycle in hb, which
requires at least one rfe for each fre in
the cycle.
The same observation applies to
(short-obs ; Alpha-strong-ppo).
In addition, although cpord does not require an rfe
for each fre, it does require a strong-fence
for each fre (via strong-prop),
and there is only one smp_mb(),
which can cover only one of the two fre relations.
Can we somehow combine hb (for example, via propbase)
and cpord (via strong-prop) to pair an rfe
with one of the fre relations and a strong-fence
with the other?
The answer is “no” because the only relations feeding into
propbase that include fre must start and end within
the same process, and cannot have more fre relations than
rfe relations on that path.
This should not be too surprising, given the non-causal nature of
fre relations,
and can be confirmed by running the following command line:
herd7 -conf strong.cfg C-W+WRC+o-rel+acq-o+o-mb-o.litmus
Which results in the following output:
Outcome for Strong Model Litmus Test #13
1 Test C-W+WRC+o-rel+acq-o+o-mb-o Allowed
2 States 8
3 1:r1=0; 1:r2=0; 2:r3=0;
4 1:r1=0; 1:r2=0; 2:r3=1;
5 1:r1=0; 1:r2=1; 2:r3=0;
6 1:r1=0; 1:r2=1; 2:r3=1;
7 1:r1=1; 1:r2=0; 2:r3=0;
8 1:r1=1; 1:r2=0; 2:r3=1;
9 1:r1=1; 1:r2=1; 2:r3=0;
10 1:r1=1; 1:r2=1; 2:r3=1;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (1:r1=1 /\ 1:r2=0 /\ 2:r3=0)
15 Observation C-W+WRC+o-rel+acq-o+o-mb-o Sometimes 1 7
16 Hash=8e3c5d7d5d36f2b1484ff237e8d22f91
However, full barriers (smp_mb()) can be used to force the
Linux kernel to respect full non-causal ordering, as expressed by
the “cpord” relation.
To see this, consider the following store-buffering litmus test
(a repeat of an earlier example):
Strong Model Litmus Test #1
1 C C-SB+o-mb-o+o-mb-o.litmus
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 int r1;
9
10 WRITE_ONCE(*x, 1);
11 smp_mb();
12 r1 = READ_ONCE(*y);
13 }
14
15 P1(int *x, int *y)
16 {
17 int r2;
18
19 WRITE_ONCE(*y, 1);
20 smp_mb();
21 r2 = READ_ONCE(*x);
22 }
23
24 exists
25 (0:r1=0 /\ 1:r2=0)
Can the cyclic outcome called out in line 25 really happen?
Line 10⟶12 and 19⟶21 are members of mb and
thus of strong-fence,
while line 12⟶19 and 21⟶10 are members of fre.
This means that line 21⟶10⟶12 and
line 12⟶19⟶21 are members of strong-prop,
and thus of prop and cpord.
This pair of cpord relations forms a cycle, and cpord
is constrained to be acyclic.
Therefore, the exists clause cannot be satisfied.
This is confirmed by the following command:
herd7 -conf strong.cfg C-SB+o-mb-o+o-mb-o.litmus
This command produces the following output:
Outcome for Strong Model Litmus Test #1
1 Test C-SB+o-mb-o+o-mb-o Allowed
2 States 3
3 0:r1=0; 1:r2=1;
4 0:r1=1; 1:r2=0;
5 0:r1=1; 1:r2=1;
6 No
7 Witnesses
8 Positive: 0 Negative: 3
9 Condition exists (0:r1=0 /\ 1:r2=0)
10 Observation C-SB+o-mb-o+o-mb-o Never 0 3
11 Hash=a61f698662bb72c2ed1755812580d385
Quick Quiz 28:
Is there an easy way to tell which definitions have effect for a
given litmus test?
Answer
Quick Quiz 29:
Why does cpord prohibit a cycle containing two
fre links when hb does not?
They are both acyclic, after all!
Answer
The previous section showed how smp_mb() can restore
sequential consistency.
However, as Jade Alglave has noted, synchronize_rcu() is even stronger
still, and therefore requires even more Cat-file code.
The final portion of the Cat file therefore covers
RCU.
Quick Quiz 30:
Say what???
How can anything possibly be stronger than sequential consistency???
Answer
RCU's fragment of the Cat file is as follows:
105 (* Propagation between strong fences *)
106 let rcu-order = hb* ; obs? ; cpord* ; fre? ; propbase* ; rfe?
107
108 (* Chains that can prevent the RCU grace-period guarantee *)
109 let gp-link = sync ; rcu-order
110 let cs-link = po? ; crit^-1 ; po? ; rcu-order
111 let rcu-path0 = gp-link |
112 (gp-link ; cs-link) |
113 (cs-link ; gp-link)
114 let rec rcu-path = rcu-path0 |
115 (rcu-path ; rcu-path) |
116 (gp-link ; rcu-path ; cs-link) |
117 (cs-link ; rcu-path ; gp-link)
118
119 irreflexive rcu-path as rcu
Quick Quiz 31:
Why the special-purpose Cat code for RCU?
After all, given that there are RCU implementations, why not just translate a
representative implementation into the corresponding set of memory
accesses and memory barriers?
Answer
The rcu-order relation interfaces the RCU model to the rest
of the memory model, and defines what can order RCU read-side
critical sections and RCU grace periods with respect to each other.
This relation can be roughly thought of as an arbitrarily long
set of sequences of events providing B-cumulativity and then A-cumultivity,
with intervening events providing strong ordering.
Quick Quiz 32:
But every element of rcu-order is optional, which means
it contains links directly connecting each event to itself.
How can that be right?
Answer
The gp-link relation is an RCU grace period followed
by some sequence of events that provide sufficient ordering,
and the cs-link relation is an RCU read-side critical section
followed by some sequence of events that provide sufficient ordering.
Note that the cs-link relation
allows any access po-before the end of an RCU read-side critical section
to be used as evidence that a grace period
is ordered before the critical section, and it allows any access
po-after the start of an critical section to be used as evidence that
a grace period is ordered after the critical section.
The importance of this is shown by the following litmus test:
Strong Model Litmus Test #14
1 C C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 int r1;
9
10 r1 = READ_ONCE(*a);
11 synchronize_rcu();
12 WRITE_ONCE(*b, 1);
13 }
14
15 P1(int *b, int *c)
16 {
17 int r2;
18
19 rcu_read_lock();
20 r2 = READ_ONCE(*b);
21 WRITE_ONCE(*c, 1);
22 rcu_read_unlock();
23 }
24
25 P2(int *c, int *d)
26 {
27 int r3;
28
29 r3 = READ_ONCE(*c);
30 rcu_read_lock();
31 // do_something_else();
32 rcu_read_unlock();
33 WRITE_ONCE(*d, 1);
34 }
35
36 P3(int *d, int *a)
37 {
38 int r4;
39
40 r4 = READ_ONCE(*d);
41 synchronize_rcu();
42 WRITE_ONCE(*a, 1);
43 }
44
45 exists
46 (0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 3:r4=1)
The normal usage of cs-link is illustrated by
P1().
The cs-link instance could start at line 20,
take a po step to the rcu_read_unlock() on
line 22, step back to the matching
rcu_read_lock() on line 19,
and finally a po step to line 21.
This implements the rule: “If any part of an RCU read-side
critical section follows anything after a given RCU grace period,
then the entirety of that critical section follows anything preceding
that grace period”, where the preceding grace period is the
one in P0().
The more expansive usage is illustrated by P2().
The cs-link instance could start at line 29,
take a po step to the
rcu_read_unlock() on line 32, then a
crit^-1, step back to the matching rcu_read_lock()
on line 30, and finally a po step to
line 33.
This allows cs-link (in conjunction with
rcu-order) to link the access on
line 21 of P1() with the access on line 40
of P3().
Without this more expansive definition of cs-link,
the questionable outcome
“0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 3:r4=1”
would be permitted, which
it is not, as can be seen by running:
herd7 -conf strong.cfg C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o.litmus
This gives the reassuring output:
Outcome for Strong Model Litmus Test #14
1 Test C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o Allowed
2 States 15
3 0:r1=0; 1:r2=0; 2:r3=0; 3:r4=0;
4 0:r1=0; 1:r2=0; 2:r3=0; 3:r4=1;
5 0:r1=0; 1:r2=0; 2:r3=1; 3:r4=0;
6 0:r1=0; 1:r2=0; 2:r3=1; 3:r4=1;
7 0:r1=0; 1:r2=1; 2:r3=0; 3:r4=0;
8 0:r1=0; 1:r2=1; 2:r3=0; 3:r4=1;
9 0:r1=0; 1:r2=1; 2:r3=1; 3:r4=0;
10 0:r1=0; 1:r2=1; 2:r3=1; 3:r4=1;
11 0:r1=1; 1:r2=0; 2:r3=0; 3:r4=0;
12 0:r1=1; 1:r2=0; 2:r3=0; 3:r4=1;
13 0:r1=1; 1:r2=0; 2:r3=1; 3:r4=0;
14 0:r1=1; 1:r2=0; 2:r3=1; 3:r4=1;
15 0:r1=1; 1:r2=1; 2:r3=0; 3:r4=0;
16 0:r1=1; 1:r2=1; 2:r3=0; 3:r4=1;
17 0:r1=1; 1:r2=1; 2:r3=1; 3:r4=0;
18 No
19 Witnesses
20 Positive: 0 Negative: 15
21 Condition exists (0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 3:r4=1)
22 Observation C-LB+o-sync-o+rl-o-o-rul+o-rl-rul-o+o-sync-o Never 0 15
23 Hash=c792a4c620a9d5244c0bee80da2a90fa
In short, if anything within or preceding a given RCU read-side critical
section follows anything after a given RCU grace period, then
that entire RCU read-side critical section must follow
anything preceding the grace period, and analogously for the opposite direction.
The rcu-path0
(RCU path base case) relation defines the three basic ways that
RCU provides ordering:
- A single synchronize_rcu() invocation, which
in theory may be substituted for smp_mb().
(In practice, good luck with instances of smp_mb()
in preempt-disabled regions of code, to say nothing of the
disastrous degradation of performance.)
- A synchronize_rcu() that is ordered before an
RCU read-side critical section.
This commonly used case guarantees that if some RCU read-side
critical section extends beyond the end of a grace period,
then all of that RCU read-side critical section happens after
anything preceding that grace period.
For example, if any part of a critical section might happen
after a kfree(), all of that critical section will
happen after the corresponding list_del_rcu().
This case groups the RCU grace period in P0()
and the RCU read-side critical section in P1()
in the litmus test above.
- An RCU read-side critical section that is ordered before a
synchronize_rcu().
This commonly used case guarantees that if some RCU read-side
critical section extends before the beginning of a grace period,
then all of that RCU read-side critical section happens before
anything following that grace period.
For example, if any part of a critical section might happen
before the list_del_rcu(), all of that critical section will
happen before the corresponding kfree().
This case groups the the RCU read-side critical section in
P2() and RCU grace period in P3()
in the litmus test above.
The recursive definition of rcu-path builds on the
rcu-path0 base case.
Then (rcu-path;rcu-path) states that
if any two sequences of RCU grace periods and read-side critical sections
provide ordering, then the concatenation of those two sequences also
provides ordering, and applies to the P0()-P1()
and P2()-P3() groups in the example above,
thus guaranteeing that the questionable cyclic outcome is forbidden.
On the other hand, line 116's (gp-link;rcu-path;cs-link) states
that if some sequence of RCU grace periods and read-side critical sections
provides ordering, then ordering is still provided when that sequence
is preceded by synchronize_rcu() and followed by an RCU
read-side critical section.
Finally, line 117's (cs-link;rcu-path;gp-link) states
that if some sequence of RCU grace periods and read-side critical sections
provides ordering, then ordering is still provided when that sequence
is preceded by an RCU read-side critical section and followed by
synchronize_rcu().
The irreflexive statement prohibits rcu-path from looping
back on itself, in other words, this statement requires rcu-path
to provide ordering.
Another way of thinking of rcu-path is of a counter
and comparison, implemented recursively.
If there are at least as many calls to synchronize_rcu()
as there are RCU read-side critical sections in a given sequence,
that sequence will fit the definition of rcu-path and
ordering is guaranteed, otherwise not.
Let's use this machinery to analyze the prototypical RCU-deferred-free
scenario:
Strong Model Litmus Test #15
1 C C-LB+rl-deref-o-rul+o-sync-o.litmus
2
3 {
4 a=x;
5 }
6
7 P0(int **a)
8 {
9 int *r1;
10 int r2;
11
12 rcu_read_lock();
13 r1 = rcu_dereference(*a);
14 r2 = READ_ONCE(*r1);
15 rcu_read_unlock();
16 }
17
18 P1(int **a, int *x, int *y)
19 {
20 WRITE_ONCE(*a, y);
21 synchronize_rcu();
22 WRITE_ONCE(*x, 1); /* Emulate kfree(). */
23 }
24
25 exists
26 (0:r1=x /\ 0:r2=1)
The variable a initially points to the variable x,
which is initially zero.
The P1() function sets variable a to reference
the variable y (also initially zero), then sets the value
of x to 1 to emulate the effects of kfree().
Any RCU reader accessing and dereferencing a should therefore
see the value zero, so that the outcome 0:r2=1 should
be forbidden.
In other words, we would expect the cycle
20⟶22⟶14⟶15⟶12⟶13⟶20
to be forbidden.
Let's check!
Lines 12⟶15 is a
crit link, while
lines 20⟶22 is a sync link.
If the cycle is allowed,
Lines 13⟶20 form an fre link
and lines 22⟶14 form an rfe link.
This means that lines 13⟶20 and lines 22⟶14 are also
rcu-order links,
and as a result, the series 20⟶22⟶14 is an instance of
gp-link.
Given that lines 14⟶15 and 12⟶13 are
po links,
the series 14⟶15⟶12⟶13⟶20 is a
cs-link link.
We therefore have an instance of gp-link
followed by an instance of cs-link (and vice versa if you like), so
that the sequence
20⟶22⟶14⟶15⟶12⟶13⟶20
falls under the rcu-path0
definition, which means that this same sequence is also an instance of
rcu-path.
Because it ends where it starts, on line 20, it is reflexive,
and thus forbidden.
The following command confirms this:
herd7 -conf strong.cfg C-LB+rl-deref-o-rul+o-sync-o.litmus
This command produces the following output:
Outcome for Strong Model Litmus Test #15
1 Test C-LB+rl-deref-o-rul+o-sync-o Allowed
2 States 2
3 0:r1=x; 0:r2=0;
4 0:r1=y; 0:r2=0;
5 No
6 Witnesses
7 Positive: 0 Negative: 2
8 Condition exists (0:r1=x /\ 0:r2=1)
9 Observation C-LB+rl-deref-o-rul+o-sync-o Never 0 2
10 Hash=4cac9d9e7ffa84096d8869e1ab199f09
Therefore, the RCU read-side critical section in P0()
cannot see the emulated kfree() following P1()'s
grace period, which should be some comfort to users of RCU.
Quick Quiz 33:
Why does
Strong Model Litmus Test #15's
exists clause specify 0:r1=x?
Isn't the second clause (0:r2=1) forbidden in and of itself?
Answer
But suppose we add another RCU read-side critical section to the mix,
in the following somewhat inane but hopefully instructive example:
Strong Model Litmus Test #16
1 C C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu.litmus
2
3 {
4 a=x;
5 }
6
7 P0(int **a)
8 {
9 int *r1;
10 int r2;
11
12 rcu_read_lock();
13 r1 = rcu_dereference(*a);
14 r2 = READ_ONCE(*r1);
15 rcu_read_unlock();
16 }
17
18 P1(int **a, int *y, int *z)
19 {
20 WRITE_ONCE(*a, y);
21 synchronize_rcu();
22 WRITE_ONCE(*z, 1);
23 }
24
25 P2(int *x, int *z)
26 {
27 int r3;
28
29 rcu_read_lock();
30 r3 = READ_ONCE(*z);
31 WRITE_ONCE(*x, 1); /* Emulate kfree(). */
32 rcu_read_unlock();
33 }
34
35 exists
36 (0:r1=x /\ 0:r2=1 /\ 2:r3=1)
Can the outcome show in the “exists” clause happen now?
Lines 12⟶15 and 29⟶32 are
crit links, while
Lines 20⟶22 is a sync link.
Lines 22⟶30 and 31⟶14 are rfe links
and lines 13⟶20 are an fre link,
which means that all three are also
rcu-order links.
As a result, the sequence 20⟶22⟶30 is a
gp-link instance.
Given that lines 14⟶15 and 12⟶13 are
po links,
the sequence
14⟶15⟶12⟶13⟶20 is a cs-link instance.
Similarly, because lines 30⟶32 and 29⟶31 are
po links,
the sequence 30⟶32⟶29⟶31⟶14
is also an instance of cs-link.
We therefore have one cs-link followed by a gp-link,
which in turn is followed by another cs-link.
The cs-link sequence
14⟶15⟶12⟶13⟶20
can combine with the
gp-link sequence 20⟶22⟶30
to form a rcu-path0 instance
14⟶15⟶12⟶13⟶20⟶22⟶30.
However, there is no way to add on the remaining cs-link
sequence 30⟶32⟶29⟶31⟶14,
so the cycle described in the “exists” clause can in fact happen.
This is confirmed by the command:
herd7 -conf strong.cfg C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu.litmus
Which produces the output:
Outcome for Strong Model Litmus Test #16
1 Test C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu Allowed
2 States 6
3 0:r1=x; 0:r2=0; 2:r3=0;
4 0:r1=x; 0:r2=0; 2:r3=1;
5 0:r1=x; 0:r2=1; 2:r3=0;
6 0:r1=x; 0:r2=1; 2:r3=1;
7 0:r1=y; 0:r2=0; 2:r3=0;
8 0:r1=y; 0:r2=0; 2:r3=1;
9 Ok
10 Witnesses
11 Positive: 1 Negative: 5
12 Condition exists (0:r1=x /\ 0:r2=1 /\ 2:r3=1)
13 Observation C-LB+rl-deref-o-rul+o-sync-o+rl-o-o-rlu Sometimes 1 5
14 Hash=b591d622245952a2fc8eaad233203817
This should be no surprise, given that we have more RCU read-side
critical sections than grace periods in the cycle.
This situation underscores the need to avoid doing inane things with RCU.
However, one nice thing about a memory model incorporating
RCU is that such inanity can now be detected, at least when it is
confined to relatively small code fragments.
Acknowledgments
We owe thanks to H. Peter Anvin, Will Deacon, Andy Glew,
Derek Williams, Leonid Yegoshin, and Peter Zijlstra for their
patient explanations of their respective systems' memory models.
We are indebted to Peter Sewell, Susmit Sarkar, and their groups
for their seminal work formalizing many of these same memory models.
We all owe thanks to Dmitry Vyukov, Boqun Feng, and Peter Zijlstra for
their help making this human-readable.
We are also grateful to Michelle Rankin and Jim Wasko for their support
of this effort.
This work represents the views of the authors and does not necessarily
represent the views of University College London, INRIA Paris,
Scuola Superiore Sant'Anna, Harvard University, or IBM Corporation.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or
service marks of others.
Quick Quiz 1:
But couldn't a CPU designer create a memory subsystem that did
allow writes to be taken back?
Answer:
Maybe someday they will.
However, such a CPU would still need to provide ordering,
and if it provided ordering similar to current CPU families,
it is quite possible that our current models would simply consider
the portion of the memory that allows prior writes to be taken back
to be part of the CPU rather than part of the memory subsystem.
Back to Quick Quiz 1.
Quick Quiz 2:
Why can't CPU designers use speculation to hide the slowness of
strong barriers?
Answer:
They can and they do.
However, they must take care when doing so, because the hardware absolutely
must respect the barriers' semantics.
This requires complex circuitry to detect cases where speculation must
be squashed, and can impose additional delays when rerunning the
speculated instructions.
So nothing comes for free, but speculation is indeed a powerful tool
that is heavily used by high-performance CPUs.
Back to Quick Quiz 2.
Quick Quiz 3:
Isn't this single coherence point a huge bottleneck on large systems?
Answer:
Not necessarily.
Although the model is presented in terms of a single coherence point,
the only real requirement is that there be a single coherence point
for a given variable at a given point in time.
This requirement is satisified by the cache line holding that variable,
which can move around as needed.
In addition, this means that the system can enjoy the scalability of
a coherence point per cache line, which should suffice even for the
very largest of systems.
But what if all the CPUs are writing to a single variable?
This does not sound like particularly enlightened software design, but
a scalable hardware implementation simply relies on the store buffers.
Given that there is one store buffer per CPU (or at least per core),
this should not limit scalability,
give or take constraints imposed by memory ordering.
Back to Quick Quiz 3.
Quick Quiz 4:
But how could the system possibly prevent some other write on some other
CPU from taking place between the time the RMW's read and write execute?
Is there some Big System Lock implemented in hardware that will totally
destroy scalability???
Answer:
It can't, but there is no need for a Big System Lock.
After all, it is not time that matters, but rather coherence order.
To see this, consider the following time-ordered sequence of events:
- CPU 0 executes the read portion of its RMW operation to variable x.
- CPU 1 writes to x.
- CPU 0 executes the write portion of its RMW operation.
The system has the option of placing CPU 1's write to x after
CPU 0's write to x in coherence order, which avoids the need for tight
(and thus expensive) coordination between the CPUs, while still preserving
the atomic nature of the RMW operation.
Another option is for the memory subsystem to cause CPU 0's write
to fail in this circumstance.
On architectures that do this, RMW operations have to be implemented as a loop
in software, where the CPU goes back to the initial read
if the write portion is not successful.
Back to Quick Quiz 4.
Quick Quiz 5:
The terms “A-cumulativity” and “B-cumulativity”
aren't particularly mnemonic, are they?
Answer:
They are perfectly mnemonic.
The ‘A’ stands for the French word avant, which
translates to “before”, and it is the A-cumulative barriers'
pre-sets that are modified by A-cumulativity.
The ‘B’ stands for the Swahili word baada, which
translates to “after”, and it is the B-cumulative barriers'
post-sets that are modified by B-cumulativity.
Or you could just remember that ‘A’ precedes ‘B’
in the Latin alphabet.
Back to Quick Quiz 5.
Quick Quiz 6:
By symmetry, shouldn't a B-cumulative barrier's post-set include all writes
that propagate to the barrier's CPU after the barrier is committed?
Answer:
Symmetry is in the eye of the beholder.
Beholders who care about hardware performance and scalability
(or, for that matter, clean semantics) will
prefer the definition in the list over the definition in this Quick Quiz's
question.
Working out why is left as an exercise for the reader.
[Hint: What should happen if a write on one CPU commits before
a barrier on another CPU, but the write doesn't propagate to the other CPU
until after the barrier has committed?]
Back to Quick Quiz 6.
Quick Quiz 7:
Given the steadily increasing number of transistors, why couldn't a CPU
analyze code to detect at least some classes of “x == x”
comparisons?
Answer:
Several existing CPU architectures guarantee that even trivial conditionals
will provide ordering, so they have no reason to perform such an analysis.
However, a new CPU family might well carry out such analyses.
If this happened, this CPU family would probably need to provide an option
to suppress trivial-conditional detection in order to correctly execute
pre-existing code.
But if this hardware optimization came into being and proved sufficiently
valuable, it might be necessary to adjust accordingly.
Back to Quick Quiz 7.
Quick Quiz 8:
Given all these constraints, how can weak-memory CPUs possibly expect to
attain any benefits of any sort compared to strong-memory CPUs?
Answer:
This argument between proponents of strong and weak memory ordering has
been going on for some decades, so we do not expect to be able to settle
it here.
Nevertheless, even with all these constraints there's still plenty of
wiggle room.
And don't forget, strong-memory CPUs are also subject to all these
restrictions.
Back to Quick Quiz 8.
Quick Quiz 9:
Following up on exercise for the reader in the detour
relationship, what happens if the value from the write is forwarded
to that thread's later read?
Answer:
Then there will be no detour link from the write to the read,
since the rfe term in detour requires the read to
obtain its value from a write in a different thread.
As a result, the read is not forced to execute after the write.
Back to Quick Quiz 9.
Quick Quiz 10:
What about RMW (read-modify-write) instructions, such as xchg()
or atomic_inc()?
Don't they constitute both a read and a write?
Answer:
Yes, RMW instructions do constitute both a read and a write.
But herd represents such instructions
internally as a read event followed by a separate write event.
(And that is also how they are treated by the hardware.)
No single event is ever both a read and a write.
Back to Quick Quiz 10.
Quick Quiz 11:
But this short-obs link goes backward from line 20 to
line 18!
How can a backward link on a single CPU represent a
“happens-before” ordering relation???
Answer:
It's true that unlike the other intra-CPU relations that make up hb,
short-obs and obs can create links that go backward
in program order.
There's nothing sinister going on here;
it's merely a reflection of the fact that modern CPUs can and do
execute instructions out of order.
Back to Quick Quiz 11.
Quick Quiz 12:
Readers who go to the trouble of reading the actual definition of
cpord in the Linux-kernel strong memory model will see that
it includes the co, propbase & (W*W), and strong-prop
terms mentioned earlier, but it does not include any terms corresponding to
the “trivial” case of a propbase or hb+ link
starting from a read.
Why not?
Answer:
It turns out that the “trivial” terms aren't necessary.
Including them in the definition of cpord wouldn't hurt,
but it wouldn't change the model's predictions in any way.
For any cycle in cpord involving these terms, there is a
cycle already forbidden by the model without the terms.
As the simplest case, suppose each link in the cycle is an instance of
hb+ & (R*M) (i.e., a sequence of hb links starting
from a read).
Then the entire cycle is itself a cycle in hb, and so is forbidden
by the “happens-before” check.
For the more general case, suppose there is a cycle in which an instance
of hb+ & (R*M) follows one of the other terms making up cpord.
Since that other term must end in a read, it cannot be an instance of
co or propbase & (W*W); hence it must be an instance
of strong-prop.
But the definition of strong-prop ends in hb*, so
strong-prop followed by a sequence of hb links is still
an instance of strong-prop.
Thus the two terms can be combined into a single strong-prop term.
In this way, all the instances of hb+ & (R*M) in the cycle
can be absorbed into the strong-prop terms,
leaving a forbidden cycle containing none of the “trivial”
terms at all.
Back to Quick Quiz 12.
Quick Quiz 13:
Given how important split caches are for attaining full performance on
superscalar CPUs, why don't any non-Alpha architectures have split caches?
Answer:
Other architectures do have split caches.
However, these other architectures also have additional circuitry
that preserves read-to-read dependencies among accesses to different
banks of their split caches.
Back to Quick Quiz 13.
Quick Quiz 14:
Why weren't adjustments needed for PowerPC, given that it has a weak
memory model?
Answer:
The reason that no adjustments were needed for PowerPC was that we
started with PPCMEM's Power memory model.
Alternatively, one could argue that all necessary adjustments for
PowerPC were made at the very beginning of this effort.
For those keeping score, the fact that PowerPC's release-acquire chains
do not provide full ordering does cause significant heartburn in some
circles.
Back to Quick Quiz 14.
Quick Quiz 15:
Why weren't adjustments needed for Itanium, given that it allows
reads to the same variable to be reordered?
Answer:
Because READ_ONCE() and WRITE_ONCE() use volatile
accesses, which compile to ld,acq and st,rel
instructions, respectively.
These instructions provide the single-variable-SC guarantee needed
by the Linux kernel.
Back to Quick Quiz 15.
Quick Quiz 16:
But what if some new CPU had an even weaker memory model than
Alpha, ARM, and PowerPC?
Mightn't that invalidate a lot of Linux-kernel code?
Answer:
In theory, it might.
In practice, we expect that the Linux kernel community would be
highly motivated to include memory-barrier instructions in the
new CPU's arch-specific code
so as to minimize (or perhaps even eliminate) core Linux-kernel modifications.
We also hope that this prospect will encourage future CPU designers and
architects to avoid the need for excessive numbers of memory-barrier
instructions.
Back to Quick Quiz 16.
Quick Quiz 17:
Given that this is about memory barriers, why
“instructions F[Barriers]” instead of perhaps
“instructions B[Barriers]”?
Answer:
“Memory barriers” are also sometimes called
“memory fences”.
This can be confusing, but both terms are used so we might
as well get used to it.
Besides, the “B” instruction class
was already reserved for Branches.
Back to Quick Quiz 17.
Quick Quiz 18:
Why wouldn't “let sync = fencerel(Sync)”
work just as well as the modified definition?
Answer:
The modified definition is necessary because the model needs to recognize that
code like:
WRITE_ONCE(*x, 1);
synchronize_rcu();
synchronize_rcu();
r2 = READ_ONCE(*y);
will insert two grace periods between the memory accesses, not just one.
With the modified definition, there is a “sync”
pair linking the
WRITE_ONCE() to the first synchronize_rcu() as well as
a pair linking that event to the READ_ONCE(),
so it is possible to pass from the write to the read via two links.
With the “let sync = fencerel(Sync)” definition,
there would be no link from the WRITE_ONCE() to the
first synchronize_rcu().
Consequently there would be a path from the write to the read
involving one link, but no path involving two.
Back to Quick Quiz 18.
Quick Quiz 19:
This strong model is insanely complex!!!
How can anyone be expected to understand it???
Answer:
Given that this model is set up to be as strong as reasonably possible given
the rather wide variety of processor architectures that the Linux kernel runs
on, it is actually surprisingly simple.
Furthermore, this model has a tool that goes with it, which is more
than can be said of memory-barriers.txt.
Nevertheless, it is quite possible that this model should be carefully
weakened, if it turns out that doing so simplifies the model
without invalidating any use cases.
A simpler but weaker model can be found
here.
Back to Quick Quiz 19.
Quick Quiz 20:
Why aren't these additional classes of fences in the
Bell file
where the other classes live?
Answer:
Because exec-order-fence, propagation-fence, and
ordering-fence aren't needed by the
weak model,
which shares the
Bell file.
Back to Quick Quiz 20.
Quick Quiz 21:
Why would a write following an address dependency get any special
treatment?
After all, there does not appear to be any particular ordering relationship
in the general case.
Answer:
It turns out that writes following an
address-dependency pair are guaranteed not to be reordered before
the load heading up the dependency pair, as can be seen from this
load-buffering litmus test
and its output (note the
“Never” on the last line) and from this
message-passing litmus test
and its output.
Why would PowerPC and other architectures provide such ordering
from a load to an unrelated store?
Because until the load completes, the CPU can't tell whether or not
the store is unrelated.
If the load ends up causing its dependent access to target
the same address that is used by the “unrelated”
store, then the accesses are no longer unrelated and the CPU must
provide ordering between them.
Since the CPU can't know what ordering requirements there might be until
the load completes, all later writes must wait for the load.
There's a second, more subtle reason.
Until the load completes, the CPU can't tell whether the dependent access
will cause an addressing exception.
If an exception does occur then later stores should not be executed,
even if they are unrelated.
But what about loads?
Don't they have the same ordering requirements?
Indeed they do, but the CPU can safely speculate such loads, squashing the
speculation if it later learns that there was an unexpected address
collision or an exception.
For more information on this dependency/ordering corner case, please see
section 10.5 of
A Tutorial Introduction to the ARM and POWER Relaxed Memory Models.
Other sections cover many other interesting corner cases.
Back to Quick Quiz 21.
Quick Quiz 22:
Why is this relation called “Alpha-strong-ppo”
if it talks about obscurable writes?
Obscurable writes are a feature of the ARM architecture, not Alpha!
What gives?
Answer:
Left to itself, DEC Alpha would provide ordering in cases where ARM
obscures reads and writes, which means that the model must be very careful
not to provide ordering in these cases.
After all, the model must not forbid outcomes that the ARM architecture
could allow.
Therefore, the
((po-loc & ((M\OW\OR)*W)) limits Alpha's po-loc
strong ordering to links from non-obscurable reads and writes.
In short, even though these relations provide strong ordering on Alpha,
it is necessary to weaken our model a little bit in
order to properly model ARM's obscurable accesses.
Back to Quick Quiz 22.
Quick Quiz 23:
But ARM-strong-ppo (and thus ppo) includes addr,
which links reads to later dependent reads, which DEC Alpha does not respect.
How is this supposed to work?
Answer:
To say that Alpha does not respect read-read dependencies is an
oversimplification.
Such dependencies yield weak ordering on Alpha,
in the sense that the second read is forced to execute after the first,
although the second read's horizon time may precede the first read's.
Thus addr is appropriate for the ARM-strong-ppo
relation, which contains PPO links that are “strong” on ARM
but may be “weak” on Alpha.
Back to Quick Quiz 23.
Quick Quiz 24:
Why don't these relations match those shown in the
Execution ordering: write propagation and cumulativity section?
Answer:
Answer: That section contains a simplified explanation of the original
PowerPC-based model.
It does not include complexities related to release sequences or the
mutually recursive definitions of hb and obs, and it
does not include changes made later to accommodate Alpha and ARM.
For these reasons, the Cat code shown in that section is different from
the current model.
Back to Quick Quiz 24.
Quick Quiz 25:
The hb relation seems to have a lot of moving parts, and
the choices seem a bit on the arbitrary side.
What gives?
Answer:
This is the strong model, which intentionally trades away simplicity
to get added strength.
The added strength implies added complexity because the strong model
is necessarily limited by the various weaknesses of the hardware that
the Linux kernel runs on.
A simpler but weaker model can be found
here,
and is described
here.
Back to Quick Quiz 25.
Quick Quiz 26:
The definition of B-cum-propbase allows for either a B-cumulative
fence or an A- and B-cumulative fence preceded by an optional rfe.
But an A- and B-cumulative fence with no preceding rfe would
fall into both cases.
What's the story?
Answer:
The model is redundant here.
There's nothing wrong with allowing the two cases to overlap,
and nobody has gone to the trouble of removing the overlap.
In this case it could be done quite easily by changing the rfe?
to rfe.
Perhaps a later revision of the Cat file will do this.
Back to Quick Quiz 26.
Quick Quiz 27:
The nco relation mentioned above requires the first write to be
non-obscurable, but not the second.
What sense is there in saying that write A reaches the coherence
point before write B if B is obscured?
Answer:
This is another example of redundancy in the model.
Allowing an nco link from A to B to exist
is harmless.
After all, if B is obscured then there must be a po-later
non-obscurable write C to the same address which obscures it,
meaning that there is also an nco link from A to C.
If you think about what links can exist in a cpord cycle,
you'll soon see that any link starting from B can be replaced
with an equivalent one starting from C.
(Hint: Given that B is obscured by C, there cannot be any
propagation barriers between the two writes.)
Consequently, if there is a cpord cycle involving an nco
link from A to B, then there is also a cycle involving
a link from A to C.
Thus allowing nco links to an obscurable write won't create a
cycle if one didn't already exist involving only non-obscurable writes,
so it doesn't change the predictions of the model.
(Exercise: Demonstrate that allowing links to start from an obscurable
write, i.e., replacing nco with co, would change
the model's predictions for some litmus tests.)
Back to Quick Quiz 27.
Quick Quiz 28:
Is there an easy way to tell which definitions have effect for a
given litmus test?
Answer:
One very straightforward approach is to edit the .cat and .bell files
to remove “acyclic” or
“irreflexive” statements.
For example, for the above store-buffering litmus test, removing
the “acyclic cpord as propagation” allows
the cyclic outcome.
Alternatively, you can pass the
“-skipcheck propagation” argument-line argument to
herd7.
However, editing the .bell and .cat files to omit different elements
can be an extremely educational activity.
Back to Quick Quiz 28.
Quick Quiz 29:
Why does cpord prohibit a cycle containing two
fre links when hb does not?
They are both acyclic, after all!
Answer:
The difference is that hb requires that any
path including an fre link begin and
end at the same thread.
Therefore, no matter how you string hb
links together, they cannot prohibit a cycle that goes
through two fre links before returning
to the original thread, and thus cannot prohibit the store-buffering
litmus test.
In contrast, the strong-prop relation included in the
cpord relation has no
same-thread restriction, which means that cpord
can forbid a cycle containing more than one fre
links.
Back to Quick Quiz 29.
Quick Quiz 30:
Say what???
How can anything possibly be stronger than sequential consistency???
Answer:
Easily.
Consider mutual exclusion as a simple example.
As a more pertinent example,
recall the store-buffering example from the previous section,
in which smp_mb() prevented any executions that were not
simple interleavings, in other words, it prohibits the cyclic outcome
“0:r1=0 /\ 1:r2=0”.
If we replace the first smp_mb() with synchronize_rcu(),
replace the second smp_mb() with with an RCU read-side
critical section, and reverse P1()'s memory references,
we get the following:
Strong Model Litmus Test #17
1 C C-LB+o-sync-o+rl-o-o-rul.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 int r1;
9
10 r1 = READ_ONCE(*a);
11 synchronize_rcu();
12 WRITE_ONCE(*b, 1);
13 }
14
15 P1(int *b, int *a)
16 {
17 int r2;
18
19 rcu_read_lock();
20 r2 = READ_ONCE(*b);
21 WRITE_ONCE(*a, 1);
22 rcu_read_unlock();
23 }
24
25 exists
26 (0:r1=1 /\ 1:r2=1)
Outcome for Strong Model Litmus Test #17
1 Test C-LB+o-sync-o+rl-o-o-rul Allowed
2 States 3
3 0:r1=0; 1:r2=0;
4 0:r1=0; 1:r2=1;
5 0:r1=1; 1:r2=0;
6 No
7 Witnesses
8 Positive: 0 Negative: 3
9 Condition exists (0:r1=1 /\ 1:r2=1)
10 Observation C-LB+o-sync-o+rl-o-o-rul Never 0 3
11 Hash=5caf2c7dacd825160dc693180fdb58aa
It turns out that synchronize_rcu() is so strong that it
is able to forbid the cyclic outcome “0:r1=1 /\ 1:r2=1”
even though P1() places no ordering constraints whatsoever
on its two memory references.
Now that is strong ordering!
There is of course no free lunch.
On systems having more than one CPU, the overhead of
synchronize_rcu() is orders of magnitude greater than that of
smp_mb().
You get what you pay for!
Back to Quick Quiz 30.
Quick Quiz 31:
Why the special-purpose Cat code for RCU?
After all, given that there are RCU implementations, why not just translate a
representative implementation into the corresponding set of memory
accesses and memory barriers?
Answer:
Because the goal of the Linux-kernel memory model's RCU is not to
emulate some specific
RCU implementation, but rather to closely approximate what might be called
platonic
RCU, thereby providing precise semantics without unnecessarily constraining
implementations.
All known concrete RCU implementations provide stronger semantics than
the Linux-kernel memory model's RCU (let alone
platonic RCU), but different implementations are stronger in different ways.
For example, SRCU uses read-side memory barriers on the one hand and
Tree RCU has elaborate and extremely strong update-side ordering
on the other.
The following litmus test is a case in point:
Strong Model Litmus Test #18
1 C C-rcu-relacq1.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 WRITE_ONCE(*x, 1);
9 }
10
11 P1(int *x, int *y)
12 {
13 rcu_read_lock();
14 r0 = READ_ONCE(*x);
15 rcu_read_unlock();
16 smp_wmb();
17 WRITE_ONCE(*y, 1);
18 }
19
20 P2(int *x, int *y)
21 {
22 r0 = READ_ONCE(*y);
23 smp_rmb();
24 r1 = READ_ONCE(*x);
25 }
26
27 exists
28 (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
Because there are no RCU grace periods, the RCU read-side critical sections
have no effect on ordering, which means that the cycle in this litmus
test is allowed:
Outcome for Strong Model Litmus Test #18
1 Test C-rcu-relacq1 Allowed
2 States 8
3 1:r0=0; 2:r0=0; 2:r1=0;
4 1:r0=0; 2:r0=0; 2:r1=1;
5 1:r0=0; 2:r0=1; 2:r1=0;
6 1:r0=0; 2:r0=1; 2:r1=1;
7 1:r0=1; 2:r0=0; 2:r1=0;
8 1:r0=1; 2:r0=0; 2:r1=1;
9 1:r0=1; 2:r0=1; 2:r1=0;
10 1:r0=1; 2:r0=1; 2:r1=1;
11 Ok
12 Witnesses
13 Positive: 1 Negative: 7
14 Condition exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
15 Observation C-rcu-relacq1 Sometimes 1 7
16 Hash=9878b5f38ed2ce07a4954babadec09e3
But suppose we translate the RCU primitives to normal accesses using
Alan Stern's release-acquire transformation script,
which adds a release store to
a csend01 variable to the end of the RCU read-side critical section:
Strong Model Litmus Test #19
1 C C-rcu-relacq1-relacq.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 WRITE_ONCE(*x, 1);
9 }
10
11 P1(int *x, int *y, int *csend01)
12 {
13 r0 = READ_ONCE(*x);
14 smp_store_release(csend01, 1);
15 smp_wmb();
16 WRITE_ONCE(*y, 1);
17 }
18
19 P2(int *x, int *y)
20 {
21 r0 = READ_ONCE(*y);
22 smp_rmb();
23 r1 = READ_ONCE(*x);
24 }
25
26 exists
27 (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
This transformation is correct for most RCU-related
litmus tests, but incorrectly prohibits the cycle in
the translated version of
Strong Model Litmus Test #18:
Outcome for Strong Model Litmus Test #19
1 Test C-rcu-relacq1-relacq Allowed
2 States 7
3 1:r0=0; 2:r0=0; 2:r1=0;
4 1:r0=0; 2:r0=0; 2:r1=1;
5 1:r0=0; 2:r0=1; 2:r1=0;
6 1:r0=0; 2:r0=1; 2:r1=1;
7 1:r0=1; 2:r0=0; 2:r1=0;
8 1:r0=1; 2:r0=0; 2:r1=1;
9 1:r0=1; 2:r0=1; 2:r1=1;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0)
14 Observation C-rcu-relacq1-relacq Never 0 7
15 Hash=947aeac1dc0f87ac9796edfcadc8bb4a
As far as we know, it is not possible to produce a transformation from
RCU primitives to normal memory accesses that exactly implements
the Linux kernel memory model's RCU.
There is a transformation that works in all known cases, but:
(1) It requires “ghost” memory accesses that are
unaffected by normal memory barriers,
(2) It requires “ghost-buster” memory barriers that
order all accesses, including ghost accesses, and
(3) We don't have a proof that it exactly matches the Linux
kernel memory model's RCU.
In addition, model performance is quite important.
After all, developers need answers in seconds or minutes, not days or weeks.
The scalability of the Linux-kernel memory model's RCU code and of the
release-acquire transformation are shown in the following table,
with each cell linking to the corresponding litmus test:
Given these performance results, we hypothesize that the vast majority
of developers would prefer to use the Linux kernel memory model's RCU, which
provides reasonable response time even for unreasonable 14-process
litmus tests with seven grace periods and seven critical sections.
Even a more-unreasonable 18-process litmus test with nine grace
periods and nine critical sections completes in about an hour.
In contrast, the release-acquire requires several days of CPU time,
even for an eminently reasonable litmus test with only six processes.
However, the above performance results ran with the default herd
behavior,
in which it computes and lists all possible final states.
Many developers would instead be interested only in whether or not the
final state indicated by the “exists” clause is
reachable.
The “-speedcheck fast” command-line argument tells
herd to check only for that final state, which results in the
large speedups shown in the following table:
Here, the Linux-kernel memory model's RCU completes in reasonable time all
the way out to the 18-process monster, which takes a little more than two
minutes to complete.
Although the release-acquire RCU runs much faster with
“-speedcheck fast” than without, it still takes
more than two hours for the six-process litmus test (three grace periods
and three critical sections), which is not at all suitable for interactive use.
One problem is that release-acquire RCU has twice as many variables
as does the Linux-kernel memory model's RCU, a second problem is that the
release-acquire RCU has (N+1) times as many reads as
does the Linux-kernel memory model's RCU, and a final problem
is that release-acquire RCU's litmus tests have a complex
“exists”
clause, while in contrast the Linux-kernel memory model's RCU's
litmus tests all have simple disjunctions.
Roughly speaking, the overhead of the former increases as the factorial
of the number of variables, while the latter two increases as two to the
power of the number of reads.
All of these considerations motivated us to include RCU directly in
the Cat model, rather than relying on scripted translations of RCU
primitives to memory accesses and memory barriers.
Back to Quick Quiz 31.
Quick Quiz 32:
But every element of rcu-order is optional, which means
it contains links directly connecting each event to itself.
How can that be right?
Answer:
It is not just right, but absolutely necessary.
This permits a pair of consecutive grace periods to do the right thing.
For example, consider the following litmus test, where, as usual,
a, b, and c are initially all zero:
Strong Model Litmus Test #20
1 C C-LB+o-sync-sync-o+rl-o-o-rul+rl-o-o-rul.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 int r1;
9
10 r1 = READ_ONCE(*a);
11 synchronize_rcu();
12 synchronize_rcu();
13 WRITE_ONCE(*b, 1);
14 }
15
16 P1(int *b, int *c)
17 {
18 int r2;
19
20 rcu_read_lock();
21 r2 = READ_ONCE(*b);
22 WRITE_ONCE(*c, 1);
23 rcu_read_unlock();
24 }
25
26 P2(int *c, int *a)
27 {
28 int r3;
29
30 rcu_read_lock();
31 r3 = READ_ONCE(*c);
32 WRITE_ONCE(*a, 1);
33 rcu_read_unlock();
34 }
35
36 exists
37 (0:r1=1 /\ 1:r2=1 /\ 2:r3=1)
If rcu-order did not permit an link from line 11 to itself,
the pair of synchronize_rcu() invocations on lines 11 and 12
would not be serialized, but would instead effectively merge into a
single synchronize_rcu().
Thus, the possibility of an empty rcu-order is
absolutely required in order to forbid the undesirable outcome
“0:r1=1 /\ 1:r2=1 /\ 2:r3=1”:
Outcome for Strong Model Litmus Test #20
1 Test C-LB+o-sync-sync-o+rl-o-o-rul+rl-o-o-rul Allowed
2 States 7
3 0:r1=0; 1:r2=0; 2:r3=0;
4 0:r1=0; 1:r2=0; 2:r3=1;
5 0:r1=0; 1:r2=1; 2:r3=0;
6 0:r1=0; 1:r2=1; 2:r3=1;
7 0:r1=1; 1:r2=0; 2:r3=0;
8 0:r1=1; 1:r2=0; 2:r3=1;
9 0:r1=1; 1:r2=1; 2:r3=0;
10 No
11 Witnesses
12 Positive: 0 Negative: 7
13 Condition exists (0:r1=1 /\ 1:r2=1 /\ 2:r3=1)
14 Observation C-LB+o-sync-sync-o+rl-o-o-rul+rl-o-o-rul Never 0 7
15 Hash=44ee0f607659a74ea40149d1ca3d80f5
Back to Quick Quiz 32.
Quick Quiz 33:
Why does
Strong Model Litmus Test #15's
exists clause specify 0:r1=x?
Isn't the second clause (0:r2=1) forbidden in and of itself?
Answer:
Try it and see what happens!
Back to Quick Quiz 33.