[RFC] tools/memory-model: Rule out OOTA

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] tools/memory-model: Rule out OOTA
@ 2025-01-06 21:40 Jonas Oberhauser
  2025-01-07 10:06 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-06 21:40 UTC (permalink / raw)
  To: paulmck
  Cc: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon,
	Jonas Oberhauser

The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
example shared on this list a few years ago:

  P0(int *a, int *b, int *x, int *y) {
  	int r1;

  	r1 = READ_ONCE(*x);
  	smp_rmb();
  	if (r1 == 1) {
  		*a = *b;
  	}
  	smp_wmb();
  	WRITE_ONCE(*y, 1);
  }

  P1(int *a, int *b, int *x, int *y) {
  	int r1;

  	r1 = READ_ONCE(*y);
  	smp_rmb();
  	if (r1 == 1) {
  		*b = *a;
  	}
  	smp_wmb();
  	WRITE_ONCE(*x, 1);
  }

  exists b=42

The root cause is an interplay between plain accesses and rw-fences, i.e.,
smp_rmb() and smp_wmb(): while smp_rmb() and smp_wmb() provide sufficient
ordering for plain accesses to rule out data races, they do not in the current
formalization generally actually order the plain accesses, allowing, e.g., the
load and stores to *b to proceed in any order even if P1 reads from P0; and in
particular, the marked accesses around those plain accesses are also not
ordered, which causes this OOTA.

In this patch, we choose the rather conservative approach of forcing only the
order of these marked accesses, specifically, when a marked read r is
separated from a plain read r' by an smp_rmb() (or r' has an address
dependency on r or is r'), on which a write w' depends, and w' is either plain
and seperated by a subsequent write w by an smp_wmb() (or w' is w), then r
precedes w in ppo.

Furthermore, we do not force any order in cases where r or w could be elided
due to a store with the same address either being before the read (which would
allow r to be substituted for the value of the store) or after the write
(which would allow w to be dropped).

Even though this patch is conservative in this sense, it ensures general
OOTA-freedom, more specifically, that any execution with no data race will not
have any cycles of

      (ctrl | addr | data) ; rfe

This definition of OOTA is much weaker than more standard definitions, such as
requiring that there are no cycles of

      ctrl | addr | data | rf

These definitions work well for syntactic dependencies (hardware models) but not
semantic dependencies (language models, like LKMM).

We first discuss why the more standard definition does not work well for
language models like LKMM. For example, consider

    r1 = *a;
    *b = 1;
    if (*a == 1)
      *b = 1;
    *c = *b;

In the execution where r1 == 1, there is a control dependency from
the load of *a to the second store to *b, from which the load to *b reads,
and the store to *c has a data dependency on this load from *b. Nevertheless
there is no semantic dependency from the load of *a to the store to *c; the
compiler could easily replace the last line with *c = 1 and move this line to
the top as follows:

    *c = 1;
    r1 = *a;
    *b = 1;

Since there is no order imposed by this sequence of syntactic dependencies
and reads, syntactic dependencies can not by themselves form an acyclic
relation.

In turn, there are some sequences of syntactic dependencies and reads that do
form semantic dependencies, such as

    r1 = *a;
    *b = 2;
    if (*a == 1)
      *b = 1;
    *c = *b;

Here we would consider that the store to *c has a semantic data dependency on
the read from *a, given that depending on the result of that read, we store
either the value 1 or 2 to *c.

Unfortunately, herd7 is currently limited to syntactic dependencies and can
not distinguish these two programs. As a result, while our patch is intended
to provide ordering for cases resembling the second program (but not the
first), with the dependencies considered by the current version of herd7, we
do not get such an ordering.

There are two more caveats of this patch. The first is that just because
there are no subsequent writes to the location of a write w (until the next
compiler barrier), does not imply that the write w can not be elided, e.g.,
if the location of w does not live until the next barrier (as pointed out
by Alan Stern a while back).

Unfortunately we can not currently express this in herd7's syntax.

In fact, to avoid OOTA, it would be sufficient to provide order only in
cases where w is read-from by another thread. But that is a rather unnatural
formulation.

The last caveat is that while we have done a formal proof that this
patch excludes OOTA (in all data race free executions), we did so with a
different formalization of compiler barrier (formalized in this patch as
w_barrier). I suspect that it may be possible to almost completely switch
over from w_barrier to the normal definition of barrier, +- the fact that
a marked write together with po-loc is also a compiler barrier.

But I currently do not have time to investigate this deeply, and I thought
maybe there are already some comments on the main parts of the patch. The
epsilons and deltas should be resolvable.

Have fun,
  jonas

Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
---
 tools/memory-model/linux-kernel.cat | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index d7e7bf13c831..180cab56729e 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -71,6 +71,10 @@ let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
 		Rcu-lock | Rcu-unlock | Srcu-lock | Srcu-unlock) |
 	(po ; [Release]) | ([Acquire] ; po)

+let w_barrier = po? ; [F | Marked] ; po?
+let rmb-plain = [R4rmb] ; po ; [Rmb] ; (po \ (po ; [W] ; (po-loc \ w_barrier))) ; [R4rmb & Plain]
+let plain-wmb = [W & Plain] ; (po \ ((po-loc \ w_barrier) ; po ; [W] ; po)) ; [Wmb] ; po ; [W]
+
 (**********************************)
 (* Fundamental coherence ordering *)
 (**********************************)
@@ -90,7 +94,7 @@ empty rmw & (fre ; coe) as atomic
 let dep = addr | data
 let rwdep = (dep | ctrl) ; [W]
 let overwrite = co | fr
-let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
+let to-w = ((addr | rmb-plain)? ; rwdep ; plain-wmb?) | (overwrite & int) | addr ; [Plain] ; wmb
 let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
 let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)

-- 
2.34.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-06 21:40 [RFC] tools/memory-model: Rule out OOTA Jonas Oberhauser
@ 2025-01-07 10:06 ` Peter Zijlstra
  2025-01-07 11:02   ` Jonas Oberhauser
  2025-01-07 15:46 ` Jonas Oberhauser
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2025-01-07 10:06 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, stern, parri.andrea, will, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:

> We first discuss why the more standard definition does not work well for
> language models like LKMM. For example, consider
> 
>     r1 = *a;
>     *b = 1;
>     if (*a == 1)

      if (r1 == 1) 

?

>       *b = 1;
>     *c = *b;
> 
> In the execution where r1 == 1, there is a control dependency from
> the load of *a to the second store to *b, from which the load to *b reads,
> and the store to *c has a data dependency on this load from *b. Nevertheless
> there is no semantic dependency from the load of *a to the store to *c; the
> compiler could easily replace the last line with *c = 1 and move this line to
> the top as follows:
> 
>     *c = 1;
>     r1 = *a;
>     *b = 1;
> 
> Since there is no order imposed by this sequence of syntactic dependencies
> and reads, syntactic dependencies can not by themselves form an acyclic
> relation.
> 
> In turn, there are some sequences of syntactic dependencies and reads that do
> form semantic dependencies, such as
> 
>     r1 = *a;
>     *b = 2;
>     if (*a == 1)

 r1 again?

>       *b = 1;
>     *c = *b;
> 
> Here we would consider that the store to *c has a semantic data dependency on
> the read from *a, given that depending on the result of that read, we store
> either the value 1 or 2 to *c.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-07 10:06 ` Peter Zijlstra
@ 2025-01-07 11:02   ` Jonas Oberhauser
  0 siblings, 0 replies; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-07 11:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, stern, parri.andrea, will, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/7/2025 um 11:06 AM schrieb Peter Zijlstra:
> On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> 
>> We first discuss why the more standard definition does not work well for
>> language models like LKMM. For example, consider
>>
>>      r1 = *a;
>>      *b = 1;
>>      if (*a == 1)
> 
>        if (r1 == 1)
> 
> ?
> 
>>        *b = 1;
>>      *c = *b;
>>
>> In the execution where r1 == 1, there is a control dependency from
>> the load of *a to the second store to *b, from which the load to *b reads,
>> and the store to *c has a data dependency on this load from *b. Nevertheless
>> there is no semantic dependency from the load of *a to the store to *c; the
>> compiler could easily replace the last line with *c = 1 and move this line to
>> the top as follows:
>>
>>      *c = 1;
>>      r1 = *a;
>>      *b = 1;
>>
>> Since there is no order imposed by this sequence of syntactic dependencies
>> and reads, syntactic dependencies can not by themselves form an acyclic
>> relation.
>>
>> In turn, there are some sequences of syntactic dependencies and reads that do
>> form semantic dependencies, such as
>>
>>      r1 = *a;
>>      *b = 2;
>>      if (*a == 1)
> 
>   r1 again?
> 
>>        *b = 1;
>>      *c = *b;
>>
>> Here we would consider that the store to *c has a semantic data dependency on
>> the read from *a, given that depending on the result of that read, we store
>> either the value 1 or 2 to *c.


Yes on both counts, thanks!

    jonas


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-06 21:40 [RFC] tools/memory-model: Rule out OOTA Jonas Oberhauser
  2025-01-07 10:06 ` Peter Zijlstra
@ 2025-01-07 15:46 ` Jonas Oberhauser
  2025-01-07 16:09 ` Alan Stern
  2025-07-23  0:43 ` Paul E. McKenney
  3 siblings, 0 replies; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-07 15:46 UTC (permalink / raw)
  To: paulmck
  Cc: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/6/2025 um 10:40 PM schrieb Jonas Oberhauser:>
> Signed-off-by: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> ---
>   tools/memory-model/linux-kernel.cat | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index d7e7bf13c831..180cab56729e 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -71,6 +71,10 @@ let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
>   		Rcu-lock | Rcu-unlock | Srcu-lock | Srcu-unlock) |
>   	(po ; [Release]) | ([Acquire] ; po)
>   
> +let w_barrier = po? ; [F | Marked] ; po?
> +let rmb-plain = [R4rmb] ; po ; [Rmb] ; (po \ (po ; [W] ; (po-loc \ w_barrier))) ; [R4rmb & Plain]
> +let plain-wmb = [W & Plain] ; (po \ ((po-loc \ w_barrier) ; po ; [W] ; po)) ; [Wmb] ; po ; [W]
> +
>   (**********************************)
>   (* Fundamental coherence ordering *)
>   (**********************************)
> @@ -90,7 +94,7 @@ empty rmw & (fre ; coe) as atomic
>   let dep = addr | data
>   let rwdep = (dep | ctrl) ; [W]
>   let overwrite = co | fr
> -let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> +let to-w = ((addr | rmb-plain)? ; rwdep ; plain-wmb?) | (overwrite & int) | addr ; [Plain] ; wmb
>   let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
>   let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
>   

I will also try to give an intuitive :) :( :) reasoning for why this 
patch rules out OOTA.

If we look at an dep ; rfe cycle

   dep ; rfe ; dep ; rfe ; ...


then because of the absence of data races, each rfe is more or less an

   w-post-bounded ; rfe ; r-pre-bounded

edge.

If we rotate the circle around we turn

   dep ; w-post-bounded ; rfe ; r-pre-bounded ; dep ; w-post-bounded ; 
rfe ; r-pre-bounded ; ...

into

    rfe ; (r-pre-bounded ; dep ; w-post-bounded) ; rfe ; (r-pre-bounded 
; dep ; w-post-bounded) ; rfe ; (r-pre-bounded ; ...


and ideally, each of these (w-pre-bounded ; dep ; r-post-bounded) would 
imply happens-before, since then the cycle would be.
    rfe ; hb+; rfe; hb+ ; ...
which is acyclic.

However, we do not get hb+ in general, in particular if the bounding is 
due to rmb/wmb. For all other cases, it is relatively easy to see that 
we get hb+, e.g., if the bound is due to an smp_mb().

Luckily, in our specific case, we can get hb+ evenfor cases where 
rmb/wmb bound these accesses, because these accesses related by the dep 
edges are known to be reading/read-from externally.
Such an external interaction would be impossible if there were another 
store to the same location between such an access and the next 
w_barrier: because of the absence of data races and the lack of 
w_barriers that would allow synchronization with the outside world, the 
external event could not "occur between" the access and such a store.

As a result, all pre-bounds caused by a rmb must have the form

[R4rmb] ; po ; [Rmb] ; (po \ (po ; [W] ; (po-loc \ w_barrier))) ; [R4rmb 
& Plain]

and similar for post-bounds caused by wmb, which means the corresponding

    r-pre-bounded ; dep ; w-post-bounded

edges must be

    rmb-plain ; dep ; plain-wmb

which is in ppo and thus also hb.

Hope that helps clarify the idea...

   jonas


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-06 21:40 [RFC] tools/memory-model: Rule out OOTA Jonas Oberhauser
  2025-01-07 10:06 ` Peter Zijlstra
  2025-01-07 15:46 ` Jonas Oberhauser
@ 2025-01-07 16:09 ` Alan Stern
  2025-01-07 18:47   ` Paul E. McKenney
  2025-01-08 17:33   ` Jonas Oberhauser
  2025-07-23  0:43 ` Paul E. McKenney
  3 siblings, 2 replies; 59+ messages in thread
From: Alan Stern @ 2025-01-07 16:09 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
> example shared on this list a few years ago:
> 
>   P0(int *a, int *b, int *x, int *y) {
>   	int r1;
> 
>   	r1 = READ_ONCE(*x);
>   	smp_rmb();
>   	if (r1 == 1) {
>   		*a = *b;
>   	}
>   	smp_wmb();
>   	WRITE_ONCE(*y, 1);
>   }
> 
>   P1(int *a, int *b, int *x, int *y) {
>   	int r1;
> 
>   	r1 = READ_ONCE(*y);
>   	smp_rmb();
>   	if (r1 == 1) {
>   		*b = *a;
>   	}
>   	smp_wmb();
>   	WRITE_ONCE(*x, 1);
>   }
> 
>   exists b=42
> 
> The root cause is an interplay between plain accesses and rw-fences, i.e.,
> smp_rmb() and smp_wmb(): while smp_rmb() and smp_wmb() provide sufficient
> ordering for plain accesses to rule out data races, they do not in the current
> formalization generally actually order the plain accesses, allowing, e.g., the
> load and stores to *b to proceed in any order even if P1 reads from P0; and in
> particular, the marked accesses around those plain accesses are also not
> ordered, which causes this OOTA.

That's right.  The memory model deliberately tries to avoid placing 
restrictions on plain accesses, whenever it can.

In the example above, for instance, I think it's more interesting to ask

	exists 0:r1=1 /\ 1:r1=1

than to concentrate on a and b.

OOTA is a very difficult subject.  It can be approached only by making 
the memory model take all sorts of compiler optimizations into account, 
and doing this for all possible optimizations is not feasible.

(For example, in a presentation to the C++ working group last year, Paul 
and I didn't try to show how to extend the C++ memory model to exclude 
OOTA [other than by fiat, as it does now].  Instead we argued that with 
the existing memory model, no reasonable compiler would ever produce an 
executable that could exhibit OOTA and so the memory model didn't need 
to be changed.)

> In this patch, we choose the rather conservative approach of forcing only the
> order of these marked accesses, specifically, when a marked read r is
> separated from a plain read r' by an smp_rmb() (or r' has an address
> dependency on r or is r'), on which a write w' depends, and w' is either plain
> and seperated by a subsequent write w by an smp_wmb() (or w' is w), then r
> precedes w in ppo.

Is this really valid?  In the example above, if there were no other 
references to a or b in the rest of the program, the compiler could 
eliminate them entirely.  (Whether the result could count as OOTA is 
open to question, but that's not the point.)  Is it not possible that a 
compiler might find other ways to defeat your intentions?

In any case, my feeling is that memory models for higher languages 
(i.e., anything above the assembler level) should not try very hard to 
address the question of OOTA.  And for LKMM, OOTA involving _plain_ 
accesses is doubly out of bounds.

Your proposed change seems to add a significant complication to the 
memory model for a not-very-clear benefit.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-07 16:09 ` Alan Stern
@ 2025-01-07 18:47   ` Paul E. McKenney
  2025-01-08 17:39     ` Jonas Oberhauser
  2025-01-08 17:33   ` Jonas Oberhauser
  1 sibling, 1 reply; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-07 18:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Tue, Jan 07, 2025 at 11:09:55AM -0500, Alan Stern wrote:
> On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> > The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
> > example shared on this list a few years ago:
> > 
> >   P0(int *a, int *b, int *x, int *y) {
> >   	int r1;
> > 
> >   	r1 = READ_ONCE(*x);
> >   	smp_rmb();
> >   	if (r1 == 1) {
> >   		*a = *b;
> >   	}
> >   	smp_wmb();
> >   	WRITE_ONCE(*y, 1);
> >   }
> > 
> >   P1(int *a, int *b, int *x, int *y) {
> >   	int r1;
> > 
> >   	r1 = READ_ONCE(*y);
> >   	smp_rmb();
> >   	if (r1 == 1) {
> >   		*b = *a;
> >   	}
> >   	smp_wmb();
> >   	WRITE_ONCE(*x, 1);
> >   }
> > 
> >   exists b=42
> > 
> > The root cause is an interplay between plain accesses and rw-fences, i.e.,
> > smp_rmb() and smp_wmb(): while smp_rmb() and smp_wmb() provide sufficient
> > ordering for plain accesses to rule out data races, they do not in the current
> > formalization generally actually order the plain accesses, allowing, e.g., the
> > load and stores to *b to proceed in any order even if P1 reads from P0; and in
> > particular, the marked accesses around those plain accesses are also not
> > ordered, which causes this OOTA.
> 
> That's right.  The memory model deliberately tries to avoid placing 
> restrictions on plain accesses, whenever it can.
> 
> In the example above, for instance, I think it's more interesting to ask
> 
> 	exists 0:r1=1 /\ 1:r1=1
> 
> than to concentrate on a and b.
> 
> OOTA is a very difficult subject.  It can be approached only by making 
> the memory model take all sorts of compiler optimizations into account, 
> and doing this for all possible optimizations is not feasible.

Mark Batty and his students believe otherwise, but I am content to let
them make that argument.  As in I agree with you rather than them.  At
least unless and until they make their argument.  ;-)

> (For example, in a presentation to the C++ working group last year, Paul 
> and I didn't try to show how to extend the C++ memory model to exclude 
> OOTA [other than by fiat, as it does now].  Instead we argued that with 
> the existing memory model, no reasonable compiler would ever produce an 
> executable that could exhibit OOTA and so the memory model didn't need 
> to be changed.)

Furthermore, the LKMM design choice was that if a given litmus test was
flagged as having a data race, anything might happen, including OOTA.

In case there is interest, that presentation may be found here:

https://drive.google.com/file/d/1ZeJlUJfH90S2uf2wRvNXQvM4jNVSlZI8/view?usp=sharing

The most recent version of the working paper may be found here:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3064r2.pdf

							Thanx, Paul

> > In this patch, we choose the rather conservative approach of forcing only the
> > order of these marked accesses, specifically, when a marked read r is
> > separated from a plain read r' by an smp_rmb() (or r' has an address
> > dependency on r or is r'), on which a write w' depends, and w' is either plain
> > and seperated by a subsequent write w by an smp_wmb() (or w' is w), then r
> > precedes w in ppo.
> 
> Is this really valid?  In the example above, if there were no other 
> references to a or b in the rest of the program, the compiler could 
> eliminate them entirely.  (Whether the result could count as OOTA is 
> open to question, but that's not the point.)  Is it not possible that a 
> compiler might find other ways to defeat your intentions?
> 
> In any case, my feeling is that memory models for higher languages 
> (i.e., anything above the assembler level) should not try very hard to 
> address the question of OOTA.  And for LKMM, OOTA involving _plain_ 
> accesses is doubly out of bounds.
> 
> Your proposed change seems to add a significant complication to the 
> memory model for a not-very-clear benefit.
> 
> Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-07 16:09 ` Alan Stern
  2025-01-07 18:47   ` Paul E. McKenney
@ 2025-01-08 17:33   ` Jonas Oberhauser
  2025-01-08 18:47     ` Alan Stern
  1 sibling, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-08 17:33 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/7/2025 um 5:09 PM schrieb Alan Stern:
> On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
>> The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
>> example shared on this list a few years ago:
>>
>>    P0(int *a, int *b, int *x, int *y) {
>>    	int r1;
>>
>>    	r1 = READ_ONCE(*x);
>>    	smp_rmb();
>>    	if (r1 == 1) {
>>    		*a = *b;
>>    	}
>>    	smp_wmb();
>>    	WRITE_ONCE(*y, 1);
>>    }
>>
>>    P1(int *a, int *b, int *x, int *y) {
>>    	int r1;
>>
>>    	r1 = READ_ONCE(*y);
>>    	smp_rmb();
>>    	if (r1 == 1) {
>>    		*b = *a;
>>    	}
>>    	smp_wmb();
>>    	WRITE_ONCE(*x, 1);
>>    }
>>
>>    exists b=42
>>
>> The root cause is an interplay between plain accesses and rw-fences, i.e.,
>> smp_rmb() and smp_wmb(): while smp_rmb() and smp_wmb() provide sufficient
>> ordering for plain accesses to rule out data races, they do not in the current
>> formalization generally actually order the plain accesses, allowing, e.g., the
>> load and stores to *b to proceed in any order even if P1 reads from P0; and in
>> particular, the marked accesses around those plain accesses are also not
>> ordered, which causes this OOTA.
> 
> That's right.  The memory model deliberately tries to avoid placing
> restrictions on plain accesses, whenever it can.
> 
> In the example above, for instance, I think it's more interesting to ask
> 
> 	exists 0:r1=1 /\ 1:r1=1
> 
> than to concentrate on a and b.

Yes, and of course there's a relationship between the two anomalies.
My proposed patch solves OOTA indirectly, by forbidding the marked 
accesses to *x and *y from being reordered in case r1 == 1, and thereby, 
forbidding exactly the outcome where both have r1 == 1.

> OOTA is a very difficult subject.

No doubt.

> It can be approached only by making
> the memory model take all sorts of compiler optimizations into account,
> and doing this for all possible optimizations is not feasible.

I think there's two parts of this, one is the correct definition of 
semantic (or compiler-preserved) dependencies. This is a really hard 
problem that can indeed only be solved by looking at all ways the 
compiler can optimize things. I'm not trying to solve that (and can't 
address that issue in a cat file anyways).

The second part is to see which accesses could participate in an OOTA.
I think this is a lot simpler.

> (For example, in a presentation to the C++ working group last year, Paul
> and I didn't try to show how to extend the C++ memory model to exclude
> OOTA [other than by fiat, as it does now].  Instead we argued that with
> the existing memory model, no reasonable compiler would ever produce an
> executable that could exhibit OOTA and so the memory model didn't need
> to be changed.)

I think a model like C++ should exclude it exactly by fiat, by 
formalizing OOTA-freedom as an axiom, like

   acyclic ( (data|ctrl|addr) ; rfe )

This axiom is to some extent unsatisfying because it does not explain 
*how* the OOTA is avoided, and in particular, it does not forbid any 
specific kind of behavior that would lead to OOTA, just the complete 
combination of them.
So for example, the following would be allowed:
P0:
   r0 = y.load_explicit(memory_order_relaxed);
   x.store_explicit(r0,memory_order_relaxed);

P1:
   r1 = x; // reads 1
   atomic_thread_fence();
   y = 1;

which could by some people be interpreted as the accesses of P0 being 
executed out of order (although there is no such concept in C++), 
indicating that if P0's accesses are allowed to be executed out of 
order, then so should P2's:

P2:
   r2 = x.load_explicit(memory_order_relaxed);
   y.store_explicit(r2,memory_order_relaxed);

Of course if both P0 and P2 are "executing out of order" at the same 
time, one would have OOTA, but this "global behavior" would be forbidden 
by the axiom.

But that is already how C++ works: a release access in C++ does not have 
any ordering by itself either. It is just the combination of release + 
acquire on other threads that somehow establishes synchronization.

LKMM works differently though, by providing "local" ordering rules, 
through ppo. We can argue about a ppo locally even without knowing the 
code of any other threads, let alone whether their accesses have acquire 
or release semantics. So we can address OOTA in a "local" manner as well.

And it has the advantage of having compiler barriers around a lot of 
things, which makes reasoning a lot more feasible.

>> In this patch, we choose the rather conservative approach of forcing only the
>> order of these marked accesses, specifically, when a marked read r is
>> separated from a plain read r' by an smp_rmb() (or r' has an address
>> dependency on r or is r'), on which a write w' depends, and w' is either plain
>> and seperated by a subsequent write w by an smp_wmb() (or w' is w), then r
>> precedes w in ppo.
> 
> Is this really valid?  In the example above, if there were no other
> references to a or b in the rest of the program, the compiler could
> eliminate them entirely.

In the example above, this is not possible, because the address of a/b 
have escaped the function and are not deallocated before synchronization 
happens.
Therefore the compiler must assume that a/b are accessed inside the 
compiler barrier.

>  (Whether the result could count as OOTA is
> open to question, but that's not the point.)  Is it not possible that a
> compiler might find other ways to defeat your intentions?

The main point (which I already mentioned in the previous e-mail) is if 
the object is deallocated without synchronization (or never escapes the 
function in the first place).

And indeed, any such case renders the added rule unsound. It is in a 
sense unrelated to OOTA; cases where the load/store can be elided are 
never OOTA.

Of course that is outside the current scope of what herd7 needs to deal 
with or can express, because deallocation isn't a thing in herd7.

Nevertheless, trying to specify inside cat when an access is "live" -- 
relevant enough that the compiler will keep it around -- is hard and 
tedious (it's the main source of complication in the patch).

A much better way would be to add a base set of "live loads and stores" 
Live, which are the loads and stores that the compiler must consider to 
be live. Just like addr, ctrl, etc., we don't have to specify these in 
cat, and rather rely on herd7 to correctly .

If an access interacts with an access of another thread (by reading from 
it or being read from it), it must be live.

Then we could formulate the rule as

+let to-w = (overwrite & int) | (addr | rmb ; [Live])? ; rwdep ; ([Live] 
; wmb)?

(the latter case being a generalization of the current `addr ; [Plain] ; 
wmb` and `rwdep` cases of to-w, assuming we restrict it Life accesses - 
it is otherwise also unsound:

  int a[2] = {0};
  int r1 = READ_ONCE(*x);
  a[r1] = 0; // compiler will just remove this
  smp_wmb();
  WRITE_ONCE(*y, 1);

)

The formulation in the patch is just based on a complicated and close 
but imperfect approximation of Live.

> In any case, my feeling is that memory models for higher languages
> (i.e., anything above the assembler level) should not try very hard to
> address the question of OOTA.  And for LKMM, OOTA involving _plain_
> accesses is doubly out of bounds.
> 
> Your proposed change seems to add a significant complication to the
> memory model for a not-very-clear benefit.

Even if we ignore OOTA for the moment, it is not inconceivable to have 
some cases use a combination of plain accesses with 
dependencies/rmb/wmb, such as in an RCU case.

That's probably the reason for the current addr ; [Plain] ; wmb exists.
It's not clear that it covers all cases though.

Best wishes,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-07 18:47   ` Paul E. McKenney
@ 2025-01-08 17:39     ` Jonas Oberhauser
  2025-01-08 18:09       ` Paul E. McKenney
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-08 17:39 UTC (permalink / raw)
  To: paulmck, Alan Stern
  Cc: parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/7/2025 um 7:47 PM schrieb Paul E. McKenney:
> On Tue, Jan 07, 2025 at 11:09:55AM -0500, Alan Stern wrote:
> 
>> (For example, in a presentation to the C++ working group last year, Paul
>> and I didn't try to show how to extend the C++ memory model to exclude
>> OOTA [other than by fiat, as it does now].  Instead we argued that with
>> the existing memory model, no reasonable compiler would ever produce an
>> executable that could exhibit OOTA and so the memory model didn't need
>> to be changed.)
> 
> Furthermore, the LKMM design choice was that if a given litmus test was
> flagged as having a data race, anything might happen, including OOTA.

Note that there is no data race in this litmus test.
There is a race condition on plain accesses according to LKMM,
but LKMM also says that this is *not* a data race.

The patch removes the (actually non-existant) race condition by saying 
that a critical section that is protected from having a data race with 
address dependency or rmb/wmb (which LKMM already says works for 
avoiding data races), is in fact also ordered and therefore has no race 
condition either.

As a side effect :), this happens to fix OOTA in general in LKMM.

Best wishes,
   jonas


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-08 17:39     ` Jonas Oberhauser
@ 2025-01-08 18:09       ` Paul E. McKenney
  2025-01-08 19:17         ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-08 18:09 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Wed, Jan 08, 2025 at 06:39:12PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/7/2025 um 7:47 PM schrieb Paul E. McKenney:
> > On Tue, Jan 07, 2025 at 11:09:55AM -0500, Alan Stern wrote:
> > 
> > > (For example, in a presentation to the C++ working group last year, Paul
> > > and I didn't try to show how to extend the C++ memory model to exclude
> > > OOTA [other than by fiat, as it does now].  Instead we argued that with
> > > the existing memory model, no reasonable compiler would ever produce an
> > > executable that could exhibit OOTA and so the memory model didn't need
> > > to be changed.)
> > 
> > Furthermore, the LKMM design choice was that if a given litmus test was
> > flagged as having a data race, anything might happen, including OOTA.
> 
> Note that there is no data race in this litmus test.
> There is a race condition on plain accesses according to LKMM,
> but LKMM also says that this is *not* a data race.
> 
> The patch removes the (actually non-existant) race condition by saying that
> a critical section that is protected from having a data race with address
> dependency or rmb/wmb (which LKMM already says works for avoiding data
> races), is in fact also ordered and therefore has no race condition either.
> 
> As a side effect :), this happens to fix OOTA in general in LKMM.

Fair point, no data race is flagged.

On the other hand, Documentation/memory-barriers.txt says the following:

------------------------------------------------------------------------

However, stores are not speculated.  This means that ordering -is- provided
for load-store control dependencies, as in the following example:

	q = READ_ONCE(a);
	if (q) {
		WRITE_ONCE(b, 1);
	}

Control dependencies pair normally with other types of barriers.
That said, please note that neither READ_ONCE() nor WRITE_ONCE()
are optional! Without the READ_ONCE(), the compiler might combine the
load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
the compiler might combine the store to 'b' with other stores to 'b'.
Either can result in highly counterintuitive effects on ordering.

------------------------------------------------------------------------

If I change the two plain assignments to use WRITE_ONCE() as required
by memory-barriers.txt, OOTA is avoided:

------------------------------------------------------------------------

C jonas

{}

P0(int *a, int *b, int *x, int *y) {
	int r1;

	r1 = READ_ONCE(*x);
	smp_rmb();
	if (r1 == 1) {
		WRITE_ONCE(*a, *b);
	}
	smp_wmb();
	WRITE_ONCE(*y, 1);
}

P1(int *a, int *b, int *x, int *y) {
	int r1;

	r1 = READ_ONCE(*y);
	smp_rmb();
	if (r1 == 1) {
		WRITE_ONCE(*b, *a);
	}
	smp_wmb();
	WRITE_ONCE(*x, 1);
}

exists b=42

------------------------------------------------------------------------

$ herd7 -conf linux-kernel.cfg /tmp/jonas.litmus
Test jonas Allowed
States 1
[b]=0;
No
Witnesses
Positive: 0 Negative: 3
Condition exists ([b]=42)
Observation jonas Never 0 3
Time jonas 0.01
Hash=39c0c230bd221b2f54fc88be6771372a

------------------------------------------------------------------------

If LKMM is to allow plain assignments in this case, we need to also update
memory-barriers.txt.  I am reluctant to do this because the community
needs to trust plain C-language assignments less rather than more,
especially given that compilers are continuing to become more aggressive.

Yes, in your example, the "if" and the two explicit barriers should
prevent compilers from being too clever, but these sorts of things are
more fragile than one might think given future code changes.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-08 17:33   ` Jonas Oberhauser
@ 2025-01-08 18:47     ` Alan Stern
  2025-01-08 19:22       ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-08 18:47 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Wed, Jan 08, 2025 at 06:33:05PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/7/2025 um 5:09 PM schrieb Alan Stern:
> > Is this really valid?  In the example above, if there were no other
> > references to a or b in the rest of the program, the compiler could
> > eliminate them entirely.
> 
> In the example above, this is not possible, because the address of a/b have
> escaped the function and are not deallocated before synchronization happens.
> Therefore the compiler must assume that a/b are accessed inside the compiler
> barrier.

I'm not quite sure what you mean by that, but if the compiler has access 
to the entire program and can do a global analysis then it can recognize 
cases where accesses that may seem to be live aren't really.

However, I admit this objection doesn't really apply to Linux kernel 
programming.

> >  (Whether the result could count as OOTA is
> > open to question, but that's not the point.)  Is it not possible that a
> > compiler might find other ways to defeat your intentions?
> 
> The main point (which I already mentioned in the previous e-mail) is if the
> object is deallocated without synchronization (or never escapes the function
> in the first place).
> 
> And indeed, any such case renders the added rule unsound. It is in a sense
> unrelated to OOTA; cases where the load/store can be elided are never OOTA.

That is a matter of definition.  In our paper, Paul and I described 
instances of OOTA in which all the accesses have been optimized away as 
"trivial".

> Of course that is outside the current scope of what herd7 needs to deal with
> or can express, because deallocation isn't a thing in herd7.
> 
> Nevertheless, trying to specify inside cat when an access is "live" --
> relevant enough that the compiler will keep it around -- is hard and tedious
> (it's the main source of complication in the patch).
> 
> A much better way would be to add a base set of "live loads and stores"
> Live, which are the loads and stores that the compiler must consider to be
> live. Just like addr, ctrl, etc., we don't have to specify these in cat, and
> rather rely on herd7 to correctly .

I agree that determining which accesses are live (in the sense that the 
compiler cannot optimize them out of existence) accounts for a large 
part of the difficulty of analyzing plain accesses in general, and OOTA 
in particular.

> If an access interacts with an access of another thread (by reading from it
> or being read from it), it must be live.

This is the sort of approximation I'm a little uncomfortable with.  It 
would be better to say that a store which is read from by a live load 
must be live.  I don't see why a load which reads from a live store has 
to be live.

> Then we could formulate the rule as
> 
> +let to-w = (overwrite & int) | (addr | rmb ; [Live])? ; rwdep ; ([Live] ;
> wmb)?
> 
> (the latter case being a generalization of the current `addr ; [Plain] ;
> wmb` and `rwdep` cases of to-w, assuming we restrict it Life accesses - it
> is otherwise also unsound:
> 
>  int a[2] = {0};
>  int r1 = READ_ONCE(*x);
>  a[r1] = 0; // compiler will just remove this
>  smp_wmb();
>  WRITE_ONCE(*y, 1);
> 

Yes, and we recognize that this part of the rule is on shaky ground.

> )
> 
> The formulation in the patch is just based on a complicated and close but
> imperfect approximation of Live.

Maybe you can reformulate the patch to make this more explicit.

In any case, it seems that any approximation we can make to Live will be 
subject to various sorts of errors.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-08 18:09       ` Paul E. McKenney
@ 2025-01-08 19:17         ` Jonas Oberhauser
  2025-01-09 17:54           ` Paul E. McKenney
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-08 19:17 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/8/2025 um 7:09 PM schrieb Paul E. McKenney:
> On Wed, Jan 08, 2025 at 06:39:12PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/7/2025 um 7:47 PM schrieb Paul E. McKenney:
>>> On Tue, Jan 07, 2025 at 11:09:55AM -0500, Alan Stern wrote:
>>>
>>>> (For example, in a presentation to the C++ working group last year, Paul
>>>> and I didn't try to show how to extend the C++ memory model to exclude
>>>> OOTA [other than by fiat, as it does now].  Instead we argued that with
>>>> the existing memory model, no reasonable compiler would ever produce an
>>>> executable that could exhibit OOTA and so the memory model didn't need
>>>> to be changed.)
>>>
>>> Furthermore, the LKMM design choice was that if a given litmus test was
>>> flagged as having a data race, anything might happen, including OOTA.
>>
>> Note that there is no data race in this litmus test.
>> There is a race condition on plain accesses according to LKMM,
>> but LKMM also says that this is *not* a data race.
>>
>> The patch removes the (actually non-existant) race condition by saying that
>> a critical section that is protected from having a data race with address
>> dependency or rmb/wmb (which LKMM already says works for avoiding data
>> races), is in fact also ordered and therefore has no race condition either.
>>
>> As a side effect :), this happens to fix OOTA in general in LKMM.
> 
> Fair point, no data race is flagged.
> 
> On the other hand, Documentation/memory-barriers.txt says the following:
> 
> ------------------------------------------------------------------------
> 
> However, stores are not speculated.  This means that ordering -is- provided
> for load-store control dependencies, as in the following example:
> 
> 	q = READ_ONCE(a);
> 	if (q) {
> 		WRITE_ONCE(b, 1);
> 	}
> 
> Control dependencies pair normally with other types of barriers.
> That said, please note that neither READ_ONCE() nor WRITE_ONCE()
> are optional! Without the READ_ONCE(), the compiler might combine the
> load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
> the compiler might combine the store to 'b' with other stores to 'b'.
> Either can result in highly counterintuitive effects on ordering.
> 
> ------------------------------------------------------------------------
> 
> If I change the two plain assignments to use WRITE_ONCE() as required
> by memory-barriers.txt, OOTA is avoided:


I think this direction of inquiry is a bit misleading. There need not be 
any speculative store at all:



P0(int *a, int *b, int *x, int *y) {
	int r1;
	int r2 = 0;
	r1 = READ_ONCE(*x);
	smp_rmb();
	if (r1 == 1) {
		r2 = *b;
	}
	WRITE_ONCE(*a, r2);
	smp_wmb();
	WRITE_ONCE(*y, 1);
}

P1(int *a, int *b, int *x, int *y) {
	int r1;

	int r2 = 0;

	r1 = READ_ONCE(*y);
	smp_rmb();
	if (r1 == 1) {
		r2 = *a;
	}
	WRITE_ONCE(*b, r2);
	smp_wmb();
	WRITE_ONCE(*x, 1);
}


The reason that the WRITE_ONCE helps in the speculative store case is 
that both its ctrl dependency and the wmb provide ordering, which 
together creates ordering between *x and *y.

I should point out that a version of herd7 that respects semantic 
dependencies (instead of syntactic only) might solve this case, by 
figuring out that the WRITE_ONCE to *a resp. *b depends on the first 
READ_ONCE.

Here's another funny example:


P0(int *a, int *b, int *x, int *y) {
	int r1;

	r1 = READ_ONCE(*x);
	smp_rmb();
	int r2 = READ_ONCE(*b);
	if (r1 == 1) {
		r2 = *b;
	}
	WRITE_ONCE(*a, r2);
	smp_wmb();
	WRITE_ONCE(*y, 1);
}

P1(int *a, int *b, int *x, int *y) {
	int r1;

	r1 = READ_ONCE(*y);
	smp_rmb();
	int r2 = READ_ONCE(*a);
	if (r1 == 1) {
		r2 = *a;
	}
	WRITE_ONCE(*b, r2);
	smp_wmb();
	WRITE_ONCE(*x, 1);
}

exists (0:r1=1 /\ 1:r1=1)

Is there still a semantic dependency from the inner load to the store to 
*a resp. *b, especially since the outer load from *b resp. *a is reading 
from the same store as the inner one? The compiler is definitely allowed 
to eliminate the inner load, which *also removes the OOTA*.

Please do look at the OOTA graph generated by herd7 for this one, it 
looks quite amazing.


> If LKMM is to allow plain assignments in this case, we need to also update
> memory-barriers.txt.  

But I am not suggesting to allow the plain assignment *by itself*.
In particular, my patch does not enforce any happens-before order 
between the READ_ONCE(*x) and the plain assignment to *b.
It only provides order between READ_ONCE(*x) and WRITE_ONCE(*y,...), 
through dependencies in the plain critical section.

Which must be 1) properly guarded (e.g., by rmb/wmb) and 2) live.

Because of this, I don't know if the text needs much updating, although 
one could add a text in the direction that "in the rare case where 
compilers do guarantee that a load and dependent store (including plain) 
will be emitted in some form, one can use rmb and wmb to ensure the 
order of surrounding marked accesses".

 > I am reluctant to do this because the community> needs to trust plain 
C-language assignments less rather than more,
> especially given that compilers are continuing to become more aggressive.

Yes, I agree.

> Yes, in your example, the "if" and the two explicit barriers should
> prevent compilers from being too clever, but these sorts of things are
> more fragile than one might think given future code changes.
> 
> Thoughts?

We certainly need to be very careful about how to formalize what the 
compiler is allowed of doing and what it is not. And even more careful 
about how to communicate this.

Best wishes,
   jonas


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-08 18:47     ` Alan Stern
@ 2025-01-08 19:22       ` Jonas Oberhauser
  2025-01-09 16:17         ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-08 19:22 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/8/2025 um 7:47 PM schrieb Alan Stern:
> On Wed, Jan 08, 2025 at 06:33:05PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/7/2025 um 5:09 PM schrieb Alan Stern:
>>> Is this really valid?  In the example above, if there were no other
>>> references to a or b in the rest of the program, the compiler could
>>> eliminate them entirely.
>>
>> In the example above, this is not possible, because the address of a/b have
>> escaped the function and are not deallocated before synchronization happens.
>> Therefore the compiler must assume that a/b are accessed inside the compiler
>> barrier.
> 
> I'm not quite sure what you mean by that, but if the compiler has access
> to the entire program and can do a global analysis then it can recognize
> cases where accesses that may seem to be live aren't really.

Even for trivial enough cases where the compiler has access to all the 
source, compiler barriers should be opaque to the compiler.

Since it is opaque,

   *a = 1;
   compiler_barrier();

might as well be

   *a = 1;
   *d = *a; // *d is in device memory

and so in my opinion the compiler needs to ensure that the value of *a 
right before the compiler barrier is 1.

Of course, only if the address of *a could be possibly legally known to 
the opaque code in the compiler barrier.


> 
> However, I admit this objection doesn't really apply to Linux kernel
> programming.
> 
>>>   (Whether the result could count as OOTA is
>>> open to question, but that's not the point.)  Is it not possible that a
>>> compiler might find other ways to defeat your intentions?
>>
>> The main point (which I already mentioned in the previous e-mail) is if the
>> object is deallocated without synchronization (or never escapes the function
>> in the first place).
>>
>> And indeed, any such case renders the added rule unsound. It is in a sense
>> unrelated to OOTA; cases where the load/store can be elided are never OOTA.
> 
> That is a matter of definition.  In our paper, Paul and I described
> instances of OOTA in which all the accesses have been optimized away as
> "trivial".

Yes, by OOTA I mean a rwdep;rfe cycle.

In the absence of data races, such a cycle can't be optimized away 
because it is created with volatile/compiler-barrier-protected accesses.


>> If an access interacts with an access of another thread (by reading from it
>> or being read from it), it must be live.
> 
> This is the sort of approximation I'm a little uncomfortable with.  It
> would be better to say that a store which is read from by a live load
> must be live.  I don't see why a load which reads from a live store has
> to be live.

You are right, and I was careless.
All we need is that a store that is read externally by a live load is 
live, and that a load that reads from an external store and has its 
value semantically depended-on by a live store is live.


>> The formulation in the patch is just based on a complicated and close but
>> imperfect approximation of Live.
> 
> Maybe you can reformulate the patch to make this more explicit.


It would look something like this:

Live = R & rng(po \ po ; [W] ; (po-loc \ w_barrier)) | W & dom(po \ 
((po-loc \ w_barrier) ; [W] ; po))

let to-w = (overwrite & int) | (addr | rmb ; [Live])? ; rwdep ; ([Live] 
; wmb)?


> In any case, it seems that any approximation we can make to Live will be
> subject to various sorts of errors.

Probably (this is certainly true for trying to approximate dependencies, 
for example), but what I know for certain is that the approximations of 
Live inside cat get more ugly the more precise they become. In the above 
definition of Live I have not included that the address must escape, nor 
that it must not be freed.

A non-local definition that suffices for OOTA would be so:

Live = R & rng(rfe) & dom(rwdep ; rfe) | W & dom(rfe)


It seems the ideal solution is to let Live be defined by the tools, 
which should keep up with or exceed the analysis done by state-of-art 
compilers.

Best wishes,
    jonas


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-08 19:22       ` Jonas Oberhauser
@ 2025-01-09 16:17         ` Alan Stern
  2025-01-09 16:44           ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-09 16:17 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Wed, Jan 08, 2025 at 08:22:07PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/8/2025 um 7:47 PM schrieb Alan Stern:
> > On Wed, Jan 08, 2025 at 06:33:05PM +0100, Jonas Oberhauser wrote:
> > > 
> > > 
> > > Am 1/7/2025 um 5:09 PM schrieb Alan Stern:
> > > > Is this really valid?  In the example above, if there were no other
> > > > references to a or b in the rest of the program, the compiler could
> > > > eliminate them entirely.
> > > 
> > > In the example above, this is not possible, because the address of a/b have
> > > escaped the function and are not deallocated before synchronization happens.
> > > Therefore the compiler must assume that a/b are accessed inside the compiler
> > > barrier.
> > 
> > I'm not quite sure what you mean by that, but if the compiler has access
> > to the entire program and can do a global analysis then it can recognize
> > cases where accesses that may seem to be live aren't really.
> 
> Even for trivial enough cases where the compiler has access to all the
> source, compiler barriers should be opaque to the compiler.
> 
> Since it is opaque,
> 
>   *a = 1;
>   compiler_barrier();
> 
> might as well be
> 
>   *a = 1;
>   *d = *a; // *d is in device memory
> 
> and so in my opinion the compiler needs to ensure that the value of *a right
> before the compiler barrier is 1.
> 
> Of course, only if the address of *a could be possibly legally known to the
> opaque code in the compiler barrier.

What do you mean by "opaque code in the compiler barrier"?  The 
compiler_barrier() instruction doesn't generate any code at all; it 
merely directs the compiler not to carry any knowledge about values 
stored in memory from one side of the barrier to the other.

Note that it does _not_ necessarily prevent the compiler from carrying 
knowledge that a memory location is unused from one side of the barrier 
to the other.

> > However, I admit this objection doesn't really apply to Linux kernel
> > programming.
> > 
> > > >   (Whether the result could count as OOTA is
> > > > open to question, but that's not the point.)  Is it not possible that a
> > > > compiler might find other ways to defeat your intentions?
> > > 
> > > The main point (which I already mentioned in the previous e-mail) is if the
> > > object is deallocated without synchronization (or never escapes the function
> > > in the first place).
> > > 
> > > And indeed, any such case renders the added rule unsound. It is in a sense
> > > unrelated to OOTA; cases where the load/store can be elided are never OOTA.
> > 
> > That is a matter of definition.  In our paper, Paul and I described
> > instances of OOTA in which all the accesses have been optimized away as
> > "trivial".
> 
> Yes, by OOTA I mean a rwdep;rfe cycle.
> 
> In the absence of data races, such a cycle can't be optimized away because
> it is created with volatile/compiler-barrier-protected accesses.

That wasn't true in the C++ context of the paper Paul and I worked on.  
Of course, C++ is not our current context here.

What I was trying to get at above is that compiler-barrier protection 
does not necessarily guarantee that non-volatile accesses can't be 
optimized away.  (However, it's probably safe for us to make such an 
assumption here.)

> It would look something like this:
> 
> Live = R & rng(po \ po ; [W] ; (po-loc \ w_barrier)) | W & dom(po \ ((po-loc
> \ w_barrier) ; [W] ; po))
> 
> let to-w = (overwrite & int) | (addr | rmb ; [Live])? ; rwdep ; ([Live] ;
> wmb)?
> 
> 
> > In any case, it seems that any approximation we can make to Live will be
> > subject to various sorts of errors.
> 
> Probably (this is certainly true for trying to approximate dependencies, for
> example), but what I know for certain is that the approximations of Live
> inside cat get more ugly the more precise they become. In the above
> definition of Live I have not included that the address must escape, nor
> that it must not be freed.
> 
> A non-local definition that suffices for OOTA would be so:
> 
> Live = R & rng(rfe) & dom(rwdep ; rfe) | W & dom(rfe)

I could live with this (although I would prefer to have more parentheses 
-- IMO it'smistake-prone to rely on the relative precedence of | and &).  

Especially if the to-w definition above were rewritten in a way that 
would be a little easier to parse and understand.

> It seems the ideal solution is to let Live be defined by the tools, which
> should keep up with or exceed the analysis done by state-of-art compilers.

I don't think it works that way in practice.  :-)

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-09 16:17         ` Alan Stern
@ 2025-01-09 16:44           ` Jonas Oberhauser
  2025-01-09 19:27             ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-09 16:44 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/9/2025 um 5:17 PM schrieb Alan Stern:
> On Wed, Jan 08, 2025 at 08:22:07PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/8/2025 um 7:47 PM schrieb Alan Stern:
>>> On Wed, Jan 08, 2025 at 06:33:05PM +0100, Jonas Oberhauser wrote:
>>>>
>>>>
>>>> Am 1/7/2025 um 5:09 PM schrieb Alan Stern:
>>>>> Is this really valid?  In the example above, if there were no other
>>>>> references to a or b in the rest of the program, the compiler could
>>>>> eliminate them entirely.
>>>>
>>>> In the example above, this is not possible, because the address of a/b have
>>>> escaped the function and are not deallocated before synchronization happens.
>>>> Therefore the compiler must assume that a/b are accessed inside the compiler
>>>> barrier.
>>>
>>> I'm not quite sure what you mean by that, but if the compiler has access
>>> to the entire program and can do a global analysis then it can recognize
>>> cases where accesses that may seem to be live aren't really.
>>
>> Even for trivial enough cases where the compiler has access to all the
>> source, compiler barriers should be opaque to the compiler.
>>
>> Since it is opaque,
>>
>>    *a = 1;
>>    compiler_barrier();
>>
>> might as well be
>>
>>    *a = 1;
>>    *d = *a; // *d is in device memory
>>
>> and so in my opinion the compiler needs to ensure that the value of *a right
>> before the compiler barrier is 1.
>>
>> Of course, only if the address of *a could be possibly legally known to the
>> opaque code in the compiler barrier.
> 
> What do you mean by "opaque code in the compiler barrier"?  The
> compiler_barrier() instruction doesn't generate any code at all; it
> merely directs the compiler not to carry any knowledge about values
> stored in memory from one side of the barrier to the other.

What I mean by "opaque" is that the compiler does not analyze the code 
inside the compiler barrier, so it must treat it as a black box that 
could manipulate memory arbitrarily within the limitation that it can 
not guess the address of memory.

So for example, in

   int a = 1;
   barrier();
   a = 2;
   //...

the compiler does not know how the code inside barrier() accesses 
memory, including volatile memory.
But it knows that it can not access `a`, because the address of `a` has 
never escaped before the barrier().

So it can change this to:

   barrier();
   int a = 2;
   // ...

But if we let the address of `a` escape, for example with some external 
function foo(int*):

   int a;
   foo(&a);
   a = 1;
   barrier();
   a = 2;
   // ...

Then the compiler has to assume that the code of foo and barrier might 
be something like this:

foo(p) { SPECIAL_VARIABLE = p; }
barrier() { TURN_THE_BREAKS_ON = *SPECIAL_VARIABLE; }

and it must make sure that the value of `a` before barrier() is 1.

That is at least my understanding.

In fact, even if `a` is unused after a=2, the compiler can only 
eliminate `a` in the former case, but in the latter case, still needs to 
ensure that the value of `a` before barrier() is 1 (but it can eliminate 
a=2).

> 
> Note that it does _not_ necessarily prevent the compiler from carrying
> knowledge that a memory location is unused from one side of the barrier
> to the other.

Yes, or even merging/moving assignments to the memory location across a 
barrier(), as in the example above.

>>> However, I admit this objection doesn't really apply to Linux kernel
>>> programming.
>>>
>>>>>    (Whether the result could count as OOTA is
>>>>> open to question, but that's not the point.)  Is it not possible that a
>>>>> compiler might find other ways to defeat your intentions?
>>>>
>>>> The main point (which I already mentioned in the previous e-mail) is if the
>>>> object is deallocated without synchronization (or never escapes the function
>>>> in the first place).
>>>>
>>>> And indeed, any such case renders the added rule unsound. It is in a sense
>>>> unrelated to OOTA; cases where the load/store can be elided are never OOTA.
>>>
>>> That is a matter of definition.  In our paper, Paul and I described
>>> instances of OOTA in which all the accesses have been optimized away as
>>> "trivial".
>>
>> Yes, by OOTA I mean a rwdep;rfe cycle.
>>
>> In the absence of data races, such a cycle can't be optimized away because
>> it is created with volatile/compiler-barrier-protected accesses.
> 
> That wasn't true in the C++ context of the paper Paul and I worked on.
> Of course, C++ is not our current context here.

Yes, you are completely correct. In C++ (or pure C), where data races 
are prevented by compiler/language-builtins rather than with 
compiler-barriers/volatiles, all my assumptions break.

That is because the compiler absolutely knows that an atomic_store(&x) 
does not access any memory location other than x, so it can do a lot 
more "harmful" optimizations.

That's why I said such a language model should just exclude global OOTA 
by fiat.

I have to read your paper again (I think I read it a few months ago) to 
understand if the trivial OOTA would make even that vague axiom unsound
(my intuition says that if the OOTA is never observed by influencing the 
side-effect, then forbidding OOTA makes no difference to the set of 
"observable behaviors" of a C++ program even there is a trivial OOTA, 
and if the OOTA has visible side-effects, then it is acceptable for the 
compiler not to do the "optimization" that turns it into a trivial OOTA 
and choose some other optimization instead, so we can as well forbid the 
compiler from doing it).

> What I was trying to get at above is that compiler-barrier protection
> does not necessarily guarantee that non-volatile accesses can't be
> optimized away.  (However, it's probably safe for us to make such an
> assumption here.)

Yes, I agree.

>> It seems the ideal solution is to let Live be defined by the tools, which
>> should keep up with or exceed the analysis done by state-of-art compilers.
> 
> I don't think it works that way in practice.  :-)

Yeah... maybe not :(

Best wishes,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-08 19:17         ` Jonas Oberhauser
@ 2025-01-09 17:54           ` Paul E. McKenney
  2025-01-09 18:35             ` Jonas Oberhauser
  2025-01-09 20:37             ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-09 17:54 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/8/2025 um 7:09 PM schrieb Paul E. McKenney:
> > On Wed, Jan 08, 2025 at 06:39:12PM +0100, Jonas Oberhauser wrote:
> > > 
> > > 
> > > Am 1/7/2025 um 7:47 PM schrieb Paul E. McKenney:
> > > > On Tue, Jan 07, 2025 at 11:09:55AM -0500, Alan Stern wrote:
> > > > 
> > > > > (For example, in a presentation to the C++ working group last year, Paul
> > > > > and I didn't try to show how to extend the C++ memory model to exclude
> > > > > OOTA [other than by fiat, as it does now].  Instead we argued that with
> > > > > the existing memory model, no reasonable compiler would ever produce an
> > > > > executable that could exhibit OOTA and so the memory model didn't need
> > > > > to be changed.)
> > > > 
> > > > Furthermore, the LKMM design choice was that if a given litmus test was
> > > > flagged as having a data race, anything might happen, including OOTA.
> > > 
> > > Note that there is no data race in this litmus test.
> > > There is a race condition on plain accesses according to LKMM,
> > > but LKMM also says that this is *not* a data race.
> > > 
> > > The patch removes the (actually non-existant) race condition by saying that
> > > a critical section that is protected from having a data race with address
> > > dependency or rmb/wmb (which LKMM already says works for avoiding data
> > > races), is in fact also ordered and therefore has no race condition either.
> > > 
> > > As a side effect :), this happens to fix OOTA in general in LKMM.
> > 
> > Fair point, no data race is flagged.
> > 
> > On the other hand, Documentation/memory-barriers.txt says the following:
> > 
> > ------------------------------------------------------------------------
> > 
> > However, stores are not speculated.  This means that ordering -is- provided
> > for load-store control dependencies, as in the following example:
> > 
> > 	q = READ_ONCE(a);
> > 	if (q) {
> > 		WRITE_ONCE(b, 1);
> > 	}
> > 
> > Control dependencies pair normally with other types of barriers.
> > That said, please note that neither READ_ONCE() nor WRITE_ONCE()
> > are optional! Without the READ_ONCE(), the compiler might combine the
> > load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
> > the compiler might combine the store to 'b' with other stores to 'b'.
> > Either can result in highly counterintuitive effects on ordering.
> > 
> > ------------------------------------------------------------------------
> > 
> > If I change the two plain assignments to use WRITE_ONCE() as required
> > by memory-barriers.txt, OOTA is avoided:
> 
> 
> I think this direction of inquiry is a bit misleading. There need not be any
> speculative store at all:
> 
> 
> 
> P0(int *a, int *b, int *x, int *y) {
> 	int r1;
> 	int r2 = 0;
> 	r1 = READ_ONCE(*x);
> 	smp_rmb();
> 	if (r1 == 1) {
> 		r2 = *b;
> 	}
> 	WRITE_ONCE(*a, r2);
> 	smp_wmb();
> 	WRITE_ONCE(*y, 1);
> }
> 
> P1(int *a, int *b, int *x, int *y) {
> 	int r1;
> 
> 	int r2 = 0;
> 
> 	r1 = READ_ONCE(*y);
> 	smp_rmb();
> 	if (r1 == 1) {
> 		r2 = *a;
> 	}
> 	WRITE_ONCE(*b, r2);
> 	smp_wmb();
> 	WRITE_ONCE(*x, 1);
> }
> 
> 
> The reason that the WRITE_ONCE helps in the speculative store case is that
> both its ctrl dependency and the wmb provide ordering, which together
> creates ordering between *x and *y.

Ah, and that is because LKMM does not enforce control dependencies past
the end of the "if" statement.  Cute!

But memory-barriers.txt requires that the WRITE_ONCE() be within the
"if" statement for control dependencies to exist, so LKMM is in agreement
with memory-barriers.txt in this case.  So again, if we change this,
we need to also change memory-barriers.txt.

> I should point out that a version of herd7 that respects semantic
> dependencies (instead of syntactic only) might solve this case, by figuring
> out that the WRITE_ONCE to *a resp. *b depends on the first READ_ONCE.
> 
> Here's another funny example:
> 
> 
> P0(int *a, int *b, int *x, int *y) {
> 	int r1;
> 
> 	r1 = READ_ONCE(*x);
> 	smp_rmb();
> 	int r2 = READ_ONCE(*b);
> 	if (r1 == 1) {
> 		r2 = *b;
> 	}
> 	WRITE_ONCE(*a, r2);
> 	smp_wmb();
> 	WRITE_ONCE(*y, 1);
> }
> 
> P1(int *a, int *b, int *x, int *y) {
> 	int r1;
> 
> 	r1 = READ_ONCE(*y);
> 	smp_rmb();
> 	int r2 = READ_ONCE(*a);
> 	if (r1 == 1) {
> 		r2 = *a;
> 	}
> 	WRITE_ONCE(*b, r2);
> 	smp_wmb();
> 	WRITE_ONCE(*x, 1);
> }
> 
> exists (0:r1=1 /\ 1:r1=1)
> 
> Is there still a semantic dependency from the inner load to the store to *a
> resp. *b, especially since the outer load from *b resp. *a is reading from
> the same store as the inner one? The compiler is definitely allowed to
> eliminate the inner load, which *also removes the OOTA*.

Also cute.  And also the WRITE_ONCE() outside of the "if" statement.

> Please do look at the OOTA graph generated by herd7 for this one, it looks
> quite amazing.

Given the way this morning is going, I must take your word for it...

> > If LKMM is to allow plain assignments in this case, we need to also update
> > memory-barriers.txt.
> 
> But I am not suggesting to allow the plain assignment *by itself*.
> In particular, my patch does not enforce any happens-before order between
> the READ_ONCE(*x) and the plain assignment to *b.
> It only provides order between READ_ONCE(*x) and WRITE_ONCE(*y,...), through
> dependencies in the plain critical section.
> 
> Which must be 1) properly guarded (e.g., by rmb/wmb) and 2) live.
> 
> Because of this, I don't know if the text needs much updating, although one
> could add a text in the direction that "in the rare case where compilers do
> guarantee that a load and dependent store (including plain) will be emitted
> in some form, one can use rmb and wmb to ensure the order of surrounding
> marked accesses".

If we want to respect something containing a control dependency to a
WRITE_ONCE() not in the body of the "if" statement, we need to make some
change to memory-barriers.txt.

> > I am reluctant to do this because the community> needs to trust plain
> C-language assignments less rather than more,
> > especially given that compilers are continuing to become more aggressive.
> 
> Yes, I agree.

Whew!!!  ;-)

> > Yes, in your example, the "if" and the two explicit barriers should
> > prevent compilers from being too clever, but these sorts of things are
> > more fragile than one might think given future code changes.
> > 
> > Thoughts?
> 
> We certainly need to be very careful about how to formalize what the
> compiler is allowed of doing and what it is not. And even more careful about
> how to communicate this.

No argument here!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-09 17:54           ` Paul E. McKenney
@ 2025-01-09 18:35             ` Jonas Oberhauser
  2025-01-10 14:54               ` Paul E. McKenney
  2025-01-09 20:37             ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-09 18:35 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
> On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/8/2025 um 7:09 PM schrieb Paul E. McKenney:
>>> On Wed, Jan 08, 2025 at 06:39:12PM +0100, Jonas Oberhauser wrote:
>>>>
>>>>
>>>> Am 1/7/2025 um 7:47 PM schrieb Paul E. McKenney:
>>>>> On Tue, Jan 07, 2025 at 11:09:55AM -0500, Alan Stern wrote:
>>>>>
>>>> The patch removes the (actually non-existant) race condition by saying that
>>>> a critical section that is protected from having a data race with address
>>>> dependency or rmb/wmb (which LKMM already says works for avoiding data
>>>> races), is in fact also ordered and therefore has no race condition either.
>>>>
>>>> As a side effect :), this happens to fix OOTA in general in LKMM.
>>>
>>> Fair point, no data race is flagged.
>>>
>>> On the other hand, Documentation/memory-barriers.txt says the following:
>>>
>>> ------------------------------------------------------------------------
>>>
>>> However, stores are not speculated.  This means that ordering -is- provided
>>> for load-store control dependencies, as in the following example:
>>>
>>> 	q = READ_ONCE(a);
>>> 	if (q) {
>>> 		WRITE_ONCE(b, 1);
>>> 	}
>>>
>>> Control dependencies pair normally with other types of barriers.
>>> That said, please note that neither READ_ONCE() nor WRITE_ONCE()
>>> are optional! Without the READ_ONCE(), the compiler might combine the
>>> load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
>>> the compiler might combine the store to 'b' with other stores to 'b'.
>>> Either can result in highly counterintuitive effects on ordering.
>>>
>>> ------------------------------------------------------------------------
>>>
>>> If I change the two plain assignments to use WRITE_ONCE() as required
>>> by memory-barriers.txt, OOTA is avoided:
>>
>>
>> I think this direction of inquiry is a bit misleading. There need not be any
>> speculative store at all:
>>
>>
>>
>> P0(int *a, int *b, int *x, int *y) {
>> 	int r1;
>> 	int r2 = 0;
>> 	r1 = READ_ONCE(*x);
>> 	smp_rmb();
>> 	if (r1 == 1) {
>> 		r2 = *b;
>> 	}
>> 	WRITE_ONCE(*a, r2);
>> 	smp_wmb();
>> 	WRITE_ONCE(*y, 1);
>> }
>>
>> P1(int *a, int *b, int *x, int *y) {
>> 	int r1;
>>
>> 	int r2 = 0;
>>
>> 	r1 = READ_ONCE(*y);
>> 	smp_rmb();
>> 	if (r1 == 1) {
>> 		r2 = *a;
>> 	}
>> 	WRITE_ONCE(*b, r2);
>> 	smp_wmb();
>> 	WRITE_ONCE(*x, 1);
>> }
>>
>>
>> The reason that the WRITE_ONCE helps in the speculative store case is that
>> both its ctrl dependency and the wmb provide ordering, which together
>> creates ordering between *x and *y.
> 
> Ah, and that is because LKMM does not enforce control dependencies past
> the end of the "if" statement.  Cute!
> 
> But memory-barriers.txt requires that the WRITE_ONCE() be within the
> "if" statement for control dependencies to exist, so LKMM is in agreement
> with memory-barriers.txt in this case.  So again, if we change this,
> we need to also change memory-barriers.txt.
 > [...]
 > If we want to respect something containing a control dependency to a
 > WRITE_ONCE() not in the body of the "if" statement, we need to make some
 > change to memory-barriers.txt.

I'm not sure what you denotate by *this* in "if we change this", but 
just to clarify, I am not thinking of claiming that there were a 
(semantic) control dependency to WRITE_ONCE(*b, r2) in this example.

There is however a data dependency from r2 = *a to WRITE_ONCE, and I 
would say that there is a semantic data (not control) dependency from r1 
= READ_ONCE(*y) to WRITE_ONCE(*b, r2), too: depending on the value read 
from *y, the value stored to *b will be different. The latter would be 
enough to avoid OOTA according to the mainline LKMM, but currently this 
semantic dependency is not detected by herd7.

I currently can not come up with an example where there would be a 
(semantic) control dependency from a load to a store that is not in the 
arm of an if statement (or a loop / switch of some form with the branch 
depending on the load).

I think the control dependency is just a red herring. It is only there 
to avoid the data race.

In a hypothetical LKMM where reading in a race is not a data race unless 
the data is used (*1), this would also work:

  	unsigned int r1;
  	unsigned int r2 = 0;
  	r1 = READ_ONCE(*x);
  	smp_rmb();
  	r2 = *b;
  	WRITE_ONCE(*a, (~r1 + 1) & r2);
  	smp_wmb();
  	WRITE_ONCE(*y, 1);

Here in case r1 == 0, the value of r2 is not used, so there is a race 
but there would not be data race in the hypothetical LKMM.

This example would also have OOTA under such a hypothetical LKMM, but 
not with my patch, because in the case where r1 == 1,
  READ_ONCE(*x) is seperated by rmb from the load from *b,
    upon which the store to *a depends,
    which itself is seperated by a wmb from the store to WRITE_ONCE(*y,1)
and this would ensure that READ_ONCE(*x) and WRITE_ONCE(*y,1) can not be 
reordered with each other anymore.

(*1= such a definition is not absurd! One needs to allow such races to 
make sequence locks and other similar datastructures well-defined.)

I currently don't know another way than the if-statement to avoid the 
data race in the program(*2) in the current LKMM, so that's why I rely 
on it, but at least conceptually it is orthogonal to the problem.

(*2=we can avoid the data race flag in herd by using filter, and only 
generating the graphs where r1==1 and there is no data race. But that is 
cheating -- the program is not valid under mainline LKMM.)

>> Please do look at the OO
TA graph generated by herd7 for this one, it looks
>> quite amazing.
> 
> Given the way this morning is going, I must take your word for it...

That sounds awful :(
Technical issues?
With any luck, you can test it on arm's herd7 web interface at 
https://developer.arm.com/herd7 (just don't be like me and type all the 
code first and then change the drop-down selector to Linux - that will 
reset the code window...)

Best wishes,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-09 16:44           ` Jonas Oberhauser
@ 2025-01-09 19:27             ` Alan Stern
  2025-01-09 20:09               ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-09 19:27 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Thu, Jan 09, 2025 at 05:44:54PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/9/2025 um 5:17 PM schrieb Alan Stern:
> > What do you mean by "opaque code in the compiler barrier"?  The
> > compiler_barrier() instruction doesn't generate any code at all; it
> > merely directs the compiler not to carry any knowledge about values
> > stored in memory from one side of the barrier to the other.
> 
> What I mean by "opaque" is that the compiler does not analyze the code
> inside the compiler barrier, so it must treat it as a black box that could
> manipulate memory arbitrarily within the limitation that it can not guess
> the address of memory.

Okay, I see what you're getting at.  The way you express it is a little 
confusing, because in fact there is NO code inside the compiler barrier 
(although the compiler doesn't know that) -- the barrier() macro expands 
to an empty assembler instruction, along with an annotation telling the 
compiler that this instruction may affect the contents of memory in 
unknown and unpredictable ways.

> So for example, in
> 
>   int a = 1;
>   barrier();
>   a = 2;
>   //...
> 
> the compiler does not know how the code inside barrier() accesses memory,
> including volatile memory.

I would say rather that the compiler does not know that the values 
stored in memory are the same before and after the barrier().  Even the 
values of local variables whose addresses have not been exported.

> But it knows that it can not access `a`, because the address of `a` has
> never escaped before the barrier().

I don't think this is right.  barrier is (or can be) a macro, not a 
function call with its own scope.  As such, it has -- in principle -- 
the ability to export the address of a.

Question: Can the compiler assume that no other threads access a between 
the two stores, on the grounds that this would be a data race?  I'd 
guess that it can't make that assumption, but it would be nice to know 
for sure.

> So it can change this to:
> 
>   barrier();
>   int a = 2;
>   // ...
> 
> But if we let the address of `a` escape, for example with some external
> function foo(int*):
> 
>   int a;
>   foo(&a);
>   a = 1;
>   barrier();
>   a = 2;
>   // ...
> 
> Then the compiler has to assume that the code of foo and barrier might be
> something like this:
> 
> foo(p) { SPECIAL_VARIABLE = p; }
> barrier() { TURN_THE_BREAKS_ON = *SPECIAL_VARIABLE; }

I think you're giving the compiler too much credit.  The one thing the 
compiler is allowed to assume is that the code, as written, does not 
contain a data race or other undefined behavior.

> and it must make sure that the value of `a` before barrier() is 1.
> 
> That is at least my understanding.

This is the sort of question that memory-barriers.txt should answer.  
It's closely related to the question I mentioned above.

> In fact, even if `a` is unused after a=2, the compiler can only eliminate
> `a` in the former case, but in the latter case, still needs to ensure that
> the value of `a` before barrier() is 1 (but it can eliminate a=2).

And what if a were a global shared variable instead of a local one?  The 
compiler is still allowed to do weird optimizations on it, since the 
accesses aren't volatile.  The barrier() merely prevents the compiler 
from using its knowledge that a is supposed to contain 1 before the 
barrier to influence its decisions about how to optimize the code 
following the barrier.

> > That wasn't true in the C++ context of the paper Paul and I worked on.
> > Of course, C++ is not our current context here.
> 
> Yes, you are completely correct. In C++ (or pure C), where data races are
> prevented by compiler/language-builtins rather than with
> compiler-barriers/volatiles, all my assumptions break.
> 
> That is because the compiler absolutely knows that an atomic_store(&x) does
> not access any memory location other than x, so it can do a lot more
> "harmful" optimizations.
> 
> That's why I said such a language model should just exclude global OOTA by
> fiat.

One problem with doing this is that there is no widely agreed-upon 
formal definition of OOTA.  A cycle in (rwdep ; rfe) isn't the answer 
because rwdep does not encapsulate the notion of semantic dependency.

> I have to read your paper again (I think I read it a few months ago) to
> understand if the trivial OOTA would make even that vague axiom unsound
> (my intuition says that if the OOTA is never observed by influencing the
> side-effect, then forbidding OOTA makes no difference to the set of
> "observable behaviors" of a C++ program even there is a trivial OOTA, and if
> the OOTA has visible side-effects, then it is acceptable for the compiler
> not to do the "optimization" that turns it into a trivial OOTA and choose
> some other optimization instead, so we can as well forbid the compiler from
> doing it).

If an instance of OOTA is never observed, does it exist?

In the paper, I speculated that if a physical execution of a program 
matches an abstract execution containing such a non-observed OOTA cycle, 
then it also matches another abstract execution in which the cycle 
doesn't exist.  I don't know how to prove this conjecture, though.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-09 19:27             ` Alan Stern
@ 2025-01-09 20:09               ` Jonas Oberhauser
  2025-01-10  3:12                 ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-09 20:09 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/9/2025 um 8:27 PM schrieb Alan Stern:
> On Thu, Jan 09, 2025 at 05:44:54PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/9/2025 um 5:17 PM schrieb Alan Stern:
>>> What do you mean by "opaque code in the compiler barrier"?  The
>>> compiler_barrier() instruction doesn't generate any code at all; it
>>> merely directs the compiler not to carry any knowledge about values
>>> stored in memory from one side of the barrier to the other.
>>
>> What I mean by "opaque" is that the compiler does not analyze the code
>> inside the compiler barrier, so it must treat it as a black box that could
>> manipulate memory arbitrarily within the limitation that it can not guess
>> the address of memory.
> 
> Okay, I see what you're getting at.  The way you express it is a little
> confusing, because in fact there is NO code inside the compiler barrier
> (although the compiler doesn't know that) -- the barrier() macro expands
> to an empty assembler instruction, along with an annotation telling the
> compiler that this instruction may affect the contents of memory in
> unknown and unpredictable ways.

I am happy to learn a better way to express this.

> 
>> So for example, in
>>
>>    int a = 1;
>>    barrier();
>>    a = 2;
>>    //...
>>
>> the compiler does not know how the code inside barrier() accesses memory,
>> including volatile memory.
> 
> I would say rather that the compiler does not know that the values
> stored in memory are the same before and after the barrier().  
 >
> Even the
> values of local variables whose addresses have not been exported.

No, this is not true. I used to think so too until a short while ago.

But if you look at the output of gcc -o3 you will see that it does 
happily remove `a` in this function.

> 
>> But it knows that it can not access `a`, because the address of `a` has
>> never escaped before the barrier().
> 
 > I don't think this is right.  barrier is (or can be) a macro, not a
 > function call with its own scope.  As such, it has -- in principle --
 > the ability to export the address of a.

See above. Please test it for yourself, but for your convenience here is 
the code

   extern void foo(int *p);

   int opt_a() {
     int a;
     a = 1;
       __asm__ volatile ("" ::: "memory");
     a = 2;
   }

   int opt_b() {
     int a;
     foo(&a);
     a = 1;
       __asm__ volatile ("" ::: "memory");
     a = 2;
   }

and corresponding asm:

opt_a:
         bx      lr

(the whole function body got deleted!)

opt_b:
         push    {lr}
         sub     sp, sp, #12
	// calling foo(&a)
         add     r0, sp, #4
         bl      foo

	// a = 1 -- to make sure a==1 before the barrier()
         movs    r3, #1
         str     r3, [sp, #4]

	// [empty code from the barrier()]

         // a = 2 -- totally deleted

	// [return]
         add     sp, sp, #12
         ldr     pc, [sp], #4

If you wanted to avoid this, you would need to expose the address of `a` 
to the asm block using one of its clobber arguments (but I'm not 
familiar with the syntax to produce a working example on the spot).

> 
> Question: Can the compiler assume that no other threads access a between
> the two stores, on the grounds that this would be a data race?  I'd
> guess that it can't make that assumption, but it would be nice to know
> for sure.

It can not make the assumption if &a has escaped. In that case, 
barrier() could be so:

barrier(){
   store_release(OTHER_THREAD_PLEASE_MODIFY,&a);

   while (! load_acquire(OTHER_THREAD_IS_DONE));
}

with another thread doing

   while (! load_acquire(OTHER_THREAD_PLEASE_MODIFY)) yield();
   *OTHER_THREAD_PLEASE_MODIFY ++;
   store_release(OTHER_THREAD_IS_DONE, 1);

>>
>> But if we let the address of `a` escape, for example with some external
>> function foo(int*):
>>
>>    int a;
>>    foo(&a);
>>    a = 1;
>>    barrier();
>>    a = 2;
>>    // ...
>>
>> Then the compiler has to assume that the code of foo and barrier might be
>> something like this:
>>
>> foo(p) { SPECIAL_VARIABLE = p; }
>> barrier() { TURN_THE_BREAKS_ON = *SPECIAL_VARIABLE; }
> 
> I think you're giving the compiler too much credit.  The one thing the
> compiler is allowed to assume is that the code, as written, does not
> contain a data race or other undefined behavior.

Apologies, the way I used "assume" is misleading.
I should have said that the compiler has to ensure that even if the code 
of foo() and barrier() were so, that the behavior of the code it 
generates is the same (w.r.t. observable side effects) as if the program 
were executed by the abstract machine. Or I should have said that it can 
*not* assume that the functions are *not* as defined above.

Which means that TURN_THE_BREAKS_ON would need to be assigned 1.
The only way the compiler can achieve that guarantee (while treating 
barrier as a black box) is to make sure that the value of `a` before 
barrier() is 1.

>>
>> That is at least my understanding.
> 
> This is the sort of question that memory-barriers.txt should answer.
> It's closely related to the question I mentioned above.
> 
>> In fact, even if `a` is unused after a=2, the compiler can only eliminate
>> `a` in the former case, but in the latter case, still needs to ensure that
>> the value of `a` before barrier() is 1 (but it can eliminate a=2).
> 
> And what if a were a global shared variable instead of a local one?  The
> compiler is still allowed to do weird optimizations on it, since the
> accesses aren't volatile.

I'm not sure if `a` being global is enough for the compiler to consider 
`a`'s address as having escaped to the inline asm memory, especially if 
`a` has static lifetime.

If it must be considered escaped, then it can do weird optimizations on 
it, but not across the barrier().
That is because inside the barrier(), we could be reading the value and 
store into a volatile field like TURN_THE_BREAKS_ON.
In that case, the value right before the barrier needs to be equal to 
the value in the abstract machine.

> 
>>> That wasn't true in the C++ context of the paper Paul and I worked on.
>>> Of course, C++ is not our current context here.
>>
>> Yes, you are completely correct. In C++ (or pure C), where data races are
>> prevented by compiler/language-builtins rather than with
>> compiler-barriers/volatiles, all my assumptions break.
>>
>> That is because the compiler absolutely knows that an atomic_store(&x) does
>> not access any memory location other than x, so it can do a lot more
>> "harmful" optimizations.
>>
>> That's why I said such a language model should just exclude global OOTA by
>> fiat.
> 
> One problem with doing this is that there is no widely agreed-upon
> formal definition of OOTA.  A cycle in (rwdep ; rfe) isn't the answer
> because rwdep does not encapsulate the notion of semantic dependency.

rwdep does not encapsulate any specific notion.
It is herd7 which decides which dependency edges to add to the graphs it 
generates.
If herd7 would generate edges for semantic dependencies instead of for 
its version of syntactic dependencies, then rwdep is the answer.

Given that we can not define dep inside the cat model, one may as well 
define it as rwdep;rfe with the intended meaning of the dependencies 
being the semantic ones; then it is an inaccuracy of herd7 that it does 
not provide the proper dependencies.

Anyways LKMM should not care about syntactic dependencies, e.g.

   if (READ_ONCE(*a)) {
     WRITE_ONCE(*b,1);
   } else {
     WRITE_ONCE(*b,1);
   }

has no semantic dependency and gcc does not guarantee the order between 
these two accesses, even though herd7 does give us a dependency edge.

>> I have to read your paper again (I think I read it a few months ago) to
>> understand if the trivial OOTA would make even that vague axiom unsound
>> (my intuition says that if the OOTA is never observed by influencing the
>> side-effect, then forbidding OOTA makes no difference to the set of
>> "observable behaviors" of a C++ program even there is a trivial OOTA, and if
>> the OOTA has visible side-effects, then it is acceptable for the compiler
>> not to do the "optimization" that turns it into a trivial OOTA and choose
>> some other optimization instead, so we can as well forbid the compiler from
>> doing it).
> 
> If an instance of OOTA is never observed, does it exist?

:) :) :)

> In the paper, I speculated that if a physical execution of a program
> matches an abstract execution containing such a non-observed OOTA cycle,
> then it also matches another abstract execution in which the cycle
> doesn't exist.  I don't know how to prove this conjecture, though.

Yes, that also makes sense.

Note that this speculation does not hold in the current LKMM though. In 
the Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 
/\ 1:r1=1 is only possible with an OOTA (even though the values from the 
OOTA are never used anywhere).
With C++'s non-local model I wouldn't be totally surprised if there were 
similar examples in C++, but given that its ordering definition is a lot 
more straightforward than LKMM in that it doesn't have all these cases 
of different barriers like wmb and rmb and corner cases like Noreturn 
etc., my intuition says that there aren't any.

But I am not going to think deeply about it for the time being.

Best wishes
    jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-09 17:54           ` Paul E. McKenney
  2025-01-09 18:35             ` Jonas Oberhauser
@ 2025-01-09 20:37             ` Peter Zijlstra
  2025-01-09 21:13               ` Paul E. McKenney
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2025-01-09 20:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jonas Oberhauser, Alan Stern, parri.andrea, will, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	urezki, quic_neeraju, frederic, linux-kernel, lkmm,
	hernan.poncedeleon

On Thu, Jan 09, 2025 at 09:54:28AM -0800, Paul E. McKenney wrote:
> > P0(int *a, int *b, int *x, int *y) {
> > 	int r1;
> > 	int r2 = 0;
> > 	r1 = READ_ONCE(*x);
> > 	smp_rmb();
> > 	if (r1 == 1) {
> > 		r2 = *b;
> > 	}
> > 	WRITE_ONCE(*a, r2);
> > 	smp_wmb();
> > 	WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *a, int *b, int *x, int *y) {
> > 	int r1;
> > 
> > 	int r2 = 0;
> > 
> > 	r1 = READ_ONCE(*y);
> > 	smp_rmb();
> > 	if (r1 == 1) {
> > 		r2 = *a;
> > 	}
> > 	WRITE_ONCE(*b, r2);
> > 	smp_wmb();
> > 	WRITE_ONCE(*x, 1);
> > }
> > 
> > 
> > The reason that the WRITE_ONCE helps in the speculative store case is that
> > both its ctrl dependency and the wmb provide ordering, which together
> > creates ordering between *x and *y.
> 
> Ah, and that is because LKMM does not enforce control dependencies past
> the end of the "if" statement.  Cute!

I think the reason we hesitated on that was CMOV and similar conditional
instructions. If the body of the branch is a CMOV, then there no
conditionality on the common path after the body.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-09 20:37             ` Peter Zijlstra
@ 2025-01-09 21:13               ` Paul E. McKenney
  0 siblings, 0 replies; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-09 21:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jonas Oberhauser, Alan Stern, parri.andrea, will, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	urezki, quic_neeraju, frederic, linux-kernel, lkmm,
	hernan.poncedeleon

On Thu, Jan 09, 2025 at 09:37:08PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 09, 2025 at 09:54:28AM -0800, Paul E. McKenney wrote:
> > > P0(int *a, int *b, int *x, int *y) {
> > > 	int r1;
> > > 	int r2 = 0;
> > > 	r1 = READ_ONCE(*x);
> > > 	smp_rmb();
> > > 	if (r1 == 1) {
> > > 		r2 = *b;
> > > 	}
> > > 	WRITE_ONCE(*a, r2);
> > > 	smp_wmb();
> > > 	WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *a, int *b, int *x, int *y) {
> > > 	int r1;
> > > 
> > > 	int r2 = 0;
> > > 
> > > 	r1 = READ_ONCE(*y);
> > > 	smp_rmb();
> > > 	if (r1 == 1) {
> > > 		r2 = *a;
> > > 	}
> > > 	WRITE_ONCE(*b, r2);
> > > 	smp_wmb();
> > > 	WRITE_ONCE(*x, 1);
> > > }
> > > 
> > > 
> > > The reason that the WRITE_ONCE helps in the speculative store case is that
> > > both its ctrl dependency and the wmb provide ordering, which together
> > > creates ordering between *x and *y.
> > 
> > Ah, and that is because LKMM does not enforce control dependencies past
> > the end of the "if" statement.  Cute!
> 
> I think the reason we hesitated on that was CMOV and similar conditional
> instructions. If the body of the branch is a CMOV, then there no
> conditionality on the common path after the body.

That does match my recollection.

In addition, in some cases the compiler can move memory references
following the body of the "if" to precede that "if", and then CPU
memory-reference reordering can do the rest.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-09 20:09               ` Jonas Oberhauser
@ 2025-01-10  3:12                 ` Alan Stern
  2025-01-10 12:21                   ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-10  3:12 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Thu, Jan 09, 2025 at 09:09:00PM +0100, Jonas Oberhauser wrote:

> > > So for example, in
> > > 
> > >    int a = 1;
> > >    barrier();
> > >    a = 2;
> > >    //...
> > > 
> > > the compiler does not know how the code inside barrier() accesses memory,
> > > including volatile memory.
> > 
> > I would say rather that the compiler does not know that the values
> > stored in memory are the same before and after the barrier().
> >
> > Even the
> > values of local variables whose addresses have not been exported.
> 
> No, this is not true. I used to think so too until a short while ago.
> 
> But if you look at the output of gcc -o3 you will see that it does happily
> remove `a` in this function.

Isn't that consistent with what I said?

> > > But it knows that it can not access `a`, because the address of `a` has
> > > never escaped before the barrier().
> > 
> > I don't think this is right.  barrier is (or can be) a macro, not a
> > function call with its own scope.  As such, it has -- in principle --
> > the ability to export the address of a.
> 
> See above. Please test it for yourself, but for your convenience here is the
> code

Oh, I believe that gcc does what you say.  I'm just not sure your 
explanation is entirely accurate.

> If you wanted to avoid this, you would need to expose the address of `a` to
> the asm block using one of its clobber arguments (but I'm not familiar with
> the syntax to produce a working example on the spot).
> 
> > 
> > Question: Can the compiler assume that no other threads access a between
> > the two stores, on the grounds that this would be a data race?  I'd
> > guess that it can't make that assumption, but it would be nice to know
> > for sure.
> 
> It can not make the assumption if &a has escaped. In that case, barrier()
> could be so:
> 
> barrier(){
>   store_release(OTHER_THREAD_PLEASE_MODIFY,&a);
> 
>   while (! load_acquire(OTHER_THREAD_IS_DONE));
> }
> 
> with another thread doing
> 
>   while (! load_acquire(OTHER_THREAD_PLEASE_MODIFY)) yield();
>   *OTHER_THREAD_PLEASE_MODIFY ++;
>   store_release(OTHER_THREAD_IS_DONE, 1);

Bear in mind that there's a difference between what a compiler _can do_ 
and what gcc _currently does_.

> > I think you're giving the compiler too much credit.  The one thing the
> > compiler is allowed to assume is that the code, as written, does not
> > contain a data race or other undefined behavior.
> 
> Apologies, the way I used "assume" is misleading.
> I should have said that the compiler has to ensure that even if the code of
> foo() and barrier() were so, that the behavior of the code it generates is
> the same (w.r.t. observable side effects) as if the program were executed by
> the abstract machine. Or I should have said that it can *not* assume that
> the functions are *not* as defined above.
> 
> Which means that TURN_THE_BREAKS_ON would need to be assigned 1.
> The only way the compiler can achieve that guarantee (while treating barrier
> as a black box) is to make sure that the value of `a` before barrier() is 1.

Who says the compiler has to treat barrier() as a black box?  As far as 
I know, gcc makes no such guarantee.

> > > That's why I said such a language model should just exclude global OOTA by
> > > fiat.
> > 
> > One problem with doing this is that there is no widely agreed-upon
> > formal definition of OOTA.  A cycle in (rwdep ; rfe) isn't the answer
> > because rwdep does not encapsulate the notion of semantic dependency.
> 
> rwdep does not encapsulate any specific notion.
> It is herd7 which decides which dependency edges to add to the graphs it
> generates.

Sorry, that's what I meant: rwdep plus the decisions that herd7 makes 
about which edges are dependencies.

> If herd7 would generate edges for semantic dependencies instead of for its
> version of syntactic dependencies, then rwdep is the answer.

That statement is meaningless (or at least, impossible to implement) 
because there is no widely agreed-upon formal definition for semantic 
dependency.

> Given that we can not define dep inside the cat model, one may as well
> define it as rwdep;rfe with the intended meaning of the dependencies being
> the semantic ones; then it is an inaccuracy of herd7 that it does not
> provide the proper dependencies.

Perhaps so.  LKMM does include other features which the compiler can 
defeat if the programmer isn't sufficiently careful.  Still, I suspect 
that changing the memory model solely with the goal of eliminating OOTA 
may not be a good idea.

> Anyways LKMM should not care about syntactic dependencies, e.g.
> 
> 
>   if (READ_ONCE(*a)) {
>     WRITE_ONCE(*b,1);
>   } else {
>     WRITE_ONCE(*b,1);
>   }
> 
> has no semantic dependency and gcc does not guarantee the order between
> these two accesses, even though herd7 does give us a dependency edge.

Like I said.

> > > I have to read your paper again (I think I read it a few months ago) to
> > > understand if the trivial OOTA would make even that vague axiom unsound
> > > (my intuition says that if the OOTA is never observed by influencing the
> > > side-effect, then forbidding OOTA makes no difference to the set of
> > > "observable behaviors" of a C++ program even there is a trivial OOTA, and if
> > > the OOTA has visible side-effects, then it is acceptable for the compiler
> > > not to do the "optimization" that turns it into a trivial OOTA and choose
> > > some other optimization instead, so we can as well forbid the compiler from
> > > doing it).
> > 
> > If an instance of OOTA is never observed, does it exist?
> 
> :) :) :)
> 
> > In the paper, I speculated that if a physical execution of a program
> > matches an abstract execution containing such a non-observed OOTA cycle,
> > then it also matches another abstract execution in which the cycle
> > doesn't exist.  I don't know how to prove this conjecture, though.
> 
> Yes, that also makes sense.
> 
> Note that this speculation does not hold in the current LKMM though. In the
> Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 /\
> 1:r1=1 is only possible with an OOTA (even though the values from the OOTA
> are never used anywhere).

If the fact that the outcome 0:r1=1 /\ 1:r1=1 has occurred is proof that 
there was OOTA, then the OOTA cycle _is_ observed, albeit indirectly -- 
at least, in the sense that I intended.  (The situation mentioned in the 
paper is better described as an execution where the compiler has elided 
all the accesses in the OOTA cycle.)

> With C++'s non-local model I wouldn't be totally surprised if there were
> similar examples in C++, but given that its ordering definition is a lot
> more straightforward than LKMM in that it doesn't have all these cases of
> different barriers like wmb and rmb and corner cases like Noreturn etc., my
> intuition says that there aren't any.

I'll have to give this some thought.

Alan

> But I am not going to think deeply about it for the time being.
> 
> Best wishes
>    jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-10  3:12                 ` Alan Stern
@ 2025-01-10 12:21                   ` Jonas Oberhauser
  2025-01-10 21:51                     ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-10 12:21 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/10/2025 um 4:12 AM schrieb Alan Stern:
> On Thu, Jan 09, 2025 at 09:09:00PM +0100, Jonas Oberhauser wrote:
> 
>>>> So for example, in
>>>>
>>>>     int a = 1;
>>>>     barrier();
>>>>     a = 2;
>>>>     //...
>>>>
>>>> the compiler does not know how the code inside barrier() accesses memory,
>>>> including volatile memory.
>>>
>>> I would say rather that the compiler does not know that the values
>>> stored in memory are the same before and after the barrier().
>>>
>>> Even the
>>> values of local variables whose addresses have not been exported.
>>
>> No, this is not true. I used to think so too until a short while ago.
>>
>> But if you look at the output of gcc -o3 you will see that it does happily
>> remove `a` in this function.
> 
> Isn't that consistent with what I said?

Ok, after careful reading, I think there are two assumptions you have 
that I think are not true, but my example is only inconsistent with 
exactly one of them being not true, not with both of them being not true:

1) the barrier only tells the compiler that it may *change* the value of 
memory locations. I believe it also tells the compiler that it may 
*read* the value of memory locations.
2) the barrier also talks about the values of local variables whose 
addresses have not been exported. I do not believe this is the case.

The second example I put (where a=1 is still emitted) shows your 
assumption 1) is inconsistent with what gcc currently does.

For what gcc guarantees, the manual says: "add memory to the list of 
clobbered registers. This will cause GCC to not keep memory values 
cached in registers across the assembler instruction", i.e., it needs to 
flush the value from the register to actual memory.

I believe this too is not consistent with your assumption 1), because if 
the barrier would just modify memory and not read it, there would be no 
need to flush the value to memory. It would suffice to ensure that the 
value is not assumed to be the same after the barrier.

With your assumption 1) discharged, the fact that `a=1` still can be 
removed from before the barrier should show that this guarantee does not 
hold for all memory locations (only for those that could legally be 
accessed in the barrier, which are those whose address has been exported).

>>> Question: Can the compiler assume that no other threads access a between
>>> the two stores, on the grounds that this would be a data race?  I'd
>>> guess that it can't make that assumption, but it would be nice to know
>>> for sure.
>>
>> It can not make the assumption if &a has escaped. In that case, barrier()
>> could be so:
>>
>> barrier(){
>>    store_release(OTHER_THREAD_PLEASE_MODIFY,&a);
>>
>>    while (! load_acquire(OTHER_THREAD_IS_DONE));
>> }
>>
>> with another thread doing
>>
>>    while (! load_acquire(OTHER_THREAD_PLEASE_MODIFY)) yield();
>>    *OTHER_THREAD_PLEASE_MODIFY ++;
>>    store_release(OTHER_THREAD_IS_DONE, 1);
> 
> Bear in mind that there's a difference between what a compiler _can do_
> and what gcc _currently does_.

Of course. But I am not sure if your comment addresses my comment here, 
or was intended for another section.

> 
>>> I think you're giving the compiler too much credit.  The one thing the
>>> compiler is allowed to assume is that the code, as written, does not
>>> contain a data race or other undefined behavior.
>>
>> Apologies, the way I used "assume" is misleading.
>> I should have said that the compiler has to ensure that even if the code of
>> foo() and barrier() were so, that the behavior of the code it generates is
>> the same (w.r.t. observable side effects) as if the program were executed by
>> the abstract machine. Or I should have said that it can *not* assume that
>> the functions are *not* as defined above.
>>
>> Which means that TURN_THE_BREAKS_ON would need to be assigned 1.
>> The only way the compiler can achieve that guarantee (while treating barrier
>> as a black box) is to make sure that the value of `a` before barrier() is 1.
> 
> Who says the compiler has to treat barrier() as a black box?  As far as
> I know, gcc makes no such guarantee.

Note that if it wouldn't, then barrier() would not work. Since the asm 
instruction is empty, the compiler could figure this out easily and just 
delete it.

But maybe I should also be more precise about what I mean by black box, 
namely, that
1) the asm block has significant side effects,
2) which include reading and storing to arbitrary (legally known) memory 
locations in an arbitrary order & control flow

Both of these imply that the compiler can not assume that it does not 
execute some logic equivalent to what I put above.

>> If herd7 would generate edges for semantic dependencies instead of for its
>> version of syntactic dependencies, then rwdep is the answer.
> 
> That statement is meaningless (or at least, impossible to implement)
> because there is no widely agreed-upon formal definition for semantic
> dependency.

Yes, which also means that a 100% correct end-to-end solution (herd + 
cat + ... ?) is currently not implementable.

But we can still break the problem into two halves, one which is 100% 
solved inside the cat file, and one which is the responsibility of herd7 
and currently not solved (or 100% satisfactorily solvable).

The advantage being that if we read the cat file as a mathematical 
definition, we can at least on paper argue 100% correctly about code for 
the cases where we either can figure out on paper what the semantic 
dependencies are, or where we at least just say "with relation to 
current compilers, we know what the semantically preserved dependencies 
are", even if herd7 or other tools lags behind in one or both.

After all, herd7 is just a (useful) automation tool for reasoning about 
LKMM, which has its limitations (scalability, a specific definition of 
dependencies, limited C subset...).

>> Given that we can not define dep inside the cat model, one may as well
>> define it as rwdep;rfe with the intended meaning of the dependencies being
>> the semantic ones; then it is an inaccuracy of herd7 that it does not
>> provide the proper dependencies.
> 
> Perhaps so.  LKMM does include other features which the compiler can
> defeat if the programmer isn't sufficiently careful.

How many of these are due to herd7's limitations vs. in the cat file?

>>> In the paper, I speculated that if a physical execution of a program
>>> matches an abstract execution containing such a non-observed OOTA cycle,
>>> then it also matches another abstract execution in which the cycle
>>> doesn't exist.  I don't know how to prove this conjecture, though.
>>
>> Yes, that also makes sense.
>>
>> Note that this speculation does not hold in the current LKMM though. In the
>> Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 /\
>> 1:r1=1 is only possible with an OOTA (even though the values from the OOTA
>> are never used anywhere).
> 
> If the fact that the outcome 0:r1=1 /\ 1:r1=1 has occurred is proof that
> there was OOTA, then the OOTA cycle _is_ observed, albeit indirectly --
> at least, in the sense that I intended.  (The situation mentioned in the
> paper is better described as an execution where the compiler has elided
> all the accesses in the OOTA cycle.)

I'm not sure that sense makes a lot of sense to me.
But it does make the proof of your claim totally trivial. If there is no 
other OOTA-free execution with the same observable behavior, then it is 
proof that the OOTA happened, so the OOTA was observed.
So by contraposition any non-observed OOTA has an OOTA-free execution 
with the same observable behavior.

The sense in which I would define observed is more along the lines of 
"there is an observable side effect (such as store to volatile location) 
which has a semantic dependency on a load that reads from one of the 
stores in the OOTA cycle".

> 
>> With C++'s non-local model I wouldn't be totally surprised if there were
>> similar examples in C++, but given that its ordering definition is a lot
>> more straightforward than LKMM in that it doesn't have all these cases of
>> different barriers like wmb and rmb and corner cases like Noreturn etc., my
>> intuition says that there aren't any.
> 
> I'll have to give this some thought.

If you do want to prove the claim for the stricter definition of 
observable OOTA, I would think about looking at the first read r in each 
thread that reads from an OOTA store w, and see if at least one of them 
could read from another store.

If that is not possible, then my intuition would be that there would be 
some happens-before relation blocking you from reading from an earlier 
store than w, in particular, w ->hb r.

If it is not possible on any thread, then you get a bunch of these hb 
edges much in parallel to the OOTA cycle.

Perhaps you can turn that into an hb cycle.

This last step doesn't work in LKMM because the hb may be caused by 
rmb/wmb, which does not extend over the plain accesses in the OOTA cycle 
to the bounding store to make a bigger hb cycle.

But in C++, if you have w ->hb r on each OOTA rfe edge, then you also 
have r ->hb w' along the OOTA dep edge, and you get an hb cycle.

Best wishes,
    jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-09 18:35             ` Jonas Oberhauser
@ 2025-01-10 14:54               ` Paul E. McKenney
  2025-01-10 16:21                 ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-10 14:54 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:
> Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
> > On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
> > > 
> > > 
> > > Am 1/8/2025 um 7:09 PM schrieb Paul E. McKenney:
> > > > On Wed, Jan 08, 2025 at 06:39:12PM +0100, Jonas Oberhauser wrote:
> > > > > 
> > > > > 
> > > > > Am 1/7/2025 um 7:47 PM schrieb Paul E. McKenney:
> > > > > > On Tue, Jan 07, 2025 at 11:09:55AM -0500, Alan Stern wrote:
> > > > > > 
> > > > > The patch removes the (actually non-existant) race condition by saying that
> > > > > a critical section that is protected from having a data race with address
> > > > > dependency or rmb/wmb (which LKMM already says works for avoiding data
> > > > > races), is in fact also ordered and therefore has no race condition either.
> > > > > 
> > > > > As a side effect :), this happens to fix OOTA in general in LKMM.
> > > > 
> > > > Fair point, no data race is flagged.
> > > > 
> > > > On the other hand, Documentation/memory-barriers.txt says the following:
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > However, stores are not speculated.  This means that ordering -is- provided
> > > > for load-store control dependencies, as in the following example:
> > > > 
> > > > 	q = READ_ONCE(a);
> > > > 	if (q) {
> > > > 		WRITE_ONCE(b, 1);
> > > > 	}
> > > > 
> > > > Control dependencies pair normally with other types of barriers.
> > > > That said, please note that neither READ_ONCE() nor WRITE_ONCE()
> > > > are optional! Without the READ_ONCE(), the compiler might combine the
> > > > load from 'a' with other loads from 'a'.  Without the WRITE_ONCE(),
> > > > the compiler might combine the store to 'b' with other stores to 'b'.
> > > > Either can result in highly counterintuitive effects on ordering.
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > If I change the two plain assignments to use WRITE_ONCE() as required
> > > > by memory-barriers.txt, OOTA is avoided:
> > > 
> > > 
> > > I think this direction of inquiry is a bit misleading. There need not be any
> > > speculative store at all:
> > > 
> > > 
> > > 
> > > P0(int *a, int *b, int *x, int *y) {
> > > 	int r1;
> > > 	int r2 = 0;
> > > 	r1 = READ_ONCE(*x);
> > > 	smp_rmb();
> > > 	if (r1 == 1) {
> > > 		r2 = *b;
> > > 	}
> > > 	WRITE_ONCE(*a, r2);
> > > 	smp_wmb();
> > > 	WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *a, int *b, int *x, int *y) {
> > > 	int r1;
> > > 
> > > 	int r2 = 0;
> > > 
> > > 	r1 = READ_ONCE(*y);
> > > 	smp_rmb();
> > > 	if (r1 == 1) {
> > > 		r2 = *a;
> > > 	}
> > > 	WRITE_ONCE(*b, r2);
> > > 	smp_wmb();
> > > 	WRITE_ONCE(*x, 1);
> > > }
> > > 
> > > 
> > > The reason that the WRITE_ONCE helps in the speculative store case is that
> > > both its ctrl dependency and the wmb provide ordering, which together
> > > creates ordering between *x and *y.
> > 
> > Ah, and that is because LKMM does not enforce control dependencies past
> > the end of the "if" statement.  Cute!
> > 
> > But memory-barriers.txt requires that the WRITE_ONCE() be within the
> > "if" statement for control dependencies to exist, so LKMM is in agreement
> > with memory-barriers.txt in this case.  So again, if we change this,
> > we need to also change memory-barriers.txt.
> > [...]
> > If we want to respect something containing a control dependency to a
> > WRITE_ONCE() not in the body of the "if" statement, we need to make some
> > change to memory-barriers.txt.
> 
> I'm not sure what you denotate by *this* in "if we change this", but just to
> clarify, I am not thinking of claiming that there were a (semantic) control
> dependency to WRITE_ONCE(*b, r2) in this example.
> 
> There is however a data dependency from r2 = *a to WRITE_ONCE, and I would
> say that there is a semantic data (not control) dependency from r1 =
> READ_ONCE(*y) to WRITE_ONCE(*b, r2), too: depending on the value read from
> *y, the value stored to *b will be different. The latter would be enough to
> avoid OOTA according to the mainline LKMM, but currently this semantic
> dependency is not detected by herd7.

According to LKMM, address and data dependencies must be headed by
rcu_dereference() or similar.  See Documentation/RCU/rcu_dereference.rst.

Therefore, there is nothing to chain the control dependency with.

> I currently can not come up with an example where there would be a
> (semantic) control dependency from a load to a store that is not in the arm
> of an if statement (or a loop / switch of some form with the branch
> depending on the load).
> 
> I think the control dependency is just a red herring. It is only there to
> avoid the data race.

Well, that red herring needs to have a companion fish to swim with in
order to enforce ordering, and I am not seeing that companion.

Or am I (yet again!) missing something subtle here?

> In a hypothetical LKMM where reading in a race is not a data race unless the
> data is used (*1), this would also work:

You lost me on the "(*1)", which might mean that I am misunderstanding
your text and examples below.

>  	unsigned int r1;
>  	unsigned int r2 = 0;
>  	r1 = READ_ONCE(*x);
>  	smp_rmb();
>  	r2 = *b;

This load from *b does not head any sort of dependency per LKMM, as noted
in rcu_dereference.rst.  As that document states, there are too many games
that compilers are permitted to play with plain C-language loads.

>  	WRITE_ONCE(*a, (~r1 + 1) & r2);
>  	smp_wmb();
>  	WRITE_ONCE(*y, 1);
> 
> 
> Here in case r1 == 0, the value of r2 is not used, so there is a race but
> there would not be data race in the hypothetical LKMM.

That plain C-language load from b, if concurrent with any sort of store to
b, really is a data race.  Sure, a compiler that can prove that r1==0 at
the WRITE_ONCE() to a might optimize that load away, but the C-language
definition of data race still applies.

Ah, I finally see that (*1) is a footnote.

> This example would also have OOTA under such a hypothetical LKMM, but not
> with my patch, because in the case where r1 == 1,
>  READ_ONCE(*x) is seperated by rmb from the load from *b,
>    upon which the store to *a depends,
>    which itself is seperated by a wmb from the store to WRITE_ONCE(*y,1)
> and this would ensure that READ_ONCE(*x) and WRITE_ONCE(*y,1) can not be
> reordered with each other anymore.
> 
> 
> (*1= such a definition is not absurd! One needs to allow such races to make
> sequence locks and other similar datastructures well-defined.)

I am currently not at all comfortable with the thought of allowing
plain C-language loads to head any sort of dependency.  I really did put
that restriction into both memory-barriers.txt and rcu_dereference.rst
intentionally.  There is the old saying "Discipline = freedom", and
therefore compilers' lack of discipline surrounding plain C-language
loads implies a lack of freedom.  ;-)

> I currently don't know another way than the if-statement to avoid the data
> race in the program(*2) in the current LKMM, so that's why I rely on it, but
> at least conceptually it is orthogonal to the problem.
> 
> (*2=we can avoid the data race flag in herd by using filter, and only
> generating the graphs where r1==1 and there is no data race. But that is
> cheating -- the program is not valid under mainline LKMM.)

Such cheats can be valid in cases where that is how you tell herd7 about
some restriction that it cannot be told otherwise, but in this case, I
agree that this cheat is unhelpful.

> > > Please do look at the OO
> TA graph generated by herd7 for this one, it looks
> > > quite amazing.
> > 
> > Given the way this morning is going, I must take your word for it...
> 
> That sounds awful :(
> Technical issues?

Nothing awful, just catching up from tracking down a Linux-kernel RCU
bug that fought well [1] and from the holidays.

> With any luck, you can test it on arm's herd7 web interface at
> https://developer.arm.com/herd7 (just don't be like me and type all the code
> first and then change the drop-down selector to Linux - that will reset the
> code window...)

Ah, I have been running them locally, and didn't have time to chase down
the herd7 arguments.

Which reminds me...  We should decide which of these examples should be
added to the github litmus archive, perhaps to illustrate the fact that
plain C-language loads do not head dependency chains.  Thoughts?

							Thanx, Paul

[1] https://people.kernel.org/paulmck/hunting-a-tree03-heisenbug

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-10 14:54               ` Paul E. McKenney
@ 2025-01-10 16:21                 ` Jonas Oberhauser
  2025-01-13 22:04                   ` Paul E. McKenney
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-10 16:21 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/10/2025 um 3:54 PM schrieb Paul E. McKenney:
> On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:
>> Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
>>> On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
>>>>
>>>>
>>>> Am 1/8/2025 um 7:09 PM schrieb Paul E. McKenney:
>>>>> If I change the two plain assignments to use WRITE_ONCE() as required
>>>>> by memory-barriers.txt, OOTA is avoided:
>>>>
>>>>
>>>> I think this direction of inquiry is a bit misleading. There need not be any
>>>> speculative store at all:
>>>>
>>>>
>>>>
>>>> P0(int *a, int *b, int *x, int *y) {
>>>> 	int r1;
>>>> 	int r2 = 0;
>>>> 	r1 = READ_ONCE(*x);
>>>> 	smp_rmb();
>>>> 	if (r1 == 1) {
>>>> 		r2 = *b;
>>>> 	}
>>>> 	WRITE_ONCE(*a, r2);
>>>> 	smp_wmb();
>>>> 	WRITE_ONCE(*y, 1);
>>>> }
>>>>
>>>> P1(int *a, int *b, int *x, int *y) {
>>>> 	int r1;
>>>>
>>>> 	int r2 = 0;
>>>>
>>>> 	r1 = READ_ONCE(*y);
>>>> 	smp_rmb();
>>>> 	if (r1 == 1) {
>>>> 		r2 = *a;
>>>> 	}
>>>> 	WRITE_ONCE(*b, r2);
>>>> 	smp_wmb();
>>>> 	WRITE_ONCE(*x, 1);
>>>> }
>>>>
>>>>
>>> If we want to respect something containing a control dependency to a
>>> WRITE_ONCE() not in the body of the "if" statement, we need to make some
>>> change to memory-barriers.txt.
>>
>> I'm not sure what you denotate by *this* in "if we change this", but just to
>> clarify, I am not thinking of claiming that there were a (semantic) control
>> dependency to WRITE_ONCE(*b, r2) in this example.
>>
>> There is however a data dependency from r2 = *a to WRITE_ONCE, and I would
>> say that there is a semantic data (not control) dependency from r1 =
>> READ_ONCE(*y) to WRITE_ONCE(*b, r2), too: depending on the value read from
>> *y, the value stored to *b will be different. The latter would be enough to
>> avoid OOTA according to the mainline LKMM, but currently this semantic
>> dependency is not detected by herd7.
> 
> According to LKMM, address and data dependencies must be headed by
> rcu_dereference() or similar.  See Documentation/RCU/rcu_dereference.rst.
> 
> Therefore, there is nothing to chain the control dependency with.

Note that herd7 does generate dependencies. And speaking informally, 
there clearly is a semantic dependency.

Both the original formalization of LKMM and my patch do say that a plain 
load at the head of a dependency chain does not provide any dependency 
ordering, i.e.,

   [Plain & R] ; dep

is never part of hb, both in LKMM and in my patch.

By the way, if your concern is the dependency *starting* from the plain 
load, then we can look at examples where the dependency starts from a 
marked load:

   r1 = READ_ONCE(*x);
   smp_rmb();
   if (r1 == 1) {
     r2 = READ_ONCE(*a);
   }
   *b = 1;
   smp_wmb();
   WRITE_ONCE(*y,1);

This is more or less analogous to the case of the addr ; [Plain] ; wmb 
case you already have.

>> I currently can not come up with an example where there would be a
>> (semantic) control dependency from a load to a store that is not in the arm
>> of an if statement (or a loop / switch of some form with the branch
>> depending on the load).
>>
>> I think the control dependency is just a red herring. It is only there to
>> avoid the data race.
> 
> Well, that red herring needs to have a companion fish to swim with in
> order to enforce ordering, and I am not seeing that companion.
> 
> Or am I (yet again!) missing something subtle here?

It makes more sense to think about how people do message passing (or 
seqlock), which might look something like this:

   [READ_ONCE]
   rmb
   [plain read]

and

   [plain write]
   wmb
   [WRITE_ONCE]

Clearly LKMM says that there is some sort of order (not quite 
happens-before order though) between the READ_ONCE and the plain read, 
and between the plain write and the WRITE_ONCE. This order is clearly 
defined in the data race definition, in r-pre-bounded and w-post-bounded.

Now consider

   [READ_ONCE]
   rmb
   [plain read]
   // some code that creates order between the plain accesses
   [plain write]
   wmb
   [WRITE_ONCE]

where for some specific reason we can discern that the compiler can not 
fully eliminate/move across the barrier either this specific plain read, 
nor the plain write, nor the ordering between the two.

In this case, is there order between the READ_ONCE and the WRITE_ONCE, 
or not? Of course, we know current LKMM says no. I would say that in 
those very specific cases, we do have ordering.

>> In a hypothetical LKMM where reading in a race is not a data race unless the
>> data is used (*1), this would also work:
> 
> You lost me on the "(*1)", which might mean that I am misunderstanding
> your text and examples below.

This was meant to be a footnote :D

>>   	unsigned int r1;
>>   	unsigned int r2 = 0;
>>   	r1 = READ_ONCE(*x);
>>   	smp_rmb();
>>   	r2 = *b;
> 
> This load from *b does not head any sort of dependency per LKMM, as noted
> in rcu_dereference.rst.  As that document states, there are too many games
> that compilers are permitted to play with plain C-language loads.
> 
>>   	WRITE_ONCE(*a, (~r1 + 1) & r2);
>>   	smp_wmb();
>>   	WRITE_ONCE(*y, 1);
>>
>>
>> Here in case r1 == 0, the value of r2 is not used, so there is a race but
>> there would not be data race in the hypothetical LKMM.
> 
> That plain C-language load from b, if concurrent with any sort of store to
> b, really is a data race. Sure, a compiler that can prove that r1==0 at
> the WRITE_ONCE() to a might optimize that load away, but the C-language
> definition of data race still applies.

It is a data race according to C, but so are all races on WRITE_ONCE and 
READ_ONCE, so we already do not actually care what C says.

What we care about is what the compiler says (and does).

The reality is that no matter what kind of crazy optimizations the 
compiler does to the load and to the concurrent store, all that would 
happen is that the load "returns" some insane value. But that insane 
value is not used by the remainder of the computation.

I think the right way to think about it is that a race between a read 
and a write gives the read an indeterminate value, and a race between 
two writes produces undefined behavior. I vaguely recall that this is 
even guaranteed by LLVM.

That is why sequence locks work, after all. In our internal memory model 
we have relaxed the definition accordingly and there are a bunch of 
internally used datastructures that can only be verified because of the 
relaxation.

> 
> I am currently not at all comfortable with the thought of allowing
> plain C-language loads to head any sort of dependency.  I really did put
> that restriction into both memory-barriers.txt and rcu_dereference.rst
> intentionally.  There is the old saying "Discipline = freedom", and
> therefore compilers' lack of discipline surrounding plain C-language
> loads implies a lack of freedom.  ;-)

Yes, I understand your concern (or more generally, the concern of 
letting plain accesses play a role in ordering).
Obviously, allowing arbitrary plain loads to invoke some kind of 
ordering because of a dependency is plain (heh) wrong.
There are two kinds of potential problems:
  - the load or its dependent store may not exist in that location at all
  - the dependency may not really exist

The second case is a problem also with marked accesses, and should be 
handled by herd7 only giving us actual semantic dependencies (whatever 
those are). It can not be solved in cat. Either way it is a limitation 
that herd7 (and also other tools) currently has and we already live with.

So the new problem we deal with is to somehow restrict the rule to loads 
and dependent stores that the compiler for whatever reason will not 
fully eliminate.

This problem too can not be solved completely inside cat. We can give an 
approximation, as discussed with Alan (stating that a store would not be 
elided if it is read by another thread, and a read would not be elided 
if it reads from another thread and a store that won't be elided depends 
on it).

This approximation is also limited, e.g., if the addresses of the plain 
loads and stores have not yet escaped the function, but at least this 
scenario is currently impossible in herd7 (unlike the fake dependency 
scenario).

In my mind it would again be better to offload the correct selection of 
"compiler-un(re)movable plain loads and stores" to the tools. That may 
again not solve the problem fully, but it at least would mean that any 
changes to address the imprecision wouldn't need to go through the 
kernel tree, and IMHO it is easier to say LKMM in the cat files is the 
model, and the interpretation of the model has some limitations.

> We should decide which of these examples should be
> added to the github litmus archive, perhaps to illustrate the fact that
> plain C-language loads do not head dependency chains.  Thoughts?

I'm not sure that is a good idea, given that running the herd7 tool on 
the litmus test will clearly show a dependency chain headed by plain 
loads in the visual graphs (with `doshow rwdep`).

Maybe it makes more sense to say in the docs that they may head 
syntactic or semantic dependency chains, but because of the common case 
that the compiler may cruelly optimize things, LKMM does not guarantee 
ordering based on the dependency chains headed by plain loads. That 
would be consistent with the tooling.

> [1] https://people.kernel.org/paulmck/hunting-a-tree03-heisenbug

Fun. Thanks :) Duplication and Devious both start with D.

Best wishes
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-10 12:21                   ` Jonas Oberhauser
@ 2025-01-10 21:51                     ` Alan Stern
  2025-01-11 12:46                       ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-10 21:51 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Fri, Jan 10, 2025 at 01:21:32PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/10/2025 um 4:12 AM schrieb Alan Stern:
> > On Thu, Jan 09, 2025 at 09:09:00PM +0100, Jonas Oberhauser wrote:
> > 
> > > > > So for example, in
> > > > > 
> > > > >     int a = 1;
> > > > >     barrier();
> > > > >     a = 2;
> > > > >     //...
> > > > > 
> > > > > the compiler does not know how the code inside barrier() accesses memory,
> > > > > including volatile memory.
> > > > 
> > > > I would say rather that the compiler does not know that the values
> > > > stored in memory are the same before and after the barrier().
> > > > 
> > > > Even the
> > > > values of local variables whose addresses have not been exported.
> > > 
> > > No, this is not true. I used to think so too until a short while ago.
> > > 
> > > But if you look at the output of gcc -o3 you will see that it does happily
> > > remove `a` in this function.
> > 
> > Isn't that consistent with what I said?
> 
> 
> Ok, after careful reading, I think there are two assumptions you have that I
> think are not true, but my example is only inconsistent with exactly one of
> them being not true, not with both of them being not true:
> 
> 1) the barrier only tells the compiler that it may *change* the value of
> memory locations. I believe it also tells the compiler that it may *read*
> the value of memory locations.
> 2) the barrier also talks about the values of local variables whose
> addresses have not been exported. I do not believe this is the case.

I checked the GCC manual.  You are right about 1); the compiler is 
required to guarantee that the contents of memory before the barrier are 
fully up-to-date (no dirty values remaining in registers or 
temporaries).

2) isn't so clear.  If a local variable's address is never computed then 
the compiler might put the variable in a register, in which case the 
barrier would not clobber it.  On the other hand, if the variable's 
address is computed somewhere (even if it isn't exported) then the 
variable can't be kept in a register and so it is subject to the 
barrier's effects.

The manual says:

	Using the ‘"memory"’ clobber effectively forms a read/write 
	memory barrier for the compiler.

Not a word about what happens if a variable has the "register" storage 
class, for example, and so might never be stored in memory at all.  But 
this leaves the programmer with _no_ way to specify a memory barrier for 
such variables!

Of course, the fact that these variables cannot be exposed to outside 
code does mitigate the problem...

> For what gcc guarantees, the manual says: "add memory to the list of
> clobbered registers. This will cause GCC to not keep memory values cached in
> registers across the assembler instruction", i.e., it needs to flush the
> value from the register to actual memory.

Yes, GCC will write back dirty values from registers.  But not because 
the cached values will become invalid (in fact, the cached values might 
not even be used after the barrier).  Rather, because the compiler is 
required to assume that the assembler code in the barrier might access 
arbitrary memory locations -- even if that code is empty.

> > > > Question: Can the compiler assume that no other threads access a between
> > > > the two stores, on the grounds that this would be a data race?  I'd
> > > > guess that it can't make that assumption, but it would be nice to know
> > > > for sure.
> > > 
> > > It can not make the assumption if &a has escaped. In that case, barrier()
> > > could be so:
> > > 
> > > barrier(){
> > >    store_release(OTHER_THREAD_PLEASE_MODIFY,&a);
> > > 
> > >    while (! load_acquire(OTHER_THREAD_IS_DONE));
> > > }
> > > 
> > > with another thread doing
> > > 
> > >    while (! load_acquire(OTHER_THREAD_PLEASE_MODIFY)) yield();
> > >    *OTHER_THREAD_PLEASE_MODIFY ++;
> > >    store_release(OTHER_THREAD_IS_DONE, 1);

Okay, yes, the compiler can't know whether the assembler code will do 
this.  But as far as I know, there is no specification about whether 
inline assembler can synchronize with code in another thread (in the 
sense used by the C++ memory model) and therefore create a 
happens-before link.  Language specifications tend to ignore issues 
like inline assembler.

Does this give the compiler license to believe no such link can exist 
and therefore accesses to these non-atomic variables by another thread 
concurrent with the barrier would be data races?

In the end maybe this doesn't matter.

> > > If herd7 would generate edges for semantic dependencies instead of for its
> > > version of syntactic dependencies, then rwdep is the answer.
> > 
> > That statement is meaningless (or at least, impossible to implement)
> > because there is no widely agreed-upon formal definition for semantic
> > dependency.
> 
> Yes, which also means that a 100% correct end-to-end solution (herd + cat +
> ... ?) is currently not implementable.
> 
> But we can still break the problem into two halves, one which is 100% solved
> inside the cat file, and one which is the responsibility of herd7 and
> currently not solved (or 100% satisfactorily solvable).

I believe that Luc and the other people involved with herd7 take the 
opposite point of view: herd7 is intended to do the "easy" analysis 
involving only straightforward code parsing, leaving the "hard" 
conceptual parts to the user-supplied .cat file.

> The advantage being that if we read the cat file as a mathematical
> definition, we can at least on paper argue 100% correctly about code for the
> cases where we either can figure out on paper what the semantic dependencies
> are, or where we at least just say "with relation to current compilers, we
> know what the semantically preserved dependencies are", even if herd7 or
> other tools lags behind in one or both.
> 
> After all, herd7 is just a (useful) automation tool for reasoning about
> LKMM, which has its limitations (scalability, a specific definition of
> dependencies, limited C subset...).

I still think we should not attempt any sort of formalization of 
semantic dependency here.

> > > Given that we can not define dep inside the cat model, one may as well
> > > define it as rwdep;rfe with the intended meaning of the dependencies being
> > > the semantic ones; then it is an inaccuracy of herd7 that it does not
> > > provide the proper dependencies.
> > 
> > Perhaps so.  LKMM does include other features which the compiler can
> > defeat if the programmer isn't sufficiently careful.
> 
> How many of these are due to herd7's limitations vs. in the cat file?

Important limitations are present in both.

> > > > In the paper, I speculated that if a physical execution of a program
> > > > matches an abstract execution containing such a non-observed OOTA cycle,
> > > > then it also matches another abstract execution in which the cycle
> > > > doesn't exist.  I don't know how to prove this conjecture, though.
> > > 
> > > Yes, that also makes sense.
> > > 
> > > Note that this speculation does not hold in the current LKMM though. In the
> > > Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 /\
> > > 1:r1=1 is only possible with an OOTA (even though the values from the OOTA
> > > are never used anywhere).
> > 
> > If the fact that the outcome 0:r1=1 /\ 1:r1=1 has occurred is proof that
> > there was OOTA, then the OOTA cycle _is_ observed, albeit indirectly --
> > at least, in the sense that I intended.  (The situation mentioned in the
> > paper is better described as an execution where the compiler has elided
> > all the accesses in the OOTA cycle.)
> 
> I'm not sure that sense makes a lot of sense to me.

Here's an example illustrating what I had in mind.  Imagine that all the 
accesses here are C++-style relaxed atomic (i.e., not volatile and also 
not subject to data races):

P0(int *a, int *b) {
	int r0 = *a;
	*b = r0;
	*b = 2;
	// r0 not used again
}

P1(int *a, int *b) {
	int r1 = *b;
	*a = r1;
	*a = 2;
	// r1 not used again
}

The compiler could eliminate the r0 and r1 accesses entirely, leaving 
just:

P0(int *a, int *b) {
	*b = 2;
}

P1(int *a, int *b) {
	*a = 2;
}

An execution of the corresponding machine code would then be compatible 
with an abstract execution of the source code in which both r0 and r1 
get set to 42 (OOTA).  But it would also be compatible with an abstract 
execution in which both r0 and r1 are 0, so it doesn't make sense to say 
that the hardware execution is, or might be, an instance of OOTA.

> But it does make the proof of your claim totally trivial. If there is no
> other OOTA-free execution with the same observable behavior, then it is
> proof that the OOTA happened, so the OOTA was observed.
> So by contraposition any non-observed OOTA has an OOTA-free execution with
> the same observable behavior.
> 
> The sense in which I would define observed is more along the lines of "there
> is an observable side effect (such as store to volatile location) which has
> a semantic dependency on a load that reads from one of the stores in the
> OOTA cycle".

Yes, I can see that from your proposed definition of Live.

I'm afraid we've wandered off the point of this email thread, however... 

Getting back to the original point, why don't you rewrite your patch as 
discussed earlier and describe it as an attempt to add ordering for 
important situations involving plain accesses that the LKMM currently 
does not handle?  In other words, leave out as far as possible any 
mention of OOTA or semantic dependency.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-10 21:51                     ` Alan Stern
@ 2025-01-11 12:46                       ` Jonas Oberhauser
  2025-01-11 21:19                         ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-11 12:46 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/10/2025 um 10:51 PM schrieb Alan Stern:
> On Fri, Jan 10, 2025 at 01:21:32PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/10/2025 um 4:12 AM schrieb Alan Stern:
>>> On Thu, Jan 09, 2025 at 09:09:00PM +0100, Jonas Oberhauser wrote:
>>>
>>>>>> So for example, in
>>>>>>
>>>>>>      int a = 1;
>>>>>>      barrier();
>>>>>>      a = 2;
>>>>>>      //...
>>>>>>
>>>>>> the compiler does not know how the code inside barrier() accesses memory,
>>>>>> including volatile memory.
>>>>>
>>>>> I would say rather that the compiler does not know that the values
>>>>> stored in memory are the same before and after the barrier().
>>>>>
>>>>> Even the
>>>>> values of local variables whose addresses have not been exported.
>>>>
>>>> No, this is not true. I used to think so too until a short while ago.
>>>>
>>>> But if you look at the output of gcc -o3 you will see that it does happily
>>>> remove `a` in this function.
>>>
>>> Isn't that consistent with what I said?
>>
>>
>> Ok, after careful reading, I think there are two assumptions you have that I
>> think are not true, but my example is only inconsistent with exactly one of
>> them being not true, not with both of them being not true:
>>
>> [...]
 >>
>> 2) the barrier also talks about the values of local variables whose
>> addresses have not been exported. I do not believe this is the case.
>  
> 2) isn't so clear.  If a local variable's address is never computed then
> the compiler might put the variable in a register, in which case the
> barrier would not clobber it.  On the other hand, if the variable's
> address is computed somewhere (even if it isn't exported) then the
> variable can't be kept in a register and so it is subject to the
> barrier's effects.

I understood "its address is exported" to mean that enough information 
has been exported to legally compute its address.

Btw, I just remembered a discussion about provenance in C & C++ which is 
also very related to this, where the compiler moved a (I think 
non-atomic) access across a release fence because it "knew" that the 
address of the non-atomic was not exported.

I can't find that discussion now.



>> For what gcc guarantees, the manual says: "add memory to the list of
>> clobbered registers. This will cause GCC to not keep memory values cached in
>> registers across the assembler instruction", i.e., it needs to flush the
>> value from the register to actual memory.
> 
> Yes, GCC will write back dirty values from registers.  But not because
> the cached values will become invalid (in fact, the cached values might
> not even be used after the barrier).  Rather, because the compiler is
> required to assume that the assembler code in the barrier might access
> arbitrary memory locations -- even if that code is empty.

Yes :)


>>>> If herd7 would generate edges for semantic dependencies instead of for its
>>>> version of syntactic dependencies, then rwdep is the answer.
>>>
>>> That statement is meaningless (or at least, impossible to implement)
>>> because there is no widely agreed-upon formal definition for semantic
>>> dependency.
>>
>> Yes, which also means that a 100% correct end-to-end solution (herd + cat +
>> ... ?) is currently not implementable.
>>
>> But we can still break the problem into two halves, one which is 100% solved
>> inside the cat file, and one which is the responsibility of herd7 and
>> currently not solved (or 100% satisfactorily solvable).
> 
> I believe that Luc and the other people involved with herd7 take the
> opposite point of view: herd7 is intended to do the "easy" analysis
> involving only straightforward code parsing, leaving the "hard"
> conceptual parts to the user-supplied .cat file.

I can understand the attractiveness of that point of view, but there is 
no way to define "semantic dependencies" at all or "live access" 100% 
accurately in cat, since it requires a lot of syntactic information that 
is not present at that level.

But there is in herd7 (at least for some specific definition of 
"semantically dependent").

>> The advantage being that if we read the cat file as a mathematical
>> definition, we can at least on paper argue 100% correctly about code for the
>> cases where we either can figure out on paper what the semantic dependencies
>> are, or where we at least just say "with relation to current compilers, we
>> know what the semantically preserved dependencies are", even if herd7 or
>> other tools lags behind in one or both.
>>
>> After all, herd7 is just a (useful) automation tool for reasoning about
>> LKMM, which has its limitations (scalability, a specific definition of
>> dependencies, limited C subset...).
> 
> I still think we should not attempt any sort of formalization of
> semantic dependency here.

I 100% agree and apologies if I ever gave that impression.

What I want is to change the interpretation of ctrl,data,addr in LKMM 
from saying "it is intended to be a syntactic dependency, which causes 
LKMM to be inaccurate" to "it is intended to be a semantic dependency, 
but because there is no formal defn. and tooling we *use* syntactic 
dependencies, which causes the current implementations to be 
inaccurate", without formally defining what a semantic dependency is.

E.g., in "A WARNING", I would change
-------------
The protections provided by READ_ONCE(), WRITE_ONCE(), and others are
not perfect; and under some circumstances it is possible for the
compiler to undermine the memory model.
-------------
into something like
-------------
The current tooling around LKMM does not model semantic dependencies, 
and instead uses syntactic dependencies to specify the protections 
provided by READ_ONCE(), WRITE_ONCE(), and others. The compiler can 
undermine these syntactic dependencies under some circumstances.
As a consequence, the tooling may write checks that LKMM and the 
compiler can not cash.
-------------

etc.

This is under my assumption that if we had let's say gcc's "semantic 
dependencies" or an under-approximation of it (by that I mean allow less 
things to be dependent than gcc can see), that these cases would be 
resolved, in the sense that gcc can not undermine [R & Marked] ; gcc-dep 
where gcc-dep is the dependencies detected by gcc.

But I would be interested to see any cases where this assumption is not 
true.


Note that this still results in a fully formal definition of LKMM, 
because just like now, addr,ctrl,and data are simply uninterpreted 
relations. We don't need to formalize their meaning at that level.


>>>> Given that we can not define dep inside the cat model, one may as well
>>>> define it as rwdep;rfe with the intended meaning of the dependencies being
>>>> the semantic ones; then it is an inaccuracy of herd7 that it does not
>>>> provide the proper dependencies.
>>>
>>> Perhaps so.  LKMM does include other features which the compiler can
>>> defeat if the programmer isn't sufficiently careful.
>>
>> How many of these are due to herd7's limitations vs. in the cat file?
> 
> Important limitations are present in both.

I am genuinely asking. Do we have a list of the limitations?
Maybe it would be good to collect it in the "A WARNING" section of 
explanation.txt if it doesn't exist elsewhere.

>>>>> In the paper, I speculated that if a physical execution of a program
>>>>> matches an abstract execution containing such a non-observed OOTA cycle,
>>>>> then it also matches another abstract execution in which the cycle
>>>>> doesn't exist.  I don't know how to prove this conjecture, though.
>>>>
>>>> Yes, that also makes sense.
>>>>
>>>> Note that this speculation does not hold in the current LKMM though. In the
>>>> Litmus test I shared in the opening e-mail, where the outcome 0:r1=1 /\
>>>> 1:r1=1 is only possible with an OOTA (even though the values from the OOTA
>>>> are never used anywhere).
>>>
>>> If the fact that the outcome 0:r1=1 /\ 1:r1=1 has occurred is proof that
>>> there was OOTA, then the OOTA cycle _is_ observed, albeit indirectly --
>>> at least, in the sense that I intended.  (The situation mentioned in the
>>> paper is better described as an execution where the compiler has elided
>>> all the accesses in the OOTA cycle.)
>>
>> I'm not sure that sense makes a lot of sense to me.
> 
> Here's an example illustrating what I had in mind.  Imagine that all the
> accesses here are C++-style relaxed atomic (i.e., not volatile and also
> not subject to data races):
> 
> P0(int *a, int *b) {
> 	int r0 = *a;
> 	*b = r0;
> 	*b = 2;
> 	// r0 not used again
> }
> 
> P1(int *a, int *b) {
> 	int r1 = *b;
> 	*a = r1;
> 	*a = 2;
> 	// r1 not used again
> }
> 
> The compiler could eliminate the r0 and r1 accesses entirely, leaving
> just:
> 
> P0(int *a, int *b) {
> 	*b = 2;
> }
> 
> P1(int *a, int *b) {
> 	*a = 2;
> }
> 
> An execution of the corresponding machine code would then be compatible
> with an abstract execution of the source code in which both r0 and r1
> get set to 42 (OOTA).  But it would also be compatible with an abstract
> execution in which both r0 and r1 are 0, so it doesn't make sense to say
> that the hardware execution is, or might be, an instance of OOTA.

Yes, but this does not require the definition you expressed before. This 
is already not observable according to the definition that there is no 
read R from an OOTA-cycle store, where some observable side effect 
semantically depends on R.

What I was trying to say is that your definition is almost a 
no-true-scotsman (sorry Paul) definition: every program with an OOTA 
execution where you could potentially not find an "equivalent" execution 
without OOTA, is simply labeled a "no-true-unobserved OOTA".

>> But it does make the proof of your claim totally trivial. If there is no
>> other OOTA-free execution with the same observable behavior, then it is
>> proof that the OOTA happened, so the OOTA was observed.
>> So by contraposition any non-observed OOTA has an OOTA-free execution with
>> the same observable behavior.
>>
>> The sense in which I would define observed is more along the lines of "there
>> is an observable side effect (such as store to volatile location) which has
>> a semantic dependency on a load that reads from one of the stores in the
>> OOTA cycle".
> 
> Yes, I can see that from your proposed definition of Live.
> I'm afraid we've wandered off the point of this email thread, however...

Maybe, but it is still an important and interesting discussion.
I am also open to continuing it on another channel though.

> Getting back to the original point, why don't you rewrite your patch as
> discussed earlier and describe it as an attempt to add ordering for
> important situations involving plain accesses that the LKMM currently
> does not handle?  In other words, leave out as far as possible any
> mention of OOTA or semantic dependency.
I will try, but I have some other things to do first, so it may take a 
while.

Have fun,
   jonas


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-11 12:46                       ` Jonas Oberhauser
@ 2025-01-11 21:19                         ` Alan Stern
  2025-01-12 15:55                           ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-11 21:19 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Sat, Jan 11, 2025 at 01:46:21PM +0100, Jonas Oberhauser wrote:
> What I want is to change the interpretation of ctrl,data,addr in LKMM from
> saying "it is intended to be a syntactic dependency, which causes LKMM to be
> inaccurate" to "it is intended to be a semantic dependency, but because
> there is no formal defn. and tooling we *use* syntactic dependencies, which
> causes the current implementations to be inaccurate", without formally
> defining what a semantic dependency is.

I guess you could take that point of view.  We have never tried to make 
it explicit.

> E.g., in "A WARNING", I would change
> -------------
> The protections provided by READ_ONCE(), WRITE_ONCE(), and others are
> not perfect; and under some circumstances it is possible for the
> compiler to undermine the memory model.
> -------------
> into something like
> -------------
> The current tooling around LKMM does not model semantic dependencies, and
> instead uses syntactic dependencies to specify the protections provided by
> READ_ONCE(), WRITE_ONCE(), and others. The compiler can undermine these
> syntactic dependencies under some circumstances.
> As a consequence, the tooling may write checks that LKMM and the compiler
> can not cash.
> -------------
> 
> etc.

I doubt the ordinary reader would appreciate the difference.

How would you feel about changing the existing text this way instead?

	... it is possible for the compiler to undermine the memory
	model (because the LKMM and the software tools associated with 
	it do not understand the somewhat vague concept of "semantic 
	dependency" -- see below).

> This is under my assumption that if we had let's say gcc's "semantic
> dependencies" or an under-approximation of it (by that I mean allow less
> things to be dependent than gcc can see), that these cases would be
> resolved, in the sense that gcc can not undermine [R & Marked] ; gcc-dep
> where gcc-dep is the dependencies detected by gcc.

That seems circular.  Basically, you're saying the gcc will not break 
any dependencies that gcc classifies as not-breakable!

> But I would be interested to see any cases where this assumption is not
> true.
> 
> 
> Note that this still results in a fully formal definition of LKMM, because
> just like now, addr,ctrl,and data are simply uninterpreted relations. We
> don't need to formalize their meaning at that level.

> > > > Perhaps so.  LKMM does include other features which the compiler can
> > > > defeat if the programmer isn't sufficiently careful.
> > > 
> > > How many of these are due to herd7's limitations vs. in the cat file?
> > 
> > Important limitations are present in both.
> 
> I am genuinely asking. Do we have a list of the limitations?
> Maybe it would be good to collect it in the "A WARNING" section of
> explanation.txt if it doesn't exist elsewhere.

There are a few listed already at various spots in explanation.txt -- 
search for "undermine".  And yes, many or most of these limitations do 
arise from LKMM's failure to recognize when a dependency isn't semantic.  
Maybe some are also related to undefined behavior, which LKMM is not 
aware of.

There is one other weakness I know of, however -- something totally 
different.  It's an instance in which the formal model in the .cat file 
fails to capture the intent of the informal operational model.

As I recall, it goes like this: The operational model asserts that an 
A-cumulative fence (like a release fence) on a CPU ensures ordering for 
all stores that propagate to that CPU before the fence is executed.  But 
the .cat code says only that this ordering applies to stores which the 
CPU reads from before the fence is executed.  I believe you can make up 
litmus tests where you can prove that a store must have propagated to 
the CPU before an A-cumulative fence occurs, even though the CPU doesn't 
read from that store; for such examples the LKMM may accept executions 
that shouldn't be allowed.

There may even be an instance of this somewhere in the various litmus 
test archives; I don't remember.

> > An execution of the corresponding machine code would then be compatible
> > with an abstract execution of the source code in which both r0 and r1
> > get set to 42 (OOTA).  But it would also be compatible with an abstract
> > execution in which both r0 and r1 are 0, so it doesn't make sense to say
> > that the hardware execution is, or might be, an instance of OOTA.
> 
> Yes, but this does not require the definition you expressed before. This is
> already not observable according to the definition that there is no read R
> from an OOTA-cycle store, where some observable side effect semantically
> depends on R.

Yes.  This was not meant to be an exact analogy.  It was merely an 
observation of a concept closely related to something you said.

> What I was trying to say is that your definition is almost a
> no-true-scotsman (sorry Paul) definition:

(I would use the term "tautology" -- although I don't believe it was a 
tautology in its original context, which referred specifically to cases 
where the compiler had removed all the accesses in an OOTA cycle.)

>  every program with an OOTA
> execution where you could potentially not find an "equivalent" execution
> without OOTA, is simply labeled a "no-true-unobserved OOTA".
> 
> > > But it does make the proof of your claim totally trivial. If there is no
> > > other OOTA-free execution with the same observable behavior, then it is
> > > proof that the OOTA happened, so the OOTA was observed.
> > > So by contraposition any non-observed OOTA has an OOTA-free execution with
> > > the same observable behavior.
> > > 
> > > The sense in which I would define observed is more along the lines of "there
> > > is an observable side effect (such as store to volatile location) which has
> > > a semantic dependency on a load that reads from one of the stores in the
> > > OOTA cycle".

Agreed.  But maybe this indicates that your sense is too weak a 
criterion, that an indirect observation should count just as much as a 
direct one.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-11 21:19                         ` Alan Stern
@ 2025-01-12 15:55                           ` Jonas Oberhauser
  2025-01-13 19:43                             ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-12 15:55 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/11/2025 um 10:19 PM schrieb Alan Stern:
> On Sat, Jan 11, 2025 at 01:46:21PM +0100, Jonas Oberhauser wrote:
 >>
>> This is under my assumption that if we had let's say gcc's "semantic
>> dependencies" or an under-approximation of it (by that I mean allow less
>> things to be dependent than gcc can see), that these cases would be
>> resolved, in the sense that gcc can not undermine [R & Marked] ; gcc-dep
>> where gcc-dep is the dependencies detected by gcc.
> 
> That seems circular.  Basically, you're saying the gcc will not break
> any dependencies that gcc classifies as not-breakable!

Maybe my formulation is not exactly what I meant to express.
I am thinking of examples like this,

  r1 = READ_ONCE(*x);
  *b = r1;

~~>

  r1 = READ_ONCE(*x);
  if (*b != r1) {
    *b = r1;
  }

Here there is clearly a dependency to a store, but gcc might turn it 
into an independent load (in case *b == r1).

Just because gcc admits that there is a dependency, does not necessarily 
mean that it will not still undermine the ordering "bestowed upon" that 
dependency by a memory model in some creative way.

The cases that I could think of all still worked for very specific 
architecture-specific reasons (e.g., x86 has CMOV but all loads provide 
acquire-ordering, and arm does not have flag-conditional str, etc.)

Or perhaps there is no dependency in case *b == r1. I am not sure.

Another thought that pops up here is that when I last worked on 
formalizing dependencies, I could not define dependencies as being 
between one load and one store, a dependency might be between a set of 
loads and one store. I would have to look up the exact reason, but I 
think it was because sometimes you need to change more than one value to 
influence the result, e.g., a && b where both a and b are 0 - just 
changing one will not make a difference.

All of these complications make me wonder whether even a relational 
notion of semantic dependency is good enough.

>>>> Alan Stern:
>>>>> Perhaps so.  LKMM does include other features which the compiler can
>>>>> defeat if the programmer isn't sufficiently careful.
>>>>
>>>> How many of these are due to herd7's limitations vs. in the cat file?
>>>
>>> Important limitations are present in both.
>>
>> I am genuinely asking. Do we have a list of the limitations?
>> Maybe it would be good to collect it in the "A WARNING" section of
>> explanation.txt if it doesn't exist elsewhere.
> 
> There are a few listed already at various spots in explanation.txt --
> search for "undermine".  And yes, many or most of these limitations do
> arise from LKMM's failure to recognize when a dependency isn't semantic.
> Maybe some are also related to undefined behavior, which LKMM is not
> aware of.

Thanks. Another point mentioned in that document is the total order of 
po, whereas C has a more relaxed notion of sequenced-before; this could 
even affect volatile accesses, e.g. in f(READ_ONCE(*a), g()). where g() 
calls an rmb and then READ_ONCE(*b), and it is not clear whether there 
should be a ppo from reading *a to *b in some executions or not.

> 
> There is one other weakness I know of, however -- something totally
> different.  It's an instance in which the formal model in the .cat file
> fails to capture the intent of the informal operational model.
> 
> As I recall, it goes like this: The operational model asserts that an
> A-cumulative fence (like a release fence) on a CPU ensures ordering for
> all stores that propagate to that CPU before the fence is executed.  But
> the .cat code says only that this ordering applies to stores which the
> CPU reads from before the fence is executed.  I believe you can make up
> litmus tests where you can prove that a store must have propagated to
> the CPU before an A-cumulative fence occurs, even though the CPU doesn't
> read from that store; for such examples the LKMM may accept executions
> that shouldn't be allowed.
> 
> There may even be an instance of this somewhere in the various litmus
> test archives; I don't remember.

Ah, I think you explained this case to me before. It is the one where 
prop & int covers some but not all cases, right? (compared to the cat 
PowerPC model, which does not even have prop & int)

For now I am mostly worried about cases where LKMM promises too much, 
rather than too little. The latter case arises naturally as a trade-off 
between complexity of the model, algorithmic complexity, and what 
guarantees are actually needed from the model by real code.

By the way, I am currently investigating a formulation of LKMM where 
there is a separate "propagation order" per thread, prop-order[t], which 
relates `a` to `b` events iff `a` is "observed" to propagate to t before 
`b`. Observation can also include co and fr, not just rf, which might be 
sufficient to cover those cases. I have a hand-written proof sketch that 
an order ORD induced by prop-order[t], xb, and some other orders is 
acyclic, and a Coq proof that executing an operational model in any 
linearization of ORD is permitted (e.g., does not propagate a store 
before it is executed, or in violation of co) and has the same rf as the 
axiomatic execution.

So if the proof sketch works out, this might indicate that with such a 
per-thread propagation order, one can eliminate those cases.

But to make the definition work, I had to make xb use prop-order[t] 
instead of prop in some cases, and the definitions of xb and 
prop-order[t] are mutually recursive, so it's not a very digestible 
definition.

So I do not recommend replacing LKMM with such a model even if it works, 
but it is useful for investigating some boundary conditions.

>>> An execution of the corresponding machine code would then be compatible
>>> with an abstract execution of the source code in which both r0 and r1
>>> get set to 42 (OOTA).  But it would also be compatible with an abstract
>>> execution in which both r0 and r1 are 0, so it doesn't make sense to say
>>> that the hardware execution is, or might be, an instance of OOTA.
>>
>> Yes, but this does not require the definition you expressed before. This is
>> already not observable according to the definition that there is no read R
>> from an OOTA-cycle store, where some observable side effect semantically
>> depends on R.
 >>
>> What I was trying to say is that your definition is almost a
>> no-true-scotsman (sorry Paul) definition:
> 
> (I would use the term "tautology" -- although I don't believe it was a
> tautology in its original context, which referred specifically to cases
> where the compiler had removed all the accesses in an OOTA cycle.)
> 
>>   every program with an OOTA
>> execution where you could potentially not find an "equivalent" execution
>> without OOTA, is simply labeled a "no-true-unobserved OOTA".
>>
>>>> But it does make the proof of your claim totally trivial. If there is no
>>>> other OOTA-free execution with the same observable behavior, then it is
>>>> proof that the OOTA happened, so the OOTA was observed.
>>>> So by contraposition any non-observed OOTA has an OOTA-free execution with
>>>> the same observable behavior.
>>>>
>>>> The sense in which I would define observed is more along the lines of "there
>>>> is an observable side effect (such as store to volatile location) which has
>>>> a semantic dependency on a load that reads from one of the stores in the
>>>> OOTA cycle".
> 
> Agreed.  But maybe this indicates that your sense is too weak a
> criterion, that an indirect observation should count just as much as a
> direct one.
I don't follow this conclusion.
I think there are two relevant claims here:
1) compilers do not introduce "observed OOTA"
2) For every execution graph with not-observed OOTA, there is another 
execution with the same observable side effects that does not have OOTA.

While it may be much easier to prove 2) with a more relaxed notion of 
observed OOTA, 1) sounds much harder. How does the compiler know that 
there is no indirect way to observe the OOTA?

E.g., in the LKMM example, ignoring the compiler barriers, it might be 
possible for a compiler to deduce that the plain accesses are never used 
and delete them, resulting in an OOTA that is observed under the more 
relaxed setting, violating claim 1).

Best wishes,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-12 15:55                           ` Jonas Oberhauser
@ 2025-01-13 19:43                             ` Alan Stern
  0 siblings, 0 replies; 59+ messages in thread
From: Alan Stern @ 2025-01-13 19:43 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Sun, Jan 12, 2025 at 04:55:07PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/11/2025 um 10:19 PM schrieb Alan Stern:
> > On Sat, Jan 11, 2025 at 01:46:21PM +0100, Jonas Oberhauser wrote:
> >>
> > > This is under my assumption that if we had let's say gcc's "semantic
> > > dependencies" or an under-approximation of it (by that I mean allow less
> > > things to be dependent than gcc can see), that these cases would be
> > > resolved, in the sense that gcc can not undermine [R & Marked] ; gcc-dep
> > > where gcc-dep is the dependencies detected by gcc.
> > 
> > That seems circular.  Basically, you're saying the gcc will not break
> > any dependencies that gcc classifies as not-breakable!
> 
> Maybe my formulation is not exactly what I meant to express.
> I am thinking of examples like this,
> 
>  r1 = READ_ONCE(*x);
>  *b = r1;
> 
> ~~>
> 
>  r1 = READ_ONCE(*x);
>  if (*b != r1) {
>    *b = r1;
>  }
> 
> Here there is clearly a dependency to a store, but gcc might turn it into an
> independent load (in case *b == r1).
> 
> Just because gcc admits that there is a dependency, does not necessarily
> mean that it will not still undermine the ordering "bestowed upon" that
> dependency by a memory model in some creative way.

Yes; that's why the LKMM considers only ordering involving marked 
accesses, unless absolutely necessary.

> The cases that I could think of all still worked for very specific
> architecture-specific reasons (e.g., x86 has CMOV but all loads provide
> acquire-ordering, and arm does not have flag-conditional str, etc.)
> 
> Or perhaps there is no dependency in case *b == r1. I am not sure.
> 
> Another thought that pops up here is that when I last worked on formalizing
> dependencies, I could not define dependencies as being between one load and
> one store, a dependency might be between a set of loads and one store. I
> would have to look up the exact reason, but I think it was because sometimes
> you need to change more than one value to influence the result, e.g., a && b
> where both a and b are 0 - just changing one will not make a difference.

Paul and I mentioned exactly this issue in our C++ presentation.  It's 
in the paper; I had to reformulate the definition of OOTA to take it 
into account.

> All of these complications make me wonder whether even a relational notion
> of semantic dependency is good enough.

For handling OOTA, probably not.

> > > I am genuinely asking. Do we have a list of the limitations?
> > > Maybe it would be good to collect it in the "A WARNING" section of
> > > explanation.txt if it doesn't exist elsewhere.
> > 
> > There are a few listed already at various spots in explanation.txt --
> > search for "undermine".  And yes, many or most of these limitations do
> > arise from LKMM's failure to recognize when a dependency isn't semantic.
> > Maybe some are also related to undefined behavior, which LKMM is not
> > aware of.
> 
> Thanks. Another point mentioned in that document is the total order of po,
> whereas C has a more relaxed notion of sequenced-before; this could even
> affect volatile accesses, e.g. in f(READ_ONCE(*a), g()). where g() calls an
> rmb and then READ_ONCE(*b), and it is not clear whether there should be a
> ppo from reading *a to *b in some executions or not.

Hah, yes.  Of course, this isn't a problem for the litmus tests that 
herd7 will accept, because herd7 doesn't understand user-defined 
functions.

Even so, we can run across odd things like this:

	r1 = READ_ONCE(*x);
	if (r1)
		smp_mb();
	WRITE_ONCE(*y, 1);

Here the WRITE_ONCE has to be ordered after the READ_ONCE, even when *x 
is 0, although the memory model isn't aware of that fact.

> For now I am mostly worried about cases where LKMM promises too much, rather
> than too little. The latter case arises naturally as a trade-off between
> complexity of the model, algorithmic complexity, and what guarantees are
> actually needed from the model by real code.

Agreed, promising too much is a worse problem than promising too little.

> By the way, I am currently investigating a formulation of LKMM where there
> is a separate "propagation order" per thread, prop-order[t], which relates
> `a` to `b` events iff `a` is "observed" to propagate to t before `b`.

I guess this would depend on what you mean by "observed".  Are there any 
useful cases of this where b isn't executed by t?  All I can think of is 
that when the right sort of memory barrier separates a from b, then for 
all t, a propagates to t before b does -- and this does not depend on t.

> Observation can also include co and fr, not just rf, which might be
> sufficient to cover those cases. I have a hand-written proof sketch that an
> order ORD induced by prop-order[t], xb, and some other orders is acyclic,
> and a Coq proof that executing an operational model in any linearization of
> ORD is permitted (e.g., does not propagate a store before it is executed, or
> in violation of co) and has the same rf as the axiomatic execution.
> 
> So if the proof sketch works out, this might indicate that with such a
> per-thread propagation order, one can eliminate those cases.
> 
> But to make the definition work, I had to make xb use prop-order[t] instead
> of prop in some cases, and the definitions of xb and prop-order[t] are
> mutually recursive, so it's not a very digestible definition.

These mutually recursive definitions are the bane of memory models.  In 
fact, that's what drove Will Deacon to impose other-multicopy atomicity 
on the ARM64 memory model -- doing so allowed him to avoid mutual 
recursion.

> I think there are two relevant claims here:
> 1) compilers do not introduce "observed OOTA"
> 2) For every execution graph with not-observed OOTA, there is another
> execution with the same observable side effects that does not have OOTA.
> 
> While it may be much easier to prove 2) with a more relaxed notion of
> observed OOTA, 1) sounds much harder. How does the compiler know that there
> is no indirect way to observe the OOTA?
> 
> E.g., in the LKMM example, ignoring the compiler barriers, it might be
> possible for a compiler to deduce that the plain accesses are never used and
> delete them, resulting in an OOTA that is observed under the more relaxed
> setting, violating claim 1).

A main point of the paper Paul and I wrote for the C++ working group was 
that the _only_ way OOTA can occur in real-world programs is if all the 
accesses have been removed by the compiler (assuming the compiler obeys 
some minimal restrictions).

We did not consider the issue of whether such instances of OOTA could be 
considered to be "observed".  As I understand it, they would not satisfy 
your proposed notion of "observed".

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-10 16:21                 ` Jonas Oberhauser
@ 2025-01-13 22:04                   ` Paul E. McKenney
  2025-01-16 18:40                     ` Paul E. McKenney
  2025-01-16 19:08                     ` Jonas Oberhauser
  0 siblings, 2 replies; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-13 22:04 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Fri, Jan 10, 2025 at 05:21:59PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/10/2025 um 3:54 PM schrieb Paul E. McKenney:
> > On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:
> > > Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
> > > > On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
> > > > > 
> > > > > 
> > > > > Am 1/8/2025 um 7:09 PM schrieb Paul E. McKenney:
> > > > > > If I change the two plain assignments to use WRITE_ONCE() as required
> > > > > > by memory-barriers.txt, OOTA is avoided:
> > > > > 
> > > > > 
> > > > > I think this direction of inquiry is a bit misleading. There need not be any
> > > > > speculative store at all:
> > > > > 
> > > > > 
> > > > > 
> > > > > P0(int *a, int *b, int *x, int *y) {
> > > > > 	int r1;
> > > > > 	int r2 = 0;
> > > > > 	r1 = READ_ONCE(*x);
> > > > > 	smp_rmb();
> > > > > 	if (r1 == 1) {
> > > > > 		r2 = *b;
> > > > > 	}
> > > > > 	WRITE_ONCE(*a, r2);
> > > > > 	smp_wmb();
> > > > > 	WRITE_ONCE(*y, 1);
> > > > > }
> > > > > 
> > > > > P1(int *a, int *b, int *x, int *y) {
> > > > > 	int r1;
> > > > > 
> > > > > 	int r2 = 0;
> > > > > 
> > > > > 	r1 = READ_ONCE(*y);
> > > > > 	smp_rmb();
> > > > > 	if (r1 == 1) {
> > > > > 		r2 = *a;
> > > > > 	}
> > > > > 	WRITE_ONCE(*b, r2);
> > > > > 	smp_wmb();
> > > > > 	WRITE_ONCE(*x, 1);
> > > > > }
> > > > > 
> > > > > 
> > > > If we want to respect something containing a control dependency to a
> > > > WRITE_ONCE() not in the body of the "if" statement, we need to make some
> > > > change to memory-barriers.txt.
> > > 
> > > I'm not sure what you denotate by *this* in "if we change this", but just to
> > > clarify, I am not thinking of claiming that there were a (semantic) control
> > > dependency to WRITE_ONCE(*b, r2) in this example.
> > > 
> > > There is however a data dependency from r2 = *a to WRITE_ONCE, and I would
> > > say that there is a semantic data (not control) dependency from r1 =
> > > READ_ONCE(*y) to WRITE_ONCE(*b, r2), too: depending on the value read from
> > > *y, the value stored to *b will be different. The latter would be enough to
> > > avoid OOTA according to the mainline LKMM, but currently this semantic
> > > dependency is not detected by herd7.
> > 
> > According to LKMM, address and data dependencies must be headed by
> > rcu_dereference() or similar.  See Documentation/RCU/rcu_dereference.rst.
> > 
> > Therefore, there is nothing to chain the control dependency with.
> 
> Note that herd7 does generate dependencies. And speaking informally, there
> clearly is a semantic dependency.
> 
> Both the original formalization of LKMM and my patch do say that a plain
> load at the head of a dependency chain does not provide any dependency
> ordering, i.e.,
> 
>   [Plain & R] ; dep
> 
> is never part of hb, both in LKMM and in my patch.

Agreed, LKMM does filter the underlying herd7 dependencies.

> By the way, if your concern is the dependency *starting* from the plain
> load, then we can look at examples where the dependency starts from a marked
> load:
> 
> 
>   r1 = READ_ONCE(*x);
>   smp_rmb();
>   if (r1 == 1) {
>     r2 = READ_ONCE(*a);
>   }
>   *b = 1;
>   smp_wmb();
>   WRITE_ONCE(*y,1);
> 
> This is more or less analogous to the case of the addr ; [Plain] ; wmb case
> you already have.

This is probably a failure of imagination on my part, but I am not
seeing how to create another thread that interacts with that store to
"b" without resulting in a data race.

Ignoring that, I am not seeing much in the way of LKMM dependencies
in that code.

> > > I currently can not come up with an example where there would be a
> > > (semantic) control dependency from a load to a store that is not in the arm
> > > of an if statement (or a loop / switch of some form with the branch
> > > depending on the load).
> > > 
> > > I think the control dependency is just a red herring. It is only there to
> > > avoid the data race.
> > 
> > Well, that red herring needs to have a companion fish to swim with in
> > order to enforce ordering, and I am not seeing that companion.
> > 
> > Or am I (yet again!) missing something subtle here?
> 
> It makes more sense to think about how people do message passing (or
> seqlock), which might look something like this:
> 
>   [READ_ONCE]
>   rmb
>   [plain read]
> 
> and
> 
>   [plain write]
>   wmb
>   [WRITE_ONCE]
> 
> 
> Clearly LKMM says that there is some sort of order (not quite happens-before
> order though) between the READ_ONCE and the plain read, and between the
> plain write and the WRITE_ONCE. This order is clearly defined in the data
> race definition, in r-pre-bounded and w-post-bounded.
> 
> Now consider
> 
>   [READ_ONCE]
>   rmb
>   [plain read]
>   // some code that creates order between the plain accesses
>   [plain write]
>   wmb
>   [WRITE_ONCE]
> 
> where for some specific reason we can discern that the compiler can not
> fully eliminate/move across the barrier either this specific plain read, nor
> the plain write, nor the ordering between the two.
> 
> In this case, is there order between the READ_ONCE and the WRITE_ONCE, or
> not? Of course, we know current LKMM says no. I would say that in those very
> specific cases, we do have ordering.

Agreed, for LKMM to deal with seqlock, the read-side critical section
would need to use READ_ONCE(), which is a bit unnatural.  The C++
standards committee has been discussing this for some time, as that
memory model also gives data race in that case.

But it might be better to directly model seqlock than to try to make
LKMM deal with the underlying atomic operations.

> > > In a hypothetical LKMM where reading in a race is not a data race unless the
> > > data is used (*1), this would also work:
> > 
> > You lost me on the "(*1)", which might mean that I am misunderstanding
> > your text and examples below.
> 
> This was meant to be a footnote :D

My first thought was indirecting through a value-1 pointer without the
needed casts.  ;-)

> > >   	unsigned int r1;
> > >   	unsigned int r2 = 0;
> > >   	r1 = READ_ONCE(*x);
> > >   	smp_rmb();
> > >   	r2 = *b;
> > 
> > This load from *b does not head any sort of dependency per LKMM, as noted
> > in rcu_dereference.rst.  As that document states, there are too many games
> > that compilers are permitted to play with plain C-language loads.
> > 
> > >   	WRITE_ONCE(*a, (~r1 + 1) & r2);
> > >   	smp_wmb();
> > >   	WRITE_ONCE(*y, 1);
> > > 
> > > 
> > > Here in case r1 == 0, the value of r2 is not used, so there is a race but
> > > there would not be data race in the hypothetical LKMM.
> > 
> > That plain C-language load from b, if concurrent with any sort of store to
> > b, really is a data race. Sure, a compiler that can prove that r1==0 at
> > the WRITE_ONCE() to a might optimize that load away, but the C-language
> > definition of data race still applies.
> 
> It is a data race according to C, but so are all races on WRITE_ONCE and
> READ_ONCE, so we already do not actually care what C says.

The situation with WRITE_ONCE() and READ_ONCE() is debatable.
The volatile accesses are defined more by folklore than standardese,
and they do seriously constrain the compiler.  Which is why many people
don't like volatile much.

> What we care about is what the compiler says (and does).
> 
> The reality is that no matter what kind of crazy optimizations the compiler
> does to the load and to the concurrent store, all that would happen is that
> the load "returns" some insane value. But that insane value is not used by
> the remainder of the computation.

Yes, but only as long as your properly constrain what the code in that
read-side critical section is permitted to do.

> I think the right way to think about it is that a race between a read and a
> write gives the read an indeterminate value, and a race between two writes
> produces undefined behavior. I vaguely recall that this is even guaranteed
> by LLVM.
> 
> That is why sequence locks work, after all. In our internal memory model we
> have relaxed the definition accordingly and there are a bunch of internally
> used datastructures that can only be verified because of the relaxation.

Again, it is likely best to model sequence locking directly.

> > I am currently not at all comfortable with the thought of allowing
> > plain C-language loads to head any sort of dependency.  I really did put
> > that restriction into both memory-barriers.txt and rcu_dereference.rst
> > intentionally.  There is the old saying "Discipline = freedom", and
> > therefore compilers' lack of discipline surrounding plain C-language
> > loads implies a lack of freedom.  ;-)
> 
> Yes, I understand your concern (or more generally, the concern of letting
> plain accesses play a role in ordering).
> Obviously, allowing arbitrary plain loads to invoke some kind of ordering
> because of a dependency is plain (heh) wrong.
> There are two kinds of potential problems:
>  - the load or its dependent store may not exist in that location at all
>  - the dependency may not really exist
> 
> The second case is a problem also with marked accesses, and should be
> handled by herd7 only giving us actual semantic dependencies (whatever those
> are). It can not be solved in cat. Either way it is a limitation that herd7
> (and also other tools) currently has and we already live with.

This is easy to say.  Please see this paper (which Alan also referred to)
for some of the challenges:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3064r2.pdf

Mark Batty and his group have identified more gotchas.  (See the citation
in the above paper.)

> So the new problem we deal with is to somehow restrict the rule to loads and
> dependent stores that the compiler for whatever reason will not fully
> eliminate.
> 
> This problem too can not be solved completely inside cat. We can give an
> approximation, as discussed with Alan (stating that a store would not be
> elided if it is read by another thread, and a read would not be elided if it
> reads from another thread and a store that won't be elided depends on it).
> 
> This approximation is also limited, e.g., if the addresses of the plain
> loads and stores have not yet escaped the function, but at least this
> scenario is currently impossible in herd7 (unlike the fake dependency
> scenario).
> 
> In my mind it would again be better to offload the correct selection of
> "compiler-un(re)movable plain loads and stores" to the tools. That may again
> not solve the problem fully, but it at least would mean that any changes to
> address the imprecision wouldn't need to go through the kernel tree, and
> IMHO it is easier to say LKMM in the cat files is the model, and the
> interpretation of the model has some limitations.

Which we currently do via marked accesses.  ;-)

> > We should decide which of these examples should be
> > added to the github litmus archive, perhaps to illustrate the fact that
> > plain C-language loads do not head dependency chains.  Thoughts?
> 
> I'm not sure that is a good idea, given that running the herd7 tool on the
> litmus test will clearly show a dependency chain headed by plain loads in
> the visual graphs (with `doshow rwdep`).
> 
> Maybe it makes more sense to say in the docs that they may head syntactic or
> semantic dependency chains, but because of the common case that the compiler
> may cruelly optimize things, LKMM does not guarantee ordering based on the
> dependency chains headed by plain loads. That would be consistent with the
> tooling.

Well, LKMM and herd7 have different notions of what constitutes a
dependency, and the LKMM notion is most relevant here.  (And herd7
needs its broader definition so as to handle a wide variety of
memory models.)

> > [1] https://people.kernel.org/paulmck/hunting-a-tree03-heisenbug
> 
> Fun. Thanks :) Duplication and Devious both start with D.

In reference to the performance and scalability consequences of
eliminating that duplication, so does Devine!  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-13 22:04                   ` Paul E. McKenney
@ 2025-01-16 18:40                     ` Paul E. McKenney
  2025-01-16 19:13                       ` Jonas Oberhauser
  2025-01-16 19:28                       ` Jonas Oberhauser
  2025-01-16 19:08                     ` Jonas Oberhauser
  1 sibling, 2 replies; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-16 18:40 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Mon, Jan 13, 2025 at 02:04:58PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 10, 2025 at 05:21:59PM +0100, Jonas Oberhauser wrote:
> > Am 1/10/2025 um 3:54 PM schrieb Paul E. McKenney:
> > > On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:
> > > > Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
> > > > > On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:

[ . . . ]

> > > > I currently can not come up with an example where there would be a
> > > > (semantic) control dependency from a load to a store that is not in the arm
> > > > of an if statement (or a loop / switch of some form with the branch
> > > > depending on the load).
> > > > 
> > > > I think the control dependency is just a red herring. It is only there to
> > > > avoid the data race.
> > > 
> > > Well, that red herring needs to have a companion fish to swim with in
> > > order to enforce ordering, and I am not seeing that companion.
> > > 
> > > Or am I (yet again!) missing something subtle here?
> > 
> > It makes more sense to think about how people do message passing (or
> > seqlock), which might look something like this:
> > 
> >   [READ_ONCE]
> >   rmb
> >   [plain read]
> > 
> > and
> > 
> >   [plain write]
> >   wmb
> >   [WRITE_ONCE]
> > 
> > 
> > Clearly LKMM says that there is some sort of order (not quite happens-before
> > order though) between the READ_ONCE and the plain read, and between the
> > plain write and the WRITE_ONCE. This order is clearly defined in the data
> > race definition, in r-pre-bounded and w-post-bounded.
> > 
> > Now consider
> > 
> >   [READ_ONCE]
> >   rmb
> >   [plain read]
> >   // some code that creates order between the plain accesses
> >   [plain write]
> >   wmb
> >   [WRITE_ONCE]
> > 
> > where for some specific reason we can discern that the compiler can not
> > fully eliminate/move across the barrier either this specific plain read, nor
> > the plain write, nor the ordering between the two.
> > 
> > In this case, is there order between the READ_ONCE and the WRITE_ONCE, or
> > not? Of course, we know current LKMM says no. I would say that in those very
> > specific cases, we do have ordering.
> 
> Agreed, for LKMM to deal with seqlock, the read-side critical section
> would need to use READ_ONCE(), which is a bit unnatural.  The C++
> standards committee has been discussing this for some time, as that
> memory model also gives data race in that case.
> 
> But it might be better to directly model seqlock than to try to make
> LKMM deal with the underlying atomic operations.

Maybe I should give an example approach, perhaps inspiring a better
approach.

o	Model reader-writer locks in LKMM, including a relaxed
	write-lock-held primitive.

o	Model sequence locks in terms of reader-writer locks:

	o	The seqlock writer maps to write lock.

	o	The seqlock reader maps to read lock, but with a
		write-lock-held check.	If the write lock is held at
		that point, the seqlock tells the caller to retry.

		Please note that the point is simply to exercise
		the failure path.

	o	If a value is read in the seqlock reader and used
		across a "you need to retry" indication, that
		flags a seqlock data race.

But is there a better way?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-13 22:04                   ` Paul E. McKenney
  2025-01-16 18:40                     ` Paul E. McKenney
@ 2025-01-16 19:08                     ` Jonas Oberhauser
  2025-01-16 23:02                       ` Alan Stern
  1 sibling, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-16 19:08 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/13/2025 um 11:04 PM schrieb Paul E. McKenney:
> On Fri, Jan 10, 2025 at 05:21:59PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/10/2025 um 3:54 PM schrieb Paul E. McKenney:
>>> On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:
>>>> Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
>>>>> On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
>>>>>>
>>>>>>
>>>>>> Am 1/8/2025 um 7:09 PM schrieb Paul E. McKenney:
>>>>>>> If I change the two plain assignments to use WRITE_ONCE() as required
>>>>>>> by memory-barriers.txt, OOTA is avoided:
>>>>>>
>>>>>>
>>>>>> I think this direction of inquiry is a bit misleading. There need not be any
>>>>>> speculative store at all:
>>>>>>
>>>>>>
>>>>>>
>>>>>> P0(int *a, int *b, int *x, int *y) {
>>>>>> 	int r1;
>>>>>> 	int r2 = 0;
>>>>>> 	r1 = READ_ONCE(*x);
>>>>>> 	smp_rmb();
>>>>>> 	if (r1 == 1) {
>>>>>> 		r2 = *b;
>>>>>> 	}
>>>>>> 	WRITE_ONCE(*a, r2);
>>>>>> 	smp_wmb();
>>>>>> 	WRITE_ONCE(*y, 1);
>>>>>> }
>>>>>>
>>>>>> P1(int *a, int *b, int *x, int *y) {
>>>>>> 	int r1;
>>>>>>
>>>>>> 	int r2 = 0;
>>>>>>
>>>>>> 	r1 = READ_ONCE(*y);
>>>>>> 	smp_rmb();
>>>>>> 	if (r1 == 1) {
>>>>>> 		r2 = *a;
>>>>>> 	}
>>>>>> 	WRITE_ONCE(*b, r2);
>>>>>> 	smp_wmb();
>>>>>> 	WRITE_ONCE(*x, 1);
>>>>>> }
>>>>>>
>>>>>>
>>>>> If we want to respect something containing a control dependency to a
>>>>> WRITE_ONCE() not in the body of the "if" statement, we need to make some
>>>>> change to memory-barriers.txt.
>>>>
>>>> I'm not sure what you denotate by *this* in "if we change this", but just to
>>>> clarify, I am not thinking of claiming that there were a (semantic) control
>>>> dependency to WRITE_ONCE(*b, r2) in this example.
>>>>
>>>> There is however a data dependency from r2 = *a to WRITE_ONCE, and I would
>>>> say that there is a semantic data (not control) dependency from r1 =
>>>> READ_ONCE(*y) to WRITE_ONCE(*b, r2), too: depending on the value read from
>>>> *y, the value stored to *b will be different. The latter would be enough to
>>>> avoid OOTA according to the mainline LKMM, but currently this semantic
>>>> dependency is not detected by herd7.
>>>
>>> According to LKMM, address and data dependencies must be headed by
>>> rcu_dereference() or similar.  See Documentation/RCU/rcu_dereference.rst.
>>>
>>> Therefore, there is nothing to chain the control dependency with.
>>
>> Note that herd7 does generate dependencies. And speaking informally, there
>> clearly is a semantic dependency.
>>
>> Both the original formalization of LKMM and my patch do say that a plain
>> load at the head of a dependency chain does not provide any dependency
>> ordering, i.e.,
>>
>>    [Plain & R] ; dep
>>
>> is never part of hb, both in LKMM and in my patch.
> 
> Agreed, LKMM does filter the underlying herd7 dependencies.
> 
>> By the way, if your concern is the dependency *starting* from the plain
>> load, then we can look at examples where the dependency starts from a marked
>> load:
>>
>>
>>    r1 = READ_ONCE(*x);
>>    smp_rmb();
>>    if (r1 == 1) {
>>      r2 = READ_ONCE(*a);
>>    }
>>    *b = 1;
>>    smp_wmb();
>>    WRITE_ONCE(*y,1);
>>
>> This is more or less analogous to the case of the addr ; [Plain] ; wmb case
>> you already have.
> 
> This is probably a failure of imagination on my part, but I am not
> seeing how to create another thread that interacts with that store to
> "b" without resulting in a data race.

The other thread is the same just flipping x/y and a/b around.

     r1 = READ_ONCE(*y);
     smp_rmb();
     if (r1 == 1) {
       r2 = READ_ONCE(*b);
     }
     *a = 1;
     smp_wmb();
     WRITE_ONCE(*x,1);

There is no data race, because the read from *b can only happen if we 
also read from the store to *y, and there are a wmb and an rmb to 
'prevent the reordering' of the assignments to *b and *y, and the load 
from *y and the load from *b.

> Ignoring that, I am not seeing much in the way of LKMM dependencies
> in that code.

Ah, I meant to write

  *b = r2

and

  *a = r2

My mistake.

Note that if we just change it like this:

     r1 = READ_ONCE(*x);
     smp_rmb();
     if (r1 == 1) {
       r2 = READ_ONCE(*a);
     }
     *r2 = a;
     smp_wmb();
     WRITE_ONCE(*y,1);

then we have an addr dependency and the OOTA becomes forbidden (unless 
you know how to initialize *a and *b to valid addresses, you may need to 
add something like `if (r2 == 0) r2 = a` to run this in herd7).

It still has the same anomaly that triggers the OOTA though, where both 
threads can read r1==1. This anomaly goes away with my patch.

>> That is why sequence locks work, after all. In our internal memory model we
>> have relaxed the definition accordingly and there are a bunch of internally
>> used datastructures that can only be verified because of the relaxation.
> 
> Again, it is likely best to model sequence locking directly.

You can take that point of view. For our model, like I said, we have 
taken the other point of view.

The main advantage being that it helps us prove that our sequence lock 
and a few other speculative algorithms we have developed are correct.

>>> I am currently not at all comfortable with the thought of allowing
>>> plain C-language loads to head any sort of dependency.  I really did put
>>> that restriction into both memory-barriers.txt and rcu_dereference.rst
>>> intentionally.  There is the old saying "Discipline = freedom", and
>>> therefore compilers' lack of discipline surrounding plain C-language
>>> loads implies a lack of freedom.  ;-)
>>
>> Yes, I understand your concern (or more generally, the concern of letting
>> plain accesses play a role in ordering).
>> Obviously, allowing arbitrary plain loads to invoke some kind of ordering
>> because of a dependency is plain (heh) wrong.
>> There are two kinds of potential problems:
>>   - the load or its dependent store may not exist in that location at all
>>   - the dependency may not really exist
>>
>> The second case is a problem also with marked accesses, and should be
>> handled by herd7 only giving us actual semantic dependencies (whatever those
>> are). It can not be solved in cat. Either way it is a limitation that herd7
>> (and also other tools) currently has and we already live with.
> 
> This is easy to say.  Please see this paper (which Alan also referred to)
> for some of the challenges:
> 
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3064r2.pdf
> 
> Mark Batty and his group have identified more gotchas.  (See the citation
> in the above paper.)

I'm well aware that an absolute definition of semantic dependencies is 
not easy to give.

The point is, it's not a problem that can, in any way, be really solved 
inside the cat model.

So if it should be solved (instead of approximated in ways that can be 
easily undermined), it needs to be handled by the tooling around the cat 
model.

Btw, I'm re-reading that paper and here are some comments:
- I agree with the conclusion that C++ should not try to solve OOTA 
other than declaring that they do not exist
- the definition of OOTA as sdep | rfe is not the right one. You really 
need sdep ; rfe, because you can have multiple subsequent sdep links 
that together are not a dependency, e.g.,

int * x[] = { &a, &a };
int i = b;
*x[i] = 1;

here the semantic address dependency from loading b to loading x[i] and 
the semantic address dependency from loading x[i] and storing to *x[i] 
do not together form a semantic dependency anymore, because *x[i] is 
always a. So this whole code can just become a=1, and with mirrored code 
you can get a=b=1, which is an sdep | rfe cycle.

So compilers can absolutely generate an observed OOTA for your 
definition of OOTA!

As I explained in the original e-mail, you should instead rely on 
semantic dependency to cover cases of sdep ; sdep where there is in fact 
still a dependency, and define OOTA as sdep;rfe.

- I did not like the phrasing that a dependency is a function of one 
execution, especially in contrast to source code. For a fixed execution, 
there are no dependencies because all values are fixed. It is the other 
executions - where you could have read another value - that create the 
dependencies.

Perhaps it is better to say that it is not a *local* function of source 
code, i.e., just because the same source code has a dependency in 
context C does not mean that it has a dependency in context C'.

In fact I would say a major gotcha of dependencies is that they are not 
a function of even the set of all permissible (according to the 
standard) executions.
That is because the compiler does not have to make every permissible 
execution possible, only at least one.

If the compiler knows that among all executions that it actually makes 
possible, some value is always 0 - even if there is a permissible 
execution in which it is not 0 - it can still replace that value with 0. 
E.g.

   T1 { x = 1; x = 0; }
   T2 { T[x] = 1; }

    ~~ merge -- makes executions where T[x] reads x==1 impossible ~>

   T1' { x = 1; x = 0; T[x] = 1; }

    ~~ replace ~>

   T1' { x = 1; x = 0; T[0] = 1; // no more dependency :( }

which defeats any semantic dependency definition that is a function of a 
single execution.

Of course you avoid this by making it really a function of the compiler 
+ program *and* the execution, and looking at all the other possible 
(not just permissible) executions.

- On the one hand, that definition makes a lot of sense. On the other 
hand, at least without the atomics=volatile restriction it would have 
the downside that a compiler which generates just a single execution for 
your program can say that there are no dependencies whatsoever and 
generate all kinds of "out of thin air" behaviors.
I am not sure if that gets really resolved by the volatile restrictions 
you put, but either way those seem far stronger than what one would want.

I would say that the approach with volatile is overzealous because it 
tries to create a "local" order solution to the problem that only 
requires a "global" ordering solution. Since not every semantic 
dependency needs to provide order in C++ -- only the cycle of 
dependencies -- it is totally ok to add too many semantic dependency 
edges to a program, even those that are not going to be exactly 
maintained by every compiler, as long as we can ensure that globally, no 
dependency cycle occurs.

So for example, if we merge  x = y || y = x,  the merge can turn it into 
x=y=x or y=x=y (and then into an empty program), but not into a cyclic 
dependency. So even though one side of the dependency may be violated, 
for sake of OOTA, we could still label both sides as dependent.

For LKMM the problem is of course much easier because you have volatiles 
and compiler barriers. Again you could maybe add incorrect semantic 
dependencies between accesses, as long as only the really preserved ones 
will imply ordering.

So I'm not convinced that for either of the two cases you need to do a 
compiler-specific definition of dependencies.

BTW, for what it's worth, Dat3M in a sense uses the clang dependencies - 
it first allows the compiler to do its optimizations, and then verifies 
the llvm-ir (with a more hardware-like dependency definition).

I think something like that can be a good practical solution with fewer 
problems than other attempts to approximate the solution.

>> So the new problem we deal with is to somehow restrict the rule to loads and
>> dependent stores that the compiler for whatever reason will not fully
>> eliminate.
>>
>> This problem too can not be solved completely inside cat. We can give an
>> approximation, as discussed with Alan (stating that a store would not be
>> elided if it is read by another thread, and a read would not be elided if it
>> reads from another thread and a store that won't be elided depends on it).
>>
>> This approximation is also limited, e.g., if the addresses of the plain
>> loads and stores have not yet escaped the function, but at least this
>> scenario is currently impossible in herd7 (unlike the fake dependency
>> scenario).
>>
>> In my mind it would again be better to offload the correct selection of
>> "compiler-un(re)movable plain loads and stores" to the tools. That may again
>> not solve the problem fully, but it at least would mean that any changes to
>> address the imprecision wouldn't need to go through the kernel tree, and
>> IMHO it is easier to say LKMM in the cat files is the model, and the
>> interpretation of the model has some limitations.
> 
> Which we currently do via marked accesses.  ;-)

Hey, programmers are not tools : )

>>> We should decide which of these examples should be
>>> added to the github litmus archive, perhaps to illustrate the fact that
>>> plain C-language loads do not head dependency chains.  Thoughts?
>>
>> I'm not sure that is a good idea, given that running the herd7 tool on the
>> litmus test will clearly show a dependency chain headed by plain loads in
>> the visual graphs (with `doshow rwdep`).
>>
>> Maybe it makes more sense to say in the docs that they may head syntactic or
>> semantic dependency chains, but because of the common case that the compiler
>> may cruelly optimize things, LKMM does not guarantee ordering based on the
>> dependency chains headed by plain loads. That would be consistent with the
>> tooling.
> 
> Well, LKMM and herd7 have different notions of what constitutes a
> dependency, and the LKMM notion is most relevant here.  (And herd7
> needs its broader definition so as to handle a wide variety of
> memory models.)

I would say that herd7's definition handles a wide variety of hardware 
memory models, but it has limitations for language level models. Perhaps 
it would be good for herd7 to support multiple dependency models as well.

Have fun,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 18:40                     ` Paul E. McKenney
@ 2025-01-16 19:13                       ` Jonas Oberhauser
  2025-01-16 19:31                         ` Paul E. McKenney
  2025-01-16 19:28                       ` Jonas Oberhauser
  1 sibling, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-16 19:13 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/16/2025 um 7:40 PM schrieb Paul E. McKenney:
> On Mon, Jan 13, 2025 at 02:04:58PM -0800, Paul E. McKenney wrote:
>> On Fri, Jan 10, 2025 at 05:21:59PM +0100, Jonas Oberhauser wrote:
>>> Am 1/10/2025 um 3:54 PM schrieb Paul E. McKenney:
>>>> On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:
>>>>> Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
>>>>>> On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
> 
> [ . . . ]
> 
>>>>> I currently can not come up with an example where there would be a
>>>>> (semantic) control dependency from a load to a store that is not in the arm
>>>>> of an if statement (or a loop / switch of some form with the branch
>>>>> depending on the load).
>>>>>
>>>>> I think the control dependency is just a red herring. It is only there to
>>>>> avoid the data race.
>>>>
>>>> Well, that red herring needs to have a companion fish to swim with in
>>>> order to enforce ordering, and I am not seeing that companion.
>>>>
>>>> Or am I (yet again!) missing something subtle here?
>>>
>>> It makes more sense to think about how people do message passing (or
>>> seqlock), which might look something like this:
>>>
>>>    [READ_ONCE]
>>>    rmb
>>>    [plain read]
>>>
>>> and
>>>
>>>    [plain write]
>>>    wmb
>>>    [WRITE_ONCE]
>>>
>>>
>>> Clearly LKMM says that there is some sort of order (not quite happens-before
>>> order though) between the READ_ONCE and the plain read, and between the
>>> plain write and the WRITE_ONCE. This order is clearly defined in the data
>>> race definition, in r-pre-bounded and w-post-bounded.
>>>
>>> Now consider
>>>
>>>    [READ_ONCE]
>>>    rmb
>>>    [plain read]
>>>    // some code that creates order between the plain accesses
>>>    [plain write]
>>>    wmb
>>>    [WRITE_ONCE]
>>>
>>> where for some specific reason we can discern that the compiler can not
>>> fully eliminate/move across the barrier either this specific plain read, nor
>>> the plain write, nor the ordering between the two.
>>>
>>> In this case, is there order between the READ_ONCE and the WRITE_ONCE, or
>>> not? Of course, we know current LKMM says no. I would say that in those very
>>> specific cases, we do have ordering.
>>
>> Agreed, for LKMM to deal with seqlock, the read-side critical section
>> would need to use READ_ONCE(), which is a bit unnatural.  The C++
>> standards committee has been discussing this for some time, as that
>> memory model also gives data race in that case.
>>
>> But it might be better to directly model seqlock than to try to make
>> LKMM deal with the underlying atomic operations.
> 
> Maybe I should give an example approach, perhaps inspiring a better
> approach.
> 
> o	Model reader-writer locks in LKMM, including a relaxed
> 	write-lock-held primitive.
> 
> o	Model sequence locks in terms of reader-writer locks:
> 
> 	o	The seqlock writer maps to write lock.
> 
> 	o	The seqlock reader maps to read lock, but with a
> 		write-lock-held check.	If the write lock is held at
> 		that point, the seqlock tells the caller to retry.
> 
> 		Please note that the point is simply to exercise
> 		the failure path.
> 
> 	o	If a value is read in the seqlock reader and used
> 		across a "you need to retry" indication, that
> 		flags a seqlock data race.
> 
> But is there a better way?
> 

You need to be careful with those hb edges. The reader critical section 
does not have to happen-before the writer critical section, as would 
with an actual read-write lock.

I think the solution would have to be along changing the definition of 
r-post-bounded.
The read_enter() function reads from a write_exit() and establishes hb.
The read_exit() function also reads from a write_exit(), if the same as 
the matching read_enter(), then it returns success, otherwise failure.

Reads po-before a successful read_exit() are bounded with regards to 
subsequent write_enter() on the same lock.

Best wishes,
   jonas





^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 18:40                     ` Paul E. McKenney
  2025-01-16 19:13                       ` Jonas Oberhauser
@ 2025-01-16 19:28                       ` Jonas Oberhauser
  2025-01-16 19:39                         ` Paul E. McKenney
  1 sibling, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-16 19:28 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/16/2025 um 7:40 PM schrieb Paul E. McKenney:
> 	o	If a value is read in the seqlock reader and used
> 		across a "you need to retry" indication, that
> 		flags a seqlock data race.

This too is insufficient, you also need to prevent dereferencing or 
having control dependency inside the seqlock. Otherwise you could 
derefence a torn pointer and...

At this point your definition of data race becomes pretty much the same 
as we have.

https://github.com/open-s4c/libvsync/blob/main/vmm/vmm.cat#L150

(also this rule should only concern reads that are actually "data-racy" 
- if the read is synchronized by some other writes, then you can read & 
use it just fine across the seqlock data race)

I also noticed that in my previous e-mail I had overlooked the reads 
inside the CS in the failure case, but you are of course right, there 
needs to be some mechanism to prevent them from being data racy unless 
abused.

But I am not sure how to formalize that in a way that is simpler than 
just re-defining data races in general, without adding some special 
support to herd7 for it.

What do you think?

   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 19:13                       ` Jonas Oberhauser
@ 2025-01-16 19:31                         ` Paul E. McKenney
  2025-01-16 20:21                           ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-16 19:31 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Thu, Jan 16, 2025 at 08:13:28PM +0100, Jonas Oberhauser wrote:
> Am 1/16/2025 um 7:40 PM schrieb Paul E. McKenney:
> > On Mon, Jan 13, 2025 at 02:04:58PM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 10, 2025 at 05:21:59PM +0100, Jonas Oberhauser wrote:
> > > > Am 1/10/2025 um 3:54 PM schrieb Paul E. McKenney:
> > > > > On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:
> > > > > > Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
> > > > > > > On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
> > 
> > [ . . . ]
> > 
> > > > > > I currently can not come up with an example where there would be a
> > > > > > (semantic) control dependency from a load to a store that is not in the arm
> > > > > > of an if statement (or a loop / switch of some form with the branch
> > > > > > depending on the load).
> > > > > > 
> > > > > > I think the control dependency is just a red herring. It is only there to
> > > > > > avoid the data race.
> > > > > 
> > > > > Well, that red herring needs to have a companion fish to swim with in
> > > > > order to enforce ordering, and I am not seeing that companion.
> > > > > 
> > > > > Or am I (yet again!) missing something subtle here?
> > > > 
> > > > It makes more sense to think about how people do message passing (or
> > > > seqlock), which might look something like this:
> > > > 
> > > >    [READ_ONCE]
> > > >    rmb
> > > >    [plain read]
> > > > 
> > > > and
> > > > 
> > > >    [plain write]
> > > >    wmb
> > > >    [WRITE_ONCE]
> > > > 
> > > > 
> > > > Clearly LKMM says that there is some sort of order (not quite happens-before
> > > > order though) between the READ_ONCE and the plain read, and between the
> > > > plain write and the WRITE_ONCE. This order is clearly defined in the data
> > > > race definition, in r-pre-bounded and w-post-bounded.
> > > > 
> > > > Now consider
> > > > 
> > > >    [READ_ONCE]
> > > >    rmb
> > > >    [plain read]
> > > >    // some code that creates order between the plain accesses
> > > >    [plain write]
> > > >    wmb
> > > >    [WRITE_ONCE]
> > > > 
> > > > where for some specific reason we can discern that the compiler can not
> > > > fully eliminate/move across the barrier either this specific plain read, nor
> > > > the plain write, nor the ordering between the two.
> > > > 
> > > > In this case, is there order between the READ_ONCE and the WRITE_ONCE, or
> > > > not? Of course, we know current LKMM says no. I would say that in those very
> > > > specific cases, we do have ordering.
> > > 
> > > Agreed, for LKMM to deal with seqlock, the read-side critical section
> > > would need to use READ_ONCE(), which is a bit unnatural.  The C++
> > > standards committee has been discussing this for some time, as that
> > > memory model also gives data race in that case.
> > > 
> > > But it might be better to directly model seqlock than to try to make
> > > LKMM deal with the underlying atomic operations.
> > 
> > Maybe I should give an example approach, perhaps inspiring a better
> > approach.
> > 
> > o	Model reader-writer locks in LKMM, including a relaxed
> > 	write-lock-held primitive.
> > 
> > o	Model sequence locks in terms of reader-writer locks:
> > 
> > 	o	The seqlock writer maps to write lock.
> > 
> > 	o	The seqlock reader maps to read lock, but with a
> > 		write-lock-held check.	If the write lock is held at
> > 		that point, the seqlock tells the caller to retry.
> > 
> > 		Please note that the point is simply to exercise
> > 		the failure path.
> > 
> > 	o	If a value is read in the seqlock reader and used
> > 		across a "you need to retry" indication, that
> > 		flags a seqlock data race.
> > 
> > But is there a better way?
> > 
> 
> You need to be careful with those hb edges. The reader critical section does
> not have to happen-before the writer critical section, as would with an
> actual read-write lock.

True, the reader critical section only needs each of its loads to have
an fr link to all of the stores in the writer critical section.

On the other hand, the Linux-kernel implementation of seqcount does use
smp_rmb() and smp_wmb(), which does provide the message-passing pattern,
and thus hb, correct?

> I think the solution would have to be along changing the definition of
> r-post-bounded.
> The read_enter() function reads from a write_exit() and establishes hb.
> The read_exit() function also reads from a write_exit(), if the same as the
> matching read_enter(), then it returns success, otherwise failure.
> 
> Reads po-before a successful read_exit() are bounded with regards to
> subsequent write_enter() on the same lock.

Or are you trying to model the effects of the data races?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 19:28                       ` Jonas Oberhauser
@ 2025-01-16 19:39                         ` Paul E. McKenney
  2025-01-17 12:08                           ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Paul E. McKenney @ 2025-01-16 19:39 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Thu, Jan 16, 2025 at 08:28:06PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/16/2025 um 7:40 PM schrieb Paul E. McKenney:
> > 	o	If a value is read in the seqlock reader and used
> > 		across a "you need to retry" indication, that
> > 		flags a seqlock data race.
> 
> 
> This too is insufficient, you also need to prevent dereferencing or having
> control dependency inside the seqlock. Otherwise you could derefence a torn
> pointer and...

True, but isn't that prohibition separable from the underlying
implementation?

> At this point your definition of data race becomes pretty much the same as
> we have.
> 
> https://github.com/open-s4c/libvsync/blob/main/vmm/vmm.cat#L150
> 
> 
> (also this rule should only concern reads that are actually "data-racy" - if
> the read is synchronized by some other writes, then you can read & use it
> just fine across the seqlock data race)

Perhaps LKMM should adopt this or something similar, but what do others
think?

> I also noticed that in my previous e-mail I had overlooked the reads inside
> the CS in the failure case, but you are of course right, there needs to be
> some mechanism to prevent them from being data racy unless abused.
> 
> But I am not sure how to formalize that in a way that is simpler than just
> re-defining data races in general, without adding some special support to
> herd7 for it.
> 
> What do you think?

I was thinking in terms of identifying reads in critical sections (sort
of like LKMM does for RCU read-side critical sections), then identifying
any dependencies from those reads that cross a failed reader boundary.
If that set is non-empty, flag it.

But I clearly cannot claim to have thought this through.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 19:31                         ` Paul E. McKenney
@ 2025-01-16 20:21                           ` Jonas Oberhauser
  0 siblings, 0 replies; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-16 20:21 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/16/2025 um 8:31 PM schrieb Paul E. McKenney:
> On Thu, Jan 16, 2025 at 08:13:28PM +0100, Jonas Oberhauser wrote:
>> Am 1/16/2025 um 7:40 PM schrieb Paul E. McKenney:
>>> On Mon, Jan 13, 2025 at 02:04:58PM -0800, Paul E. McKenney wrote:
>>>> On Fri, Jan 10, 2025 at 05:21:59PM +0100, Jonas Oberhauser wrote:
>>>>> Am 1/10/2025 um 3:54 PM schrieb Paul E. McKenney:
>>>>>> On Thu, Jan 09, 2025 at 07:35:19PM +0100, Jonas Oberhauser wrote:
>>>>>>> Am 1/9/2025 um 6:54 PM schrieb Paul E. McKenney:
>>>>>>>> On Wed, Jan 08, 2025 at 08:17:51PM +0100, Jonas Oberhauser wrote:
>>>
>>> [ . . . ]
>>>
>>>>>>> I currently can not come up with an example where there would be a
>>>>>>> (semantic) control dependency from a load to a store that is not in the arm
>>>>>>> of an if statement (or a loop / switch of some form with the branch
>>>>>>> depending on the load).
>>>>>>>
>>>>>>> I think the control dependency is just a red herring. It is only there to
>>>>>>> avoid the data race.
>>>>>>
>>>>>> Well, that red herring needs to have a companion fish to swim with in
>>>>>> order to enforce ordering, and I am not seeing that companion.
>>>>>>
>>>>>> Or am I (yet again!) missing something subtle here?
>>>>>
>>>>> It makes more sense to think about how people do message passing (or
>>>>> seqlock), which might look something like this:
>>>>>
>>>>>     [READ_ONCE]
>>>>>     rmb
>>>>>     [plain read]
>>>>>
>>>>> and
>>>>>
>>>>>     [plain write]
>>>>>     wmb
>>>>>     [WRITE_ONCE]
>>>>>
>>>>>
>>>>> Clearly LKMM says that there is some sort of order (not quite happens-before
>>>>> order though) between the READ_ONCE and the plain read, and between the
>>>>> plain write and the WRITE_ONCE. This order is clearly defined in the data
>>>>> race definition, in r-pre-bounded and w-post-bounded.
>>>>>
>>>>> Now consider
>>>>>
>>>>>     [READ_ONCE]
>>>>>     rmb
>>>>>     [plain read]
>>>>>     // some code that creates order between the plain accesses
>>>>>     [plain write]
>>>>>     wmb
>>>>>     [WRITE_ONCE]
>>>>>
>>>>> where for some specific reason we can discern that the compiler can not
>>>>> fully eliminate/move across the barrier either this specific plain read, nor
>>>>> the plain write, nor the ordering between the two.
>>>>>
>>>>> In this case, is there order between the READ_ONCE and the WRITE_ONCE, or
>>>>> not? Of course, we know current LKMM says no. I would say that in those very
>>>>> specific cases, we do have ordering.
>>>>
>>>> Agreed, for LKMM to deal with seqlock, the read-side critical section
>>>> would need to use READ_ONCE(), which is a bit unnatural.  The C++
>>>> standards committee has been discussing this for some time, as that
>>>> memory model also gives data race in that case.
>>>>
>>>> But it might be better to directly model seqlock than to try to make
>>>> LKMM deal with the underlying atomic operations.
>>>
>>> Maybe I should give an example approach, perhaps inspiring a better
>>> approach.
>>>
>>> o	Model reader-writer locks in LKMM, including a relaxed
>>> 	write-lock-held primitive.
>>>
>>> o	Model sequence locks in terms of reader-writer locks:
>>>
>>> 	o	The seqlock writer maps to write lock.
>>>
>>> 	o	The seqlock reader maps to read lock, but with a
>>> 		write-lock-held check.	If the write lock is held at
>>> 		that point, the seqlock tells the caller to retry.
>>>
>>> 		Please note that the point is simply to exercise
>>> 		the failure path.
>>>
>>> 	o	If a value is read in the seqlock reader and used
>>> 		across a "you need to retry" indication, that
>>> 		flags a seqlock data race.
>>>
>>> But is there a better way?
>>>
>>
>> You need to be careful with those hb edges. The reader critical section does
>> not have to happen-before the writer critical section, as would with an
>> actual read-write lock.
> 
> True, the reader critical section only needs each of its loads to have
> an fr link to all of the stores in the writer critical section.

Yes. And importantly, a properly-synchronized fr link!
Meaning that it does not constitute a data race.

Actually, you only need the proper synchronization, then the fr link 
follows from the race-coherence axioms.

> On the other hand, the Linux-kernel implementation of seqcount does use
> smp_rmb() and smp_wmb(), which does provide the message-passing pattern,
> and thus hb, correct?

Only in one direction, from the last writer CS to the reader CS. But not 
from the reader CS to the next writer CS :(

The message is passed, but from a data race POV, you are reading 
something while new data is coming in (from the next write). This 
fr-race could only be prevented by

    rw-xbstar = fence | *(r-post-bounded ; xbstar ; w-pre-bounded)*

but there is no xbstar :(

Well, one thought is that one could declare xbstar from the read_exit to 
the write_enter, but just not add any release semantics to read_exit() 
(like a rw_read_unlock would have). But that sounds really scary because 
that xbstar definitely does not exist in the implementation, so my 9pm 
brain is doubtful that this is correct.


>> I think the solution would have to be along changing the definition of
>> r-post-bounded.
>> The read_enter() function reads from a write_exit() and establishes hb.
>> The read_exit() function also reads from a write_exit(), if the same as the
>> matching read_enter(), then it returns success, otherwise failure.
>>
>> Reads po-before a successful read_exit() are bounded with regards to
>> subsequent write_enter() on the same lock.
> 
> Or are you trying to model the effects of the data races?

No, I had even overlooked the "real data race" in the failure case xP

I was modelling the properly-synchronized part of the fr, without 
introducing hb/xbstar from the reader to the next writer - but only for 
the success case. The failure case also needs some way to avoid the data 
race (unless it abuses the value). Perhaps by relaxing the definition of 
data race as we did in VMM, so that the readside CS doesn't become UB 
until you do something forbidden with the value.

My suggestion for avoiding the data race in success was to extend the 
notion of rw-xbstar above in some way that provides the necessary 
protection, e.g. by drawing an "r-post-bounded"-like edge to the 
write_enter(), which would then "w-pre-bound" (or something similar) the 
write-side CS.

Have fun,
   jonas


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 19:08                     ` Jonas Oberhauser
@ 2025-01-16 23:02                       ` Alan Stern
  2025-01-17  8:34                         ` Hernan Ponce de Leon
                                           ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Alan Stern @ 2025-01-16 23:02 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Thu, Jan 16, 2025 at 08:08:22PM +0100, Jonas Oberhauser wrote:
> I'm well aware that an absolute definition of semantic dependencies is not
> easy to give.

In fact it's undecidable.  No tool is ever going to be able to detect 
semantic dependencies perfectly.

> Btw, I'm re-reading that paper and here are some comments:
> - I agree with the conclusion that C++ should not try to solve OOTA other
> than declaring that they do not exist
> - the definition of OOTA as sdep | rfe is not the right one. You really need
> sdep ; rfe, because you can have multiple subsequent sdep links that
> together are not a dependency, e.g.,
> 
> int * x[] = { &a, &a };
> int i = b;
> *x[i] = 1;
> 
> here the semantic address dependency from loading b to loading x[i] and the
> semantic address dependency from loading x[i] and storing to *x[i] do not
> together form a semantic dependency anymore, because *x[i] is always a. So
> this whole code can just become a=1, and with mirrored code you can get
> a=b=1, which is an sdep | rfe cycle.

We regard sdep as extending from loads (or sets of loads) to stores.  
(Perhaps the paper doesn't state this explicitly -- it should.)  So an 
sdep ; sdep sequence is not possible.

Nevertheless, it's arguable that in your example there is no semantic 
dependency from the load of b to the load of x[i].  Given the code 
shown, a compiler could replace the load of x[i] with the constant value 
&a, yielding simply *(&a) = 1.

There are other examples which do make the point, however.  For example:

int a = 1, b = 1, c = 2;
int *x[] = {&a, &b, &c};

int r1 = i;
if (r1 > 1)
	r1 = 1;
int *r2 = x[r1];
int r3 = *r2;
q = r3;

Here there is a semantic dependency (if you accept dependencies to 
loads) from the load of i to the load of x[r1] and from the load of *r2 
to the store to q.  But overall the value stored to q is always 1, so it 
doesn't depend on the load from i.

> - I did not like the phrasing that a dependency is a function of one
> execution, especially in contrast to source code. For a fixed execution,
> there are no dependencies because all values are fixed. It is the other
> executions - where you could have read another value - that create the
> dependencies.
> 
> Perhaps it is better to say that it is not a *local* function of source
> code, i.e., just because the same source code has a dependency in context C
> does not mean that it has a dependency in context C'.

Of course, what we meant is that whether or not there is a semantic 
dependency depends on both the source code and the execution.

> In fact I would say a major gotcha of dependencies is that they are not a
> function of even the set of all permissible (according to the standard)
> executions.
> That is because the compiler does not have to make every permissible
> execution possible, only at least one.

The paper includes examples of this, I believe.

> If the compiler knows that among all executions that it actually makes
> possible, some value is always 0 - even if there is a permissible execution
> in which it is not 0 - it can still replace that value with 0. E.g.
> 
>   T1 { x = 1; x = 0; }
>   T2 { T[x] = 1; }
> 
>    ~~ merge -- makes executions where T[x] reads x==1 impossible ~>
> 
>   T1' { x = 1; x = 0; T[x] = 1; }
> 
>    ~~ replace ~>
> 
>   T1' { x = 1; x = 0; T[0] = 1; // no more dependency :( }
> 
> which defeats any semantic dependency definition that is a function of a
> single execution.

Our wording could be improved.  We simply want to make the point that 
the same source code may have a semantic dependency in one execution but 
not in another.  I guess we should avoid calling it a "function" of the 
execution, to avoid implying that the execution alone is what matters.

> Of course you avoid this by making it really a function of the compiler +
> program *and* the execution, and looking at all the other possible (not just
> permissible) executions.
> 
> - On the one hand, that definition makes a lot of sense. On the other hand,
> at least without the atomics=volatile restriction it would have the downside
> that a compiler which generates just a single execution for your program can
> say that there are no dependencies whatsoever and generate all kinds of "out
> of thin air" behaviors.

That is so.  But a compiler which examines only a single thread at a 
time cannot afford to generate just a single execution, because it 
cannot know what values the loads will obtain when the full program 
runs.

> I am not sure if that gets really resolved by the volatile restrictions you
> put, but either way those seem far stronger than what one would want.

It does get resolved, because treating atomics as volatile also means 
that the compiler cannot afford to generate just a single execution.  
Again, because it cannot know what values the loads will obtain at 
runtime, since volatile loads can yield any value in a non-benign 
environment.

> I would say that the approach with volatile is overzealous because it tries
> to create a "local" order solution to the problem that only requires a
> "global" ordering solution. Since not every semantic dependency needs to
> provide order in C++ -- only the cycle of dependencies -- it is totally ok
> to add too many semantic dependency edges to a program, even those that are
> not going to be exactly maintained by every compiler, as long as we can
> ensure that globally, no dependency cycle occurs.

But then how would you characterize semantic dependencies, if you want 
to allow the definition to include some dependencies that aren't 
semantic but not so many that you ever create a cycle?  This sounds like 
an even worse problem than we started with!

> So for example, if we merge  x = y || y = x,  the merge can turn it into
> x=y=x or y=x=y (and then into an empty program), but not into a cyclic
> dependency. So even though one side of the dependency may be violated, for
> sake of OOTA, we could still label both sides as dependent.

They _are_ both semantically dependent (in the original parallel 
version, I mean).  I don't see what merging has to do with it.

> For LKMM the problem is of course much easier because you have volatiles and
> compiler barriers. Again you could maybe add incorrect semantic dependencies
> between accesses, as long as only the really preserved ones will imply
> ordering.

I'm not particularly concerned about OOTA or semantic dependencies in 
LKMM.

> So I'm not convinced that for either of the two cases you need to do a
> compiler-specific definition of dependencies.

For the C++ case, I cannot think of any other way to approach the 
problem.  Nor do I see anything wrong with a compiler-specific 
definition, given how nebulous the whole idea is in the first place.

> BTW, for what it's worth, Dat3M in a sense uses the clang dependencies - it
> first allows the compiler to do its optimizations, and then verifies the
> llvm-ir (with a more hardware-like dependency definition).

What do you mean by "verifies"?  What property of the llvm-ir does it 
verify?

> I think something like that can be a good practical solution with fewer
> problems than other attempts to approximate the solution.

I do not like the idea of tying the definition of OOTA (which needs to 
apply to every implementation) to a particular clang compiler.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 23:02                       ` Alan Stern
@ 2025-01-17  8:34                         ` Hernan Ponce de Leon
  2025-01-17 11:29                         ` Jonas Oberhauser
  2025-01-17 15:52                         ` Alan Stern
  2 siblings, 0 replies; 59+ messages in thread
From: Hernan Ponce de Leon @ 2025-01-17  8:34 UTC (permalink / raw)
  To: Alan Stern, Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm

On 1/17/2025 12:02 AM, Alan Stern wrote:
> On Thu, Jan 16, 2025 at 08:08:22PM +0100, Jonas Oberhauser wrote:
>> I'm well aware that an absolute definition of semantic dependencies is not
>> easy to give.
> 
> In fact it's undecidable.  No tool is ever going to be able to detect
> semantic dependencies perfectly.
> 
>> Btw, I'm re-reading that paper and here are some comments:
>> - I agree with the conclusion that C++ should not try to solve OOTA other
>> than declaring that they do not exist
>> - the definition of OOTA as sdep | rfe is not the right one. You really need
>> sdep ; rfe, because you can have multiple subsequent sdep links that
>> together are not a dependency, e.g.,
>>
>> int * x[] = { &a, &a };
>> int i = b;
>> *x[i] = 1;
>>
>> here the semantic address dependency from loading b to loading x[i] and the
>> semantic address dependency from loading x[i] and storing to *x[i] do not
>> together form a semantic dependency anymore, because *x[i] is always a. So
>> this whole code can just become a=1, and with mirrored code you can get
>> a=b=1, which is an sdep | rfe cycle.
> 
> We regard sdep as extending from loads (or sets of loads) to stores.
> (Perhaps the paper doesn't state this explicitly -- it should.)  So an
> sdep ; sdep sequence is not possible.
> 
> Nevertheless, it's arguable that in your example there is no semantic
> dependency from the load of b to the load of x[i].  Given the code
> shown, a compiler could replace the load of x[i] with the constant value
> &a, yielding simply *(&a) = 1.
> 
> There are other examples which do make the point, however.  For example:
> 
> int a = 1, b = 1, c = 2;
> int *x[] = {&a, &b, &c};
> 
> int r1 = i;
> if (r1 > 1)
> 	r1 = 1;
> int *r2 = x[r1];
> int r3 = *r2;
> q = r3;
> 
> Here there is a semantic dependency (if you accept dependencies to
> loads) from the load of i to the load of x[r1] and from the load of *r2
> to the store to q.  But overall the value stored to q is always 1, so it
> doesn't depend on the load from i.
> 
>> - I did not like the phrasing that a dependency is a function of one
>> execution, especially in contrast to source code. For a fixed execution,
>> there are no dependencies because all values are fixed. It is the other
>> executions - where you could have read another value - that create the
>> dependencies.
>>
>> Perhaps it is better to say that it is not a *local* function of source
>> code, i.e., just because the same source code has a dependency in context C
>> does not mean that it has a dependency in context C'.
> 
> Of course, what we meant is that whether or not there is a semantic
> dependency depends on both the source code and the execution.
> 
>> In fact I would say a major gotcha of dependencies is that they are not a
>> function of even the set of all permissible (according to the standard)
>> executions.
>> That is because the compiler does not have to make every permissible
>> execution possible, only at least one.
> 
> The paper includes examples of this, I believe.
> 
>> If the compiler knows that among all executions that it actually makes
>> possible, some value is always 0 - even if there is a permissible execution
>> in which it is not 0 - it can still replace that value with 0. E.g.
>>
>>    T1 { x = 1; x = 0; }
>>    T2 { T[x] = 1; }
>>
>>     ~~ merge -- makes executions where T[x] reads x==1 impossible ~>
>>
>>    T1' { x = 1; x = 0; T[x] = 1; }
>>
>>     ~~ replace ~>
>>
>>    T1' { x = 1; x = 0; T[0] = 1; // no more dependency :( }
>>
>> which defeats any semantic dependency definition that is a function of a
>> single execution.
> 
> Our wording could be improved.  We simply want to make the point that
> the same source code may have a semantic dependency in one execution but
> not in another.  I guess we should avoid calling it a "function" of the
> execution, to avoid implying that the execution alone is what matters.
> 
>> Of course you avoid this by making it really a function of the compiler +
>> program *and* the execution, and looking at all the other possible (not just
>> permissible) executions.
>>
>> - On the one hand, that definition makes a lot of sense. On the other hand,
>> at least without the atomics=volatile restriction it would have the downside
>> that a compiler which generates just a single execution for your program can
>> say that there are no dependencies whatsoever and generate all kinds of "out
>> of thin air" behaviors.
> 
> That is so.  But a compiler which examines only a single thread at a
> time cannot afford to generate just a single execution, because it
> cannot know what values the loads will obtain when the full program
> runs.
> 
>> I am not sure if that gets really resolved by the volatile restrictions you
>> put, but either way those seem far stronger than what one would want.
> 
> It does get resolved, because treating atomics as volatile also means
> that the compiler cannot afford to generate just a single execution.
> Again, because it cannot know what values the loads will obtain at
> runtime, since volatile loads can yield any value in a non-benign
> environment.
> 
>> I would say that the approach with volatile is overzealous because it tries
>> to create a "local" order solution to the problem that only requires a
>> "global" ordering solution. Since not every semantic dependency needs to
>> provide order in C++ -- only the cycle of dependencies -- it is totally ok
>> to add too many semantic dependency edges to a program, even those that are
>> not going to be exactly maintained by every compiler, as long as we can
>> ensure that globally, no dependency cycle occurs.
> 
> But then how would you characterize semantic dependencies, if you want
> to allow the definition to include some dependencies that aren't
> semantic but not so many that you ever create a cycle?  This sounds like
> an even worse problem than we started with!
> 
>> So for example, if we merge  x = y || y = x,  the merge can turn it into
>> x=y=x or y=x=y (and then into an empty program), but not into a cyclic
>> dependency. So even though one side of the dependency may be violated, for
>> sake of OOTA, we could still label both sides as dependent.
> 
> They _are_ both semantically dependent (in the original parallel
> version, I mean).  I don't see what merging has to do with it.
> 
>> For LKMM the problem is of course much easier because you have volatiles and
>> compiler barriers. Again you could maybe add incorrect semantic dependencies
>> between accesses, as long as only the really preserved ones will imply
>> ordering.
> 
> I'm not particularly concerned about OOTA or semantic dependencies in
> LKMM.
> 
>> So I'm not convinced that for either of the two cases you need to do a
>> compiler-specific definition of dependencies.
> 
> For the C++ case, I cannot think of any other way to approach the
> problem.  Nor do I see anything wrong with a compiler-specific
> definition, given how nebulous the whole idea is in the first place.
> 
>> BTW, for what it's worth, Dat3M in a sense uses the clang dependencies - it
>> first allows the compiler to do its optimizations, and then verifies the
>> llvm-ir (with a more hardware-like dependency definition).
> 
> What do you mean by "verifies"?  What property of the llvm-ir does it
> verify?

Let me be more precise here. The Dat3M verifier has its own IR and it 
can verify 3 kinds of properties:

- assertions written in the code
- lack of liveness violations (e.g., a spinloop does not hang)
- other properties written in the .cat file as "flag ~empty(r)", data 
races in LKMM/C fit in this category.

Dat3M has several parsers to create its own IR: it can read the litmus 
format of herd, but also C code by first converting to llvm-ir, and then 
parsing this one.

In the case of verifying kernel code wrt LKMM we do some magic such that 
anything related to the "LKMM concurrency API" (things like 
atomic_cmpxchg) appear as function calls in the llvm-ir. This allows us 
to parse this as the concrete type of event we want.

Hernan
> 
>> I think something like that can be a good practical solution with fewer
>> problems than other attempts to approximate the solution.
> 
> I do not like the idea of tying the definition of OOTA (which needs to
> apply to every implementation) to a particular clang compiler.
> 
> Alan


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 23:02                       ` Alan Stern
  2025-01-17  8:34                         ` Hernan Ponce de Leon
@ 2025-01-17 11:29                         ` Jonas Oberhauser
  2025-01-17 20:01                           ` Alan Stern
  2025-01-17 15:52                         ` Alan Stern
  2 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-17 11:29 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/17/2025 um 12:02 AM schrieb Alan Stern:
> On Thu, Jan 16, 2025 at 08:08:22PM +0100, Jonas Oberhauser wrote:
>> I'm well aware that an absolute definition of semantic dependencies is not
>> easy to give.
> 
> In fact it's undecidable.  No tool is ever going to be able to detect
> semantic dependencies perfectly.

It depends.
Firstly, let's consider that the tool only runs on finite (or 
"finitized") programs.
Seceondly, your definition depends on the compiler.

So in the sense that we don't know the compiler, it is undecidable.
But if you fix the compiler, you could still enumerate all executions 
under that compiler and compute whether that compiler has a dependency 
or not.

But as I mentioned before, I think you can define semantic dependencies 
more appropriately because you don't really need to preserve semantic 
dependencies in C++, and in LKMM (and VMM) you have volatiles that 
restrict what kind of dependency-eliminations the compiler can do.

>>
>> int * x[] = { &a, &a };
>> int i = b;
>> *x[i] = 1;
>>
>> here the semantic address dependency from loading b to loading x[i] and the
>> semantic address dependency from loading x[i] and storing to *x[i] do not
>> together form a semantic dependency anymore, because *x[i] is always a. So
>> this whole code can just become a=1, and with mirrored code you can get
>> a=b=1, which is an sdep | rfe cycle.
> 
> We regard sdep as extending from loads (or sets of loads) to stores.
> (Perhaps the paper doesn't state this explicitly -- it should.)  So an
> sdep ; sdep sequence is not possible.

I see. There is a line in the introduction ("Roughly speaking, there is 
a semantic dependency from a given load
to a given store when all other things being equal, a change in the 
value loaded can
change the value stored or prevent the store from occurring at all"), 
but I read it too carelessly and treated it as an implication rather 
than an iff.

I suppose at some point you anyways replace the definition of OOTA with 
a version where only semantic dependencies from loads to a store are 
considered, so the concern becomes irrelevant then.

>> - On the one hand, that definition makes a lot of sense. On the other hand,
>> at least without the atomics=volatile restriction it would have the downside
>> that a compiler which generates just a single execution for your program can
>> say that there are no dependencies whatsoever and generate all kinds of "out
>> of thin air" behaviors.
> 
> That is so.  But a compiler which examines only a single thread at a
> time cannot afford to generate just a single execution, because it
> cannot know what values the loads will obtain when the full program
> runs.
> 
>> I am not sure if that gets really resolved by the volatile restrictions you
>> put, but either way those seem far stronger than what one would want.
> 
> It does get resolved, because treating atomics as volatile also means
> that the compiler cannot afford to generate just a single execution.
> Again, because it cannot know what values the loads will obtain at
> runtime, since volatile loads can yield any value in a non-benign
> environment.

Yes. Actually I wonder if you put this "all loads are volatile" 
restriction, can a globally analysing compiler still have any 
optimizations that a locally analysing compiler can not?

It rather seems the other way, that the locally analysing quasi-volatile 
compiler can at least to some local optimizations, while the global 
volatile compiler can not (e.g., x=1; y=x; can not be x=1; y=1; for the 
global compiler because x is volatile now).

>> I would say that the approach with volatile is overzealous because it tries
>> to create a "local" order solution to the problem that only requires a
>> "global" ordering solution. Since not every semantic dependency needs to
>> provide order in C++ -- only the cycle of dependencies -- it is totally ok
>> to add too many semantic dependency edges to a program, even those that are
>> not going to be exactly maintained by every compiler, as long as we can
>> ensure that globally, no dependency cycle occurs.
> 
> But then how would you characterize semantic dependencies, if you want
> to allow the definition to include some dependencies that aren't
> semantic but not so many that you ever create a cycle? 

I don't know which definition is correct yet, but the point is that you 
don't have to avoid making so many dependencies that you would create a 
cycle. It just forbids the compiler from looking for cycles and 
optimizing based on the existance of the cycle. (Looking for unused vars 
and removing those is still allowed, under the informal argument that 
this simulates an execution where no OOTA happened)

> This sounds like
> an even worse problem than we started with!
 >
 >
>> So for example, if we merge  x = y || y = x,  the merge can turn it into
>> x=y=x or y=x=y (and then into an empty program), but not into a cyclic
>> dependency. So even though one side of the dependency may be violated, for
>> sake of OOTA, we could still label both sides as dependent.
> 
> They _are_ both semantically dependent (in the original parallel
> version, I mean).  I don't see what merging has to do with it.

Note that I was considering the case where they are not volatile.
With a compiler that is not treating them as volatile, which merges the 
two threads, under your definition, there is no semantic dependency in 
at least one direction because there is no hardware realization H where 
you read something else (of course you exclude such compilers, but I 
think realistically they should be allowed).

My point is that we can say they are semantically dependent for the sake 
of OOTA, not derive any ordering from these dependencies other than the 
cyclical one, and therefore allow compilers to one of the two 
optimizations (make x=y no longer depend on y or make y=x no longer 
depend on x) but noth make a cycle analysis to remove both dependencies 
and generate an OOTA value (it can remove both dependencies by leaving x 
and y unchanged though).

>> So I'm not convinced that for either of the two cases you need to do a
>> compiler-specific definition of dependencies.
> 
> For the C++ case, I cannot think of any other way to approach the
> problem.  Nor do I see anything wrong with a compiler-specific
> definition, given how nebulous the whole idea is in the first place.
> 
>> BTW, for what it's worth, Dat3M in a sense uses the clang dependencies - it
>> first allows the compiler to do its optimizations, and then verifies the
>> llvm-ir (with a more hardware-like dependency definition).
> 
> What do you mean by "verifies"?  What property of the llvm-ir does it
> verify?

Verify that the algorithm, e.g., qspinlock, has no executions that are 
buggy, e.g., have a data race on the critical section.

It does so for a fixed test case with a small number of threads, i.e., a 
finite program.
  >> I think something like that can be a good practical solution with fewer
>> problems than other attempts to approximate the solution.
> 
> I do not like the idea of tying the definition of OOTA (which needs to
> apply to every implementation) to a particular clang compiler.

But that is what you have done, no? Whether something is an sdep depends 
on the compiler, so compiler A could generate an execution that is OOTA 
in the sdep definition of compiler B.
(Of course with the assumption of atomic=volatile, it may just be that 
we are back to the beginning and all "naive" semantic dependencies are 
actually semantic dependencies for all compilers).

Anyways what I meant is not about tying the definition of OOTA to one 
compiler or other. As I mentioned I think it can be fine to define OOTA 
in the same way for all compilers.
What I meant is to specialize the memory model to a specific compiler, 
as long as that is the compiler that is used in reality.
So long as your code does not depend on the ordering of any semantic 
dependencies, the verification can be cross platform.

And although...

 > I'm not particularly concerned about OOTA or semantic dependencies in
LKMM.

... there is code that relies on semantic dependencies, e.g. RCU read 
side CS code. (even if we do not care about OOTA).
For that code, the semantic dependencies must be guaranteed to create 
ordering.

So you either need a definition of semantic dependency that
a) applies in all cases we practically need and
b) is guaranteed by all compilers

or we need to live with the fact that we do not have a semantic 
dependency definition that is independent of the compilation (even of 
the same compiler) and need to do our verification for that specific 
compilation.

I think for LKMM we could give such a semantic dependency definition 
because it uses volatile, and verify RCU-read-side code. But we 
currently do not have one. What I meant to say is that using the actual 
(whatever compiler you use) optimizations first to remove syntactic 
dependencies, and then verifying under the assumption of whatever 
dependencies are left, may be better than trying to approximate 
dependencies in some way in cat. Given that we want to verify and rely 
on the code today, not in X years when we all agree on what a 
compiler-independent definition of semantic dependency is.

I think for C++ consume we could also give one by simply restricting 
some compiler optimizations for consume loads (and doing whatever needs 
to be done on alpha). Or just kick it out and not have any dependency 
ordering except the global OOTA case.

Sorry for the confusion, I think there are so many different 
combinations/battlefields (OOTA vs just semantic dependencies, 
volatile/non-volatile atomics, verifying the model vs verifying a piece 
of code etc.) that it becomes hard for me not to confuse myself, let 
alone others :))

Best wishes,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 19:39                         ` Paul E. McKenney
@ 2025-01-17 12:08                           ` Jonas Oberhauser
  0 siblings, 0 replies; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-17 12:08 UTC (permalink / raw)
  To: paulmck
  Cc: Alan Stern, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/16/2025 um 8:39 PM schrieb Paul E. McKenney:
> On Thu, Jan 16, 2025 at 08:28:06PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/16/2025 um 7:40 PM schrieb Paul E. McKenney:
>>> 	o	If a value is read in the seqlock reader and used
>>> 		across a "you need to retry" indication, that
>>> 		flags a seqlock data race.
>>
>>
>> This too is insufficient, you also need to prevent dereferencing or having
>> control dependency inside the seqlock. Otherwise you could derefence a torn
>> pointer and...
> 
> True, but isn't that prohibition separable from the underlying
> implementation?

Yes, but so is the prohibition of using the value after the failed 
reader_exit().

So probably it needs to be added to the specification of what you are 
allowed to do with values from the read-side critical section.

Actually this was a bug we had in some code once, and I overlooked it 
because I thought the incorrect data isn't used anyways, right?

Luckily I had put the condition into our cat model already and so the 
tooling caught the bug before it went out...

>> At this point your definition of data race becomes pretty much the same as
>> we have.
>>
>> https://github.com/open-s4c/libvsync/blob/main/vmm/vmm.cat#L150
>>
>>
>> (also this rule should only concern reads that are actually "data-racy" - if
>> the read is synchronized by some other writes, then you can read & use it
>> just fine across the seqlock [edit: boundary])
> 
> Perhaps LKMM should adopt this or something similar, but what do others
> think?

I am not sure how many others are still reading this deep into the 
conversation, maybe best to start a new thread.

>> I also noticed that in my previous e-mail I had overlooked the reads inside
>> the CS in the failure case, but you are of course right, there needs to be
>> some mechanism to prevent them from being data racy unless abused.
>>
>> But I am not sure how to formalize that in a way that is simpler than just
>> re-defining data races in general, without adding some special support to
>> herd7 for it.
>>
>> What do you think?
> 
> I was thinking in terms of identifying reads in critical sections (sort
> of like LKMM does for RCU read-side critical sections), then identifying
> any dependencies from those reads that cross a failed reader boundary.
> If that set is non-empty, flag it.

Yes, the general idea sounds reasonable, but the details have a lot of 
potential for future improvement.

One tricky part of seqlock besides the data race is that it kind of uses 
negative message passing - the fact that you have not seen the message 
means you also have not seen the flag. (And the message in this case is 
the write_enter(), and the flag is the plain access in the critical 
section! Fun.)

This makes it hard to formalize in the box of LKMM and make it play well 
with all the other pieces.

Maybe something like avoiding rw data races also under something like prop?

r-prop-post-bounded ; ((overwrite & ext) ; cumul-fence*) ; 
w-prop-pre-bounded

for the cases where the pre-bounded write must propagate after the 
overwriting store, and the post-bounded read executes before and on the 
same CPU as the overwritten event.

Then you could argue that if the overwriting write has not propagated to 
the overwritten event, the pre-bounded write also has not propagated to 
that event.

 From that you can conclude it also can not have propagated to the 
post-bounded read.

I'm not sure if the cases where propagation is handled by a strong-fence 
are already covered by the rw-xbstar rule.

Have fun,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-16 23:02                       ` Alan Stern
  2025-01-17  8:34                         ` Hernan Ponce de Leon
  2025-01-17 11:29                         ` Jonas Oberhauser
@ 2025-01-17 15:52                         ` Alan Stern
  2025-01-17 16:45                           ` Jonas Oberhauser
  2 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-17 15:52 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Thu, Jan 16, 2025 at 06:02:18PM -0500, Alan Stern wrote:
> On Thu, Jan 16, 2025 at 08:08:22PM +0100, Jonas Oberhauser wrote:
> > I would say that the approach with volatile is overzealous because it tries
> > to create a "local" order solution to the problem that only requires a
> > "global" ordering solution. Since not every semantic dependency needs to
> > provide order in C++ -- only the cycle of dependencies -- it is totally ok
> > to add too many semantic dependency edges to a program, even those that are
> > not going to be exactly maintained by every compiler, as long as we can
> > ensure that globally, no dependency cycle occurs.
> 
> But then how would you characterize semantic dependencies, if you want 
> to allow the definition to include some dependencies that aren't 
> semantic but not so many that you ever create a cycle?  This sounds like 
> an even worse problem than we started with!

An interesting side comment on this issue...

This is a slight variation of the example on page 19 (section 4.3) of 
the paper.  (Pretend this is actually C++ code, the shared variables are 
all atomic, and their accesses are all relaxed.)

bool x, y, z;

void P0(bool *x, bool *y, bool *z) {
	bool r1, r2;

	r1 = *x;
	r2 = *y;

	*z = (r1 != r2);
}

The paper points out that although there is an apparent semantic 
dependency from the load of x to the store to z, if the compiler is 
allowed not to handle atomics as quasi volatile then the dependency 
can be broken.  Nevertheless, I am not able to think of a program that 
could exhibit OOTA as a result of breaking the semantic dependency.  The 
best I can come up with is this:

[P0 as above]

void P1(bool *x, bool *y, bool *z) {
	bool r3;

	r3 = z;
	x = r3;
}

void P2(bool *x, bool *y, bool *z) {
	y = true;
}

exists (x=true /\ z=true)

If P2 were not present, this result could not occur in any physical 
execution, even if the dependency in P0 is broken.  With P2 this result 
isn't OOTA, even in executions where P0 ends up storing z before loading 
x, because P2 could have executed first, then P0, then P1.

So perhaps this is an example of what you were talking about -- a 
dependency which may or may not be semantic, but either way cannot lead 
to OOTA.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-17 15:52                         ` Alan Stern
@ 2025-01-17 16:45                           ` Jonas Oberhauser
  2025-01-17 19:02                             ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-17 16:45 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon



Am 1/17/2025 um 4:52 PM schrieb Alan Stern:
> On Thu, Jan 16, 2025 at 06:02:18PM -0500, Alan Stern wrote:
>> On Thu, Jan 16, 2025 at 08:08:22PM +0100, Jonas Oberhauser wrote:
>>> I would say that the approach with volatile is overzealous because it tries
>>> to create a "local" order solution to the problem that only requires a
>>> "global" ordering solution. Since not every semantic dependency needs to
>>> provide order in C++ -- only the cycle of dependencies -- it is totally ok
>>> to add too many semantic dependency edges to a program, even those that are
>>> not going to be exactly maintained by every compiler, as long as we can
>>> ensure that globally, no dependency cycle occurs.
>>
>> But then how would you characterize semantic dependencies, if you want
>> to allow the definition to include some dependencies that aren't
>> semantic but not so many that you ever create a cycle?  This sounds like
>> an even worse problem than we started with!
> 
> An interesting side comment on this issue...
> 
> This is a slight variation of the example on page 19 (section 4.3) of
> the paper.  (Pretend this is actually C++ code, the shared variables are
> all atomic, and their accesses are all relaxed.)
> 
> bool x, y, z;
> 
> void P0(bool *x, bool *y, bool *z) {
> 	bool r1, r2;
> 
> 	r1 = *x;
> 	r2 = *y;
> 
> 	*z = (r1 != r2);
> }
> 
> The paper points out that although there is an apparent semantic
> dependency from the load of x to the store to z, if the compiler is
> allowed not to handle atomics as quasi volatile then the dependency
> can be broken.  Nevertheless, I am not able to think of a program that
> could exhibit OOTA as a result of breaking the semantic dependency.  The
> best I can come up with is this:
> 
> [P0 as above]
> 
> void P1(bool *x, bool *y, bool *z) {
> 	bool r3;
> 
> 	r3 = z;
> 	x = r3;
> }
> 
> void P2(bool *x, bool *y, bool *z) {
> 	y = true;
> }
> 
> exists (x=true /\ z=true)
> 
> If P2 were not present, this result could not occur in any physical
> execution, even if the dependency in P0 is broken.  With P2 this result
> isn't OOTA, even in executions where P0 ends up storing z before loading
> x, because P2 could have executed first, then P0, then P1.
> 
> So perhaps this is an example of what you were talking about -- a
> dependency which may or may not be semantic, but either way cannot lead
> to OOTA.

Yes, that looks like an example of what I have in mind.

If at the model level we just say "yes there is a dependency, but no it 
does not give any ordering guarantee", then the compiler is still free 
to break the dependency like in your example.

A thread P3 { r1 = z; atomic_thread_fence(); r2 = y; }
could still observe r2 == false, r1 == true, "showing" that the 
dependency was broken.

This would not violate such a model.

(if we consider consume, then that would need to restrict the compiler 
from eliminating the dependency like this)

That is not to say that I am 100% sure that it is possible to define 
sdep correctly to make this work.
One problem is if the compiler merges two threads (with an OOTA cycle of 
3+ threads), it can turn sdep;rfe;sdep;rfe into sdep;rfi;sdep;rfe.
If sdep is too naive then it is easy to come up with counter examples 
where this sdep;rfi;sdep no longer provides ordering, making the whole 
sdep;rfe cycle possible.

I am not sure if sdep can be formalized in a way that ensures that this 
sdep;rfi;sdep edge would still need to be preserved.

Of course one could have inter-thread semantic dependencies and only forbid
   isdep ; rf (e?)
from being reflexive...

Best wishes,
   jonas



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-17 16:45                           ` Jonas Oberhauser
@ 2025-01-17 19:02                             ` Alan Stern
  0 siblings, 0 replies; 59+ messages in thread
From: Alan Stern @ 2025-01-17 19:02 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Fri, Jan 17, 2025 at 05:45:50PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/17/2025 um 4:52 PM schrieb Alan Stern:
> > On Thu, Jan 16, 2025 at 06:02:18PM -0500, Alan Stern wrote:
> > > On Thu, Jan 16, 2025 at 08:08:22PM +0100, Jonas Oberhauser wrote:
> > This is a slight variation of the example on page 19 (section 4.3) of
> > the paper.  (Pretend this is actually C++ code, the shared variables are
> > all atomic, and their accesses are all relaxed.)
> > 
> > bool x, y, z;
> > 
> > void P0(bool *x, bool *y, bool *z) {
> > 	bool r1, r2;
> > 
> > 	r1 = *x;
> > 	r2 = *y;
> > 
> > 	*z = (r1 != r2);
> > }
> > 
> > The paper points out that although there is an apparent semantic
> > dependency from the load of x to the store to z, if the compiler is
> > allowed not to handle atomics as quasi volatile then the dependency
> > can be broken.  Nevertheless, I am not able to think of a program that
> > could exhibit OOTA as a result of breaking the semantic dependency.  The
> > best I can come up with is this:
> > 
> > [P0 as above]
> > 
> > void P1(bool *x, bool *y, bool *z) {
> > 	bool r3;
> > 
> > 	r3 = z;
> > 	x = r3;
> > }
> > 
> > void P2(bool *x, bool *y, bool *z) {
> > 	y = true;
> > }
> > 
> > exists (x=true /\ z=true)
> > 
> > If P2 were not present, this result could not occur in any physical
> > execution, even if the dependency in P0 is broken.  With P2 this result
> > isn't OOTA, even in executions where P0 ends up storing z before loading
> > x, because P2 could have executed first, then P0, then P1.
> > 
> > So perhaps this is an example of what you were talking about -- a
> > dependency which may or may not be semantic, but either way cannot lead
> > to OOTA.
> 
> Yes, that looks like an example of what I have in mind.
> 
> If at the model level we just say "yes there is a dependency, but no it does
> not give any ordering guarantee", then the compiler is still free to break
> the dependency like in your example.
> 
> A thread P3 { r1 = z; atomic_thread_fence(); r2 = y; }
> could still observe r2 == false, r1 == true, "showing" that the dependency
> was broken.

That wouldn't prove anything unless P0 had its own memory barrier 
somewhere before it stored z.

At any rate, I don't have any ideas on how to characterize semantic 
dependencies that can be broken without risking OOTA.  (Some people 
would say that if a dependency can be broken at all, that demonstrates 
it wasn't semantic to begin with.)

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-17 11:29                         ` Jonas Oberhauser
@ 2025-01-17 20:01                           ` Alan Stern
  2025-01-21 10:36                             ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-17 20:01 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Fri, Jan 17, 2025 at 12:29:34PM +0100, Jonas Oberhauser wrote:
> 
> 
> Am 1/17/2025 um 12:02 AM schrieb Alan Stern:
> > On Thu, Jan 16, 2025 at 08:08:22PM +0100, Jonas Oberhauser wrote:
> > > I'm well aware that an absolute definition of semantic dependencies is not
> > > easy to give.
> > 
> > In fact it's undecidable.  No tool is ever going to be able to detect
> > semantic dependencies perfectly.
> 
> It depends.
> Firstly, let's consider that the tool only runs on finite (or "finitized")
> programs.

A program may be finite; that doesn't mean its executions are.  Programs 
can go into infinite loops.  Are you saying that the tool would stop 
verifying executions after (say) a billion steps?  But then it would not 
be able to detect semantic dependences in programs that run longer.

> Seceondly, your definition depends on the compiler.
> 
> So in the sense that we don't know the compiler, it is undecidable.
> But if you fix the compiler, you could still enumerate all executions under
> that compiler and compute whether that compiler has a dependency or not.

You can't enumerate the executions unless you put an artificial bound on 
the length of each execution, as noted above.

> But as I mentioned before, I think you can define semantic dependencies more
> appropriately because you don't really need to preserve semantic
> dependencies in C++, and in LKMM (and VMM) you have volatiles that restrict
> what kind of dependency-eliminations the compiler can do.

What makes you think this "more appropriate" definition of semantic 
dependency will be any easier to detect than the original one?

> Yes. Actually I wonder if you put this "all loads are volatile" restriction,
> can a globally analysing compiler still have any optimizations that a
> locally analysing compiler can not?

Yes, although whether they are pertinent is open to question.  For 
example, a globally optimizing compiler may observe that no thread ever 
reads the value of a particular shared variable and then eliminate that 
variable.

> It rather seems the other way, that the locally analysing quasi-volatile
> compiler can at least to some local optimizations, while the global volatile
> compiler can not (e.g., x=1; y=x; can not be x=1; y=1; for the global
> compiler because x is volatile now).

In the paper we speculate that it may be sufficient to require globally 
optimizing compilers to treat atomics as quasi volatile with the added 
restriction that loads must not be omitted.

> > But then how would you characterize semantic dependencies, if you want
> > to allow the definition to include some dependencies that aren't
> > semantic but not so many that you ever create a cycle?
> 
> I don't know which definition is correct yet, but the point is that you
> don't have to avoid making so many dependencies that you would create a
> cycle. It just forbids the compiler from looking for cycles and optimizing
> based on the existance of the cycle. (Looking for unused vars and removing
> those is still allowed, under the informal argument that this simulates an
> execution where no OOTA happened)

At the moment, the only underlying ideas we have driving our notions of 
semantic dependency is that a true semantic dependency should be one 
which must give rise to order in the machine code.  In turn, this order 
prevents OOTA cycles from occuring during execution.  That is the 
essence of the paper.

Can your ideas be fleshed out to a comparable degree?

> > > So for example, if we merge  x = y || y = x,  the merge can turn it into
> > > x=y=x or y=x=y (and then into an empty program), but not into a cyclic
> > > dependency. So even though one side of the dependency may be violated, for
> > > sake of OOTA, we could still label both sides as dependent.
> > 
> > They _are_ both semantically dependent (in the original parallel
> > version, I mean).  I don't see what merging has to do with it.
> 
> Note that I was considering the case where they are not volatile.
> With a compiler that is not treating them as volatile, which merges the two
> threads, under your definition, there is no semantic dependency in at least
> one direction because there is no hardware realization H where you read
> something else (of course you exclude such compilers, but I think
> realistically they should be allowed).
> 
> My point is that we can say they are semantically dependent for the sake of
> OOTA, not derive any ordering from these dependencies other than the
> cyclical one, and therefore allow compilers to one of the two optimizations
> (make x=y no longer depend on y or make y=x no longer depend on x) but noth
> make a cycle analysis to remove both dependencies and generate an OOTA value
> (it can remove both dependencies by leaving x and y unchanged though).

I don't understand.

> > I do not like the idea of tying the definition of OOTA (which needs to
> > apply to every implementation) to a particular clang compiler.
> 
> But that is what you have done, no? Whether something is an sdep depends on
> the compiler, so compiler A could generate an execution that is OOTA in the
> sdep definition of compiler B.

Yes, but this does not tie the definition to _one particular_ compiler.  
That is, we don't say "This dependency is semantic because of the way 
GCC 14.2.1 handles it."  Rather, we define for each compiler whether a 
dependency is semantic for _that_ compiler.

> (Of course with the assumption of atomic=volatile, it may just be that we
> are back to the beginning and all "naive" semantic dependencies are actually
> semantic dependencies for all compilers).
> 
> Anyways what I meant is not about tying the definition of OOTA to one
> compiler or other. As I mentioned I think it can be fine to define OOTA in
> the same way for all compilers.
> What I meant is to specialize the memory model to a specific compiler, as
> long as that is the compiler that is used in reality.
> So long as your code does not depend on the ordering of any semantic
> dependencies, the verification can be cross platform.
> 
> And although...
> 
> > I'm not particularly concerned about OOTA or semantic dependencies in
> LKMM.
> 
> ... there is code that relies on semantic dependencies, e.g. RCU read side
> CS code. (even if we do not care about OOTA).
> For that code, the semantic dependencies must be guaranteed to create
> ordering.
> 
> So you either need a definition of semantic dependency that
> a) applies in all cases we practically need and
> b) is guaranteed by all compilers
> 
> or we need to live with the fact that we do not have a semantic dependency
> definition that is independent of the compilation (even of the same
> compiler) and need to do our verification for that specific compilation.

Add the qualification that the definition should be practical to 
evaluate, and I agree.

> I think for LKMM we could give such a semantic dependency definition because
> it uses volatile, and verify RCU-read-side code. But we currently do not
> have one. What I meant to say is that using the actual (whatever compiler
> you use) optimizations first to remove syntactic dependencies, and then
> verifying under the assumption of whatever dependencies are left, may be
> better than trying to approximate dependencies in some way in cat. Given
> that we want to verify and rely on the code today, not in X years when we
> all agree on what a compiler-independent definition of semantic dependency
> is.
> 
> I think for C++ consume we could also give one by simply restricting some
> compiler optimizations for consume loads (and doing whatever needs to be
> done on alpha). Or just kick it out and not have any dependency ordering
> except the global OOTA case.
> 
> Sorry for the confusion, I think there are so many different
> combinations/battlefields (OOTA vs just semantic dependencies,
> volatile/non-volatile atomics, verifying the model vs verifying a piece of
> code etc.) that it becomes hard for me not to confuse myself, let alone
> others :))

I know what that feels like!

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-17 20:01                           ` Alan Stern
@ 2025-01-21 10:36                             ` Jonas Oberhauser
  2025-01-21 16:39                               ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-21 10:36 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/17/2025 um 9:01 PM schrieb Alan Stern:
> On Fri, Jan 17, 2025 at 12:29:34PM +0100, Jonas Oberhauser wrote:
>>
>>
>> Am 1/17/2025 um 12:02 AM schrieb Alan Stern:
>>> On Thu, Jan 16, 2025 at 08:08:22PM +0100, Jonas Oberhauser wrote:
>>>> I'm well aware that an absolute definition of semantic dependencies is not
>>>> easy to give.
>>>
>>> In fact it's undecidable.  No tool is ever going to be able to detect
>>> semantic dependencies perfectly.
>>
>> It depends.
>> Firstly, let's consider that the tool only runs on finite (or "finitized")
>> programs.
> 
> A program may be finite; that doesn't mean its executions are.  Programs
> can go into infinite loops.  Are you saying that the tool would stop
> verifying executions after (say) a billion steps?  But then it would not
> be able to detect semantic dependences in programs that run longer.

Yes, what I said is not fully correct. Finite (non-trivial-)loop-free 
programs, or programs with finite (representative) execution spaces, 
would have been more correct.

> 
>> Seceondly, your definition depends on the compiler.
>>
>> So in the sense that we don't know the compiler, it is undecidable.
>> But if you fix the compiler, you could still enumerate all executions under
>> that compiler and compute whether that compiler has a dependency or not.
> 
> You can't enumerate the executions unless you put an artificial bound on
> the length of each execution, as noted above.

Note that for many practical cases it is possible to write test cases 
where the bound is not artificial in the sense that it is at least as 
large as the bound needed for exhaustive enumeration.

>> But as I mentioned before, I think you can define semantic dependencies more
>> appropriately because you don't really need to preserve semantic
>> dependencies in C++, and in LKMM (and VMM) you have volatiles that restrict
>> what kind of dependency-eliminations the compiler can do.
> 
> What makes you think this "more appropriate" definition of semantic
> dependency will be any easier to detect than the original one?

For starters, as you mentioned, the compiler has to assume any value is 
possible.

Which means that if any other value would lead to a "different 
behavior", you know you have a semantic dependency. You can detect this 
in a per-thread manner, independently of the compiler.

Without the assumption of volatile, even a different value that could 
actually be generated in another run of the same program does not prove 
a semantic dependency.

>> Yes. Actually I wonder if you put this "all loads are volatile" restriction,
>> can a globally analysing compiler still have any optimizations that a
>> locally analysing compiler can not?
> 
> Yes, although whether they are pertinent is open to question.  For
> example, a globally optimizing compiler may observe that no thread ever
> reads the value of a particular shared variable and then eliminate that
> variable.

Oh, I meant the "all atomic objects is volatile" restriction, not just 
the loads. In that case, all stores the object - even if never read - 
still need to be generated.

Are there still any optimizations?

>> It rather seems the other way, that the locally analysing quasi-volatile
>> compiler can at least to some local optimizations, while the global volatile
>> compiler can not (e.g., x=1; y=x; can not be x=1; y=1; for the global
>> compiler because x is volatile now).
> 
> In the paper we speculate that it may be sufficient to require globally
> optimizing compilers to treat atomics as quasi volatile with the added
> restriction that loads must not be omitted.

I see.

>>> But then how would you characterize semantic dependencies, if you want
>>> to allow the definition to include some dependencies that aren't
>>> semantic but not so many that you ever create a cycle?
>>
>> I don't know which definition is correct yet, but the point is that you
>> don't have to avoid making so many dependencies that you would create a
>> cycle. It just forbids the compiler from looking for cycles and optimizing
>> based on the existance of the cycle. (Looking for unused vars and removing
>> those is still allowed, under the informal argument that this simulates an
>> execution where no OOTA happened)
> 
> At the moment, the only underlying ideas we have driving our notions of
> semantic dependency is that a true semantic dependency should be one
> which must give rise to order in the machine code.  In turn, this order
> prevents OOTA cycles from occuring during execution.  That is the
> essence of the paper.
> 
> Can your ideas be fleshed out to a comparable degree?

Well, I certainly have not put as much deep thought into it as you have, 
so it is certainly more fragile. But my current thoughts are along these 
lines:

We consider inter-thread semantic dependencies (isdep) based on the set 
of allowed executions + thread local optimizations + what would be 
allowed to happen if rfe edges along the dependencies become rfi edges 
due to merging. So the definition is not compiler-specific.

Those provide order at machine level unless the compiler restricts the 
set of executions through its choices, especially cross-thread 
optimizations, and then uses the restricted set of executions (i.e., 
possible input range) to optimize the execution further.
I haven't thought deeply about all the different optimizations that are 
possible there.

For an example of how it may not provide order, if we have accesses 
related by isdep;rfe;isdep, then the compiler could merge all the 
involved threads into one, and the accesses could no longer have any 
dependency.

So it is important that one can not forbid isdep;rf cycles, the axiom 
would be that isdep;rf is irreflexive.

When merging threads, if in an OOTA execution there is an inter-thread 
semantic dependency from r in one of the merged threads to w in one of 
the merged threads, such that r reads from a non-merged thread and w is 
read by a non-merged thread in the OOTA cycle, then it is still an 
inter-thread dependency which still preserves the order. This prevents 
the full cycle even if it some other sub-edges within the merged threads 
are now no longer dependencies.

But if r is reading from w in the OOTA cycle, then the compiler (which 
is not allowed to look for the cycle) has to put r before w in the 
merged po, preventing r from reading w. (it can also omit r by giving it 
a value that it could legally read in this po, but it still can't get 
its value from w this way).

E.g.,
P0 {
   y = x;
}
P1 {
   z = y;
}
... (some way to assign x dependent on z)

would have an inter-thread dependency from the load of x to the store to 
z. If P0 and P1 are merged, and there are no other accesses to y, then 
the intra-thread dependency from loading x to store to y etc. are 
eliminated, but the inter-thread dependency from x to z remains.

>>>> So for example, if we merge  x = y || y = x,  the merge can turn it into
>>>> x=y=x or y=x=y (and then into an empty program), but not into a cyclic
>>>> dependency. So even though one side of the dependency may be violated, for
>>>> sake of OOTA, we could still label both sides as dependent.
>>>
>>> They _are_ both semantically dependent (in the original parallel
>>> version, I mean).  I don't see what merging has to do with it.
>>
>> Note that I was considering the case where they are not volatile.
>> With a compiler that is not treating them as volatile, which merges the two
>> threads, under your definition, there is no semantic dependency in at least
>> one direction because there is no hardware realization H where you read
>> something else (of course you exclude such compilers, but I think
>> realistically they should be allowed).
>>
>> My point is that we can say they are semantically dependent for the sake of
>> OOTA, not derive any ordering from these dependencies other than the
>> cyclical one, and therefore allow compilers to one of the two optimizations
>> (make x=y no longer depend on y or make y=x no longer depend on x) but not
>> make a cycle analysis to remove both dependencies and generate an OOTA value
>> (it can remove both dependencies by leaving x and y unchanged though).
> 
> I don't understand.

Does the explanation above help?

> 
>>> I do not like the idea of tying the definition of OOTA (which needs to
>>> apply to every implementation) to a particular clang compiler.
>>
>> But that is what you have done, no? Whether something is an sdep depends on
>> the compiler, so compiler A could generate an execution that is OOTA in the
>> sdep definition of compiler B.
> 
> Yes, but this does not tie the definition to _one particular_ compiler.
> That is, we don't say "This dependency is semantic because of the way
> GCC 14.2.1 handles it."  Rather, we define for each compiler whether a
> dependency is semantic for _that_ compiler.

Yes, but the result is still that every compiler has its own memory 
model (at least at the point of which graphs are being ruled out as 
OOTA). So there is no definition of 'G is OOTA', only 'G is OOTA on 
compiler x'.

The example I gave of the tool verifying the program relative to one 
specific compiler is also not giving a definition of 'G is OOTA', in 
fact, it does not specify OOTA at all; it simply says ``with compiler x, 
we get the following "semantic dependencies" and the following graphs...''

And as long as compiler x does not generate OOTA, there will be no OOTA 
graphs among those.

So it does not solve or tackle the theoretical problem in any way, and 
can not do cross-compiler verification. But it will already apply the 
'naive-dependency-breaking optimizations' that you would not see in e.g. 
herd7.

>> So you either need a definition of semantic dependency that
>> a) applies in all cases we practically need and
>> b) is guaranteed by all compilers
>>
>> or we need to live with the fact that we do not have a semantic dependency
>> definition that is independent of the compilation (even of the same
>> compiler) and need to do our verification for that specific compilation.
> 
> Add the qualification that the definition should be practical to
> evaluate, and I agree.

Yes, an important point. And hard to resolve as well.

Best wishes,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-21 10:36                             ` Jonas Oberhauser
@ 2025-01-21 16:39                               ` Alan Stern
  2025-01-22  3:46                                 ` Jonas Oberhauser
  0 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-01-21 16:39 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Tue, Jan 21, 2025 at 11:36:01AM +0100, Jonas Oberhauser wrote:
> > What makes you think this "more appropriate" definition of semantic
> > dependency will be any easier to detect than the original one?
> 
> For starters, as you mentioned, the compiler has to assume any value is
> possible.
> 
> Which means that if any other value would lead to a "different behavior",
> you know you have a semantic dependency. You can detect this in a per-thread
> manner, independently of the compiler.

How?  The question is not as simple as it sounds.  What counts as 
"different behavior"?  What if some loads take place with the other 
value that didn't take place with the original value?  What if 
completely different code ends up executing but it stores the same 
values as before?

> Without the assumption of volatile, even a different value that could
> actually be generated in another run of the same program does not prove a
> semantic dependency.

So then how do you tell whether there is a semantic dependency?

> > > Yes. Actually I wonder if you put this "all loads are volatile" restriction,
> > > can a globally analysing compiler still have any optimizations that a
> > > locally analysing compiler can not?
> > 
> > Yes, although whether they are pertinent is open to question.  For
> > example, a globally optimizing compiler may observe that no thread ever
> > reads the value of a particular shared variable and then eliminate that
> > variable.
> 
> Oh, I meant the "all atomic objects is volatile" restriction, not just the
> loads. In that case, all stores the object - even if never read - still need
> to be generated.
> 
> Are there still any optimizations?

Perhaps not any that affect shared variable accesses.  In a way, that 
was the intention.

> Well, I certainly have not put as much deep thought into it as you have, so
> it is certainly more fragile. But my current thoughts are along these lines:
> 
> We consider inter-thread semantic dependencies (isdep) based on the set of
> allowed executions + thread local optimizations + what would be allowed to
> happen if rfe edges along the dependencies become rfi edges due to merging.
> So the definition is not compiler-specific.

How do you know what thread-local optimizations may be applied?  
Different compilers use different ones, and new ones are constantly 
being invented.

Why not consider global optimizations?  Yes, they are the same as 
thread-local optimizations if all the threads have been merged into one, 
but what if they haven't been merged?

For that matter, why bring thread merging into the discussion?

> Those provide order at machine level unless the compiler restricts the set
> of executions through its choices, especially cross-thread optimizations,
> and then uses the restricted set of executions (i.e., possible input range)
> to optimize the execution further.
> I haven't thought deeply about all the different optimizations that are
> possible there.
> 
> For an example of how it may not provide order, if we have accesses related
> by isdep;rfe;isdep, then the compiler could merge all the involved threads
> into one, and the accesses could no longer have any dependency.
> 
> So it is important that one can not forbid isdep;rf cycles, the axiom would
> be that isdep;rf is irreflexive.
> 
> 
> When merging threads, if in an OOTA execution there is an inter-thread
> semantic dependency from r in one of the merged threads to w in one of the
> merged threads, such that r reads from a non-merged thread and w is read by
> a non-merged thread in the OOTA cycle, then it is still an inter-thread
> dependency which still preserves the order. This prevents the full cycle
> even if it some other sub-edges within the merged threads are now no longer
> dependencies.
> 
> But if r is reading from w in the OOTA cycle, then the compiler (which is
> not allowed to look for the cycle) has to put r before w in the merged po,
> preventing r from reading w. (it can also omit r by giving it a value that
> it could legally read in this po, but it still can't get its value from w
> this way).
> 
> E.g.,
> P0 {
>   y = x;
> }
> P1 {
>   z = y;
> }
> ... (some way to assign x dependent on z)
> 
> would have an inter-thread dependency from the load of x to the store to z.
> If P0 and P1 are merged, and there are no other accesses to y, then the
> intra-thread dependency from loading x to store to y etc. are eliminated,
> but the inter-thread dependency from x to z remains.

> > I don't understand.
> 
> Does the explanation above help?

No.  I don't want to think about thread merging, and at first glance it 
looks like you don't want to think about anything else.

> > > > I do not like the idea of tying the definition of OOTA (which needs to
> > > > apply to every implementation) to a particular clang compiler.
> > > 
> > > But that is what you have done, no? Whether something is an sdep depends on
> > > the compiler, so compiler A could generate an execution that is OOTA in the
> > > sdep definition of compiler B.
> > 
> > Yes, but this does not tie the definition to _one particular_ compiler.
> > That is, we don't say "This dependency is semantic because of the way
> > GCC 14.2.1 handles it."  Rather, we define for each compiler whether a
> > dependency is semantic for _that_ compiler.
> 
> Yes, but the result is still that every compiler has its own memory model
> (at least at the point of which graphs are being ruled out as OOTA). So
> there is no definition of 'G is OOTA', only 'G is OOTA on compiler x'.

Our definition of OOTA and semantic dependency does not apply to graphs; 
it applies to particular hardware executions (together with the abstract 
executions that they realize).

Besides, it is still possible to use our definition to talk about 
semantic dependency in a compiler-independent way.  Namely, if a 
dependency is semantic relative to _every_ compiler then we can say it 
is absolutely semantic.

> The example I gave of the tool verifying the program relative to one
> specific compiler is also not giving a definition of 'G is OOTA', in fact,
> it does not specify OOTA at all; it simply says ``with compiler x, we get
> the following "semantic dependencies" and the following graphs...''
> 
> And as long as compiler x does not generate OOTA, there will be no OOTA
> graphs among those.
> 
> So it does not solve or tackle the theoretical problem in any way, and can
> not do cross-compiler verification. But it will already apply the
> 'naive-dependency-breaking optimizations' that you would not see in e.g.
> herd7.

Okay, fine.  But we're trying to come up with general characterizations, 
and it appears that you're doing something quite different.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-21 16:39                               ` Alan Stern
@ 2025-01-22  3:46                                 ` Jonas Oberhauser
  2025-01-22 19:11                                   ` Alan Stern
  0 siblings, 1 reply; 59+ messages in thread
From: Jonas Oberhauser @ 2025-01-22  3:46 UTC (permalink / raw)
  To: Alan Stern
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

Am 1/21/2025 um 5:39 PM schrieb Alan Stern:
> On Tue, Jan 21, 2025 at 11:36:01AM +0100, Jonas Oberhauser wrote:
>>> What makes you think this "more appropriate" definition of semantic
>>> dependency will be any easier to detect than the original one?
>>
>> For starters, as you mentioned, the compiler has to assume any value is
>> possible.
>>
>> Which means that if any other value would lead to a "different behavior",
>> you know you have a semantic dependency. You can detect this in a per-thread
>> manner, independently of the compiler.
> 
> How?  The question is not as simple as it sounds.  What counts as
> "different behavior"?  What if some loads take place with the other
> value that didn't take place with the original value?  What if
> completely different code ends up executing but it stores the same
> values as before?

I agree that it is not easy, mostly because there's no good way to say 
whether two stores from two executions are "the same".

Besides the complications mentioned in your paper (e.g., another store 
to the same address being on the new code path), I would also need to 
think about barriers (e.g., a relaxed store in one execution, but a sc 
one in another execution), other kinds of synchronization (what if there 
are other atomic accesses in between that establish synchronization?).

One could take your definition (with the requirements about H dropped), 
and perhaps since in this specific example all atomic accesses are 
volatile, the restriction about counting accesses does not sting as much.

> 
>> Without the assumption of volatile, even a different value that could
>> actually be generated in another run of the same program does not prove a
>> semantic dependency.
> 
> So then how do you tell whether there is a semantic dependency?
> 
>>>> Yes. Actually I wonder if you put this "all loads are volatile" restriction,
>>>> can a globally analysing compiler still have any optimizations that a
>>>> locally analysing compiler can not?
>>>
>>> Yes, although whether they are pertinent is open to question.  For
>>> example, a globally optimizing compiler may observe that no thread ever
>>> reads the value of a particular shared variable and then eliminate that
>>> variable.
>>
>> Oh, I meant the "all atomic objects is volatile" restriction, not just the
>> loads. In that case, all stores the object - even if never read - still need
>> to be generated.
>>
>> Are there still any optimizations?
> 
> Perhaps not any that affect shared variable accesses.  In a way, that
> was the intention.

Yes, but it becomes a bit strange then to treat the "globally analyzing 
compiler" as a harder problem. You have made a globally analyzing 
compiler that can not globally analyze. I understand that makes the 
argument feasible, but I am not sure if this is how compilers really 
behave (or at least will still behave in the future).

>> Well, I certainly have not put as much deep thought into it as you have, so
>> it is certainly more fragile. But my current thoughts are along these lines:
>>
>> We consider inter-thread semantic dependencies (isdep) based on the set of
>> allowed executions + thread local optimizations + what would be allowed to
>> happen if rfe edges along the dependencies become rfi edges due to merging.
>> So the definition is not compiler-specific.
> 
> How do you know what thread-local optimizations may be applied?
> Different compilers use different ones, and new ones are constantly
> being invented.

I'm not sure it is necessary to know specific optimizations.
I don't have a satisfactory answer now.
But perhaps it is possible to define this through the abstract machine, 
maybe with a detour through the set of allowed C realizations.
Something like "given some execution G, and a subset of threads (T)i, 
and a read r and write w in those executions, if for every C realization 
of threads (T)i together such that all executions of the realization 
under all rf relations that exist in some original execution G' modulo 
some po-preserving mapping have the same observable side effects as that 
original execution G', there is a sequence of syntactic dependency + rf 
from r to w, then there is an inter-thread semantic dependency from r to w".

But this makes it really hard to prove that there is a dependency, since 
it quantifies over all C programs. And it only takes into account 
C-level optimizations, not optimizations specific to some hardware 
platform (which for all we know may have some transactional memory or 
powerful write speculation which the compiler knows about and uses in 
its optimizations).

I am not sure there is a better definition.

> Why not consider global optimizations? Yes, they are the same as
> thread-local optimizations if all the threads have been merged into one,
> but what if they haven't been merged?

Some global optimizations are considered by the fact that we only 
consider the set of allowed executions.

So if the compiler can see that in all executions the value of some 
variable is always 0 or 1 - for example, because those are the only 
kinds of stores to that variable - it might use that to do local 
optimizations. Such as eliminating some switch cases which then lead to 
eliminating control dependencies.

> For that matter, why bring thread merging into the discussion?

For one, it sounds impractical to make a compiler that does advanced 
global optimizations without merging the threads. It's probably easy 
enough to do on toy examples, but for realistic applications it seems 
completely infeasible, and additionally, hard to gain much benefit from it.

So I'm not sure it is necessary to solve the more advanced problem.

For another, thread merging defeats per-thread semantic dependencies.
So unless all optimizations related to that are ruled out (e.g., by 
saying all accesses are volatile), it needs to be considered either 
specifically or by a more powerful simpler abstraction. Which I don't have.

> 
> I don't want to think about thread merging, and at first glance it
> looks like you don't want to think about anything else.

I *want* to think about other practical global optimizations that are 
not included in the optimizations due to knowing the set of possible 
executions.

I just haven't been able to.

>>>>> I do not like the idea of tying the definition of OOTA (which needs to
>>>>> apply to every implementation) to a particular clang compiler.
>>>>
>>>> But that is what you have done, no? Whether something is an sdep depends on
>>>> the compiler, so compiler A could generate an execution that is OOTA in the
>>>> sdep definition of compiler B.
>>>
>>> Yes, but this does not tie the definition to _one particular_ compiler.
>>> That is, we don't say "This dependency is semantic because of the way
>>> GCC 14.2.1 handles it."  Rather, we define for each compiler whether a
>>> dependency is semantic for _that_ compiler.
>>
>> Yes, but the result is still that every compiler has its own memory model
>> (at least at the point of which graphs are being ruled out as OOTA). So
>> there is no definition of 'G is OOTA', only 'G is OOTA on compiler x'.
> 
> Our definition of OOTA and semantic dependency does not apply to graphs;
> it applies to particular hardware executions (together with the abstract
> executions that they realize).

True, but it immediately induces a predicate over graphs indexed by the 
compiler (by looking at hardware executions generated by x and the 
graphs from the realized abstract executions).

> 
> Besides, it is still possible to use our definition to talk about
> semantic dependency in a compiler-independent way.  Namely, if a
> dependency is semantic relative to _every_ compiler then we can say it
> is absolutely semantic.

Sure, but that definition is useless.

For example, in the running OOTA example x=y||y=x, there are no 
absolutely semantic dependencies. One compiler can turn this into x=y, 
where there is no dependency from y to x, and another into y=x, where 
there is no dependency from x to y. In fact even a naive non-optimizing 
compiler has no dependencies here under your definition because there 
are no hardware executions with other values.

In my informal definition, there is (intended to be) a inter-thread 
dependency from the load from x in T2 to the store to x in T1, and I 
would claim there is no execution in which that load can read from that 
store. I would not claim anything about the order of the load from y and 
the store to x.

>> The example I gave of the tool verifying the program relative to one
>> specific compiler is also not giving a definition of 'G is OOTA', in fact,
>> it does not specify OOTA at all; it simply says ``with compiler x, we get
>> the following "semantic dependencies" and the following graphs...''
>>
>> And as long as compiler x does not generate OOTA, there will be no OOTA
>> graphs among those.
>>
>> So it does not solve or tackle the theoretical problem in any way, and can
>> not do cross-compiler verification. But it will already apply the
>> 'naive-dependency-breaking optimizations' that you would not see in e.g.
>> herd7.
> 
> Okay, fine.  But we're trying to come up with general characterizations,
> and it appears that you're doing something quite different.

Well, it wasn't me who did this, it was Hernan's group who did this for 
other practical concerns, but yes I agree.
It's not a solution to the very hard problem you're trying to solve, but 
a solution to perhaps the more immediate problem people have while 
looking at real code.

Best wishes,
   jonas

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-22  3:46                                 ` Jonas Oberhauser
@ 2025-01-22 19:11                                   ` Alan Stern
  0 siblings, 0 replies; 59+ messages in thread
From: Alan Stern @ 2025-01-22 19:11 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: paulmck, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Wed, Jan 22, 2025 at 04:46:03AM +0100, Jonas Oberhauser wrote:
> > > > > Yes. Actually I wonder if you put this "all loads are volatile" restriction,
> > > > > can a globally analysing compiler still have any optimizations that a
> > > > > locally analysing compiler can not?
> > > > 
> > > > Yes, although whether they are pertinent is open to question.  For
> > > > example, a globally optimizing compiler may observe that no thread ever
> > > > reads the value of a particular shared variable and then eliminate that
> > > > variable.
> > > 
> > > Oh, I meant the "all atomic objects is volatile" restriction, not just the
> > > loads. In that case, all stores the object - even if never read - still need
> > > to be generated.
> > > 
> > > Are there still any optimizations?
> > 
> > Perhaps not any that affect shared variable accesses.  In a way, that
> > was the intention.
> 
> Yes, but it becomes a bit strange then to treat the "globally analyzing
> compiler" as a harder problem. You have made a globally analyzing compiler
> that can not globally analyze. I understand that makes the argument
> feasible, but I am not sure if this is how compilers really behave (or at
> least will still behave in the future).

There's still an important difference.  A globally optimizing compiler 
is allowed to generate different object code for a thread (containing 
the same source code) in different programs, depending on the source for 
the other threads.  A locally analyzing compiler is not allowed to do 
this; it will always generate the same object code for threads 
containing the same source code, regardless of what the rest of the 
program does.

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-01-06 21:40 [RFC] tools/memory-model: Rule out OOTA Jonas Oberhauser
                   ` (2 preceding siblings ...)
  2025-01-07 16:09 ` Alan Stern
@ 2025-07-23  0:43 ` Paul E. McKenney
  2025-07-23  7:26   ` Hernan Ponce de Leon
                     ` (2 more replies)
  3 siblings, 3 replies; 59+ messages in thread
From: Paul E. McKenney @ 2025-07-23  0:43 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
> example shared on this list a few years ago:

Apologies for being slow, but I have finally added the litmus tests in
this email thread to the https://github.com/paulmckrcu/litmus repo.

It is quite likely that I have incorrectly intuited the missing portions
of the litmus tests, especially the two called out in the commit log
below.  If you have time, please do double-check.

And the updated (and condensed!) version of the C++ OOTA paper may be
found here, this time with a proposed change to the standard:

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3692r1.pdf

							Thanx, Paul

------------------------------------------------------------------------

commit fd17e8fceb75326e159ba3aa6fdb344f74f5c7a5
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Tue Jul 22 17:21:19 2025 -0700

    manual/oota:  Add Jonas and Alan OOTA examples
    
    Each of these new litmus tests contains the URL of the email message
    that I took it from.
    
    Please note that I had to tweak the example leading up to
    C-JO-OOTA-4.litmus, and I might well have misinterpreted Jonas's "~"
    operator.
    
    Also, C-JO-OOTA-7.litmus includes a "*r2 = a" statement that makes herd7
    very unhappy.  On the other hand, initializing registers to the address
    of a variable is straight forward, as shown in the resulting litmus test.
    
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/manual/oota/C-AS-OOTA-1.litmus b/manual/oota/C-AS-OOTA-1.litmus
new file mode 100644
index 00000000..81a873a7
--- /dev/null
+++ b/manual/oota/C-AS-OOTA-1.litmus
@@ -0,0 +1,40 @@
+C C-AS-OOTA-1
+
+(*
+ * Result: Sometimes
+ *
+ * Because smp_rmb() combined with smp_wmb() does not order earlier
+ * reads against later writes.
+ *
+ * https://lore.kernel.org/all/a3bf910f-509a-4ad3-a1cc-4b14ef9b3259@rowland.harvard.edu
+ *)
+
+{}
+
+P0(int *a, int *b, int *x, int *y)
+{
+	int r1;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	if (r1 == 1) {
+		*a = *b;
+	}
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y)
+{
+	int r1;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	if (r1 == 1) {
+		*b = *a;
+	}
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+exists (0:r1=1 /\ 1:r1=1)
diff --git a/manual/oota/C-AS-OOTA-2.litmus b/manual/oota/C-AS-OOTA-2.litmus
new file mode 100644
index 00000000..c672b0e7
--- /dev/null
+++ b/manual/oota/C-AS-OOTA-2.litmus
@@ -0,0 +1,33 @@
+C C-AS-OOTA-2
+
+(*
+ * Result: Always
+ *
+ * If we were using C-language relaxed atomics instead of volatiles,
+ * the compiler *could* eliminate the first WRITE_ONCE() in each process,
+ * then also each process's local variable, thus having an undefined value
+ * for each of those local variables.  But this cannot happen given that
+ * we are using Linux-kernel _ONCE() primitives.
+ *
+ * https://lore.kernel.org/all/c2ae9bca-8526-425e-b9b5-135004ad59ad@rowland.harvard.edu/
+ *)
+
+{}
+
+P0(int *a, int *b)
+{
+	int r0 = READ_ONCE(*a);
+
+	WRITE_ONCE(*b, r0);
+	WRITE_ONCE(*b, 2);
+}
+
+P1(int *a, int *b)
+{
+	int r1 = READ_ONCE(*b);
+
+	WRITE_ONCE(*a, r0);
+	WRITE_ONCE(*a, 2);
+}
+
+exists ((0:r0=0 \/ 0:r0=2) /\ (1:r1=0 \/ 1:r1=2))
diff --git a/manual/oota/C-JO-OOTA-1.litmus b/manual/oota/C-JO-OOTA-1.litmus
new file mode 100644
index 00000000..6ab437b4
--- /dev/null
+++ b/manual/oota/C-JO-OOTA-1.litmus
@@ -0,0 +1,40 @@
+C C-JO-OOTA-1
+
+(*
+ * Result: Never
+ *
+ * But Sometimes in LKMM as of early 2025, given that 42 is a possible
+ * value for things like S19..
+ *
+ * https://lore.kernel.org/all/20250106214003.504664-1-jonas.oberhauser@huaweicloud.com/
+ *)
+
+{}
+
+P0(int *a, int *b, int *x, int *y)
+{
+	int r1;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	if (r1 == 1) {
+		*a = *b;
+	}
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y)
+{
+	int r1;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	if (r1 == 1) {
+		*b = *a;
+	}
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+exists (b=42)
diff --git a/manual/oota/C-JO-OOTA-2.litmus b/manual/oota/C-JO-OOTA-2.litmus
new file mode 100644
index 00000000..ad708c60
--- /dev/null
+++ b/manual/oota/C-JO-OOTA-2.litmus
@@ -0,0 +1,44 @@
+C C-JO-OOTA-2
+
+(*
+ * Result: Never
+ *
+ * But Sometimes in LKMM as of early 2025, given that 42 is a possible
+ * value for things like S23.
+ *
+ * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
+ *)
+
+{}
+
+P0(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2 = 0;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	if (r1 == 1) {
+		r2 = *b;
+	}
+	WRITE_ONCE(*a, r2);
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2 = 0;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	if (r1 == 1) {
+		r2 = *a;
+	}
+	WRITE_ONCE(*b, r2);
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+exists (b=42)
diff --git a/manual/oota/C-JO-OOTA-3.litmus b/manual/oota/C-JO-OOTA-3.litmus
new file mode 100644
index 00000000..633b8334
--- /dev/null
+++ b/manual/oota/C-JO-OOTA-3.litmus
@@ -0,0 +1,46 @@
+C C-JO-OOTA-3
+
+(*
+ * Result: Never
+ *
+ * But LKMM finds the all-ones result, perhaps due to not tracking
+ * control dependencies out of the "if" statement.
+ *
+ * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
+ *)
+
+{}
+
+P0(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	r2 = READ_ONCE(*b);
+	if (r1 == 1) {
+		r2 = *b;
+	}
+	WRITE_ONCE(*a, r2);
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	r2 = READ_ONCE(*a);
+	if (r1 == 1) {
+		r2 = *a;
+	}
+	WRITE_ONCE(*b, r2);
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+exists (0:r1=1 /\ 1:r1=1)
diff --git a/manual/oota/C-JO-OOTA-4.litmus b/manual/oota/C-JO-OOTA-4.litmus
new file mode 100644
index 00000000..cab7ebb6
--- /dev/null
+++ b/manual/oota/C-JO-OOTA-4.litmus
@@ -0,0 +1,43 @@
+C C-JO-OOTA-4
+
+(*
+ * Result: Never
+ *
+ * And LKMM agrees, which might be a surprise.
+ *
+ * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
+ *)
+
+{}
+
+P0(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+	int r3;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	r2 = *b;
+	r3 = r1 == 0;
+	WRITE_ONCE(*a, (r3 + 1) & r2);
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+	int r3;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	r2 = *a;
+	r3 = r1 == 0;
+	WRITE_ONCE(*b, (r3 + 1) & r2);
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+exists (0:r1=1 /\ 1:r1=1)
diff --git a/manual/oota/C-JO-OOTA-5.litmus b/manual/oota/C-JO-OOTA-5.litmus
new file mode 100644
index 00000000..145c8378
--- /dev/null
+++ b/manual/oota/C-JO-OOTA-5.litmus
@@ -0,0 +1,44 @@
+C C-JO-OOTA-5
+
+(*
+ * Result: Never
+ *
+ * But LKMM finds the all-ones result, perhaps due r2 being unused.
+ *
+ * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
+ *)
+
+{}
+
+P0(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	if (r1 == 1) {
+		r2 = READ_ONCE(*a);
+	}
+	*b = 1;
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	if (r1 == 1) {
+		r2 = READ_ONCE(*b);
+	}
+	*a = 1;
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+locations [0:r2;1:r2]
+exists (0:r1=1 /\ 1:r1=1)
diff --git a/manual/oota/C-JO-OOTA-6.litmus b/manual/oota/C-JO-OOTA-6.litmus
new file mode 100644
index 00000000..942e6c82
--- /dev/null
+++ b/manual/oota/C-JO-OOTA-6.litmus
@@ -0,0 +1,44 @@
+C C-JO-OOTA-6
+
+(*
+ * Result: Never
+ *
+ * But LKMM finds the all-ones result, due to OOTA on r2.
+ *
+ * https://lore.kernel.org/all/1147ad3e-e3ad-4fa1-9a63-772ba136ea9a@huaweicloud.com/
+ *)
+
+{}
+
+P0(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	if (r1 == 1) {
+		r2 = READ_ONCE(*a);
+	}
+	*b = r2;
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	if (r1 == 1) {
+		r2 = READ_ONCE(*b);
+	}
+	*a = r2;
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+locations [0:r2;1:r2]
+exists (0:r1=1 /\ 1:r1=1)
diff --git a/manual/oota/C-JO-OOTA-7.litmus b/manual/oota/C-JO-OOTA-7.litmus
new file mode 100644
index 00000000..31c0b8ae
--- /dev/null
+++ b/manual/oota/C-JO-OOTA-7.litmus
@@ -0,0 +1,47 @@
+C C-JO-OOTA-7
+
+(*
+ * Result: Never
+ *
+ * But LKMM finds the all-ones result, due to OOTA on r2.
+ *
+ * https://lore.kernel.org/all/1147ad3e-e3ad-4fa1-9a63-772ba136ea9a@huaweicloud.com/
+ *)
+
+{
+	0:r2=a;
+	1:r2=b;
+}
+
+P0(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	if (r1 == 1) {
+		r2 = READ_ONCE(*a);
+	}
+	*r2 = a;
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y)
+{
+	int r1;
+	int r2;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	if (r1 == 1) {
+		r2 = READ_ONCE(*b);
+	}
+	*r2 = b;
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+locations [0:r2;1:r2]
+exists (0:r1=1 /\ 1:r1=1)
diff --git a/manual/oota/C-PM-OOTA-1.litmus b/manual/oota/C-PM-OOTA-1.litmus
new file mode 100644
index 00000000..e771e3c9
--- /dev/null
+++ b/manual/oota/C-PM-OOTA-1.litmus
@@ -0,0 +1,37 @@
+C C-PM-OOTA-1
+
+(*
+ * Result: Never
+ *
+ * LKMM agrees.
+ *
+ * https://lore.kernel.org/all/9a0dccbb-bfa7-4b33-ac1a-daa9841b609a@paulmck-laptop/
+ *)
+
+{}
+
+P0(int *a, int *b, int *x, int *y) {
+	int r1;
+
+	r1 = READ_ONCE(*x);
+	smp_rmb();
+	if (r1 == 1) {
+		WRITE_ONCE(*a, *b);
+	}
+	smp_wmb();
+	WRITE_ONCE(*y, 1);
+}
+
+P1(int *a, int *b, int *x, int *y) {
+	int r1;
+
+	r1 = READ_ONCE(*y);
+	smp_rmb();
+	if (r1 == 1) {
+		WRITE_ONCE(*b, *a);
+	}
+	smp_wmb();
+	WRITE_ONCE(*x, 1);
+}
+
+exists b=42

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-23  0:43 ` Paul E. McKenney
@ 2025-07-23  7:26   ` Hernan Ponce de Leon
  2025-07-23 16:39     ` Paul E. McKenney
  2025-07-23 17:13   ` Alan Stern
  2025-07-23 19:25   ` Alan Stern
  2 siblings, 1 reply; 59+ messages in thread
From: Hernan Ponce de Leon @ 2025-07-23  7:26 UTC (permalink / raw)
  To: paulmck, Jonas Oberhauser
  Cc: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm

On 7/23/2025 2:43 AM, Paul E. McKenney wrote:
> On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
>> The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
>> example shared on this list a few years ago:
> 
> Apologies for being slow, but I have finally added the litmus tests in
> this email thread to the https://github.com/paulmckrcu/litmus repo.

I do not understand some of the comments in the preamble of the tests:

(*
  * Result: Never
  *
  * But Sometimes in LKMM as of early 2025, given that 42 is a possible
  * value for things like S19..
  *
  * 
https://lore.kernel.org/all/20250106214003.504664-1-jonas.oberhauser@huaweicloud.com/
  *)

I see that herd7 reports one of the states to be [b]=S16. Is this 
supposed to be some kind of symbolic state (i.e., any value is possible)?

The value in the "Result" is what we would like the model to say if we 
would have a perfect version of dependencies, right?

> 
> It is quite likely that I have incorrectly intuited the missing portions
> of the litmus tests, especially the two called out in the commit log
> below.  If you have time, please do double-check.

I read the "On the other hand" from the commit log as "this fixes the 
problem". However I still get the following error when running 
C-JO-OOTA-7 with herd7

Warning: File "manual/oota/C-JO-OOTA-7.litmus": Non-symbolic memory 
access found on '[0]' (User error)


Hernan>
> And the updated (and condensed!) version of the C++ OOTA paper may be
> found here, this time with a proposed change to the standard:
> 
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3692r1.pdf
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit fd17e8fceb75326e159ba3aa6fdb344f74f5c7a5
> Author: Paul E. McKenney <paulmck@kernel.org>
> Date:   Tue Jul 22 17:21:19 2025 -0700
> 
>      manual/oota:  Add Jonas and Alan OOTA examples
>      
>      Each of these new litmus tests contains the URL of the email message
>      that I took it from.
>      
>      Please note that I had to tweak the example leading up to
>      C-JO-OOTA-4.litmus, and I might well have misinterpreted Jonas's "~"
>      operator.
>      
>      Also, C-JO-OOTA-7.litmus includes a "*r2 = a" statement that makes herd7
>      very unhappy.  On the other hand, initializing registers to the address
>      of a variable is straight forward, as shown in the resulting litmus test.
>      
>      Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/manual/oota/C-AS-OOTA-1.litmus b/manual/oota/C-AS-OOTA-1.litmus
> new file mode 100644
> index 00000000..81a873a7
> --- /dev/null
> +++ b/manual/oota/C-AS-OOTA-1.litmus
> @@ -0,0 +1,40 @@
> +C C-AS-OOTA-1
> +
> +(*
> + * Result: Sometimes
> + *
> + * Because smp_rmb() combined with smp_wmb() does not order earlier
> + * reads against later writes.
> + *
> + * https://lore.kernel.org/all/a3bf910f-509a-4ad3-a1cc-4b14ef9b3259@rowland.harvard.edu
> + *)
> +
> +{}
> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		*a = *b;
> +	}
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		*b = *a;
> +	}
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +exists (0:r1=1 /\ 1:r1=1)
> diff --git a/manual/oota/C-AS-OOTA-2.litmus b/manual/oota/C-AS-OOTA-2.litmus
> new file mode 100644
> index 00000000..c672b0e7
> --- /dev/null
> +++ b/manual/oota/C-AS-OOTA-2.litmus
> @@ -0,0 +1,33 @@
> +C C-AS-OOTA-2
> +
> +(*
> + * Result: Always
> + *
> + * If we were using C-language relaxed atomics instead of volatiles,
> + * the compiler *could* eliminate the first WRITE_ONCE() in each process,
> + * then also each process's local variable, thus having an undefined value
> + * for each of those local variables.  But this cannot happen given that
> + * we are using Linux-kernel _ONCE() primitives.
> + *
> + * https://lore.kernel.org/all/c2ae9bca-8526-425e-b9b5-135004ad59ad@rowland.harvard.edu/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b)
> +{
> +	int r0 = READ_ONCE(*a);
> +
> +	WRITE_ONCE(*b, r0);
> +	WRITE_ONCE(*b, 2);
> +}
> +
> +P1(int *a, int *b)
> +{
> +	int r1 = READ_ONCE(*b);
> +
> +	WRITE_ONCE(*a, r0);
> +	WRITE_ONCE(*a, 2);
> +}
> +
> +exists ((0:r0=0 \/ 0:r0=2) /\ (1:r1=0 \/ 1:r1=2))
> diff --git a/manual/oota/C-JO-OOTA-1.litmus b/manual/oota/C-JO-OOTA-1.litmus
> new file mode 100644
> index 00000000..6ab437b4
> --- /dev/null
> +++ b/manual/oota/C-JO-OOTA-1.litmus
> @@ -0,0 +1,40 @@
> +C C-JO-OOTA-1
> +
> +(*
> + * Result: Never
> + *
> + * But Sometimes in LKMM as of early 2025, given that 42 is a possible
> + * value for things like S19..
> + *
> + * https://lore.kernel.org/all/20250106214003.504664-1-jonas.oberhauser@huaweicloud.com/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		*a = *b;
> +	}
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		*b = *a;
> +	}
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +exists (b=42)
> diff --git a/manual/oota/C-JO-OOTA-2.litmus b/manual/oota/C-JO-OOTA-2.litmus
> new file mode 100644
> index 00000000..ad708c60
> --- /dev/null
> +++ b/manual/oota/C-JO-OOTA-2.litmus
> @@ -0,0 +1,44 @@
> +C C-JO-OOTA-2
> +
> +(*
> + * Result: Never
> + *
> + * But Sometimes in LKMM as of early 2025, given that 42 is a possible
> + * value for things like S23.
> + *
> + * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2 = 0;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = *b;
> +	}
> +	WRITE_ONCE(*a, r2);
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2 = 0;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = *a;
> +	}
> +	WRITE_ONCE(*b, r2);
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +exists (b=42)
> diff --git a/manual/oota/C-JO-OOTA-3.litmus b/manual/oota/C-JO-OOTA-3.litmus
> new file mode 100644
> index 00000000..633b8334
> --- /dev/null
> +++ b/manual/oota/C-JO-OOTA-3.litmus
> @@ -0,0 +1,46 @@
> +C C-JO-OOTA-3
> +
> +(*
> + * Result: Never
> + *
> + * But LKMM finds the all-ones result, perhaps due to not tracking
> + * control dependencies out of the "if" statement.
> + *
> + * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	r2 = READ_ONCE(*b);
> +	if (r1 == 1) {
> +		r2 = *b;
> +	}
> +	WRITE_ONCE(*a, r2);
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	r2 = READ_ONCE(*a);
> +	if (r1 == 1) {
> +		r2 = *a;
> +	}
> +	WRITE_ONCE(*b, r2);
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +exists (0:r1=1 /\ 1:r1=1)
> diff --git a/manual/oota/C-JO-OOTA-4.litmus b/manual/oota/C-JO-OOTA-4.litmus
> new file mode 100644
> index 00000000..cab7ebb6
> --- /dev/null
> +++ b/manual/oota/C-JO-OOTA-4.litmus
> @@ -0,0 +1,43 @@
> +C C-JO-OOTA-4
> +
> +(*
> + * Result: Never
> + *
> + * And LKMM agrees, which might be a surprise.
> + *
> + * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +	int r3;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	r2 = *b;
> +	r3 = r1 == 0;
> +	WRITE_ONCE(*a, (r3 + 1) & r2);
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +	int r3;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	r2 = *a;
> +	r3 = r1 == 0;
> +	WRITE_ONCE(*b, (r3 + 1) & r2);
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +exists (0:r1=1 /\ 1:r1=1)
> diff --git a/manual/oota/C-JO-OOTA-5.litmus b/manual/oota/C-JO-OOTA-5.litmus
> new file mode 100644
> index 00000000..145c8378
> --- /dev/null
> +++ b/manual/oota/C-JO-OOTA-5.litmus
> @@ -0,0 +1,44 @@
> +C C-JO-OOTA-5
> +
> +(*
> + * Result: Never
> + *
> + * But LKMM finds the all-ones result, perhaps due r2 being unused.
> + *
> + * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = READ_ONCE(*a);
> +	}
> +	*b = 1;
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = READ_ONCE(*b);
> +	}
> +	*a = 1;
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +locations [0:r2;1:r2]
> +exists (0:r1=1 /\ 1:r1=1)
> diff --git a/manual/oota/C-JO-OOTA-6.litmus b/manual/oota/C-JO-OOTA-6.litmus
> new file mode 100644
> index 00000000..942e6c82
> --- /dev/null
> +++ b/manual/oota/C-JO-OOTA-6.litmus
> @@ -0,0 +1,44 @@
> +C C-JO-OOTA-6
> +
> +(*
> + * Result: Never
> + *
> + * But LKMM finds the all-ones result, due to OOTA on r2.
> + *
> + * https://lore.kernel.org/all/1147ad3e-e3ad-4fa1-9a63-772ba136ea9a@huaweicloud.com/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = READ_ONCE(*a);
> +	}
> +	*b = r2;
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = READ_ONCE(*b);
> +	}
> +	*a = r2;
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +locations [0:r2;1:r2]
> +exists (0:r1=1 /\ 1:r1=1)
> diff --git a/manual/oota/C-JO-OOTA-7.litmus b/manual/oota/C-JO-OOTA-7.litmus
> new file mode 100644
> index 00000000..31c0b8ae
> --- /dev/null
> +++ b/manual/oota/C-JO-OOTA-7.litmus
> @@ -0,0 +1,47 @@
> +C C-JO-OOTA-7
> +
> +(*
> + * Result: Never
> + *
> + * But LKMM finds the all-ones result, due to OOTA on r2.
> + *
> + * https://lore.kernel.org/all/1147ad3e-e3ad-4fa1-9a63-772ba136ea9a@huaweicloud.com/
> + *)
> +
> +{
> +	0:r2=a;
> +	1:r2=b;
> +}
> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = READ_ONCE(*a);
> +	}
> +	*r2 = a;
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = READ_ONCE(*b);
> +	}
> +	*r2 = b;
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +locations [0:r2;1:r2]
> +exists (0:r1=1 /\ 1:r1=1)
> diff --git a/manual/oota/C-PM-OOTA-1.litmus b/manual/oota/C-PM-OOTA-1.litmus
> new file mode 100644
> index 00000000..e771e3c9
> --- /dev/null
> +++ b/manual/oota/C-PM-OOTA-1.litmus
> @@ -0,0 +1,37 @@
> +C C-PM-OOTA-1
> +
> +(*
> + * Result: Never
> + *
> + * LKMM agrees.
> + *
> + * https://lore.kernel.org/all/9a0dccbb-bfa7-4b33-ac1a-daa9841b609a@paulmck-laptop/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b, int *x, int *y) {
> +	int r1;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		WRITE_ONCE(*a, *b);
> +	}
> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y) {
> +	int r1;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		WRITE_ONCE(*b, *a);
> +	}
> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +exists b=42


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-23  7:26   ` Hernan Ponce de Leon
@ 2025-07-23 16:39     ` Paul E. McKenney
  2025-07-24 14:14       ` Paul E. McKenney
  0 siblings, 1 reply; 59+ messages in thread
From: Paul E. McKenney @ 2025-07-23 16:39 UTC (permalink / raw)
  To: Hernan Ponce de Leon
  Cc: Jonas Oberhauser, stern, parri.andrea, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	urezki, quic_neeraju, frederic, linux-kernel, lkmm

On Wed, Jul 23, 2025 at 09:26:32AM +0200, Hernan Ponce de Leon wrote:
> On 7/23/2025 2:43 AM, Paul E. McKenney wrote:
> > On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> > > The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
> > > example shared on this list a few years ago:
> > 
> > Apologies for being slow, but I have finally added the litmus tests in
> > this email thread to the https://github.com/paulmckrcu/litmus repo.
> 
> I do not understand some of the comments in the preamble of the tests:
> 
> (*
>  * Result: Never
>  *
>  * But Sometimes in LKMM as of early 2025, given that 42 is a possible
>  * value for things like S19..
>  *
>  * https://lore.kernel.org/all/20250106214003.504664-1-jonas.oberhauser@huaweicloud.com/
>  *)
> 
> I see that herd7 reports one of the states to be [b]=S16. Is this supposed
> to be some kind of symbolic state (i.e., any value is possible)?

Exactly!

> The value in the "Result" is what we would like the model to say if we would
> have a perfect version of dependencies, right?

In this case, yes.

There are other cases elsewhere in which the "Result:" comment instead
records LKMM's current state, so that any deviation (whether right or
wrong) are noted.  Most recently, the 1800+ changes in luc/RelAcq.

> > It is quite likely that I have incorrectly intuited the missing portions
> > of the litmus tests, especially the two called out in the commit log
> > below.  If you have time, please do double-check.
> 
> I read the "On the other hand" from the commit log as "this fixes the
> problem". However I still get the following error when running C-JO-OOTA-7
> with herd7
> 
> Warning: File "manual/oota/C-JO-OOTA-7.litmus": Non-symbolic memory access
> found on '[0]' (User error)

Yes, my interpretation of the example in that URL didn't make any
sense at all to herd7.  So I would welcome a fix to this litmus test.
The only potential fixes that I found clearly went against the intent
of this litmus test.

My only real contribution in my coding of manual/oota/C-JO-OOTA-7.litmus
is showing how to initialize a local herd7 variable to contain a pointer
to a global variable.  ;-)

							Thanx, Paul

> Hernan>
> > And the updated (and condensed!) version of the C++ OOTA paper may be
> > found here, this time with a proposed change to the standard:
> > 
> > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3692r1.pdf
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit fd17e8fceb75326e159ba3aa6fdb344f74f5c7a5
> > Author: Paul E. McKenney <paulmck@kernel.org>
> > Date:   Tue Jul 22 17:21:19 2025 -0700
> > 
> >      manual/oota:  Add Jonas and Alan OOTA examples
> >      Each of these new litmus tests contains the URL of the email message
> >      that I took it from.
> >      Please note that I had to tweak the example leading up to
> >      C-JO-OOTA-4.litmus, and I might well have misinterpreted Jonas's "~"
> >      operator.
> >      Also, C-JO-OOTA-7.litmus includes a "*r2 = a" statement that makes herd7
> >      very unhappy.  On the other hand, initializing registers to the address
> >      of a variable is straight forward, as shown in the resulting litmus test.
> >      Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/manual/oota/C-AS-OOTA-1.litmus b/manual/oota/C-AS-OOTA-1.litmus
> > new file mode 100644
> > index 00000000..81a873a7
> > --- /dev/null
> > +++ b/manual/oota/C-AS-OOTA-1.litmus
> > @@ -0,0 +1,40 @@
> > +C C-AS-OOTA-1
> > +
> > +(*
> > + * Result: Sometimes
> > + *
> > + * Because smp_rmb() combined with smp_wmb() does not order earlier
> > + * reads against later writes.
> > + *
> > + * https://lore.kernel.org/all/a3bf910f-509a-4ad3-a1cc-4b14ef9b3259@rowland.harvard.edu
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		*a = *b;
> > +	}
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		*b = *a;
> > +	}
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +exists (0:r1=1 /\ 1:r1=1)
> > diff --git a/manual/oota/C-AS-OOTA-2.litmus b/manual/oota/C-AS-OOTA-2.litmus
> > new file mode 100644
> > index 00000000..c672b0e7
> > --- /dev/null
> > +++ b/manual/oota/C-AS-OOTA-2.litmus
> > @@ -0,0 +1,33 @@
> > +C C-AS-OOTA-2
> > +
> > +(*
> > + * Result: Always
> > + *
> > + * If we were using C-language relaxed atomics instead of volatiles,
> > + * the compiler *could* eliminate the first WRITE_ONCE() in each process,
> > + * then also each process's local variable, thus having an undefined value
> > + * for each of those local variables.  But this cannot happen given that
> > + * we are using Linux-kernel _ONCE() primitives.
> > + *
> > + * https://lore.kernel.org/all/c2ae9bca-8526-425e-b9b5-135004ad59ad@rowland.harvard.edu/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b)
> > +{
> > +	int r0 = READ_ONCE(*a);
> > +
> > +	WRITE_ONCE(*b, r0);
> > +	WRITE_ONCE(*b, 2);
> > +}
> > +
> > +P1(int *a, int *b)
> > +{
> > +	int r1 = READ_ONCE(*b);
> > +
> > +	WRITE_ONCE(*a, r0);
> > +	WRITE_ONCE(*a, 2);
> > +}
> > +
> > +exists ((0:r0=0 \/ 0:r0=2) /\ (1:r1=0 \/ 1:r1=2))
> > diff --git a/manual/oota/C-JO-OOTA-1.litmus b/manual/oota/C-JO-OOTA-1.litmus
> > new file mode 100644
> > index 00000000..6ab437b4
> > --- /dev/null
> > +++ b/manual/oota/C-JO-OOTA-1.litmus
> > @@ -0,0 +1,40 @@
> > +C C-JO-OOTA-1
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * But Sometimes in LKMM as of early 2025, given that 42 is a possible
> > + * value for things like S19..
> > + *
> > + * https://lore.kernel.org/all/20250106214003.504664-1-jonas.oberhauser@huaweicloud.com/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		*a = *b;
> > +	}
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		*b = *a;
> > +	}
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +exists (b=42)
> > diff --git a/manual/oota/C-JO-OOTA-2.litmus b/manual/oota/C-JO-OOTA-2.litmus
> > new file mode 100644
> > index 00000000..ad708c60
> > --- /dev/null
> > +++ b/manual/oota/C-JO-OOTA-2.litmus
> > @@ -0,0 +1,44 @@
> > +C C-JO-OOTA-2
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * But Sometimes in LKMM as of early 2025, given that 42 is a possible
> > + * value for things like S23.
> > + *
> > + * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2 = 0;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = *b;
> > +	}
> > +	WRITE_ONCE(*a, r2);
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2 = 0;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = *a;
> > +	}
> > +	WRITE_ONCE(*b, r2);
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +exists (b=42)
> > diff --git a/manual/oota/C-JO-OOTA-3.litmus b/manual/oota/C-JO-OOTA-3.litmus
> > new file mode 100644
> > index 00000000..633b8334
> > --- /dev/null
> > +++ b/manual/oota/C-JO-OOTA-3.litmus
> > @@ -0,0 +1,46 @@
> > +C C-JO-OOTA-3
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * But LKMM finds the all-ones result, perhaps due to not tracking
> > + * control dependencies out of the "if" statement.
> > + *
> > + * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	r2 = READ_ONCE(*b);
> > +	if (r1 == 1) {
> > +		r2 = *b;
> > +	}
> > +	WRITE_ONCE(*a, r2);
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	r2 = READ_ONCE(*a);
> > +	if (r1 == 1) {
> > +		r2 = *a;
> > +	}
> > +	WRITE_ONCE(*b, r2);
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +exists (0:r1=1 /\ 1:r1=1)
> > diff --git a/manual/oota/C-JO-OOTA-4.litmus b/manual/oota/C-JO-OOTA-4.litmus
> > new file mode 100644
> > index 00000000..cab7ebb6
> > --- /dev/null
> > +++ b/manual/oota/C-JO-OOTA-4.litmus
> > @@ -0,0 +1,43 @@
> > +C C-JO-OOTA-4
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * And LKMM agrees, which might be a surprise.
> > + *
> > + * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +	int r3;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	r2 = *b;
> > +	r3 = r1 == 0;
> > +	WRITE_ONCE(*a, (r3 + 1) & r2);
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +	int r3;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	r2 = *a;
> > +	r3 = r1 == 0;
> > +	WRITE_ONCE(*b, (r3 + 1) & r2);
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +exists (0:r1=1 /\ 1:r1=1)
> > diff --git a/manual/oota/C-JO-OOTA-5.litmus b/manual/oota/C-JO-OOTA-5.litmus
> > new file mode 100644
> > index 00000000..145c8378
> > --- /dev/null
> > +++ b/manual/oota/C-JO-OOTA-5.litmus
> > @@ -0,0 +1,44 @@
> > +C C-JO-OOTA-5
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * But LKMM finds the all-ones result, perhaps due r2 being unused.
> > + *
> > + * https://lore.kernel.org/all/1daba0ea-0dd6-4e67-923e-fd3c1a62b40b@huaweicloud.com/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = READ_ONCE(*a);
> > +	}
> > +	*b = 1;
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = READ_ONCE(*b);
> > +	}
> > +	*a = 1;
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +locations [0:r2;1:r2]
> > +exists (0:r1=1 /\ 1:r1=1)
> > diff --git a/manual/oota/C-JO-OOTA-6.litmus b/manual/oota/C-JO-OOTA-6.litmus
> > new file mode 100644
> > index 00000000..942e6c82
> > --- /dev/null
> > +++ b/manual/oota/C-JO-OOTA-6.litmus
> > @@ -0,0 +1,44 @@
> > +C C-JO-OOTA-6
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * But LKMM finds the all-ones result, due to OOTA on r2.
> > + *
> > + * https://lore.kernel.org/all/1147ad3e-e3ad-4fa1-9a63-772ba136ea9a@huaweicloud.com/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = READ_ONCE(*a);
> > +	}
> > +	*b = r2;
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = READ_ONCE(*b);
> > +	}
> > +	*a = r2;
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +locations [0:r2;1:r2]
> > +exists (0:r1=1 /\ 1:r1=1)
> > diff --git a/manual/oota/C-JO-OOTA-7.litmus b/manual/oota/C-JO-OOTA-7.litmus
> > new file mode 100644
> > index 00000000..31c0b8ae
> > --- /dev/null
> > +++ b/manual/oota/C-JO-OOTA-7.litmus
> > @@ -0,0 +1,47 @@
> > +C C-JO-OOTA-7
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * But LKMM finds the all-ones result, due to OOTA on r2.
> > + *
> > + * https://lore.kernel.org/all/1147ad3e-e3ad-4fa1-9a63-772ba136ea9a@huaweicloud.com/
> > + *)
> > +
> > +{
> > +	0:r2=a;
> > +	1:r2=b;
> > +}
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = READ_ONCE(*a);
> > +	}
> > +	*r2 = a;
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = READ_ONCE(*b);
> > +	}
> > +	*r2 = b;
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +locations [0:r2;1:r2]
> > +exists (0:r1=1 /\ 1:r1=1)
> > diff --git a/manual/oota/C-PM-OOTA-1.litmus b/manual/oota/C-PM-OOTA-1.litmus
> > new file mode 100644
> > index 00000000..e771e3c9
> > --- /dev/null
> > +++ b/manual/oota/C-PM-OOTA-1.litmus
> > @@ -0,0 +1,37 @@
> > +C C-PM-OOTA-1
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * LKMM agrees.
> > + *
> > + * https://lore.kernel.org/all/9a0dccbb-bfa7-4b33-ac1a-daa9841b609a@paulmck-laptop/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b, int *x, int *y) {
> > +	int r1;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		WRITE_ONCE(*a, *b);
> > +	}
> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y) {
> > +	int r1;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		WRITE_ONCE(*b, *a);
> > +	}
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +exists b=42
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-23  0:43 ` Paul E. McKenney
  2025-07-23  7:26   ` Hernan Ponce de Leon
@ 2025-07-23 17:13   ` Alan Stern
  2025-07-23 17:27     ` Paul E. McKenney
  2025-07-23 19:25   ` Alan Stern
  2 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-07-23 17:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jonas Oberhauser, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Tue, Jul 22, 2025 at 05:43:16PM -0700, Paul E. McKenney wrote:
> On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> > The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
> > example shared on this list a few years ago:
> 
> Apologies for being slow, but I have finally added the litmus tests in
> this email thread to the https://github.com/paulmckrcu/litmus repo.
> 
> It is quite likely that I have incorrectly intuited the missing portions
> of the litmus tests, especially the two called out in the commit log
> below.  If you have time, please do double-check.

I didn't look very closely when this first came out...

> --- /dev/null
> +++ b/manual/oota/C-AS-OOTA-2.litmus
> @@ -0,0 +1,33 @@
> +C C-AS-OOTA-2
> +
> +(*
> + * Result: Always
> + *
> + * If we were using C-language relaxed atomics instead of volatiles,
> + * the compiler *could* eliminate the first WRITE_ONCE() in each process,
> + * then also each process's local variable, thus having an undefined value
> + * for each of those local variables.  But this cannot happen given that
> + * we are using Linux-kernel _ONCE() primitives.
> + *
> + * https://lore.kernel.org/all/c2ae9bca-8526-425e-b9b5-135004ad59ad@rowland.harvard.edu/
> + *)
> +
> +{}
> +
> +P0(int *a, int *b)
> +{
> +	int r0 = READ_ONCE(*a);
> +
> +	WRITE_ONCE(*b, r0);
> +	WRITE_ONCE(*b, 2);
> +}
> +
> +P1(int *a, int *b)
> +{
> +	int r1 = READ_ONCE(*b);
> +
> +	WRITE_ONCE(*a, r0);

This should be r1 instead of r0.

> +	WRITE_ONCE(*a, 2);
> +}
> +
> +exists ((0:r0=0 \/ 0:r0=2) /\ (1:r1=0 \/ 1:r1=2))

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-23 17:13   ` Alan Stern
@ 2025-07-23 17:27     ` Paul E. McKenney
  0 siblings, 0 replies; 59+ messages in thread
From: Paul E. McKenney @ 2025-07-23 17:27 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Wed, Jul 23, 2025 at 01:13:35PM -0400, Alan Stern wrote:
> On Tue, Jul 22, 2025 at 05:43:16PM -0700, Paul E. McKenney wrote:
> > On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> > > The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
> > > example shared on this list a few years ago:
> > 
> > Apologies for being slow, but I have finally added the litmus tests in
> > this email thread to the https://github.com/paulmckrcu/litmus repo.
> > 
> > It is quite likely that I have incorrectly intuited the missing portions
> > of the litmus tests, especially the two called out in the commit log
> > below.  If you have time, please do double-check.
> 
> I didn't look very closely when this first came out...
> 
> > --- /dev/null
> > +++ b/manual/oota/C-AS-OOTA-2.litmus
> > @@ -0,0 +1,33 @@
> > +C C-AS-OOTA-2
> > +
> > +(*
> > + * Result: Always
> > + *
> > + * If we were using C-language relaxed atomics instead of volatiles,
> > + * the compiler *could* eliminate the first WRITE_ONCE() in each process,
> > + * then also each process's local variable, thus having an undefined value
> > + * for each of those local variables.  But this cannot happen given that
> > + * we are using Linux-kernel _ONCE() primitives.
> > + *
> > + * https://lore.kernel.org/all/c2ae9bca-8526-425e-b9b5-135004ad59ad@rowland.harvard.edu/
> > + *)
> > +
> > +{}
> > +
> > +P0(int *a, int *b)
> > +{
> > +	int r0 = READ_ONCE(*a);
> > +
> > +	WRITE_ONCE(*b, r0);
> > +	WRITE_ONCE(*b, 2);
> > +}
> > +
> > +P1(int *a, int *b)
> > +{
> > +	int r1 = READ_ONCE(*b);
> > +
> > +	WRITE_ONCE(*a, r0);
> 
> This should be r1 instead of r0.

Ah, good eyes, thank you!

With that change, I still get "Always" as shown below.  Which I believe
makes sense, given that LKMM deals with volatile atomics in contrast to
the C++ relaxed atomics that you were discussing in the email.

Please let me know if I am still missing something.

> > +	WRITE_ONCE(*a, 2);
> > +}
> > +
> > +exists ((0:r0=0 \/ 0:r0=2) /\ (1:r1=0 \/ 1:r1=2))
> 
> Alan

------------------------------------------------------------------------

$ herd7 -conf linux-kernel.cfg ~/paper/scalability/LWNLinuxMM/litmus/manual/oota/C-AS-OOTA-2.litmus
Test C-AS-OOTA-2 Allowed
States 3
0:r0=0; 1:r1=0;
0:r0=0; 1:r1=2;
0:r0=2; 1:r1=0;
Ok
Witnesses
Positive: 5 Negative: 0
Condition exists ((0:r0=0 \/ 0:r0=2) /\ (1:r1=0 \/ 1:r1=2))
Observation C-AS-OOTA-2 Always 5 0
Time C-AS-OOTA-2 0.01
Hash=7b4c046bc861c102997a87e32907fa80

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-23  0:43 ` Paul E. McKenney
  2025-07-23  7:26   ` Hernan Ponce de Leon
  2025-07-23 17:13   ` Alan Stern
@ 2025-07-23 19:25   ` Alan Stern
  2025-07-23 19:57     ` Paul E. McKenney
  2 siblings, 1 reply; 59+ messages in thread
From: Alan Stern @ 2025-07-23 19:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jonas Oberhauser, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Tue, Jul 22, 2025 at 05:43:16PM -0700, Paul E. McKenney wrote:
>     Also, C-JO-OOTA-7.litmus includes a "*r2 = a" statement that makes herd7
>     very unhappy.  On the other hand, initializing registers to the address
>     of a variable is straight forward, as shown in the resulting litmus test.

...

> diff --git a/manual/oota/C-JO-OOTA-7.litmus b/manual/oota/C-JO-OOTA-7.litmus
> new file mode 100644
> index 00000000..31c0b8ae
> --- /dev/null
> +++ b/manual/oota/C-JO-OOTA-7.litmus
> @@ -0,0 +1,47 @@
> +C C-JO-OOTA-7
> +
> +(*
> + * Result: Never
> + *
> + * But LKMM finds the all-ones result, due to OOTA on r2.
> + *
> + * https://lore.kernel.org/all/1147ad3e-e3ad-4fa1-9a63-772ba136ea9a@huaweicloud.com/
> + *)
> +
> +{
> +	0:r2=a;
> +	1:r2=b;
> +}

In this litmus test a and b are never assigned any values, so they
always contain 0.

> +
> +P0(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*x);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = READ_ONCE(*a);

If this executes then r2 now contains 0.

> +	}
> +	*r2 = a;

And so what is supposed to happen here?  No wonder herd7 is unhappy!

> +	smp_wmb();
> +	WRITE_ONCE(*y, 1);
> +}
> +
> +P1(int *a, int *b, int *x, int *y)
> +{
> +	int r1;
> +	int r2;
> +
> +	r1 = READ_ONCE(*y);
> +	smp_rmb();
> +	if (r1 == 1) {
> +		r2 = READ_ONCE(*b);
> +	}
> +	*r2 = b;

Same here.

> +	smp_wmb();
> +	WRITE_ONCE(*x, 1);
> +}
> +
> +locations [0:r2;1:r2]
> +exists (0:r1=1 /\ 1:r1=1)

Alan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-23 19:25   ` Alan Stern
@ 2025-07-23 19:57     ` Paul E. McKenney
  0 siblings, 0 replies; 59+ messages in thread
From: Paul E. McKenney @ 2025-07-23 19:57 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, parri.andrea, will, peterz, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, urezki,
	quic_neeraju, frederic, linux-kernel, lkmm, hernan.poncedeleon

On Wed, Jul 23, 2025 at 03:25:13PM -0400, Alan Stern wrote:
> On Tue, Jul 22, 2025 at 05:43:16PM -0700, Paul E. McKenney wrote:
> >     Also, C-JO-OOTA-7.litmus includes a "*r2 = a" statement that makes herd7
> >     very unhappy.  On the other hand, initializing registers to the address
> >     of a variable is straight forward, as shown in the resulting litmus test.
> 
> ...
> 
> > diff --git a/manual/oota/C-JO-OOTA-7.litmus b/manual/oota/C-JO-OOTA-7.litmus
> > new file mode 100644
> > index 00000000..31c0b8ae
> > --- /dev/null
> > +++ b/manual/oota/C-JO-OOTA-7.litmus
> > @@ -0,0 +1,47 @@
> > +C C-JO-OOTA-7
> > +
> > +(*
> > + * Result: Never
> > + *
> > + * But LKMM finds the all-ones result, due to OOTA on r2.
> > + *
> > + * https://lore.kernel.org/all/1147ad3e-e3ad-4fa1-9a63-772ba136ea9a@huaweicloud.com/
> > + *)
> > +
> > +{
> > +	0:r2=a;
> > +	1:r2=b;
> > +}
> 
> In this litmus test a and b are never assigned any values, so they
> always contain 0.
> 
> > +
> > +P0(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*x);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = READ_ONCE(*a);
> 
> If this executes then r2 now contains 0.
> 
> > +	}
> > +	*r2 = a;
> 
> And so what is supposed to happen here?  No wonder herd7 is unhappy!

Nothing good, I will admit!  Good eyes, and thank you!

> > +	smp_wmb();
> > +	WRITE_ONCE(*y, 1);
> > +}
> > +
> > +P1(int *a, int *b, int *x, int *y)
> > +{
> > +	int r1;
> > +	int r2;
> > +
> > +	r1 = READ_ONCE(*y);
> > +	smp_rmb();
> > +	if (r1 == 1) {
> > +		r2 = READ_ONCE(*b);
> > +	}
> > +	*r2 = b;
> 
> Same here.
> 
> > +	smp_wmb();
> > +	WRITE_ONCE(*x, 1);
> > +}
> > +
> > +locations [0:r2;1:r2]
> > +exists (0:r1=1 /\ 1:r1=1)

Yes, I did misinterpret Jonas's initialization advice, which reads
as follows:  "unless you know how to initialize *a and *b to valid
addresses, you may need to add something like `if (r2 == 0) r2 = a`
to run this in herd7".

Given that there are two instances of r2, there are a number of
possible combinations of initialization.  I picked the one shown
in the patch below, and got this:

$ herd7 -conf linux-kernel.cfg ~/paper/scalability/LWNLinuxMM/litmus/manual/oota/C-JO-OOTA-7.litmus
Test C-JO-OOTA-7 Allowed
States 3
0:r1=0; 0:r2=a; 1:r1=0; 1:r2=b;
0:r1=0; 0:r2=a; 1:r1=1; 1:r2=b;
0:r1=1; 0:r2=a; 1:r1=0; 1:r2=b;
No
Witnesses
Positive: 0 Negative: 3
Flag mixed-accesses
Condition exists (0:r1=1 /\ 1:r1=1)
Observation C-JO-OOTA-7 Never 0 3
Time C-JO-OOTA-7 0.01
Hash=d9bb35335e45b31b1a39bab88eca837c

I get something very similar if I cross-initialize them, that is
a=b;b=a.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/manual/oota/C-JO-OOTA-7.litmus b/manual/oota/C-JO-OOTA-7.litmus
index 31c0b8ae..d7fe0f94 100644
--- a/manual/oota/C-JO-OOTA-7.litmus
+++ b/manual/oota/C-JO-OOTA-7.litmus
@@ -11,6 +11,8 @@ C C-JO-OOTA-7
 {
 	0:r2=a;
 	1:r2=b;
+	a=a;
+	b=b;
 }
 
 P0(int *a, int *b, int *x, int *y)

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-23 16:39     ` Paul E. McKenney
@ 2025-07-24 14:14       ` Paul E. McKenney
  2025-07-25  5:23         ` Hernan Ponce de Leon
  0 siblings, 1 reply; 59+ messages in thread
From: Paul E. McKenney @ 2025-07-24 14:14 UTC (permalink / raw)
  To: Hernan Ponce de Leon
  Cc: Jonas Oberhauser, stern, parri.andrea, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	urezki, quic_neeraju, frederic, linux-kernel, lkmm

On Wed, Jul 23, 2025 at 09:39:05AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 23, 2025 at 09:26:32AM +0200, Hernan Ponce de Leon wrote:
> > On 7/23/2025 2:43 AM, Paul E. McKenney wrote:
> > > On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> > > > The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
> > > > example shared on this list a few years ago:
> > > 
> > > Apologies for being slow, but I have finally added the litmus tests in
> > > this email thread to the https://github.com/paulmckrcu/litmus repo.
> > 
> > I do not understand some of the comments in the preamble of the tests:
> > 
> > (*
> >  * Result: Never
> >  *
> >  * But Sometimes in LKMM as of early 2025, given that 42 is a possible
> >  * value for things like S19..
> >  *
> >  * https://lore.kernel.org/all/20250106214003.504664-1-jonas.oberhauser@huaweicloud.com/
> >  *)
> > 
> > I see that herd7 reports one of the states to be [b]=S16. Is this supposed
> > to be some kind of symbolic state (i.e., any value is possible)?
> 
> Exactly!
> 
> > The value in the "Result" is what we would like the model to say if we would
> > have a perfect version of dependencies, right?
> 
> In this case, yes.

I should hasten to add that, compiler optimizations being what they are,
"perfect" may or may not be attainable, and even if attainable, might
not be maintainable.

I am pretty sure that you all already understood that, but I felt the
need to make it explicit.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-24 14:14       ` Paul E. McKenney
@ 2025-07-25  5:23         ` Hernan Ponce de Leon
  2025-07-29 20:34           ` Paul E. McKenney
  0 siblings, 1 reply; 59+ messages in thread
From: Hernan Ponce de Leon @ 2025-07-25  5:23 UTC (permalink / raw)
  To: paulmck
  Cc: Jonas Oberhauser, stern, parri.andrea, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	urezki, quic_neeraju, frederic, linux-kernel, lkmm

On 7/24/2025 4:14 PM, Paul E. McKenney wrote:
> On Wed, Jul 23, 2025 at 09:39:05AM -0700, Paul E. McKenney wrote:
>> On Wed, Jul 23, 2025 at 09:26:32AM +0200, Hernan Ponce de Leon wrote:
>>> On 7/23/2025 2:43 AM, Paul E. McKenney wrote:
>>>> On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
>>>>> The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
>>>>> example shared on this list a few years ago:
>>>>
>>>> Apologies for being slow, but I have finally added the litmus tests in
>>>> this email thread to the https://github.com/paulmckrcu/litmus repo.
>>>
>>> I do not understand some of the comments in the preamble of the tests:
>>>
>>> (*
>>>   * Result: Never
>>>   *
>>>   * But Sometimes in LKMM as of early 2025, given that 42 is a possible
>>>   * value for things like S19..
>>>   *
>>>   * https://lore.kernel.org/all/20250106214003.504664-1-jonas.oberhauser@huaweicloud.com/
>>>   *)
>>>
>>> I see that herd7 reports one of the states to be [b]=S16. Is this supposed
>>> to be some kind of symbolic state (i.e., any value is possible)?
>>
>> Exactly!
>>
>>> The value in the "Result" is what we would like the model to say if we would
>>> have a perfect version of dependencies, right?
>>
>> In this case, yes.
> 
> I should hasten to add that, compiler optimizations being what they are,
> "perfect" may or may not be attainable, and even if attainable, might
> not be maintainable.

Yes, I just wanted to clarify if this is what herd7 + the current model 
are saying or what developers should expect.

Hernan

> 
> I am pretty sure that you all already understood that, but I felt the
> need to make it explicit.  ;-)
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC] tools/memory-model: Rule out OOTA
  2025-07-25  5:23         ` Hernan Ponce de Leon
@ 2025-07-29 20:34           ` Paul E. McKenney
  0 siblings, 0 replies; 59+ messages in thread
From: Paul E. McKenney @ 2025-07-29 20:34 UTC (permalink / raw)
  To: Hernan Ponce de Leon
  Cc: Jonas Oberhauser, stern, parri.andrea, will, peterz, boqun.feng,
	npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	urezki, quic_neeraju, frederic, linux-kernel, lkmm

On Fri, Jul 25, 2025 at 07:23:23AM +0200, Hernan Ponce de Leon wrote:
> On 7/24/2025 4:14 PM, Paul E. McKenney wrote:
> > On Wed, Jul 23, 2025 at 09:39:05AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 23, 2025 at 09:26:32AM +0200, Hernan Ponce de Leon wrote:
> > > > On 7/23/2025 2:43 AM, Paul E. McKenney wrote:
> > > > > On Mon, Jan 06, 2025 at 10:40:03PM +0100, Jonas Oberhauser wrote:
> > > > > > The current LKMM allows out-of-thin-air (OOTA), as evidenced in the following
> > > > > > example shared on this list a few years ago:
> > > > > 
> > > > > Apologies for being slow, but I have finally added the litmus tests in
> > > > > this email thread to the https://github.com/paulmckrcu/litmus repo.
> > > > 
> > > > I do not understand some of the comments in the preamble of the tests:
> > > > 
> > > > (*
> > > >   * Result: Never
> > > >   *
> > > >   * But Sometimes in LKMM as of early 2025, given that 42 is a possible
> > > >   * value for things like S19..
> > > >   *
> > > >   * https://lore.kernel.org/all/20250106214003.504664-1-jonas.oberhauser@huaweicloud.com/
> > > >   *)
> > > > 
> > > > I see that herd7 reports one of the states to be [b]=S16. Is this supposed
> > > > to be some kind of symbolic state (i.e., any value is possible)?
> > > 
> > > Exactly!
> > > 
> > > > The value in the "Result" is what we would like the model to say if we would
> > > > have a perfect version of dependencies, right?
> > > 
> > > In this case, yes.
> > 
> > I should hasten to add that, compiler optimizations being what they are,
> > "perfect" may or may not be attainable, and even if attainable, might
> > not be maintainable.
> 
> Yes, I just wanted to clarify if this is what herd7 + the current model are
> saying or what developers should expect.

Good point, and I added explicit words to this effect in the comments
of those aspirational OOTA litmus tests, so thank you!

							Thanx, Paul

> Hernan
> 
> > 
> > I am pretty sure that you all already understood that, but I felt the
> > need to make it explicit.  ;-)
> > 
> > 							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-07-29 20:34 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-06 21:40 [RFC] tools/memory-model: Rule out OOTA Jonas Oberhauser
2025-01-07 10:06 ` Peter Zijlstra
2025-01-07 11:02   ` Jonas Oberhauser
2025-01-07 15:46 ` Jonas Oberhauser
2025-01-07 16:09 ` Alan Stern
2025-01-07 18:47   ` Paul E. McKenney
2025-01-08 17:39     ` Jonas Oberhauser
2025-01-08 18:09       ` Paul E. McKenney
2025-01-08 19:17         ` Jonas Oberhauser
2025-01-09 17:54           ` Paul E. McKenney
2025-01-09 18:35             ` Jonas Oberhauser
2025-01-10 14:54               ` Paul E. McKenney
2025-01-10 16:21                 ` Jonas Oberhauser
2025-01-13 22:04                   ` Paul E. McKenney
2025-01-16 18:40                     ` Paul E. McKenney
2025-01-16 19:13                       ` Jonas Oberhauser
2025-01-16 19:31                         ` Paul E. McKenney
2025-01-16 20:21                           ` Jonas Oberhauser
2025-01-16 19:28                       ` Jonas Oberhauser
2025-01-16 19:39                         ` Paul E. McKenney
2025-01-17 12:08                           ` Jonas Oberhauser
2025-01-16 19:08                     ` Jonas Oberhauser
2025-01-16 23:02                       ` Alan Stern
2025-01-17  8:34                         ` Hernan Ponce de Leon
2025-01-17 11:29                         ` Jonas Oberhauser
2025-01-17 20:01                           ` Alan Stern
2025-01-21 10:36                             ` Jonas Oberhauser
2025-01-21 16:39                               ` Alan Stern
2025-01-22  3:46                                 ` Jonas Oberhauser
2025-01-22 19:11                                   ` Alan Stern
2025-01-17 15:52                         ` Alan Stern
2025-01-17 16:45                           ` Jonas Oberhauser
2025-01-17 19:02                             ` Alan Stern
2025-01-09 20:37             ` Peter Zijlstra
2025-01-09 21:13               ` Paul E. McKenney
2025-01-08 17:33   ` Jonas Oberhauser
2025-01-08 18:47     ` Alan Stern
2025-01-08 19:22       ` Jonas Oberhauser
2025-01-09 16:17         ` Alan Stern
2025-01-09 16:44           ` Jonas Oberhauser
2025-01-09 19:27             ` Alan Stern
2025-01-09 20:09               ` Jonas Oberhauser
2025-01-10  3:12                 ` Alan Stern
2025-01-10 12:21                   ` Jonas Oberhauser
2025-01-10 21:51                     ` Alan Stern
2025-01-11 12:46                       ` Jonas Oberhauser
2025-01-11 21:19                         ` Alan Stern
2025-01-12 15:55                           ` Jonas Oberhauser
2025-01-13 19:43                             ` Alan Stern
2025-07-23  0:43 ` Paul E. McKenney
2025-07-23  7:26   ` Hernan Ponce de Leon
2025-07-23 16:39     ` Paul E. McKenney
2025-07-24 14:14       ` Paul E. McKenney
2025-07-25  5:23         ` Hernan Ponce de Leon
2025-07-29 20:34           ` Paul E. McKenney
2025-07-23 17:13   ` Alan Stern
2025-07-23 17:27     ` Paul E. McKenney
2025-07-23 19:25   ` Alan Stern
2025-07-23 19:57     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).