"Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
@ 2022-08-26 12:48 Paul E. McKenney
  2022-08-26 16:21 ` Boqun Feng
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-08-26 12:48 UTC (permalink / raw)
  To: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel
  Cc: linux-kernel, linux-arch

Hello!

I have not yet done more than glance at this one, but figured I should
send it along sooner rather than later.

"Verifying and Optimizing Compact NUMA-Aware Locks on Weak
Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
https://arxiv.org/abs/2111.15240

The claim is that the queued spinlocks implementation with CNA violates
LKMM but actually works on all architectures having a formal hardware
memory model.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 12:48 "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" Paul E. McKenney
@ 2022-08-26 16:21 ` Boqun Feng
  2022-08-26 16:23 ` Peter Zijlstra
  2022-08-29  2:33 ` Andrea Parri
  2 siblings, 0 replies; 28+ messages in thread
From: Boqun Feng @ 2022-08-26 16:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: stern, parri.andrea, will, peterz, npiggin, dhowells, j.alglave,
	luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch

On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> Hello!
> 
> I have not yet done more than glance at this one, but figured I should
> send it along sooner rather than later.
> 
> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> https://arxiv.org/abs/2111.15240
> 
> The claim is that the queued spinlocks implementation with CNA violates
> LKMM but actually works on all architectures having a formal hardware
> memory model.
> 

Translate one of their litmus test into a runnable one (there is a typo
in it):

	C queued-spin-lock

	(*
	 * P0 is lock-release
	 * P1 is xchg_tail()
	 * P2 is lock-acquire
	 *)

	{}

	P0(int *x, atomic_t *l)
	{
	  WRITE_ONCE(*x, 1);
	  atomic_set_release(l, 1);
	}

	P1(int *x, atomic_t *l)
	{
	 int val;
	 int new;
	 int old;
	 
	 val = atomic_read(l);
	 new = val + 2;
	 old = atomic_cmpxchg_relaxed(l, val, new);
	}

	P2(int *x, atomic_t *l)
	{
	 int r0 = atomic_read_acquire(l);
	 int r1 = READ_ONCE(*x);
	}

	exists (2:r0=3 /\ 2:r1=0)

According to LKMM, the exist-clause could be triggered because:

	po-rel; coe: rfe; acq-po

is not happen-before in LKMM. Alan actually explain why at a response to
a GitHub issue:

	https://github.com/paulmckrcu/litmus/issues/11#issuecomment-1115235860

(Paste Alan's reply)

"""
As for why the LKMM doesn't require ordering for this test...  I do not
believe this is related to 2+2W.  Think instead in terms of the LKMM's
operational model:

	The store-release in P0 means that the x=1 write will propagate
	to each CPU before the y=1 write does.

	Since y=3 at the end, we know that y=1 (and hence x=1 too)
	propagates to P1 before the addition occurs.  And we know that
	y=3 propagates to P2 before the load-acquire executes.

	But we _don't_ know that either y=1 or x=1 propagates to P2
	before y=3 does!  If the store in P1 were a store-release then
	this would have to be true (as you saw in your tests), but it
	isn't.

In other words, the litmus test could execute with the following
temporal ordering:

	P0		P1		P2
	----------	---------	----------
	Write x=1
	Write y=1

		[x=1 and y=1 propagate to P1]

			Read y=1
			Write y=3

		[y=3 propagates to P2]

					Read y=3
					Read x=0

		[x=1 and y=1 propagate to P2]

(Note that when y=1 propagates to P2, it doesn't do anything because it
won't overwrite the coherence-later store y=3.)

It may be true that none of the architectures supported by Linux will
allow this outcome for the test (although I suspect that the PPC-weird
version _would_ be allowed if you fixed it).  At any rate, disallowing
this result in the memory model would probably require more effort than
would be worthwhile.

Alan
"""

The question is that whether we "fix" LKMM because of this, or we
mention explicitly this is something "unsupported" by LKMM yet?

Regards,
Boqun

> Thoughts?
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 12:48 "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" Paul E. McKenney
  2022-08-26 16:21 ` Boqun Feng
@ 2022-08-26 16:23 ` Peter Zijlstra
  2022-08-26 17:10   ` Alan Stern
  2022-08-29  2:33 ` Andrea Parri
  2 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2022-08-26 16:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: stern, parri.andrea, will, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel,
	linux-arch

On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> Hello!
> 
> I have not yet done more than glance at this one, but figured I should
> send it along sooner rather than later.
> 
> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> https://arxiv.org/abs/2111.15240
> 
> The claim is that the queued spinlocks implementation with CNA violates
> LKMM but actually works on all architectures having a formal hardware
> memory model.
> 
> Thoughts?

So the paper mentions the following defects:

 - LKMM doesn't carry a release-acquire chain across a relaxed op

 - some babbling about a missing propagation -- ISTR Linux if stuffed
   full of them, specifically we require stores to auto propagate
   without help from barriers

 - some handoff that is CNA specific and I've not looked too hard at
   presently.


I think we should address that first one in LKMM, it seems very weird to
me a RmW would break the chain like that. Is there actual hardware that
doesn't behave?



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 16:23 ` Peter Zijlstra
@ 2022-08-26 17:10   ` Alan Stern
  2022-08-26 20:42     ` Paul E. McKenney
  2022-08-26 23:47     ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Alan Stern @ 2022-08-26 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, parri.andrea, will, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	linux-kernel, linux-arch

On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > I have not yet done more than glance at this one, but figured I should
> > send it along sooner rather than later.
> > 
> > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
> > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> > https://arxiv.org/abs/2111.15240
> > 
> > The claim is that the queued spinlocks implementation with CNA violates
> > LKMM but actually works on all architectures having a formal hardware
> > memory model.
> > 
> > Thoughts?
> 
> So the paper mentions the following defects:
> 
>  - LKMM doesn't carry a release-acquire chain across a relaxed op

That's right, although I'm not so sure this should be considered a 
defect...

>  - some babbling about a missing propagation -- ISTR Linux if stuffed
>    full of them, specifically we require stores to auto propagate
>    without help from barriers

Not a missing propagation; a late one.

Don't understand what you mean by "auto propagate without help from 
barriers".

>  - some handoff that is CNA specific and I've not looked too hard at
>    presently.
> 
> 
> I think we should address that first one in LKMM, it seems very weird to
> me a RmW would break the chain like that.

An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).

If the authors wanted to keep the release-acquire chain intact, why not 
use a cmpxchg version that has release semantics instead of going out of 
their way to use a relaxed version?

To put it another way, RMW accesses and release-acquire accesses are 
unrelated concepts.  You can have one without the other (in principle, 
anyway).  So a relaxed RMW is just as capable of breaking a 
release-acquire chain as any other relaxed operation is.

>  Is there actual hardware that
> doesn't behave?

Not as far as I know, although that isn't very far.  Certainly an 
other-multicopy-atomic architecture would make the litmus test succeed.  
But the LKMM does not require other-multicopy-atomicity.

Alan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 17:10   ` Alan Stern
@ 2022-08-26 20:42     ` Paul E. McKenney
  2022-08-27 16:00       ` Alan Stern
  2022-09-13 11:24       ` Will Deacon
  2022-08-26 23:47     ` Peter Zijlstra
  1 sibling, 2 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-08-26 20:42 UTC (permalink / raw)
  To: Alan Stern
  Cc: Peter Zijlstra, parri.andrea, will, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel,
	linux-arch

On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> > > Hello!
> > > 
> > > I have not yet done more than glance at this one, but figured I should
> > > send it along sooner rather than later.
> > > 
> > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> > > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
> > > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> > > https://arxiv.org/abs/2111.15240
> > > 
> > > The claim is that the queued spinlocks implementation with CNA violates
> > > LKMM but actually works on all architectures having a formal hardware
> > > memory model.
> > > 
> > > Thoughts?
> > 
> > So the paper mentions the following defects:
> > 
> >  - LKMM doesn't carry a release-acquire chain across a relaxed op
> 
> That's right, although I'm not so sure this should be considered a 
> defect...
> 
> >  - some babbling about a missing propagation -- ISTR Linux if stuffed
> >    full of them, specifically we require stores to auto propagate
> >    without help from barriers
> 
> Not a missing propagation; a late one.
> 
> Don't understand what you mean by "auto propagate without help from 
> barriers".
> 
> >  - some handoff that is CNA specific and I've not looked too hard at
> >    presently.
> > 
> > 
> > I think we should address that first one in LKMM, it seems very weird to
> > me a RmW would break the chain like that.
> 
> An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
> 
> If the authors wanted to keep the release-acquire chain intact, why not 
> use a cmpxchg version that has release semantics instead of going out of 
> their way to use a relaxed version?
> 
> To put it another way, RMW accesses and release-acquire accesses are 
> unrelated concepts.  You can have one without the other (in principle, 
> anyway).  So a relaxed RMW is just as capable of breaking a 
> release-acquire chain as any other relaxed operation is.
> 
> >  Is there actual hardware that
> > doesn't behave?
> 
> Not as far as I know, although that isn't very far.  Certainly an 
> other-multicopy-atomic architecture would make the litmus test succeed.  
> But the LKMM does not require other-multicopy-atomicity.

My first attempt with ppcmem suggests that powerpc does -not- behave
this way.  But that surprises me, just on general principles.  Most likely
I blew the litmus test shown below.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

PPC MP+lwsyncs+atomic
"LwSyncdWW Rfe LwSyncdRR Fre"
Cycle=Rfe LwSyncdRR Fre LwSyncdWW
{
0:r2=x; 0:r4=y;
1:r2=y; 1:r5=2;
2:r2=y; 2:r4=x;
}
 P0           | P1              | P2           ;
 li r1,1      | lwarx r1,r0,r2  | lwz r1,0(r2) ;
 stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync       ;
 lwsync       |                 | lwz r3,0(r4) ;
 li r3,1      |                 |              ;
 stw r3,0(r4) |                 |              ;
exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)

------------------------------------------------------------------------

$ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus
...
Test MP+lwsyncs+atomic Allowed
States 9
1:r1=0; 2:r1=0; 2:r3=0;
1:r1=0; 2:r1=0; 2:r3=1;
1:r1=0; 2:r1=1; 2:r3=1;
1:r1=0; 2:r1=2; 2:r3=0;
1:r1=0; 2:r1=2; 2:r3=1;
1:r1=1; 2:r1=0; 2:r3=0;
1:r1=1; 2:r1=0; 2:r3=1;
1:r1=1; 2:r1=1; 2:r3=1;
1:r1=1; 2:r1=2; 2:r3=1;
No (allowed not found)
Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c
Observation MP+lwsyncs+atomic Never 0 9

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 20:42     ` Paul E. McKenney
@ 2022-08-27 16:00       ` Alan Stern
  2022-08-27 17:04         ` Paul E. McKenney
  2022-09-13 11:24       ` Will Deacon
  1 sibling, 1 reply; 28+ messages in thread
From: Alan Stern @ 2022-08-27 16:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, parri.andrea, will, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel,
	linux-arch

On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> > > I think we should address that first one in LKMM, it seems very weird to
> > > me a RmW would break the chain like that.
> > 
> > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
> > 
> > If the authors wanted to keep the release-acquire chain intact, why not 
> > use a cmpxchg version that has release semantics instead of going out of 
> > their way to use a relaxed version?
> > 
> > To put it another way, RMW accesses and release-acquire accesses are 
> > unrelated concepts.  You can have one without the other (in principle, 
> > anyway).  So a relaxed RMW is just as capable of breaking a 
> > release-acquire chain as any other relaxed operation is.
> > 
> > >  Is there actual hardware that
> > > doesn't behave?
> > 
> > Not as far as I know, although that isn't very far.  Certainly an 
> > other-multicopy-atomic architecture would make the litmus test succeed.  
> > But the LKMM does not require other-multicopy-atomicity.
> 
> My first attempt with ppcmem suggests that powerpc does -not- behave
> this way.  But that surprises me, just on general principles.  Most likely
> I blew the litmus test shown below.
> 
> Thoughts?

The litmus test looks okay.

As for your surprise, remember that PPC is B-cumulative, another 
property which the LKMM does not require.  B-cumulativity will also 
force the original litmus test to succeed.  (The situation is like ISA2 
in the infamous test6.pdf, except that y and z are separate variables in 
ISA2 but are the same here.  The RMW nature of lwarx/stwcx provides 
the necessary R-W ordering in P1.)

Alan

> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> PPC MP+lwsyncs+atomic
> "LwSyncdWW Rfe LwSyncdRR Fre"
> Cycle=Rfe LwSyncdRR Fre LwSyncdWW
> {
> 0:r2=x; 0:r4=y;
> 1:r2=y; 1:r5=2;
> 2:r2=y; 2:r4=x;
> }
>  P0           | P1              | P2           ;
>  li r1,1      | lwarx r1,r0,r2  | lwz r1,0(r2) ;
>  stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync       ;
>  lwsync       |                 | lwz r3,0(r4) ;
>  li r3,1      |                 |              ;
>  stw r3,0(r4) |                 |              ;
> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> 
> ------------------------------------------------------------------------
> 
> $ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus
> ...
> Test MP+lwsyncs+atomic Allowed
> States 9
> 1:r1=0; 2:r1=0; 2:r3=0;
> 1:r1=0; 2:r1=0; 2:r3=1;
> 1:r1=0; 2:r1=1; 2:r3=1;
> 1:r1=0; 2:r1=2; 2:r3=0;
> 1:r1=0; 2:r1=2; 2:r3=1;
> 1:r1=1; 2:r1=0; 2:r3=0;
> 1:r1=1; 2:r1=0; 2:r3=1;
> 1:r1=1; 2:r1=1; 2:r3=1;
> 1:r1=1; 2:r1=2; 2:r3=1;
> No (allowed not found)
> Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c
> Observation MP+lwsyncs+atomic Never 0 9

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-27 16:00       ` Alan Stern
@ 2022-08-27 17:04         ` Paul E. McKenney
  0 siblings, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-08-27 17:04 UTC (permalink / raw)
  To: Alan Stern
  Cc: Peter Zijlstra, parri.andrea, will, boqun.feng, npiggin, dhowells,
	j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel,
	linux-arch

On Sat, Aug 27, 2022 at 12:00:15PM -0400, Alan Stern wrote:
> On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> > > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> > > > I think we should address that first one in LKMM, it seems very weird to
> > > > me a RmW would break the chain like that.
> > > 
> > > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
> > > 
> > > If the authors wanted to keep the release-acquire chain intact, why not 
> > > use a cmpxchg version that has release semantics instead of going out of 
> > > their way to use a relaxed version?
> > > 
> > > To put it another way, RMW accesses and release-acquire accesses are 
> > > unrelated concepts.  You can have one without the other (in principle, 
> > > anyway).  So a relaxed RMW is just as capable of breaking a 
> > > release-acquire chain as any other relaxed operation is.
> > > 
> > > >  Is there actual hardware that
> > > > doesn't behave?
> > > 
> > > Not as far as I know, although that isn't very far.  Certainly an 
> > > other-multicopy-atomic architecture would make the litmus test succeed.  
> > > But the LKMM does not require other-multicopy-atomicity.
> > 
> > My first attempt with ppcmem suggests that powerpc does -not- behave
> > this way.  But that surprises me, just on general principles.  Most likely
> > I blew the litmus test shown below.
> > 
> > Thoughts?
> 
> The litmus test looks okay.
> 
> As for your surprise, remember that PPC is B-cumulative, another 
> property which the LKMM does not require.  B-cumulativity will also 
> force the original litmus test to succeed.  (The situation is like ISA2 
> in the infamous test6.pdf, except that y and z are separate variables in 
> ISA2 but are the same here.  The RMW nature of lwarx/stwcx provides 
> the necessary R-W ordering in P1.)

Got it, thank you!

							Thanx, Paul

> > ------------------------------------------------------------------------
> > 
> > PPC MP+lwsyncs+atomic
> > "LwSyncdWW Rfe LwSyncdRR Fre"
> > Cycle=Rfe LwSyncdRR Fre LwSyncdWW
> > {
> > 0:r2=x; 0:r4=y;
> > 1:r2=y; 1:r5=2;
> > 2:r2=y; 2:r4=x;
> > }
> >  P0           | P1              | P2           ;
> >  li r1,1      | lwarx r1,r0,r2  | lwz r1,0(r2) ;
> >  stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync       ;
> >  lwsync       |                 | lwz r3,0(r4) ;
> >  li r3,1      |                 |              ;
> >  stw r3,0(r4) |                 |              ;
> > exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> > 
> > ------------------------------------------------------------------------
> > 
> > $ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus
> > ...
> > Test MP+lwsyncs+atomic Allowed
> > States 9
> > 1:r1=0; 2:r1=0; 2:r3=0;
> > 1:r1=0; 2:r1=0; 2:r3=1;
> > 1:r1=0; 2:r1=1; 2:r3=1;
> > 1:r1=0; 2:r1=2; 2:r3=0;
> > 1:r1=0; 2:r1=2; 2:r3=1;
> > 1:r1=1; 2:r1=0; 2:r3=0;
> > 1:r1=1; 2:r1=0; 2:r3=1;
> > 1:r1=1; 2:r1=1; 2:r3=1;
> > 1:r1=1; 2:r1=2; 2:r3=1;
> > No (allowed not found)
> > Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> > Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c
> > Observation MP+lwsyncs+atomic Never 0 9

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 20:42     ` Paul E. McKenney
  2022-08-27 16:00       ` Alan Stern
@ 2022-09-13 11:24       ` Will Deacon
  2022-09-13 12:21         ` Dan Lustig
  1 sibling, 1 reply; 28+ messages in thread
From: Will Deacon @ 2022-09-13 11:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, Peter Zijlstra, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	linux-kernel, linux-arch

On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> > > > Hello!
> > > > 
> > > > I have not yet done more than glance at this one, but figured I should
> > > > send it along sooner rather than later.
> > > > 
> > > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> > > > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
> > > > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> > > > https://arxiv.org/abs/2111.15240
> > > > 
> > > > The claim is that the queued spinlocks implementation with CNA violates
> > > > LKMM but actually works on all architectures having a formal hardware
> > > > memory model.
> > > > 
> > > > Thoughts?
> > > 
> > > So the paper mentions the following defects:
> > > 
> > >  - LKMM doesn't carry a release-acquire chain across a relaxed op
> > 
> > That's right, although I'm not so sure this should be considered a 
> > defect...
> > 
> > >  - some babbling about a missing propagation -- ISTR Linux if stuffed
> > >    full of them, specifically we require stores to auto propagate
> > >    without help from barriers
> > 
> > Not a missing propagation; a late one.
> > 
> > Don't understand what you mean by "auto propagate without help from 
> > barriers".
> > 
> > >  - some handoff that is CNA specific and I've not looked too hard at
> > >    presently.
> > > 
> > > 
> > > I think we should address that first one in LKMM, it seems very weird to
> > > me a RmW would break the chain like that.
> > 
> > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
> > 
> > If the authors wanted to keep the release-acquire chain intact, why not 
> > use a cmpxchg version that has release semantics instead of going out of 
> > their way to use a relaxed version?
> > 
> > To put it another way, RMW accesses and release-acquire accesses are 
> > unrelated concepts.  You can have one without the other (in principle, 
> > anyway).  So a relaxed RMW is just as capable of breaking a 
> > release-acquire chain as any other relaxed operation is.
> > 
> > >  Is there actual hardware that
> > > doesn't behave?
> > 
> > Not as far as I know, although that isn't very far.  Certainly an 
> > other-multicopy-atomic architecture would make the litmus test succeed.  
> > But the LKMM does not require other-multicopy-atomicity.
> 
> My first attempt with ppcmem suggests that powerpc does -not- behave
> this way.  But that surprises me, just on general principles.  Most likely
> I blew the litmus test shown below.
> 
> Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> PPC MP+lwsyncs+atomic
> "LwSyncdWW Rfe LwSyncdRR Fre"
> Cycle=Rfe LwSyncdRR Fre LwSyncdWW
> {
> 0:r2=x; 0:r4=y;
> 1:r2=y; 1:r5=2;
> 2:r2=y; 2:r4=x;
> }
>  P0           | P1              | P2           ;
>  li r1,1      | lwarx r1,r0,r2  | lwz r1,0(r2) ;
>  stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync       ;
>  lwsync       |                 | lwz r3,0(r4) ;
>  li r3,1      |                 |              ;
>  stw r3,0(r4) |                 |              ;
> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)

Just catching up on this, but one possible gotcha here is if you have an
architecture with native load-acquire on P2 and then you move P2 to the end
of P1. e.g. at a high-level:


  P0		P1
  Wx = 1	RmW(y) // xchg() 1 => 2
  WyRel = 1	RyAcq = 2
		Rx = 0

arm64 forbids this, but it's not "natural" to the hardware and I don't
know what e.g. risc-v would say about it.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-13 11:24       ` Will Deacon
@ 2022-09-13 12:21         ` Dan Lustig
  2022-09-16  8:18           ` Will Deacon
  0 siblings, 1 reply; 28+ messages in thread
From: Dan Lustig @ 2022-09-13 12:21 UTC (permalink / raw)
  To: Will Deacon, Paul E. McKenney
  Cc: Alan Stern, Peter Zijlstra, parri.andrea, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, joel, linux-kernel,
	linux-arch

On 9/13/2022 7:24 AM, Will Deacon wrote:
> On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
>> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
>>> On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
>>>> On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
>>>>> Hello!
>>>>>
>>>>> I have not yet done more than glance at this one, but figured I should
>>>>> send it along sooner rather than later.
>>>>>
>>>>> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
>>>>> Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
>>>>> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
>>>>> https://arxiv.org/abs/2111.15240
>>>>>
>>>>> The claim is that the queued spinlocks implementation with CNA violates
>>>>> LKMM but actually works on all architectures having a formal hardware
>>>>> memory model.
>>>>>
>>>>> Thoughts?
>>>>
>>>> So the paper mentions the following defects:
>>>>
>>>>  - LKMM doesn't carry a release-acquire chain across a relaxed op
>>>
>>> That's right, although I'm not so sure this should be considered a 
>>> defect...
>>>
>>>>  - some babbling about a missing propagation -- ISTR Linux if stuffed
>>>>    full of them, specifically we require stores to auto propagate
>>>>    without help from barriers
>>>
>>> Not a missing propagation; a late one.
>>>
>>> Don't understand what you mean by "auto propagate without help from 
>>> barriers".
>>>
>>>>  - some handoff that is CNA specific and I've not looked too hard at
>>>>    presently.
>>>>
>>>>
>>>> I think we should address that first one in LKMM, it seems very weird to
>>>> me a RmW would break the chain like that.
>>>
>>> An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
>>>
>>> If the authors wanted to keep the release-acquire chain intact, why not 
>>> use a cmpxchg version that has release semantics instead of going out of 
>>> their way to use a relaxed version?
>>>
>>> To put it another way, RMW accesses and release-acquire accesses are 
>>> unrelated concepts.  You can have one without the other (in principle, 
>>> anyway).  So a relaxed RMW is just as capable of breaking a 
>>> release-acquire chain as any other relaxed operation is.
>>>
>>>>  Is there actual hardware that
>>>> doesn't behave?
>>>
>>> Not as far as I know, although that isn't very far.  Certainly an 
>>> other-multicopy-atomic architecture would make the litmus test succeed.  
>>> But the LKMM does not require other-multicopy-atomicity.
>>
>> My first attempt with ppcmem suggests that powerpc does -not- behave
>> this way.  But that surprises me, just on general principles.  Most likely
>> I blew the litmus test shown below.
>>
>> Thoughts?
>>
>> 							Thanx, Paul
>>
>> ------------------------------------------------------------------------
>>
>> PPC MP+lwsyncs+atomic
>> "LwSyncdWW Rfe LwSyncdRR Fre"
>> Cycle=Rfe LwSyncdRR Fre LwSyncdWW
>> {
>> 0:r2=x; 0:r4=y;
>> 1:r2=y; 1:r5=2;
>> 2:r2=y; 2:r4=x;
>> }
>>  P0           | P1              | P2           ;
>>  li r1,1      | lwarx r1,r0,r2  | lwz r1,0(r2) ;
>>  stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync       ;
>>  lwsync       |                 | lwz r3,0(r4) ;
>>  li r3,1      |                 |              ;
>>  stw r3,0(r4) |                 |              ;
>> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> 
> Just catching up on this, but one possible gotcha here is if you have an
> architecture with native load-acquire on P2 and then you move P2 to the end
> of P1. e.g. at a high-level:
> 
> 
>   P0		P1
>   Wx = 1	RmW(y) // xchg() 1 => 2
>   WyRel = 1	RyAcq = 2
> 		Rx = 0
> 
> arm64 forbids this, but it's not "natural" to the hardware and I don't
> know what e.g. risc-v would say about it.
> 
> Will

RISC-V doesn't currently have native load-acquire instructions other than
load-reserve-acquire, but if it did, it would forbid this outcome as well.

To the broader question, RISC-V is other-multi-copy-atomic, so questions
about propagation order and B-cumulativity and so on aren't generally
problematic, just like they generally aren't an issue for ARMv8.

Dan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-13 12:21         ` Dan Lustig
@ 2022-09-16  8:18           ` Will Deacon
  0 siblings, 0 replies; 28+ messages in thread
From: Will Deacon @ 2022-09-16  8:18 UTC (permalink / raw)
  To: Dan Lustig
  Cc: Paul E. McKenney, Alan Stern, Peter Zijlstra, parri.andrea,
	boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks,
	joel, linux-kernel, linux-arch

On Tue, Sep 13, 2022 at 08:21:02AM -0400, Dan Lustig wrote:
> On 9/13/2022 7:24 AM, Will Deacon wrote:
> > On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
> >> PPC MP+lwsyncs+atomic
> >> "LwSyncdWW Rfe LwSyncdRR Fre"
> >> Cycle=Rfe LwSyncdRR Fre LwSyncdWW
> >> {
> >> 0:r2=x; 0:r4=y;
> >> 1:r2=y; 1:r5=2;
> >> 2:r2=y; 2:r4=x;
> >> }
> >>  P0           | P1              | P2           ;
> >>  li r1,1      | lwarx r1,r0,r2  | lwz r1,0(r2) ;
> >>  stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync       ;
> >>  lwsync       |                 | lwz r3,0(r4) ;
> >>  li r3,1      |                 |              ;
> >>  stw r3,0(r4) |                 |              ;
> >> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> > 
> > Just catching up on this, but one possible gotcha here is if you have an
> > architecture with native load-acquire on P2 and then you move P2 to the end
> > of P1. e.g. at a high-level:
> > 
> > 
> >   P0		P1
> >   Wx = 1	RmW(y) // xchg() 1 => 2
> >   WyRel = 1	RyAcq = 2
> > 		Rx = 0
> > 
> > arm64 forbids this, but it's not "natural" to the hardware and I don't
> > know what e.g. risc-v would say about it.
> > 
> 
> RISC-V doesn't currently have native load-acquire instructions other than
> load-reserve-acquire, but if it did, it would forbid this outcome as well.

Thanks for chipping in, Dan. Somehow, I hadn't realised that you didn't
have native load-acquire instructions.

Will

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 17:10   ` Alan Stern
  2022-08-26 20:42     ` Paul E. McKenney
@ 2022-08-26 23:47     ` Peter Zijlstra
  2022-08-27 16:05       ` Alan Stern
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2022-08-26 23:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: Paul E. McKenney, parri.andrea, will, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	linux-kernel, linux-arch

On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:

> >  - some babbling about a missing propagation -- ISTR Linux if stuffed
> >    full of them, specifically we require stores to auto propagate
> >    without help from barriers
> 
> Not a missing propagation; a late one.
> 
> Don't understand what you mean by "auto propagate without help from 
> barriers".

Linux hard relies on:

	CPU0				CPU1

	WRITE_ONCE(foo, 1);		while (!READ_ONCE(foo));

making forward progress.

There were a few 'funny' uarchs that were broken, see for example commit
a30718868915f.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 23:47     ` Peter Zijlstra
@ 2022-08-27 16:05       ` Alan Stern
  2022-08-27 16:44         ` Boqun Feng
  0 siblings, 1 reply; 28+ messages in thread
From: Alan Stern @ 2022-08-27 16:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, parri.andrea, will, boqun.feng, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	linux-kernel, linux-arch

On Sat, Aug 27, 2022 at 01:47:48AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> 
> > >  - some babbling about a missing propagation -- ISTR Linux if stuffed
> > >    full of them, specifically we require stores to auto propagate
> > >    without help from barriers
> > 
> > Not a missing propagation; a late one.
> > 
> > Don't understand what you mean by "auto propagate without help from 
> > barriers".
> 
> Linux hard relies on:
> 
> 	CPU0				CPU1
> 
> 	WRITE_ONCE(foo, 1);		while (!READ_ONCE(foo));
> 
> making forward progress.

Indeed yes.  As far as I can tell, this requirement is not explicitly 
mentioned in the LKMM, although it certainly is implicit.  I can't even 
think of a way to express it in a form Herd could verify.

> There were a few 'funny' uarchs that were broken, see for example commit
> a30718868915f.

Ha!  That commit should be a lesson in something, although I'm not sure 
what.  :-)

Alan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-27 16:05       ` Alan Stern
@ 2022-08-27 16:44         ` Boqun Feng
  2022-08-29  2:15           ` Andrea Parri
  2022-09-09 11:45           ` Jonas Oberhauser
  0 siblings, 2 replies; 28+ messages in thread
From: Boqun Feng @ 2022-08-27 16:44 UTC (permalink / raw)
  To: Alan Stern
  Cc: Peter Zijlstra, Paul E. McKenney, parri.andrea, will, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	linux-kernel, linux-arch

On Sat, Aug 27, 2022 at 12:05:33PM -0400, Alan Stern wrote:
> On Sat, Aug 27, 2022 at 01:47:48AM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> > 
> > > >  - some babbling about a missing propagation -- ISTR Linux if stuffed
> > > >    full of them, specifically we require stores to auto propagate
> > > >    without help from barriers
> > > 
> > > Not a missing propagation; a late one.
> > > 
> > > Don't understand what you mean by "auto propagate without help from 
> > > barriers".
> > 
> > Linux hard relies on:
> > 
> > 	CPU0				CPU1
> > 
> > 	WRITE_ONCE(foo, 1);		while (!READ_ONCE(foo));
> > 
> > making forward progress.
> 
> Indeed yes.  As far as I can tell, this requirement is not explicitly 
> mentioned in the LKMM, although it certainly is implicit.  I can't even 
> think of a way to express it in a form Herd could verify.
> 

FWIW, C++ defines this as (in https://eel.is/c++draft/atomics#order-11):

	Implementations should make atomic stores visible to atomic
	loads within a reasonable amount of time.

in other words:

if one thread does an atomic store, then all other threads must see that
store eventually.

(from: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/Rust.20forward.20progress.20guarantees/near/294702950)

Should we add something somewhere in our model, maybe in the
explanation.txt?

Plus, I think we cannot express this in Herd because Herd uses
graph-based model (axiomatic model) instead of an operational model to
describe the model: axiomatic model cannot describe "something will
eventually happen". There was also some discussion in the zulip steam
of Rust unsafe-code-guidelines.

Regards,
Boqun

> > There were a few 'funny' uarchs that were broken, see for example commit
> > a30718868915f.
> 
> Ha!  That commit should be a lesson in something, although I'm not sure 
> what.  :-)
> 
> Alan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-27 16:44         ` Boqun Feng
@ 2022-08-29  2:15           ` Andrea Parri
  2022-09-09 11:45           ` Jonas Oberhauser
  1 sibling, 0 replies; 28+ messages in thread
From: Andrea Parri @ 2022-08-29  2:15 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Alan Stern, Peter Zijlstra, Paul E. McKenney, will, npiggin,
	dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel,
	linux-kernel, linux-arch

> FWIW, C++ defines this as (in https://eel.is/c++draft/atomics#order-11):
> 
> 	Implementations should make atomic stores visible to atomic
> 	loads within a reasonable amount of time.
> 
> in other words:
> 
> if one thread does an atomic store, then all other threads must see that
> store eventually.
> 
> (from: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/Rust.20forward.20progress.20guarantees/near/294702950)
> 
> Should we add something somewhere in our model, maybe in the
> explanation.txt?

FYI, that's briefly mentioned in Section 11, "CACHE COHERENCE AND THE
COHERENCE ORDER RELATION: co, coi, and coe":

  "sequential consistency per variable and cache coherence mean the
   same thing except that cache coherence includes an extra requirement
   that every store eventually becomes visible to every CPU"

  Andrea


> Plus, I think we cannot express this in Herd because Herd uses
> graph-based model (axiomatic model) instead of an operational model to
> describe the model: axiomatic model cannot describe "something will
> eventually happen". There was also some discussion in the zulip steam
> of Rust unsafe-code-guidelines.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-27 16:44         ` Boqun Feng
  2022-08-29  2:15           ` Andrea Parri
@ 2022-09-09 11:45           ` Jonas Oberhauser
  2022-09-10 12:11             ` Hernan Luis Ponce de Leon
  1 sibling, 1 reply; 28+ messages in thread
From: Jonas Oberhauser @ 2022-09-09 11:45 UTC (permalink / raw)
  To: Boqun Feng, Alan Stern
  Cc: Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com,
	will@kernel.org, npiggin@gmail.com, dhowells@redhat.com,
	j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com,
	dlustig@nvidia.com, joel@joelfernandes.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

Hi Boqun, hi everyone,

> > > >  - some babbling about a missing propagation -- ISTR Linux if stuffed
> > > >    full of them, specifically we require stores to auto propagate
> > > >    without help from barriers
> > > 
> > > Not a missing propagation; a late one.
> > > 
> > > Don't understand what you mean by "auto propagate without help 
> > > from barriers".

after a quick look at the claim in question, it seems to be about a different kind of issue. The problem alluded to seems not to be that the store doesn't propagate in finite time to the other core. The tools used by the authors of that work already assume every (marked) access propagates given enough time.

What they mean seems to be that a prop relation followed only by wmb (not mb) doesn't enforce the order of some writes to the same location, leading to the claimed hang in qspinlock (at least as far as LKMM is concerned). Without having looked at the code or trying to understand the bug in more detail, this may be similar to the bug fixed by Ingo in qspinlock in 2018 where the initialization of the locked bit in the node could overwrite the setting of the locked bit in the other thread, unless a release barrier/wmb is added at some point.

That being said:

> Should we add something somewhere in our model, maybe in the explanation.txt?
> Plus, I think we cannot express this in Herd because Herd uses graph-based model (axiomatic model) instead of an operational model to describe the model: axiomatic model cannot describe "something will eventually happen". There was also some discussion in the zulip steam of Rust unsafe-code-guidelines.

It is possible to express this kind of progress property in a graph based model. Some work on this was done in the RISC-V manual, as well as for some stronger graph based models ("making weak memory models fair", https://dl.acm.org/doi/10.1145/3485475 ) and in a more practical way for verification of liveness ( https://dl.acm.org/doi/abs/10.1145/3445814.3446748 ).

From a theoretical perspective, the suggestion made in "making weak memory models fair" would be to define prefix-finiteness of all the explicitly acyclic relations of LKMM, which includes the sc-per-loc relation and thus in particular fr. If fr is prefix-finite, meaning each store has only finitely many reads observing an older value, the infinite behaviour in the small example posted elsewhere in this thread becomes forbidden.

From an engineering perspective, I think the only issue is that cat *currently* does not have any syntax for this, nor does herd currently implement the await model checking techniques proposed in those works (c.f. Theorem 5.3. in the "making weak memory models fair" paper, which says that for this kind of loop, iff the mo-maximal reads in some graph are read in a loop iteration that does not exit the loop, the loop can run forever). However GenMC and I believe also Dat3M and recently also Nidhugg support such techniques. It may not even be too much effort to implement something like this in herd if desired.

Best regards,
jonas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-09 11:45           ` Jonas Oberhauser
@ 2022-09-10 12:11             ` Hernan Luis Ponce de Leon
  2022-09-10 15:03               ` Alan Stern
  0 siblings, 1 reply; 28+ messages in thread
From: Hernan Luis Ponce de Leon @ 2022-09-10 12:11 UTC (permalink / raw)
  To: Jonas Oberhauser, Boqun Feng, Alan Stern
  Cc: Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com,
	will@kernel.org, npiggin@gmail.com, dhowells@redhat.com,
	j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com,
	dlustig@nvidia.com, joel@joelfernandes.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

What they mean seems to be that a prop relation followed only by wmb (not mb) doesn't enforce the order of some writes to the same location, leading to the claimed hang in qspinlock (at least as far as LKMM is concerned). 

What we mean is that wmb does not give the same propagation properties as mb.
The claim is based on these relations from the memory model

let strong-fence = mb | gp
...
let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
	po-unlock-lock-po) ; [Marked]
let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
	[Marked] ; rfe? ; [Marked]

From an engineering perspective, I think the only issue is that cat *currently* does not have any syntax for this, nor does herd currently implement the await model checking techniques proposed in those works (c.f. Theorem 5.3. in the "making weak memory models fair" paper, which says that for this kind of loop, iff the mo-maximal reads in some graph are read in a loop iteration that does not exit the loop, the loop can run forever). However GenMC and I believe also Dat3M and recently also Nidhugg support such techniques. It may not even be too much effort to implement something like this in herd if desired.

The Dartagnan model checker uses the Theorem 5.3 from above to detect liveness violations.

We did not try to come up with a litmus test about the behavior because herd7 cannot reason about liveness.
However, if anybody is interested, the violating execution is shown here
	https://github.com/huawei-drc/cna-verification/blob/master/verification-output/BUG1.png

Hernan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-10 12:11             ` Hernan Luis Ponce de Leon
@ 2022-09-10 15:03               ` Alan Stern
  2022-09-10 20:41                 ` Hernan Luis Ponce de Leon
  2022-09-12 10:46                 ` Jonas Oberhauser
  0 siblings, 2 replies; 28+ messages in thread
From: Alan Stern @ 2022-09-10 15:03 UTC (permalink / raw)
  To: Hernan Luis Ponce de Leon
  Cc: Jonas Oberhauser, Boqun Feng, Peter Zijlstra, Paul E. McKenney,
	parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com,
	dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr,
	akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

On Sat, Sep 10, 2022 at 12:11:36PM +0000, Hernan Luis Ponce de Leon wrote:
> 
> What they mean seems to be that a prop relation followed only by wmb 
> (not mb) doesn't enforce the order of some writes to the same 
> location, leading to the claimed hang in qspinlock (at least as far as 
> LKMM is concerned).

You were quoting Jonas here, right?  The email doesn't make this obvious 
because it doesn't have two levels of "> > " markings.

> What we mean is that wmb does not give the same propagation properties as mb.

In general, _no_ two distinct relations in the LKMM have the same 
propagation properties.  If wmb always behaved the same way as mb, we 
wouldn't use two separate words for them.

> The claim is based on these relations from the memory model
> 
> let strong-fence = mb | gp
> ...
> let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
> 	po-unlock-lock-po) ; [Marked]
> let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
> 	[Marked] ; rfe? ; [Marked]

Please be more specific.  What difference between mb and wmb are you 
concerned about?  Can you give a small litmus test that illustrates this 
difference?  Can you explain in more detail how this difference affects 
the qspinlock implementation?

> From an engineering perspective, I think the only issue is that cat 
> *currently* does not have any syntax for this,

Syntax for what?  The difference between wmb and mb?

>  nor does herd currently 
> implement the await model checking techniques proposed in those works 
> (c.f. Theorem 5.3. in the "making weak memory models fair" paper, 
> which says that for this kind of loop, iff the mo-maximal reads in 
> some graph are read in a loop iteration that does not exit the loop, 
> the loop can run forever). However GenMC and I believe also Dat3M and 
> recently also Nidhugg support such techniques. It may not even be too 
> much effort to implement something like this in herd if desired.

I believe that herd has no way to express the idea of a program running 
forever.  On the other hand, it's certainly true (in all of these 
models) than for any finite number N, there is a feasible execution in 
which a loop runs for more than N iterations before the termination 
condition eventually becomes true.

Alan

> The Dartagnan model checker uses the Theorem 5.3 from above to detect 
> liveness violations.
> 
> We did not try to come up with a litmus test about the behavior 
> because herd7 cannot reason about liveness.
> However, if anybody is interested, the violating execution is shown here
> 	https://github.com/huawei-drc/cna-verification/blob/master/verification-output/BUG1.png
> 
> Hernan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-10 15:03               ` Alan Stern
@ 2022-09-10 20:41                 ` Hernan Luis Ponce de Leon
  2022-09-11 10:20                   ` Joel Fernandes
  2022-09-11 14:53                   ` Andrea Parri
  2022-09-12 10:46                 ` Jonas Oberhauser
  1 sibling, 2 replies; 28+ messages in thread
From: Hernan Luis Ponce de Leon @ 2022-09-10 20:41 UTC (permalink / raw)
  To: Alan Stern
  Cc: Jonas Oberhauser, Boqun Feng, Peter Zijlstra, Paul E. McKenney,
	parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com,
	dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr,
	akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

> You were quoting Jonas here, right?  The email doesn't make this obvious
> because it doesn't have two levels of "> > " markings.

Yes, I was quoting Jonas. 
It seems my mail client did not format the email correctly and I did not notice.
Sorry for that.

> In general, _no_ two distinct relations in the LKMM have the same propagation
> properties.  If wmb always behaved the same way as mb, we wouldn't use two
> separate words for them.

I understand that relations with different names are intended to be different.
What I meant was 
	"wmb gives weaker propagation guarantees than mb and because of this, liveness of qspinlock is not guaranteed in LKMM" 

> 
> > The claim is based on these relations from the memory model
> >
> > let strong-fence = mb | gp
> > ...
> > let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
> > 	po-unlock-lock-po) ; [Marked]
> > let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
> > 	[Marked] ; rfe? ; [Marked]
> 
> Please be more specific.  What difference between mb and wmb are you
> concerned about?  

Since the code uses wmb, there are certain pairs of events that will not be in strong-fence.
Since strong-fence contributes to cumul-fence, cumul-fence to prop, and prop to hb, 
the pair of events related by hb is less that if the code would use mb instead.
Because of this, there are executions (in particular the one that violates liveness) that have 
an acyclic hb relation, but would create a cycle (and thus the memory model would not accept 
the behavior) if mb would have been used.

> Can you give a small litmus test that illustrates this
> difference?  Can you explain in more detail how this difference affects the
> qspinlock implementation?

Here is a litmus test showing the problem (I hope the comment are enough to relate it back to qspinlock)

C Liveness
{
  atomic_t x = ATOMIC_INIT(0);
  atomic_t y = ATOMIC_INIT(0);
}

P0(atomic_t *x) {
  // clear_pending_set_locked
  int r0 = atomic_fetch_add(2,x) ;
}

P1(atomic_t *x, int *z) {
  // queued_spin_trylock
  int r0 = atomic_read(x);
  // barrier after the initialization of nodes
  smp_wmb();
  // xchg_tail
  int r1 = atomic_cmpxchg_relaxed(x,r0,42);
  // link node into the waitqueue
  WRITE_ONCE(*z, 1);
}

P2(atomic_t *x,atomic_t *z) {
  // node initialization
  WRITE_ONCE(*z, 2);
  // queued_spin_trylock
  int r0 = atomic_read(x);
  // barrier after the initialization of nodes
  smp_wmb();
  // xchg_tail
  int r1 = atomic_cmpxchg_relaxed(x,r0,24);
}

exists (0:r0 = 24 /\ 1:r0 = 26 /\ z=2)

herd7 says that the behavior is observable. 
However if you change wmb by mb, it is not observable anymore.

> 
> > From an engineering perspective, I think the only issue is that cat
> > *currently* does not have any syntax for this,
> 
> Syntax for what?  The difference between wmb and mb?
> 
> >  nor does herd currently
> > implement the await model checking techniques proposed in those works
> > (c.f. Theorem 5.3. in the "making weak memory models fair" paper,
> > which says that for this kind of loop, iff the mo-maximal reads in
> > some graph are read in a loop iteration that does not exit the loop,
> > the loop can run forever). However GenMC and I believe also Dat3M and
> > recently also Nidhugg support such techniques. It may not even be too
> > much effort to implement something like this in herd if desired.
> 
> I believe that herd has no way to express the idea of a program running forever.
> On the other hand, it's certainly true (in all of these
> models) than for any finite number N, there is a feasible execution in which a
> loop runs for more than N iterations before the termination condition eventually
> becomes true.

Here I was again quoting Jonas.
I think his intention was to make a distinction between graph based semantics and tools.
While herd7 cannot reason about this kind of property, it is possible to define the property 
using graph based semantics and there are tools already using this.

Hernan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-10 20:41                 ` Hernan Luis Ponce de Leon
@ 2022-09-11 10:20                   ` Joel Fernandes
  2022-09-12 10:13                     ` Jonas Oberhauser
  2022-09-11 14:53                   ` Andrea Parri
  1 sibling, 1 reply; 28+ messages in thread
From: Joel Fernandes @ 2022-09-11 10:20 UTC (permalink / raw)
  To: Hernan Luis Ponce de Leon
  Cc: Alan Stern, Jonas Oberhauser, Boqun Feng, Peter Zijlstra,
	Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org,
	npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk,
	luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

On Sat, Sep 10, 2022 at 4:41 PM Hernan Luis Ponce de Leon
<hernanl.leon@huawei.com> wrote:
>
> > You were quoting Jonas here, right?  The email doesn't make this obvious
> > because it doesn't have two levels of "> > " markings.
>
> Yes, I was quoting Jonas.
> It seems my mail client did not format the email correctly and I did not notice.
> Sorry for that.
>
> > In general, _no_ two distinct relations in the LKMM have the same propagation
> > properties.  If wmb always behaved the same way as mb, we wouldn't use two
> > separate words for them.
>
> I understand that relations with different names are intended to be different.
> What I meant was
>         "wmb gives weaker propagation guarantees than mb and because of this, liveness of qspinlock is not guaranteed in LKMM"
>

I wonder if this sort of liveness guarantee (or lack thereof) is
really a problem in practice, where writes will eventually propagate
even though they may not for a bit. Is it possible to write a liveness
test case on any hardware, or is this more in the realms of theory?
Either way, quite intriguing!

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-11 10:20                   ` Joel Fernandes
@ 2022-09-12 10:13                     ` Jonas Oberhauser
  2022-09-12 11:10                       ` Hernan Luis Ponce de Leon
  2022-09-12 12:01                       ` Alan Stern
  0 siblings, 2 replies; 28+ messages in thread
From: Jonas Oberhauser @ 2022-09-12 10:13 UTC (permalink / raw)
  To: Joel Fernandes, Hernan Luis Ponce de Leon
  Cc: Alan Stern, Boqun Feng, Peter Zijlstra, Paul E. McKenney,
	parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com,
	dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr,
	akiyks@gmail.com, dlustig@nvidia.com,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

Hi Joel,

> I wonder if this sort of liveness guarantee (or lack thereof) is really a problem in practice, where writes will eventually propagate even though they may not for a bit. Is it possible to write a liveness test case on any hardware, or is this more in the realms of theory?
Either way, quite intriguing!

As I tried to explain before, this problem has nothing to do with stores propagating within a given time to another core. Rather it is due to two stores to the same location happening in a surprising order. I.e., both stores propagate quickly to other cores, but in a surprising coherence order.And if a wmb in the code is replaced by an mb, then this co will create a pb cycle and become forbidden.

Therefore this hang should be observable on a hypothetical LKMM processor which makes use of all the relaxed liberty the LKMM allows. However according to the authors of that paper (who are my colleagues but I haven't been involved deeply in that work), not even Power+gcc allow this reordering to happen, and if that's true it is probably because the wmb is mapped to lwsync which is fully cumulative in Power but not in LKMM.

Best wishes and hope this clears it up, jonas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-12 10:13                     ` Jonas Oberhauser
@ 2022-09-12 11:10                       ` Hernan Luis Ponce de Leon
  2022-09-14 14:41                         ` Alan Stern
  2022-09-12 12:01                       ` Alan Stern
  1 sibling, 1 reply; 28+ messages in thread
From: Hernan Luis Ponce de Leon @ 2022-09-12 11:10 UTC (permalink / raw)
  To: Jonas Oberhauser, Joel Fernandes
  Cc: Alan Stern, Boqun Feng, Peter Zijlstra, Paul E. McKenney,
	parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com,
	dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr,
	akiyks@gmail.com, dlustig@nvidia.com,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

> Therefore this hang should be observable on a hypothetical LKMM processor
> which makes use of all the relaxed liberty the LKMM allows. However according
> to the authors of that paper (who are my colleagues but I haven't been involved
> deeply in that work), not even Power+gcc allow this reordering to happen, and if
> that's true it is probably because the wmb is mapped to lwsync which is fully
> cumulative in Power but not in LKMM.

All the "issues" we mention in the technical report are according to LKMM.
As shown by (*) below, as soon as the code gets compiled and verified against the 
corresponding hardware memory model, the code is correct.

Here is a small variant of the litmus test I sent earlier where not only the "problematic 
behavior" is allowed by LKMM, but where liveness is actually violated.
The code is written in C (main function and headers missing) and cannot be used directly 
with herd7 (since I am not sure if the end of thread_3 can be written using herd7 syntax).

------------------------------------------------------------------------
int y, z;
atomic_t x;

void *thread_1(void *unused)
{   
    // clear_pending_set_locked
    int r0 = atomic_fetch_add(2,&x) ;
}

void *thread_2(void *unused)
{
    // this store breaks liveness
    WRITE_ONCE(y, 1);
    // queued_spin_trylock
    int r0 = atomic_read(&x);
    // barrier after the initialisation of nodes
    smp_wmb();
    // xchg_tail
    int r1 = atomic_cmpxchg_relaxed(&x,r0,42);
    // link node into the waitqueue
    WRITE_ONCE(z, 1);
}

void *thread_3(void *unused)
{
    // node initialisation
    WRITE_ONCE(z, 2);
    // queued_spin_trylock
    int r0 = atomic_read(&x);
    // barrier after the initialisation of nodes
    smp_wmb();
    // if we read z==2 we expect to read this store
    WRITE_ONCE(y, 0);
    // xchg_tail
    int r1 = atomic_cmpxchg_relaxed(&x,r0,24);
    // spinloop
    while(READ_ONCE(y) == 1 && (READ_ONCE(z) == 2)) {}
}
------------------------------------------------------------------------

Liveness is violated (following Theorem 5.3 of the "Making weak memory models fair" paper) because the reads from the spinloop 
can get their values from writes which come last in the coherence / modification order, and those values do not stop the spinning.

------------------------------------------------------------------------
$ java -jar $DAT3M_HOME/dartagnan/target/dartagnan-3.1.0.jar cat/linux-kernel.cat --target=lkmm --property=liveness liveness.c
...
Liveness violation found
FAIL
------------------------------------------------------------------------

(*) However, if the code is compiled (this transformation is done automatically and internally by the tool, notice the --target option) 
and we use some hardware memory model, the tool says the code is correct

------------------------------------------------------------------------
$ java -jar $DAT3M_HOME/dartagnan/target/dartagnan-3.1.0.jar cat/aarch64.cat --target=arm8 --property=liveness liveness.c
...
PASS

$ java -jar $DAT3M_HOME/dartagnan/target/dartagnan-3.1.0.jar cat/power.cat --target=power --property=liveness liveness.c
...
PASS

$ java -jar $DAT3M_HOME/dartagnan/target/dartagnan-3.1.0.jar cat/riscv.cat --target=riscv --property=liveness liveness.c
...
PASS
------------------------------------------------------------------------

I think it is somehow possible to show the liveness violation using herd7 and the following variant of the code

------------------------------------------------------------------------
C Liveness
{
  atomic_t x = ATOMIC_INIT(0);
  atomic_t y = ATOMIC_INIT(0);
}


P0(atomic_t *x) {
  // clear_pending_set_locked
  int r0 = atomic_fetch_add(2,x) ;
}

P1(atomic_t *x, int *z, int *y) {
  // this store breaks liveness
  WRITE_ONCE(*y, 1);
  // queued_spin_trylock
  int r0 = atomic_read(x);
  // barrier after the initialisation of nodes
  smp_wmb();
  // xchg_tail
  int r1 = atomic_cmpxchg_relaxed(x,r0,42);
  // link node into the waitqueue
  WRITE_ONCE(*z, 1);
}

P2(atomic_t *x,int *z, int *y) {
  // node initialisation
  WRITE_ONCE(*z, 2);
  // queued_spin_trylock
  int r0 = atomic_read(x);
  // barrier after the initialisation of nodes
  smp_wmb();
  // if we read z==2 we expect to read this store
  WRITE_ONCE(*y, 0);
  // xchg_tail
  int r1 = atomic_cmpxchg_relaxed(x,r0,24);
  // spinloop
  int r2 = READ_ONCE(*y);  
  int r3 = READ_ONCE(*z);  
}

exists (z=2 /\ y=1 /\ 2:r2=1 /\ 2:r3=2)
------------------------------------------------------------------------

Condition "2:r3=2" forces the spinloop to read from the first write in P2 and "z=2" forces this write 
to be last in the coherence order. Conditions "2:r2=1" and "y=1" force the same for the other read.
herd7 says this behavior is allowed by LKMM, showing that liveness can be violated.

In all the examples above, if we use mb() instead of wmb(), LKMM does not accept
the behavior and thus liveness is guaranteed.

Hernan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-12 11:10                       ` Hernan Luis Ponce de Leon
@ 2022-09-14 14:41                         ` Alan Stern
  0 siblings, 0 replies; 28+ messages in thread
From: Alan Stern @ 2022-09-14 14:41 UTC (permalink / raw)
  To: Hernan Luis Ponce de Leon
  Cc: Jonas Oberhauser, Joel Fernandes, Boqun Feng, Peter Zijlstra,
	Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org,
	npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk,
	luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

On Mon, Sep 12, 2022 at 11:10:17AM +0000, Hernan Luis Ponce de Leon wrote:
> I think it is somehow possible to show the liveness violation using herd7 and the following variant of the code
> 
> ------------------------------------------------------------------------
> C Liveness
> {
>   atomic_t x = ATOMIC_INIT(0);
>   atomic_t y = ATOMIC_INIT(0);
> }
> 
> 
> P0(atomic_t *x) {
>   // clear_pending_set_locked
>   int r0 = atomic_fetch_add(2,x) ;
> }
> 
> P1(atomic_t *x, int *z, int *y) {
>   // this store breaks liveness
>   WRITE_ONCE(*y, 1);
>   // queued_spin_trylock
>   int r0 = atomic_read(x);
>   // barrier after the initialisation of nodes
>   smp_wmb();
>   // xchg_tail
>   int r1 = atomic_cmpxchg_relaxed(x,r0,42);
>   // link node into the waitqueue
>   WRITE_ONCE(*z, 1);
> }
> 
> P2(atomic_t *x,int *z, int *y) {
>   // node initialisation
>   WRITE_ONCE(*z, 2);
>   // queued_spin_trylock
>   int r0 = atomic_read(x);
>   // barrier after the initialisation of nodes
>   smp_wmb();
>   // if we read z==2 we expect to read this store
>   WRITE_ONCE(*y, 0);
>   // xchg_tail
>   int r1 = atomic_cmpxchg_relaxed(x,r0,24);
>   // spinloop
>   int r2 = READ_ONCE(*y);  
>   int r3 = READ_ONCE(*z);  
> }
> 
> exists (z=2 /\ y=1 /\ 2:r2=1 /\ 2:r3=2)
> ------------------------------------------------------------------------
> 
> Condition "2:r3=2" forces the spinloop to read from the first write in P2 and "z=2" forces this write 
> to be last in the coherence order. Conditions "2:r2=1" and "y=1" force the same for the other read.
> herd7 says this behavior is allowed by LKMM, showing that liveness can be violated.
> 
> In all the examples above, if we use mb() instead of wmb(), LKMM does not accept
> the behavior and thus liveness is guaranteed.

That's right.

The issue may be somewhat confused by the fact that there have been 
_two_ separate problems covered in this discussion.  One has to do with 
the use of relaxed vs. non-relaxed atomic accesses, and the other -- 
this one -- has to do with liveness (eventual termination of a spin 
loop).

You can see the distinction quite clearly by noticing in the litmus test 
above, the variable x plays absolutely no role.  There are no 
dependencies from it, it isn't accessed by any instructions that include 
an acquire or release memory barrier, and it isn't used in the "exists" 
clause.  If we remove x from the test (and remove P0, which is now 
vacuous), and we also remove the unneeded reads at the end of P2 
(unneeded because they observe the co-final values stored in y and z), 
what remains is:

P1(int *z, int *y) {
	WRITE_ONCE(*y, 1);
	smp_wmb();
	WRITE_ONCE(*z, 1);
}

P2(int *z, int *y) {
	WRITE_ONCE(*z, 2);
	smp_wmb();
	WRITE_ONCE(*y, 0);
}

exists (z=2 /\ y=1)

which is exactly the 2+2W pattern.  As others have pointed out, this 
pattern is permitted by LKML because we never found a good reason to 
forbid it, even though it cannot occur on any real hardware that 
supports Linux.

On the other hand, the simplicity of this "liveness" test leads me to 
wonder if it isn't missing some crucial feature of the actual qspinlock 
implementation.

Alan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-12 10:13                     ` Jonas Oberhauser
  2022-09-12 11:10                       ` Hernan Luis Ponce de Leon
@ 2022-09-12 12:01                       ` Alan Stern
  1 sibling, 0 replies; 28+ messages in thread
From: Alan Stern @ 2022-09-12 12:01 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Joel Fernandes, Hernan Luis Ponce de Leon, Boqun Feng,
	Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com,
	will@kernel.org, npiggin@gmail.com, dhowells@redhat.com,
	j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com,
	dlustig@nvidia.com, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org

On Mon, Sep 12, 2022 at 10:13:33AM +0000, Jonas Oberhauser wrote:
> As I tried to explain before, this problem has nothing to do with 
> stores propagating within a given time to another core. Rather it is 
> due to two stores to the same location happening in a surprising 
> order. I.e., both stores propagate quickly to other cores, but in a 
> surprising coherence order.And if a wmb in the code is replaced by an 
> mb, then this co will create a pb cycle and become forbidden.
> 
> Therefore this hang should be observable on a hypothetical LKMM 
> processor which makes use of all the relaxed liberty the LKMM allows. 
> However according to the authors of that paper (who are my colleagues 
> but I haven't been involved deeply in that work), not even Power+gcc 
> allow this reordering to happen, and if that's true it is probably 
> because the wmb is mapped to lwsync which is fully cumulative in Power 
> but not in LKMM.

Yes, that's right.  On ARM64 architectures the reordering is forbidden 
by other multi-copy atomicity, and on PPC is it forbidden by 
B-cumulativity (neither of which is part of the LKMM).

If I'm not mistaken, another way to forbid it is to replace one of the 
relaxed atomic accesses with an atomic access having release semantics.  
Perhaps this change will find its way into the kernel source, since it 
has less overhead than replacing wmb with bm.

Alan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-10 20:41                 ` Hernan Luis Ponce de Leon
  2022-09-11 10:20                   ` Joel Fernandes
@ 2022-09-11 14:53                   ` Andrea Parri
  1 sibling, 0 replies; 28+ messages in thread
From: Andrea Parri @ 2022-09-11 14:53 UTC (permalink / raw)
  To: Hernan Luis Ponce de Leon
  Cc: Alan Stern, Jonas Oberhauser, Boqun Feng, Peter Zijlstra,
	Paul E. McKenney, will@kernel.org, npiggin@gmail.com,
	dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr,
	akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

> Here is a litmus test showing the problem (I hope the comment are enough to relate it back to qspinlock)
> 
> C Liveness
> {
>   atomic_t x = ATOMIC_INIT(0);
>   atomic_t y = ATOMIC_INIT(0);
> }
> 
> P0(atomic_t *x) {
>   // clear_pending_set_locked
>   int r0 = atomic_fetch_add(2,x) ;
> }
> 
> P1(atomic_t *x, int *z) {
>   // queued_spin_trylock
>   int r0 = atomic_read(x);
>   // barrier after the initialization of nodes
>   smp_wmb();
>   // xchg_tail
>   int r1 = atomic_cmpxchg_relaxed(x,r0,42);
>   // link node into the waitqueue
>   WRITE_ONCE(*z, 1);
> }
> 
> P2(atomic_t *x,atomic_t *z) {
>   // node initialization
>   WRITE_ONCE(*z, 2);
>   // queued_spin_trylock
>   int r0 = atomic_read(x);
>   // barrier after the initialization of nodes
>   smp_wmb();
>   // xchg_tail
>   int r1 = atomic_cmpxchg_relaxed(x,r0,24);
> }
> 
> exists (0:r0 = 24 /\ 1:r0 = 26 /\ z=2)
> 
> herd7 says that the behavior is observable. 
> However if you change wmb by mb, it is not observable anymore.

Indeed.  For more context, this is a 3-threads extension of the 2+2W
test/behavior, cf.

  https://github.com/paulmckrcu/litmus/blob/master/manual/lwn573436/C-2+2w+o-wb-o+o-wb-o.litmus

which is also allowed by the LKMM.  The basic rationale for allowing
such behaviors was that we "don't think we need to care" (cf. the
comment in the link above), except that it seems the developers of the
code at stake disagreed.  ;-)  So this does look like a good time to
review that design choice/code.

  Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-10 15:03               ` Alan Stern
  2022-09-10 20:41                 ` Hernan Luis Ponce de Leon
@ 2022-09-12 10:46                 ` Jonas Oberhauser
  2022-09-12 12:02                   ` Alan Stern
  1 sibling, 1 reply; 28+ messages in thread
From: Jonas Oberhauser @ 2022-09-12 10:46 UTC (permalink / raw)
  To: Alan Stern, Hernan Luis Ponce de Leon
  Cc: Boqun Feng, Peter Zijlstra, Paul E. McKenney,
	parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com,
	dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr,
	akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org

Hi Alan,

Sorry for the confusion.

>  [...] it's certainly true (in all of these
models) than for any finite number N, there is a feasible execution in which a loop runs for more than N iterations before the termination condition eventually becomes true.

Indeed; but more interestingly, as the Theorem 5.3 in "making weak memory models fair" states, under certain conditions it suffices to look at graphs where N=1 to decide whether a loop can run forever (despite all writes propagating to all cores eventually) or will always terminate.

And since herd can generate all such graphs, herd could be extended to make that decision and output it, just like many other tools already do.

To illuminate this on an example, consider the program sent by Peter earlier:
	WRITE_ONCE(foo, 1);		while (!READ_ONCE(foo));

Without the assumption that fr is prefix finite, the graph with infinitely many reads of thread 2 all reading the initial value (and hence being fr-before the store foo=1) would be allowed. However, the tools used to analyze the qspinlock all assume that fr is prefix finite, and hence that such a graph with infinitely many fr-before events does not exist. Because of that, all of the tools will say that this loop always terminates.

However, if you change the code into the following:

	WRITE_ONCE(foo, 1);		WRITE_ONCE(foo, 0); while (!READ_ONCE(foo));

then even under the assumption of fr-prefix-finiteness, the coherence order in which WRITE_ONCE(foo, 1); is overwritten by WRITE_ONCE(foo, 0); of thread 2 would lead to non-terminating behaviors, and these are detected by those tools. Furthermore, if we synchronize the two stores as follows:

	while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1);		WRITE_ONCE(foo, 0); smp_store_release(&bar,1); while (!READ_ONCE(foo));

then the graphs with that co become prohibited as they all have hb cycles, and the tools again will not detect any liveness violations. But if we go further and relax the release barrier as below

	while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1);		WRITE_ONCE(foo, 0); WRITE_ONCE(bar,1); while (!READ_ONCE(foo));

then the hb cycle disappears, and the coherence order in which foo=0 overwrites foo=1 becomes allowed. Again, the tools will detect this and state that thread 2 could be stuck in its while loop forever.

In each of these cases, the decision can be made by looking for a graph in which the loop is executed for one iteration which reads from foo=0, and checking whether that store is co-maximal. So in some sense this is all that is needed to express the idea that a program can run forever, even though only in some very limited (but common) circumstances, namely that the loop iteration that repeats forever does not create modifications to state that are observed outside the loop. Luckily this is a very common case, so these checks have proven quite useful in practice.

The same kind of check could be implemented in herd together with either the implicit assumption that fr is prefix finite (like in other tools) or with some special syntax like

prefix-finite fr | co as fairness

which hopefully clears up the question below:

> > From an engineering perspective, I think the only issue is that cat
> > *currently* does not have any syntax for this,

> Syntax for what?  The difference between wmb and mb?
> [...]

Thanks for your patience and hoping I explained it more clearly,

jonas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-09-12 10:46                 ` Jonas Oberhauser
@ 2022-09-12 12:02                   ` Alan Stern
  0 siblings, 0 replies; 28+ messages in thread
From: Alan Stern @ 2022-09-12 12:02 UTC (permalink / raw)
  To: Jonas Oberhauser
  Cc: Hernan Luis Ponce de Leon, Boqun Feng, Peter Zijlstra,
	Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org,
	npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk,
	luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com,
	joel@joelfernandes.org, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org

On Mon, Sep 12, 2022 at 10:46:38AM +0000, Jonas Oberhauser wrote:
> Hi Alan,
> 
> Sorry for the confusion.
> 
> >  [...] it's certainly true (in all of these
> models) than for any finite number N, there is a feasible execution in which a loop runs for more than N iterations before the termination condition eventually becomes true.
> 	
> Indeed; but more interestingly, as the Theorem 5.3 in "making weak memory models fair" states, under certain conditions it suffices to look at graphs where N=1 to decide whether a loop can run forever (despite all writes propagating to all cores eventually) or will always terminate.

Cool!

> And since herd can generate all such graphs, herd could be extended to make that decision and output it, just like many other tools already do.
> 
> To illuminate this on an example, consider the program sent by Peter earlier:
> 	WRITE_ONCE(foo, 1);		while (!READ_ONCE(foo));
> 
> Without the assumption that fr is prefix finite, the graph with infinitely many reads of thread 2 all reading the initial value (and hence being fr-before the store foo=1) would be allowed. However, the tools used to analyze the qspinlock all assume that fr is prefix finite, and hence that such a graph with infinitely many fr-before events does not exist. Because of that, all of the tools will say that this loop always terminates.
> 
> However, if you change the code into the following:
> 
> 	WRITE_ONCE(foo, 1);		WRITE_ONCE(foo, 0); while (!READ_ONCE(foo));
> 
> then even under the assumption of fr-prefix-finiteness, the coherence order in which WRITE_ONCE(foo, 1); is overwritten by WRITE_ONCE(foo, 0); of thread 2 would lead to non-terminating behaviors, and these are detected by those tools. Furthermore, if we synchronize the two stores as follows:
> 
> 	while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1);		WRITE_ONCE(foo, 0); smp_store_release(&bar,1); while (!READ_ONCE(foo));
> 
> then the graphs with that co become prohibited as they all have hb cycles, and the tools again will not detect any liveness violations. But if we go further and relax the release barrier as below
> 
> 
> 	while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1);		WRITE_ONCE(foo, 0); WRITE_ONCE(bar,1); while (!READ_ONCE(foo));
> 
> then the hb cycle disappears, and the coherence order in which foo=0 overwrites foo=1 becomes allowed. Again, the tools will detect this and state that thread 2 could be stuck in its while loop forever.

Neat stuff; I'll have to think about this.

> In each of these cases, the decision can be made by looking for a graph in which the loop is executed for one iteration which reads from foo=0, and checking whether that store is co-maximal. So in some sense this is all that is needed to express the idea that a program can run forever, even though only in some very limited (but common) circumstances, namely that the loop iteration that repeats forever does not create modifications to state that are observed outside the loop. Luckily this is a very common case, so these checks have proven quite useful in practice.
> 
> The same kind of check could be implemented in herd together with either the implicit assumption that fr is prefix finite (like in other tools) or with some special syntax like
> 
> prefix-finite fr | co as fairness
> 
> which hopefully clears up the question below:
> 
> > > From an engineering perspective, I think the only issue is that cat
> > > *currently* does not have any syntax for this,
> 
> > Syntax for what?  The difference between wmb and mb?
> > [...]
> 
> 
> Thanks for your patience and hoping I explained it more clearly,

Yes indeed, thank you very much.

Alan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-26 12:48 "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" Paul E. McKenney
  2022-08-26 16:21 ` Boqun Feng
  2022-08-26 16:23 ` Peter Zijlstra
@ 2022-08-29  2:33 ` Andrea Parri
  2022-08-29 12:25   ` Paul E. McKenney
  2 siblings, 1 reply; 28+ messages in thread
From: Andrea Parri @ 2022-08-29  2:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: stern, will, peterz, boqun.feng, npiggin, dhowells, j.alglave,
	luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch

On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> Hello!
> 
> I have not yet done more than glance at this one, but figured I should
> send it along sooner rather than later.
> 
> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> https://arxiv.org/abs/2111.15240
> 
> The claim is that the queued spinlocks implementation with CNA violates
> LKMM but actually works on all architectures having a formal hardware
> memory model.
> 
> Thoughts?

Section 4 ends with a discussion about certain "spurious" data races.
Do we have litmus tests with them?  (I could repro with Dartagnan...)

Thanks,
  Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
  2022-08-29  2:33 ` Andrea Parri
@ 2022-08-29 12:25   ` Paul E. McKenney
  0 siblings, 0 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-08-29 12:25 UTC (permalink / raw)
  To: Andrea Parri
  Cc: stern, will, peterz, boqun.feng, npiggin, dhowells, j.alglave,
	luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch

On Mon, Aug 29, 2022 at 04:33:23AM +0200, Andrea Parri wrote:
> On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > I have not yet done more than glance at this one, but figured I should
> > send it along sooner rather than later.
> > 
> > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
> > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> > https://arxiv.org/abs/2111.15240
> > 
> > The claim is that the queued spinlocks implementation with CNA violates
> > LKMM but actually works on all architectures having a formal hardware
> > memory model.
> > 
> > Thoughts?
> 
> Section 4 ends with a discussion about certain "spurious" data races.
> Do we have litmus tests with them?  (I could repro with Dartagnan...)

Their Figure 5 clearly shows a data race, but agreed, their claim is
that this race is prevented by other code not in this litmus test.
Me, I currently suspect that the spurious data race might be due to the
failure to guarantee mutual exclusion, though I have not yet read that
paper carefully.  That is scheduled for tomorrow morning.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-09-16  8:19 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-08-26 12:48 "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" Paul E. McKenney
2022-08-26 16:21 ` Boqun Feng
2022-08-26 16:23 ` Peter Zijlstra
2022-08-26 17:10   ` Alan Stern
2022-08-26 20:42     ` Paul E. McKenney
2022-08-27 16:00       ` Alan Stern
2022-08-27 17:04         ` Paul E. McKenney
2022-09-13 11:24       ` Will Deacon
2022-09-13 12:21         ` Dan Lustig
2022-09-16  8:18           ` Will Deacon
2022-08-26 23:47     ` Peter Zijlstra
2022-08-27 16:05       ` Alan Stern
2022-08-27 16:44         ` Boqun Feng
2022-08-29  2:15           ` Andrea Parri
2022-09-09 11:45           ` Jonas Oberhauser
2022-09-10 12:11             ` Hernan Luis Ponce de Leon
2022-09-10 15:03               ` Alan Stern
2022-09-10 20:41                 ` Hernan Luis Ponce de Leon
2022-09-11 10:20                   ` Joel Fernandes
2022-09-12 10:13                     ` Jonas Oberhauser
2022-09-12 11:10                       ` Hernan Luis Ponce de Leon
2022-09-14 14:41                         ` Alan Stern
2022-09-12 12:01                       ` Alan Stern
2022-09-11 14:53                   ` Andrea Parri
2022-09-12 10:46                 ` Jonas Oberhauser
2022-09-12 12:02                   ` Alan Stern
2022-08-29  2:33 ` Andrea Parri
2022-08-29 12:25   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).