* "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"
@ 2022-08-26 12:48 Paul E. McKenney
2022-08-26 16:21 ` Boqun Feng
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Paul E. McKenney @ 2022-08-26 12:48 UTC (permalink / raw)
To: stern, parri.andrea, will, peterz, boqun.feng, npiggin, dhowells,
j.alglave, luc.maranget, akiyks, dlustig, joel
Cc: linux-kernel, linux-arch
Hello!
I have not yet done more than glance at this one, but figured I should
send it along sooner rather than later.
"Verifying and Optimizing Compact NUMA-Aware Locks on Weak
Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
https://arxiv.org/abs/2111.15240
The claim is that the queued spinlocks implementation with CNA violates
LKMM but actually works on all architectures having a formal hardware
memory model.
Thoughts?
Thanx, Paul
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 12:48 "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" Paul E. McKenney @ 2022-08-26 16:21 ` Boqun Feng 2022-08-26 16:23 ` Peter Zijlstra 2022-08-29 2:33 ` Andrea Parri 2 siblings, 0 replies; 28+ messages in thread From: Boqun Feng @ 2022-08-26 16:21 UTC (permalink / raw) To: Paul E. McKenney Cc: stern, parri.andrea, will, peterz, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote: > Hello! > > I have not yet done more than glance at this one, but figured I should > send it along sooner rather than later. > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer. > https://arxiv.org/abs/2111.15240 > > The claim is that the queued spinlocks implementation with CNA violates > LKMM but actually works on all architectures having a formal hardware > memory model. > Translate one of their litmus test into a runnable one (there is a typo in it): C queued-spin-lock (* * P0 is lock-release * P1 is xchg_tail() * P2 is lock-acquire *) {} P0(int *x, atomic_t *l) { WRITE_ONCE(*x, 1); atomic_set_release(l, 1); } P1(int *x, atomic_t *l) { int val; int new; int old; val = atomic_read(l); new = val + 2; old = atomic_cmpxchg_relaxed(l, val, new); } P2(int *x, atomic_t *l) { int r0 = atomic_read_acquire(l); int r1 = READ_ONCE(*x); } exists (2:r0=3 /\ 2:r1=0) According to LKMM, the exist-clause could be triggered because: po-rel; coe: rfe; acq-po is not happen-before in LKMM. Alan actually explain why at a response to a GitHub issue: https://github.com/paulmckrcu/litmus/issues/11#issuecomment-1115235860 (Paste Alan's reply) """ As for why the LKMM doesn't require ordering for this test... I do not believe this is related to 2+2W. Think instead in terms of the LKMM's operational model: The store-release in P0 means that the x=1 write will propagate to each CPU before the y=1 write does. Since y=3 at the end, we know that y=1 (and hence x=1 too) propagates to P1 before the addition occurs. And we know that y=3 propagates to P2 before the load-acquire executes. But we _don't_ know that either y=1 or x=1 propagates to P2 before y=3 does! If the store in P1 were a store-release then this would have to be true (as you saw in your tests), but it isn't. In other words, the litmus test could execute with the following temporal ordering: P0 P1 P2 ---------- --------- ---------- Write x=1 Write y=1 [x=1 and y=1 propagate to P1] Read y=1 Write y=3 [y=3 propagates to P2] Read y=3 Read x=0 [x=1 and y=1 propagate to P2] (Note that when y=1 propagates to P2, it doesn't do anything because it won't overwrite the coherence-later store y=3.) It may be true that none of the architectures supported by Linux will allow this outcome for the test (although I suspect that the PPC-weird version _would_ be allowed if you fixed it). At any rate, disallowing this result in the memory model would probably require more effort than would be worthwhile. Alan """ The question is that whether we "fix" LKMM because of this, or we mention explicitly this is something "unsupported" by LKMM yet? Regards, Boqun > Thoughts? > > Thanx, Paul ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 12:48 "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" Paul E. McKenney 2022-08-26 16:21 ` Boqun Feng @ 2022-08-26 16:23 ` Peter Zijlstra 2022-08-26 17:10 ` Alan Stern 2022-08-29 2:33 ` Andrea Parri 2 siblings, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2022-08-26 16:23 UTC (permalink / raw) To: Paul E. McKenney Cc: stern, parri.andrea, will, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote: > Hello! > > I have not yet done more than glance at this one, but figured I should > send it along sooner rather than later. > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer. > https://arxiv.org/abs/2111.15240 > > The claim is that the queued spinlocks implementation with CNA violates > LKMM but actually works on all architectures having a formal hardware > memory model. > > Thoughts? So the paper mentions the following defects: - LKMM doesn't carry a release-acquire chain across a relaxed op - some babbling about a missing propagation -- ISTR Linux if stuffed full of them, specifically we require stores to auto propagate without help from barriers - some handoff that is CNA specific and I've not looked too hard at presently. I think we should address that first one in LKMM, it seems very weird to me a RmW would break the chain like that. Is there actual hardware that doesn't behave? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 16:23 ` Peter Zijlstra @ 2022-08-26 17:10 ` Alan Stern 2022-08-26 20:42 ` Paul E. McKenney 2022-08-26 23:47 ` Peter Zijlstra 0 siblings, 2 replies; 28+ messages in thread From: Alan Stern @ 2022-08-26 17:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul E. McKenney, parri.andrea, will, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote: > On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote: > > Hello! > > > > I have not yet done more than glance at this one, but figured I should > > send it along sooner rather than later. > > > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak > > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas > > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer. > > https://arxiv.org/abs/2111.15240 > > > > The claim is that the queued spinlocks implementation with CNA violates > > LKMM but actually works on all architectures having a formal hardware > > memory model. > > > > Thoughts? > > So the paper mentions the following defects: > > - LKMM doesn't carry a release-acquire chain across a relaxed op That's right, although I'm not so sure this should be considered a defect... > - some babbling about a missing propagation -- ISTR Linux if stuffed > full of them, specifically we require stores to auto propagate > without help from barriers Not a missing propagation; a late one. Don't understand what you mean by "auto propagate without help from barriers". > - some handoff that is CNA specific and I've not looked too hard at > presently. > > > I think we should address that first one in LKMM, it seems very weird to > me a RmW would break the chain like that. An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise). If the authors wanted to keep the release-acquire chain intact, why not use a cmpxchg version that has release semantics instead of going out of their way to use a relaxed version? To put it another way, RMW accesses and release-acquire accesses are unrelated concepts. You can have one without the other (in principle, anyway). So a relaxed RMW is just as capable of breaking a release-acquire chain as any other relaxed operation is. > Is there actual hardware that > doesn't behave? Not as far as I know, although that isn't very far. Certainly an other-multicopy-atomic architecture would make the litmus test succeed. But the LKMM does not require other-multicopy-atomicity. Alan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 17:10 ` Alan Stern @ 2022-08-26 20:42 ` Paul E. McKenney 2022-08-27 16:00 ` Alan Stern 2022-09-13 11:24 ` Will Deacon 2022-08-26 23:47 ` Peter Zijlstra 1 sibling, 2 replies; 28+ messages in thread From: Paul E. McKenney @ 2022-08-26 20:42 UTC (permalink / raw) To: Alan Stern Cc: Peter Zijlstra, parri.andrea, will, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote: > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote: > > On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote: > > > Hello! > > > > > > I have not yet done more than glance at this one, but figured I should > > > send it along sooner rather than later. > > > > > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak > > > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas > > > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer. > > > https://arxiv.org/abs/2111.15240 > > > > > > The claim is that the queued spinlocks implementation with CNA violates > > > LKMM but actually works on all architectures having a formal hardware > > > memory model. > > > > > > Thoughts? > > > > So the paper mentions the following defects: > > > > - LKMM doesn't carry a release-acquire chain across a relaxed op > > That's right, although I'm not so sure this should be considered a > defect... > > > - some babbling about a missing propagation -- ISTR Linux if stuffed > > full of them, specifically we require stores to auto propagate > > without help from barriers > > Not a missing propagation; a late one. > > Don't understand what you mean by "auto propagate without help from > barriers". > > > - some handoff that is CNA specific and I've not looked too hard at > > presently. > > > > > > I think we should address that first one in LKMM, it seems very weird to > > me a RmW would break the chain like that. > > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise). > > If the authors wanted to keep the release-acquire chain intact, why not > use a cmpxchg version that has release semantics instead of going out of > their way to use a relaxed version? > > To put it another way, RMW accesses and release-acquire accesses are > unrelated concepts. You can have one without the other (in principle, > anyway). So a relaxed RMW is just as capable of breaking a > release-acquire chain as any other relaxed operation is. > > > Is there actual hardware that > > doesn't behave? > > Not as far as I know, although that isn't very far. Certainly an > other-multicopy-atomic architecture would make the litmus test succeed. > But the LKMM does not require other-multicopy-atomicity. My first attempt with ppcmem suggests that powerpc does -not- behave this way. But that surprises me, just on general principles. Most likely I blew the litmus test shown below. Thoughts? Thanx, Paul ------------------------------------------------------------------------ PPC MP+lwsyncs+atomic "LwSyncdWW Rfe LwSyncdRR Fre" Cycle=Rfe LwSyncdRR Fre LwSyncdWW { 0:r2=x; 0:r4=y; 1:r2=y; 1:r5=2; 2:r2=y; 2:r4=x; } P0 | P1 | P2 ; li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ; stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ; lwsync | | lwz r3,0(r4) ; li r3,1 | | ; stw r3,0(r4) | | ; exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) ------------------------------------------------------------------------ $ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus ... Test MP+lwsyncs+atomic Allowed States 9 1:r1=0; 2:r1=0; 2:r3=0; 1:r1=0; 2:r1=0; 2:r3=1; 1:r1=0; 2:r1=1; 2:r3=1; 1:r1=0; 2:r1=2; 2:r3=0; 1:r1=0; 2:r1=2; 2:r3=1; 1:r1=1; 2:r1=0; 2:r3=0; 1:r1=1; 2:r1=0; 2:r3=1; 1:r1=1; 2:r1=1; 2:r3=1; 1:r1=1; 2:r1=2; 2:r3=1; No (allowed not found) Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c Observation MP+lwsyncs+atomic Never 0 9 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 20:42 ` Paul E. McKenney @ 2022-08-27 16:00 ` Alan Stern 2022-08-27 17:04 ` Paul E. McKenney 2022-09-13 11:24 ` Will Deacon 1 sibling, 1 reply; 28+ messages in thread From: Alan Stern @ 2022-08-27 16:00 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, parri.andrea, will, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote: > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote: > > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote: > > > I think we should address that first one in LKMM, it seems very weird to > > > me a RmW would break the chain like that. > > > > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise). > > > > If the authors wanted to keep the release-acquire chain intact, why not > > use a cmpxchg version that has release semantics instead of going out of > > their way to use a relaxed version? > > > > To put it another way, RMW accesses and release-acquire accesses are > > unrelated concepts. You can have one without the other (in principle, > > anyway). So a relaxed RMW is just as capable of breaking a > > release-acquire chain as any other relaxed operation is. > > > > > Is there actual hardware that > > > doesn't behave? > > > > Not as far as I know, although that isn't very far. Certainly an > > other-multicopy-atomic architecture would make the litmus test succeed. > > But the LKMM does not require other-multicopy-atomicity. > > My first attempt with ppcmem suggests that powerpc does -not- behave > this way. But that surprises me, just on general principles. Most likely > I blew the litmus test shown below. > > Thoughts? The litmus test looks okay. As for your surprise, remember that PPC is B-cumulative, another property which the LKMM does not require. B-cumulativity will also force the original litmus test to succeed. (The situation is like ISA2 in the infamous test6.pdf, except that y and z are separate variables in ISA2 but are the same here. The RMW nature of lwarx/stwcx provides the necessary R-W ordering in P1.) Alan > Thanx, Paul > > ------------------------------------------------------------------------ > > PPC MP+lwsyncs+atomic > "LwSyncdWW Rfe LwSyncdRR Fre" > Cycle=Rfe LwSyncdRR Fre LwSyncdWW > { > 0:r2=x; 0:r4=y; > 1:r2=y; 1:r5=2; > 2:r2=y; 2:r4=x; > } > P0 | P1 | P2 ; > li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ; > stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ; > lwsync | | lwz r3,0(r4) ; > li r3,1 | | ; > stw r3,0(r4) | | ; > exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) > > ------------------------------------------------------------------------ > > $ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus > ... > Test MP+lwsyncs+atomic Allowed > States 9 > 1:r1=0; 2:r1=0; 2:r3=0; > 1:r1=0; 2:r1=0; 2:r3=1; > 1:r1=0; 2:r1=1; 2:r3=1; > 1:r1=0; 2:r1=2; 2:r3=0; > 1:r1=0; 2:r1=2; 2:r3=1; > 1:r1=1; 2:r1=0; 2:r3=0; > 1:r1=1; 2:r1=0; 2:r3=1; > 1:r1=1; 2:r1=1; 2:r3=1; > 1:r1=1; 2:r1=2; 2:r3=1; > No (allowed not found) > Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) > Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c > Observation MP+lwsyncs+atomic Never 0 9 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-27 16:00 ` Alan Stern @ 2022-08-27 17:04 ` Paul E. McKenney 0 siblings, 0 replies; 28+ messages in thread From: Paul E. McKenney @ 2022-08-27 17:04 UTC (permalink / raw) To: Alan Stern Cc: Peter Zijlstra, parri.andrea, will, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Sat, Aug 27, 2022 at 12:00:15PM -0400, Alan Stern wrote: > On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote: > > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote: > > > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote: > > > > I think we should address that first one in LKMM, it seems very weird to > > > > me a RmW would break the chain like that. > > > > > > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise). > > > > > > If the authors wanted to keep the release-acquire chain intact, why not > > > use a cmpxchg version that has release semantics instead of going out of > > > their way to use a relaxed version? > > > > > > To put it another way, RMW accesses and release-acquire accesses are > > > unrelated concepts. You can have one without the other (in principle, > > > anyway). So a relaxed RMW is just as capable of breaking a > > > release-acquire chain as any other relaxed operation is. > > > > > > > Is there actual hardware that > > > > doesn't behave? > > > > > > Not as far as I know, although that isn't very far. Certainly an > > > other-multicopy-atomic architecture would make the litmus test succeed. > > > But the LKMM does not require other-multicopy-atomicity. > > > > My first attempt with ppcmem suggests that powerpc does -not- behave > > this way. But that surprises me, just on general principles. Most likely > > I blew the litmus test shown below. > > > > Thoughts? > > The litmus test looks okay. > > As for your surprise, remember that PPC is B-cumulative, another > property which the LKMM does not require. B-cumulativity will also > force the original litmus test to succeed. (The situation is like ISA2 > in the infamous test6.pdf, except that y and z are separate variables in > ISA2 but are the same here. The RMW nature of lwarx/stwcx provides > the necessary R-W ordering in P1.) Got it, thank you! Thanx, Paul > > ------------------------------------------------------------------------ > > > > PPC MP+lwsyncs+atomic > > "LwSyncdWW Rfe LwSyncdRR Fre" > > Cycle=Rfe LwSyncdRR Fre LwSyncdWW > > { > > 0:r2=x; 0:r4=y; > > 1:r2=y; 1:r5=2; > > 2:r2=y; 2:r4=x; > > } > > P0 | P1 | P2 ; > > li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ; > > stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ; > > lwsync | | lwz r3,0(r4) ; > > li r3,1 | | ; > > stw r3,0(r4) | | ; > > exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) > > > > ------------------------------------------------------------------------ > > > > $ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus > > ... > > Test MP+lwsyncs+atomic Allowed > > States 9 > > 1:r1=0; 2:r1=0; 2:r3=0; > > 1:r1=0; 2:r1=0; 2:r3=1; > > 1:r1=0; 2:r1=1; 2:r3=1; > > 1:r1=0; 2:r1=2; 2:r3=0; > > 1:r1=0; 2:r1=2; 2:r3=1; > > 1:r1=1; 2:r1=0; 2:r3=0; > > 1:r1=1; 2:r1=0; 2:r3=1; > > 1:r1=1; 2:r1=1; 2:r3=1; > > 1:r1=1; 2:r1=2; 2:r3=1; > > No (allowed not found) > > Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) > > Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c > > Observation MP+lwsyncs+atomic Never 0 9 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 20:42 ` Paul E. McKenney 2022-08-27 16:00 ` Alan Stern @ 2022-09-13 11:24 ` Will Deacon 2022-09-13 12:21 ` Dan Lustig 1 sibling, 1 reply; 28+ messages in thread From: Will Deacon @ 2022-09-13 11:24 UTC (permalink / raw) To: Paul E. McKenney Cc: Alan Stern, Peter Zijlstra, parri.andrea, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote: > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote: > > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote: > > > On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote: > > > > Hello! > > > > > > > > I have not yet done more than glance at this one, but figured I should > > > > send it along sooner rather than later. > > > > > > > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak > > > > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas > > > > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer. > > > > https://arxiv.org/abs/2111.15240 > > > > > > > > The claim is that the queued spinlocks implementation with CNA violates > > > > LKMM but actually works on all architectures having a formal hardware > > > > memory model. > > > > > > > > Thoughts? > > > > > > So the paper mentions the following defects: > > > > > > - LKMM doesn't carry a release-acquire chain across a relaxed op > > > > That's right, although I'm not so sure this should be considered a > > defect... > > > > > - some babbling about a missing propagation -- ISTR Linux if stuffed > > > full of them, specifically we require stores to auto propagate > > > without help from barriers > > > > Not a missing propagation; a late one. > > > > Don't understand what you mean by "auto propagate without help from > > barriers". > > > > > - some handoff that is CNA specific and I've not looked too hard at > > > presently. > > > > > > > > > I think we should address that first one in LKMM, it seems very weird to > > > me a RmW would break the chain like that. > > > > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise). > > > > If the authors wanted to keep the release-acquire chain intact, why not > > use a cmpxchg version that has release semantics instead of going out of > > their way to use a relaxed version? > > > > To put it another way, RMW accesses and release-acquire accesses are > > unrelated concepts. You can have one without the other (in principle, > > anyway). So a relaxed RMW is just as capable of breaking a > > release-acquire chain as any other relaxed operation is. > > > > > Is there actual hardware that > > > doesn't behave? > > > > Not as far as I know, although that isn't very far. Certainly an > > other-multicopy-atomic architecture would make the litmus test succeed. > > But the LKMM does not require other-multicopy-atomicity. > > My first attempt with ppcmem suggests that powerpc does -not- behave > this way. But that surprises me, just on general principles. Most likely > I blew the litmus test shown below. > > Thoughts? > > Thanx, Paul > > ------------------------------------------------------------------------ > > PPC MP+lwsyncs+atomic > "LwSyncdWW Rfe LwSyncdRR Fre" > Cycle=Rfe LwSyncdRR Fre LwSyncdWW > { > 0:r2=x; 0:r4=y; > 1:r2=y; 1:r5=2; > 2:r2=y; 2:r4=x; > } > P0 | P1 | P2 ; > li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ; > stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ; > lwsync | | lwz r3,0(r4) ; > li r3,1 | | ; > stw r3,0(r4) | | ; > exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) Just catching up on this, but one possible gotcha here is if you have an architecture with native load-acquire on P2 and then you move P2 to the end of P1. e.g. at a high-level: P0 P1 Wx = 1 RmW(y) // xchg() 1 => 2 WyRel = 1 RyAcq = 2 Rx = 0 arm64 forbids this, but it's not "natural" to the hardware and I don't know what e.g. risc-v would say about it. Will ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-13 11:24 ` Will Deacon @ 2022-09-13 12:21 ` Dan Lustig 2022-09-16 8:18 ` Will Deacon 0 siblings, 1 reply; 28+ messages in thread From: Dan Lustig @ 2022-09-13 12:21 UTC (permalink / raw) To: Will Deacon, Paul E. McKenney Cc: Alan Stern, Peter Zijlstra, parri.andrea, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, joel, linux-kernel, linux-arch On 9/13/2022 7:24 AM, Will Deacon wrote: > On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote: >> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote: >>> On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote: >>>> On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote: >>>>> Hello! >>>>> >>>>> I have not yet done more than glance at this one, but figured I should >>>>> send it along sooner rather than later. >>>>> >>>>> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak >>>>> Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas >>>>> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer. >>>>> https://arxiv.org/abs/2111.15240 >>>>> >>>>> The claim is that the queued spinlocks implementation with CNA violates >>>>> LKMM but actually works on all architectures having a formal hardware >>>>> memory model. >>>>> >>>>> Thoughts? >>>> >>>> So the paper mentions the following defects: >>>> >>>> - LKMM doesn't carry a release-acquire chain across a relaxed op >>> >>> That's right, although I'm not so sure this should be considered a >>> defect... >>> >>>> - some babbling about a missing propagation -- ISTR Linux if stuffed >>>> full of them, specifically we require stores to auto propagate >>>> without help from barriers >>> >>> Not a missing propagation; a late one. >>> >>> Don't understand what you mean by "auto propagate without help from >>> barriers". >>> >>>> - some handoff that is CNA specific and I've not looked too hard at >>>> presently. >>>> >>>> >>>> I think we should address that first one in LKMM, it seems very weird to >>>> me a RmW would break the chain like that. >>> >>> An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise). >>> >>> If the authors wanted to keep the release-acquire chain intact, why not >>> use a cmpxchg version that has release semantics instead of going out of >>> their way to use a relaxed version? >>> >>> To put it another way, RMW accesses and release-acquire accesses are >>> unrelated concepts. You can have one without the other (in principle, >>> anyway). So a relaxed RMW is just as capable of breaking a >>> release-acquire chain as any other relaxed operation is. >>> >>>> Is there actual hardware that >>>> doesn't behave? >>> >>> Not as far as I know, although that isn't very far. Certainly an >>> other-multicopy-atomic architecture would make the litmus test succeed. >>> But the LKMM does not require other-multicopy-atomicity. >> >> My first attempt with ppcmem suggests that powerpc does -not- behave >> this way. But that surprises me, just on general principles. Most likely >> I blew the litmus test shown below. >> >> Thoughts? >> >> Thanx, Paul >> >> ------------------------------------------------------------------------ >> >> PPC MP+lwsyncs+atomic >> "LwSyncdWW Rfe LwSyncdRR Fre" >> Cycle=Rfe LwSyncdRR Fre LwSyncdWW >> { >> 0:r2=x; 0:r4=y; >> 1:r2=y; 1:r5=2; >> 2:r2=y; 2:r4=x; >> } >> P0 | P1 | P2 ; >> li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ; >> stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ; >> lwsync | | lwz r3,0(r4) ; >> li r3,1 | | ; >> stw r3,0(r4) | | ; >> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) > > Just catching up on this, but one possible gotcha here is if you have an > architecture with native load-acquire on P2 and then you move P2 to the end > of P1. e.g. at a high-level: > > > P0 P1 > Wx = 1 RmW(y) // xchg() 1 => 2 > WyRel = 1 RyAcq = 2 > Rx = 0 > > arm64 forbids this, but it's not "natural" to the hardware and I don't > know what e.g. risc-v would say about it. > > Will RISC-V doesn't currently have native load-acquire instructions other than load-reserve-acquire, but if it did, it would forbid this outcome as well. To the broader question, RISC-V is other-multi-copy-atomic, so questions about propagation order and B-cumulativity and so on aren't generally problematic, just like they generally aren't an issue for ARMv8. Dan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-13 12:21 ` Dan Lustig @ 2022-09-16 8:18 ` Will Deacon 0 siblings, 0 replies; 28+ messages in thread From: Will Deacon @ 2022-09-16 8:18 UTC (permalink / raw) To: Dan Lustig Cc: Paul E. McKenney, Alan Stern, Peter Zijlstra, parri.andrea, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, joel, linux-kernel, linux-arch On Tue, Sep 13, 2022 at 08:21:02AM -0400, Dan Lustig wrote: > On 9/13/2022 7:24 AM, Will Deacon wrote: > > On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote: > >> PPC MP+lwsyncs+atomic > >> "LwSyncdWW Rfe LwSyncdRR Fre" > >> Cycle=Rfe LwSyncdRR Fre LwSyncdWW > >> { > >> 0:r2=x; 0:r4=y; > >> 1:r2=y; 1:r5=2; > >> 2:r2=y; 2:r4=x; > >> } > >> P0 | P1 | P2 ; > >> li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ; > >> stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ; > >> lwsync | | lwz r3,0(r4) ; > >> li r3,1 | | ; > >> stw r3,0(r4) | | ; > >> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0) > > > > Just catching up on this, but one possible gotcha here is if you have an > > architecture with native load-acquire on P2 and then you move P2 to the end > > of P1. e.g. at a high-level: > > > > > > P0 P1 > > Wx = 1 RmW(y) // xchg() 1 => 2 > > WyRel = 1 RyAcq = 2 > > Rx = 0 > > > > arm64 forbids this, but it's not "natural" to the hardware and I don't > > know what e.g. risc-v would say about it. > > > > RISC-V doesn't currently have native load-acquire instructions other than > load-reserve-acquire, but if it did, it would forbid this outcome as well. Thanks for chipping in, Dan. Somehow, I hadn't realised that you didn't have native load-acquire instructions. Will ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 17:10 ` Alan Stern 2022-08-26 20:42 ` Paul E. McKenney @ 2022-08-26 23:47 ` Peter Zijlstra 2022-08-27 16:05 ` Alan Stern 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2022-08-26 23:47 UTC (permalink / raw) To: Alan Stern Cc: Paul E. McKenney, parri.andrea, will, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote: > > - some babbling about a missing propagation -- ISTR Linux if stuffed > > full of them, specifically we require stores to auto propagate > > without help from barriers > > Not a missing propagation; a late one. > > Don't understand what you mean by "auto propagate without help from > barriers". Linux hard relies on: CPU0 CPU1 WRITE_ONCE(foo, 1); while (!READ_ONCE(foo)); making forward progress. There were a few 'funny' uarchs that were broken, see for example commit a30718868915f. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 23:47 ` Peter Zijlstra @ 2022-08-27 16:05 ` Alan Stern 2022-08-27 16:44 ` Boqun Feng 0 siblings, 1 reply; 28+ messages in thread From: Alan Stern @ 2022-08-27 16:05 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul E. McKenney, parri.andrea, will, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Sat, Aug 27, 2022 at 01:47:48AM +0200, Peter Zijlstra wrote: > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote: > > > > - some babbling about a missing propagation -- ISTR Linux if stuffed > > > full of them, specifically we require stores to auto propagate > > > without help from barriers > > > > Not a missing propagation; a late one. > > > > Don't understand what you mean by "auto propagate without help from > > barriers". > > Linux hard relies on: > > CPU0 CPU1 > > WRITE_ONCE(foo, 1); while (!READ_ONCE(foo)); > > making forward progress. Indeed yes. As far as I can tell, this requirement is not explicitly mentioned in the LKMM, although it certainly is implicit. I can't even think of a way to express it in a form Herd could verify. > There were a few 'funny' uarchs that were broken, see for example commit > a30718868915f. Ha! That commit should be a lesson in something, although I'm not sure what. :-) Alan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-27 16:05 ` Alan Stern @ 2022-08-27 16:44 ` Boqun Feng 2022-08-29 2:15 ` Andrea Parri 2022-09-09 11:45 ` Jonas Oberhauser 0 siblings, 2 replies; 28+ messages in thread From: Boqun Feng @ 2022-08-27 16:44 UTC (permalink / raw) To: Alan Stern Cc: Peter Zijlstra, Paul E. McKenney, parri.andrea, will, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Sat, Aug 27, 2022 at 12:05:33PM -0400, Alan Stern wrote: > On Sat, Aug 27, 2022 at 01:47:48AM +0200, Peter Zijlstra wrote: > > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote: > > > > > > - some babbling about a missing propagation -- ISTR Linux if stuffed > > > > full of them, specifically we require stores to auto propagate > > > > without help from barriers > > > > > > Not a missing propagation; a late one. > > > > > > Don't understand what you mean by "auto propagate without help from > > > barriers". > > > > Linux hard relies on: > > > > CPU0 CPU1 > > > > WRITE_ONCE(foo, 1); while (!READ_ONCE(foo)); > > > > making forward progress. > > Indeed yes. As far as I can tell, this requirement is not explicitly > mentioned in the LKMM, although it certainly is implicit. I can't even > think of a way to express it in a form Herd could verify. > FWIW, C++ defines this as (in https://eel.is/c++draft/atomics#order-11): Implementations should make atomic stores visible to atomic loads within a reasonable amount of time. in other words: if one thread does an atomic store, then all other threads must see that store eventually. (from: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/Rust.20forward.20progress.20guarantees/near/294702950) Should we add something somewhere in our model, maybe in the explanation.txt? Plus, I think we cannot express this in Herd because Herd uses graph-based model (axiomatic model) instead of an operational model to describe the model: axiomatic model cannot describe "something will eventually happen". There was also some discussion in the zulip steam of Rust unsafe-code-guidelines. Regards, Boqun > > There were a few 'funny' uarchs that were broken, see for example commit > > a30718868915f. > > Ha! That commit should be a lesson in something, although I'm not sure > what. :-) > > Alan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-27 16:44 ` Boqun Feng @ 2022-08-29 2:15 ` Andrea Parri 2022-09-09 11:45 ` Jonas Oberhauser 1 sibling, 0 replies; 28+ messages in thread From: Andrea Parri @ 2022-08-29 2:15 UTC (permalink / raw) To: Boqun Feng Cc: Alan Stern, Peter Zijlstra, Paul E. McKenney, will, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch > FWIW, C++ defines this as (in https://eel.is/c++draft/atomics#order-11): > > Implementations should make atomic stores visible to atomic > loads within a reasonable amount of time. > > in other words: > > if one thread does an atomic store, then all other threads must see that > store eventually. > > (from: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/Rust.20forward.20progress.20guarantees/near/294702950) > > Should we add something somewhere in our model, maybe in the > explanation.txt? FYI, that's briefly mentioned in Section 11, "CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe": "sequential consistency per variable and cache coherence mean the same thing except that cache coherence includes an extra requirement that every store eventually becomes visible to every CPU" Andrea > Plus, I think we cannot express this in Herd because Herd uses > graph-based model (axiomatic model) instead of an operational model to > describe the model: axiomatic model cannot describe "something will > eventually happen". There was also some discussion in the zulip steam > of Rust unsafe-code-guidelines. ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-27 16:44 ` Boqun Feng 2022-08-29 2:15 ` Andrea Parri @ 2022-09-09 11:45 ` Jonas Oberhauser 2022-09-10 12:11 ` Hernan Luis Ponce de Leon 1 sibling, 1 reply; 28+ messages in thread From: Jonas Oberhauser @ 2022-09-09 11:45 UTC (permalink / raw) To: Boqun Feng, Alan Stern Cc: Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Hi Boqun, hi everyone, > > > > - some babbling about a missing propagation -- ISTR Linux if stuffed > > > > full of them, specifically we require stores to auto propagate > > > > without help from barriers > > > > > > Not a missing propagation; a late one. > > > > > > Don't understand what you mean by "auto propagate without help > > > from barriers". after a quick look at the claim in question, it seems to be about a different kind of issue. The problem alluded to seems not to be that the store doesn't propagate in finite time to the other core. The tools used by the authors of that work already assume every (marked) access propagates given enough time. What they mean seems to be that a prop relation followed only by wmb (not mb) doesn't enforce the order of some writes to the same location, leading to the claimed hang in qspinlock (at least as far as LKMM is concerned). Without having looked at the code or trying to understand the bug in more detail, this may be similar to the bug fixed by Ingo in qspinlock in 2018 where the initialization of the locked bit in the node could overwrite the setting of the locked bit in the other thread, unless a release barrier/wmb is added at some point. That being said: > Should we add something somewhere in our model, maybe in the explanation.txt? > Plus, I think we cannot express this in Herd because Herd uses graph-based model (axiomatic model) instead of an operational model to describe the model: axiomatic model cannot describe "something will eventually happen". There was also some discussion in the zulip steam of Rust unsafe-code-guidelines. It is possible to express this kind of progress property in a graph based model. Some work on this was done in the RISC-V manual, as well as for some stronger graph based models ("making weak memory models fair", https://dl.acm.org/doi/10.1145/3485475 ) and in a more practical way for verification of liveness ( https://dl.acm.org/doi/abs/10.1145/3445814.3446748 ). From a theoretical perspective, the suggestion made in "making weak memory models fair" would be to define prefix-finiteness of all the explicitly acyclic relations of LKMM, which includes the sc-per-loc relation and thus in particular fr. If fr is prefix-finite, meaning each store has only finitely many reads observing an older value, the infinite behaviour in the small example posted elsewhere in this thread becomes forbidden. From an engineering perspective, I think the only issue is that cat *currently* does not have any syntax for this, nor does herd currently implement the await model checking techniques proposed in those works (c.f. Theorem 5.3. in the "making weak memory models fair" paper, which says that for this kind of loop, iff the mo-maximal reads in some graph are read in a loop iteration that does not exit the loop, the loop can run forever). However GenMC and I believe also Dat3M and recently also Nidhugg support such techniques. It may not even be too much effort to implement something like this in herd if desired. Best regards, jonas ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-09 11:45 ` Jonas Oberhauser @ 2022-09-10 12:11 ` Hernan Luis Ponce de Leon 2022-09-10 15:03 ` Alan Stern 0 siblings, 1 reply; 28+ messages in thread From: Hernan Luis Ponce de Leon @ 2022-09-10 12:11 UTC (permalink / raw) To: Jonas Oberhauser, Boqun Feng, Alan Stern Cc: Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org What they mean seems to be that a prop relation followed only by wmb (not mb) doesn't enforce the order of some writes to the same location, leading to the claimed hang in qspinlock (at least as far as LKMM is concerned). What we mean is that wmb does not give the same propagation properties as mb. The claim is based on these relations from the memory model let strong-fence = mb | gp ... let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb | po-unlock-lock-po) ; [Marked] let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ; [Marked] ; rfe? ; [Marked] From an engineering perspective, I think the only issue is that cat *currently* does not have any syntax for this, nor does herd currently implement the await model checking techniques proposed in those works (c.f. Theorem 5.3. in the "making weak memory models fair" paper, which says that for this kind of loop, iff the mo-maximal reads in some graph are read in a loop iteration that does not exit the loop, the loop can run forever). However GenMC and I believe also Dat3M and recently also Nidhugg support such techniques. It may not even be too much effort to implement something like this in herd if desired. The Dartagnan model checker uses the Theorem 5.3 from above to detect liveness violations. We did not try to come up with a litmus test about the behavior because herd7 cannot reason about liveness. However, if anybody is interested, the violating execution is shown here https://github.com/huawei-drc/cna-verification/blob/master/verification-output/BUG1.png Hernan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-10 12:11 ` Hernan Luis Ponce de Leon @ 2022-09-10 15:03 ` Alan Stern 2022-09-10 20:41 ` Hernan Luis Ponce de Leon 2022-09-12 10:46 ` Jonas Oberhauser 0 siblings, 2 replies; 28+ messages in thread From: Alan Stern @ 2022-09-10 15:03 UTC (permalink / raw) To: Hernan Luis Ponce de Leon Cc: Jonas Oberhauser, Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org On Sat, Sep 10, 2022 at 12:11:36PM +0000, Hernan Luis Ponce de Leon wrote: > > What they mean seems to be that a prop relation followed only by wmb > (not mb) doesn't enforce the order of some writes to the same > location, leading to the claimed hang in qspinlock (at least as far as > LKMM is concerned). You were quoting Jonas here, right? The email doesn't make this obvious because it doesn't have two levels of "> > " markings. > What we mean is that wmb does not give the same propagation properties as mb. In general, _no_ two distinct relations in the LKMM have the same propagation properties. If wmb always behaved the same way as mb, we wouldn't use two separate words for them. > The claim is based on these relations from the memory model > > let strong-fence = mb | gp > ... > let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb | > po-unlock-lock-po) ; [Marked] > let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ; > [Marked] ; rfe? ; [Marked] Please be more specific. What difference between mb and wmb are you concerned about? Can you give a small litmus test that illustrates this difference? Can you explain in more detail how this difference affects the qspinlock implementation? > From an engineering perspective, I think the only issue is that cat > *currently* does not have any syntax for this, Syntax for what? The difference between wmb and mb? > nor does herd currently > implement the await model checking techniques proposed in those works > (c.f. Theorem 5.3. in the "making weak memory models fair" paper, > which says that for this kind of loop, iff the mo-maximal reads in > some graph are read in a loop iteration that does not exit the loop, > the loop can run forever). However GenMC and I believe also Dat3M and > recently also Nidhugg support such techniques. It may not even be too > much effort to implement something like this in herd if desired. I believe that herd has no way to express the idea of a program running forever. On the other hand, it's certainly true (in all of these models) than for any finite number N, there is a feasible execution in which a loop runs for more than N iterations before the termination condition eventually becomes true. Alan > The Dartagnan model checker uses the Theorem 5.3 from above to detect > liveness violations. > > We did not try to come up with a litmus test about the behavior > because herd7 cannot reason about liveness. > However, if anybody is interested, the violating execution is shown here > https://github.com/huawei-drc/cna-verification/blob/master/verification-output/BUG1.png > > Hernan ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-10 15:03 ` Alan Stern @ 2022-09-10 20:41 ` Hernan Luis Ponce de Leon 2022-09-11 10:20 ` Joel Fernandes 2022-09-11 14:53 ` Andrea Parri 2022-09-12 10:46 ` Jonas Oberhauser 1 sibling, 2 replies; 28+ messages in thread From: Hernan Luis Ponce de Leon @ 2022-09-10 20:41 UTC (permalink / raw) To: Alan Stern Cc: Jonas Oberhauser, Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org > You were quoting Jonas here, right? The email doesn't make this obvious > because it doesn't have two levels of "> > " markings. Yes, I was quoting Jonas. It seems my mail client did not format the email correctly and I did not notice. Sorry for that. > In general, _no_ two distinct relations in the LKMM have the same propagation > properties. If wmb always behaved the same way as mb, we wouldn't use two > separate words for them. I understand that relations with different names are intended to be different. What I meant was "wmb gives weaker propagation guarantees than mb and because of this, liveness of qspinlock is not guaranteed in LKMM" > > > The claim is based on these relations from the memory model > > > > let strong-fence = mb | gp > > ... > > let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb | > > po-unlock-lock-po) ; [Marked] > > let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ; > > [Marked] ; rfe? ; [Marked] > > Please be more specific. What difference between mb and wmb are you > concerned about? Since the code uses wmb, there are certain pairs of events that will not be in strong-fence. Since strong-fence contributes to cumul-fence, cumul-fence to prop, and prop to hb, the pair of events related by hb is less that if the code would use mb instead. Because of this, there are executions (in particular the one that violates liveness) that have an acyclic hb relation, but would create a cycle (and thus the memory model would not accept the behavior) if mb would have been used. > Can you give a small litmus test that illustrates this > difference? Can you explain in more detail how this difference affects the > qspinlock implementation? Here is a litmus test showing the problem (I hope the comment are enough to relate it back to qspinlock) C Liveness { atomic_t x = ATOMIC_INIT(0); atomic_t y = ATOMIC_INIT(0); } P0(atomic_t *x) { // clear_pending_set_locked int r0 = atomic_fetch_add(2,x) ; } P1(atomic_t *x, int *z) { // queued_spin_trylock int r0 = atomic_read(x); // barrier after the initialization of nodes smp_wmb(); // xchg_tail int r1 = atomic_cmpxchg_relaxed(x,r0,42); // link node into the waitqueue WRITE_ONCE(*z, 1); } P2(atomic_t *x,atomic_t *z) { // node initialization WRITE_ONCE(*z, 2); // queued_spin_trylock int r0 = atomic_read(x); // barrier after the initialization of nodes smp_wmb(); // xchg_tail int r1 = atomic_cmpxchg_relaxed(x,r0,24); } exists (0:r0 = 24 /\ 1:r0 = 26 /\ z=2) herd7 says that the behavior is observable. However if you change wmb by mb, it is not observable anymore. > > > From an engineering perspective, I think the only issue is that cat > > *currently* does not have any syntax for this, > > Syntax for what? The difference between wmb and mb? > > > nor does herd currently > > implement the await model checking techniques proposed in those works > > (c.f. Theorem 5.3. in the "making weak memory models fair" paper, > > which says that for this kind of loop, iff the mo-maximal reads in > > some graph are read in a loop iteration that does not exit the loop, > > the loop can run forever). However GenMC and I believe also Dat3M and > > recently also Nidhugg support such techniques. It may not even be too > > much effort to implement something like this in herd if desired. > > I believe that herd has no way to express the idea of a program running forever. > On the other hand, it's certainly true (in all of these > models) than for any finite number N, there is a feasible execution in which a > loop runs for more than N iterations before the termination condition eventually > becomes true. Here I was again quoting Jonas. I think his intention was to make a distinction between graph based semantics and tools. While herd7 cannot reason about this kind of property, it is possible to define the property using graph based semantics and there are tools already using this. Hernan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-10 20:41 ` Hernan Luis Ponce de Leon @ 2022-09-11 10:20 ` Joel Fernandes 2022-09-12 10:13 ` Jonas Oberhauser 2022-09-11 14:53 ` Andrea Parri 1 sibling, 1 reply; 28+ messages in thread From: Joel Fernandes @ 2022-09-11 10:20 UTC (permalink / raw) To: Hernan Luis Ponce de Leon Cc: Alan Stern, Jonas Oberhauser, Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org On Sat, Sep 10, 2022 at 4:41 PM Hernan Luis Ponce de Leon <hernanl.leon@huawei.com> wrote: > > > You were quoting Jonas here, right? The email doesn't make this obvious > > because it doesn't have two levels of "> > " markings. > > Yes, I was quoting Jonas. > It seems my mail client did not format the email correctly and I did not notice. > Sorry for that. > > > In general, _no_ two distinct relations in the LKMM have the same propagation > > properties. If wmb always behaved the same way as mb, we wouldn't use two > > separate words for them. > > I understand that relations with different names are intended to be different. > What I meant was > "wmb gives weaker propagation guarantees than mb and because of this, liveness of qspinlock is not guaranteed in LKMM" > I wonder if this sort of liveness guarantee (or lack thereof) is really a problem in practice, where writes will eventually propagate even though they may not for a bit. Is it possible to write a liveness test case on any hardware, or is this more in the realms of theory? Either way, quite intriguing! Thanks, - Joel ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-11 10:20 ` Joel Fernandes @ 2022-09-12 10:13 ` Jonas Oberhauser 2022-09-12 11:10 ` Hernan Luis Ponce de Leon 2022-09-12 12:01 ` Alan Stern 0 siblings, 2 replies; 28+ messages in thread From: Jonas Oberhauser @ 2022-09-12 10:13 UTC (permalink / raw) To: Joel Fernandes, Hernan Luis Ponce de Leon Cc: Alan Stern, Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Hi Joel, > I wonder if this sort of liveness guarantee (or lack thereof) is really a problem in practice, where writes will eventually propagate even though they may not for a bit. Is it possible to write a liveness test case on any hardware, or is this more in the realms of theory? Either way, quite intriguing! As I tried to explain before, this problem has nothing to do with stores propagating within a given time to another core. Rather it is due to two stores to the same location happening in a surprising order. I.e., both stores propagate quickly to other cores, but in a surprising coherence order.And if a wmb in the code is replaced by an mb, then this co will create a pb cycle and become forbidden. Therefore this hang should be observable on a hypothetical LKMM processor which makes use of all the relaxed liberty the LKMM allows. However according to the authors of that paper (who are my colleagues but I haven't been involved deeply in that work), not even Power+gcc allow this reordering to happen, and if that's true it is probably because the wmb is mapped to lwsync which is fully cumulative in Power but not in LKMM. Best wishes and hope this clears it up, jonas ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-12 10:13 ` Jonas Oberhauser @ 2022-09-12 11:10 ` Hernan Luis Ponce de Leon 2022-09-14 14:41 ` Alan Stern 2022-09-12 12:01 ` Alan Stern 1 sibling, 1 reply; 28+ messages in thread From: Hernan Luis Ponce de Leon @ 2022-09-12 11:10 UTC (permalink / raw) To: Jonas Oberhauser, Joel Fernandes Cc: Alan Stern, Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org > Therefore this hang should be observable on a hypothetical LKMM processor > which makes use of all the relaxed liberty the LKMM allows. However according > to the authors of that paper (who are my colleagues but I haven't been involved > deeply in that work), not even Power+gcc allow this reordering to happen, and if > that's true it is probably because the wmb is mapped to lwsync which is fully > cumulative in Power but not in LKMM. All the "issues" we mention in the technical report are according to LKMM. As shown by (*) below, as soon as the code gets compiled and verified against the corresponding hardware memory model, the code is correct. Here is a small variant of the litmus test I sent earlier where not only the "problematic behavior" is allowed by LKMM, but where liveness is actually violated. The code is written in C (main function and headers missing) and cannot be used directly with herd7 (since I am not sure if the end of thread_3 can be written using herd7 syntax). ------------------------------------------------------------------------ int y, z; atomic_t x; void *thread_1(void *unused) { // clear_pending_set_locked int r0 = atomic_fetch_add(2,&x) ; } void *thread_2(void *unused) { // this store breaks liveness WRITE_ONCE(y, 1); // queued_spin_trylock int r0 = atomic_read(&x); // barrier after the initialisation of nodes smp_wmb(); // xchg_tail int r1 = atomic_cmpxchg_relaxed(&x,r0,42); // link node into the waitqueue WRITE_ONCE(z, 1); } void *thread_3(void *unused) { // node initialisation WRITE_ONCE(z, 2); // queued_spin_trylock int r0 = atomic_read(&x); // barrier after the initialisation of nodes smp_wmb(); // if we read z==2 we expect to read this store WRITE_ONCE(y, 0); // xchg_tail int r1 = atomic_cmpxchg_relaxed(&x,r0,24); // spinloop while(READ_ONCE(y) == 1 && (READ_ONCE(z) == 2)) {} } ------------------------------------------------------------------------ Liveness is violated (following Theorem 5.3 of the "Making weak memory models fair" paper) because the reads from the spinloop can get their values from writes which come last in the coherence / modification order, and those values do not stop the spinning. ------------------------------------------------------------------------ $ java -jar $DAT3M_HOME/dartagnan/target/dartagnan-3.1.0.jar cat/linux-kernel.cat --target=lkmm --property=liveness liveness.c ... Liveness violation found FAIL ------------------------------------------------------------------------ (*) However, if the code is compiled (this transformation is done automatically and internally by the tool, notice the --target option) and we use some hardware memory model, the tool says the code is correct ------------------------------------------------------------------------ $ java -jar $DAT3M_HOME/dartagnan/target/dartagnan-3.1.0.jar cat/aarch64.cat --target=arm8 --property=liveness liveness.c ... PASS $ java -jar $DAT3M_HOME/dartagnan/target/dartagnan-3.1.0.jar cat/power.cat --target=power --property=liveness liveness.c ... PASS $ java -jar $DAT3M_HOME/dartagnan/target/dartagnan-3.1.0.jar cat/riscv.cat --target=riscv --property=liveness liveness.c ... PASS ------------------------------------------------------------------------ I think it is somehow possible to show the liveness violation using herd7 and the following variant of the code ------------------------------------------------------------------------ C Liveness { atomic_t x = ATOMIC_INIT(0); atomic_t y = ATOMIC_INIT(0); } P0(atomic_t *x) { // clear_pending_set_locked int r0 = atomic_fetch_add(2,x) ; } P1(atomic_t *x, int *z, int *y) { // this store breaks liveness WRITE_ONCE(*y, 1); // queued_spin_trylock int r0 = atomic_read(x); // barrier after the initialisation of nodes smp_wmb(); // xchg_tail int r1 = atomic_cmpxchg_relaxed(x,r0,42); // link node into the waitqueue WRITE_ONCE(*z, 1); } P2(atomic_t *x,int *z, int *y) { // node initialisation WRITE_ONCE(*z, 2); // queued_spin_trylock int r0 = atomic_read(x); // barrier after the initialisation of nodes smp_wmb(); // if we read z==2 we expect to read this store WRITE_ONCE(*y, 0); // xchg_tail int r1 = atomic_cmpxchg_relaxed(x,r0,24); // spinloop int r2 = READ_ONCE(*y); int r3 = READ_ONCE(*z); } exists (z=2 /\ y=1 /\ 2:r2=1 /\ 2:r3=2) ------------------------------------------------------------------------ Condition "2:r3=2" forces the spinloop to read from the first write in P2 and "z=2" forces this write to be last in the coherence order. Conditions "2:r2=1" and "y=1" force the same for the other read. herd7 says this behavior is allowed by LKMM, showing that liveness can be violated. In all the examples above, if we use mb() instead of wmb(), LKMM does not accept the behavior and thus liveness is guaranteed. Hernan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-12 11:10 ` Hernan Luis Ponce de Leon @ 2022-09-14 14:41 ` Alan Stern 0 siblings, 0 replies; 28+ messages in thread From: Alan Stern @ 2022-09-14 14:41 UTC (permalink / raw) To: Hernan Luis Ponce de Leon Cc: Jonas Oberhauser, Joel Fernandes, Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org On Mon, Sep 12, 2022 at 11:10:17AM +0000, Hernan Luis Ponce de Leon wrote: > I think it is somehow possible to show the liveness violation using herd7 and the following variant of the code > > ------------------------------------------------------------------------ > C Liveness > { > atomic_t x = ATOMIC_INIT(0); > atomic_t y = ATOMIC_INIT(0); > } > > > P0(atomic_t *x) { > // clear_pending_set_locked > int r0 = atomic_fetch_add(2,x) ; > } > > P1(atomic_t *x, int *z, int *y) { > // this store breaks liveness > WRITE_ONCE(*y, 1); > // queued_spin_trylock > int r0 = atomic_read(x); > // barrier after the initialisation of nodes > smp_wmb(); > // xchg_tail > int r1 = atomic_cmpxchg_relaxed(x,r0,42); > // link node into the waitqueue > WRITE_ONCE(*z, 1); > } > > P2(atomic_t *x,int *z, int *y) { > // node initialisation > WRITE_ONCE(*z, 2); > // queued_spin_trylock > int r0 = atomic_read(x); > // barrier after the initialisation of nodes > smp_wmb(); > // if we read z==2 we expect to read this store > WRITE_ONCE(*y, 0); > // xchg_tail > int r1 = atomic_cmpxchg_relaxed(x,r0,24); > // spinloop > int r2 = READ_ONCE(*y); > int r3 = READ_ONCE(*z); > } > > exists (z=2 /\ y=1 /\ 2:r2=1 /\ 2:r3=2) > ------------------------------------------------------------------------ > > Condition "2:r3=2" forces the spinloop to read from the first write in P2 and "z=2" forces this write > to be last in the coherence order. Conditions "2:r2=1" and "y=1" force the same for the other read. > herd7 says this behavior is allowed by LKMM, showing that liveness can be violated. > > In all the examples above, if we use mb() instead of wmb(), LKMM does not accept > the behavior and thus liveness is guaranteed. That's right. The issue may be somewhat confused by the fact that there have been _two_ separate problems covered in this discussion. One has to do with the use of relaxed vs. non-relaxed atomic accesses, and the other -- this one -- has to do with liveness (eventual termination of a spin loop). You can see the distinction quite clearly by noticing in the litmus test above, the variable x plays absolutely no role. There are no dependencies from it, it isn't accessed by any instructions that include an acquire or release memory barrier, and it isn't used in the "exists" clause. If we remove x from the test (and remove P0, which is now vacuous), and we also remove the unneeded reads at the end of P2 (unneeded because they observe the co-final values stored in y and z), what remains is: P1(int *z, int *y) { WRITE_ONCE(*y, 1); smp_wmb(); WRITE_ONCE(*z, 1); } P2(int *z, int *y) { WRITE_ONCE(*z, 2); smp_wmb(); WRITE_ONCE(*y, 0); } exists (z=2 /\ y=1) which is exactly the 2+2W pattern. As others have pointed out, this pattern is permitted by LKML because we never found a good reason to forbid it, even though it cannot occur on any real hardware that supports Linux. On the other hand, the simplicity of this "liveness" test leads me to wonder if it isn't missing some crucial feature of the actual qspinlock implementation. Alan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-12 10:13 ` Jonas Oberhauser 2022-09-12 11:10 ` Hernan Luis Ponce de Leon @ 2022-09-12 12:01 ` Alan Stern 1 sibling, 0 replies; 28+ messages in thread From: Alan Stern @ 2022-09-12 12:01 UTC (permalink / raw) To: Jonas Oberhauser Cc: Joel Fernandes, Hernan Luis Ponce de Leon, Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org On Mon, Sep 12, 2022 at 10:13:33AM +0000, Jonas Oberhauser wrote: > As I tried to explain before, this problem has nothing to do with > stores propagating within a given time to another core. Rather it is > due to two stores to the same location happening in a surprising > order. I.e., both stores propagate quickly to other cores, but in a > surprising coherence order.And if a wmb in the code is replaced by an > mb, then this co will create a pb cycle and become forbidden. > > Therefore this hang should be observable on a hypothetical LKMM > processor which makes use of all the relaxed liberty the LKMM allows. > However according to the authors of that paper (who are my colleagues > but I haven't been involved deeply in that work), not even Power+gcc > allow this reordering to happen, and if that's true it is probably > because the wmb is mapped to lwsync which is fully cumulative in Power > but not in LKMM. Yes, that's right. On ARM64 architectures the reordering is forbidden by other multi-copy atomicity, and on PPC is it forbidden by B-cumulativity (neither of which is part of the LKMM). If I'm not mistaken, another way to forbid it is to replace one of the relaxed atomic accesses with an atomic access having release semantics. Perhaps this change will find its way into the kernel source, since it has less overhead than replacing wmb with bm. Alan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-10 20:41 ` Hernan Luis Ponce de Leon 2022-09-11 10:20 ` Joel Fernandes @ 2022-09-11 14:53 ` Andrea Parri 1 sibling, 0 replies; 28+ messages in thread From: Andrea Parri @ 2022-09-11 14:53 UTC (permalink / raw) To: Hernan Luis Ponce de Leon Cc: Alan Stern, Jonas Oberhauser, Boqun Feng, Peter Zijlstra, Paul E. McKenney, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org > Here is a litmus test showing the problem (I hope the comment are enough to relate it back to qspinlock) > > C Liveness > { > atomic_t x = ATOMIC_INIT(0); > atomic_t y = ATOMIC_INIT(0); > } > > P0(atomic_t *x) { > // clear_pending_set_locked > int r0 = atomic_fetch_add(2,x) ; > } > > P1(atomic_t *x, int *z) { > // queued_spin_trylock > int r0 = atomic_read(x); > // barrier after the initialization of nodes > smp_wmb(); > // xchg_tail > int r1 = atomic_cmpxchg_relaxed(x,r0,42); > // link node into the waitqueue > WRITE_ONCE(*z, 1); > } > > P2(atomic_t *x,atomic_t *z) { > // node initialization > WRITE_ONCE(*z, 2); > // queued_spin_trylock > int r0 = atomic_read(x); > // barrier after the initialization of nodes > smp_wmb(); > // xchg_tail > int r1 = atomic_cmpxchg_relaxed(x,r0,24); > } > > exists (0:r0 = 24 /\ 1:r0 = 26 /\ z=2) > > herd7 says that the behavior is observable. > However if you change wmb by mb, it is not observable anymore. Indeed. For more context, this is a 3-threads extension of the 2+2W test/behavior, cf. https://github.com/paulmckrcu/litmus/blob/master/manual/lwn573436/C-2+2w+o-wb-o+o-wb-o.litmus which is also allowed by the LKMM. The basic rationale for allowing such behaviors was that we "don't think we need to care" (cf. the comment in the link above), except that it seems the developers of the code at stake disagreed. ;-) So this does look like a good time to review that design choice/code. Andrea ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-10 15:03 ` Alan Stern 2022-09-10 20:41 ` Hernan Luis Ponce de Leon @ 2022-09-12 10:46 ` Jonas Oberhauser 2022-09-12 12:02 ` Alan Stern 1 sibling, 1 reply; 28+ messages in thread From: Jonas Oberhauser @ 2022-09-12 10:46 UTC (permalink / raw) To: Alan Stern, Hernan Luis Ponce de Leon Cc: Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Hi Alan, Sorry for the confusion. > [...] it's certainly true (in all of these models) than for any finite number N, there is a feasible execution in which a loop runs for more than N iterations before the termination condition eventually becomes true. Indeed; but more interestingly, as the Theorem 5.3 in "making weak memory models fair" states, under certain conditions it suffices to look at graphs where N=1 to decide whether a loop can run forever (despite all writes propagating to all cores eventually) or will always terminate. And since herd can generate all such graphs, herd could be extended to make that decision and output it, just like many other tools already do. To illuminate this on an example, consider the program sent by Peter earlier: WRITE_ONCE(foo, 1); while (!READ_ONCE(foo)); Without the assumption that fr is prefix finite, the graph with infinitely many reads of thread 2 all reading the initial value (and hence being fr-before the store foo=1) would be allowed. However, the tools used to analyze the qspinlock all assume that fr is prefix finite, and hence that such a graph with infinitely many fr-before events does not exist. Because of that, all of the tools will say that this loop always terminates. However, if you change the code into the following: WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); while (!READ_ONCE(foo)); then even under the assumption of fr-prefix-finiteness, the coherence order in which WRITE_ONCE(foo, 1); is overwritten by WRITE_ONCE(foo, 0); of thread 2 would lead to non-terminating behaviors, and these are detected by those tools. Furthermore, if we synchronize the two stores as follows: while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); smp_store_release(&bar,1); while (!READ_ONCE(foo)); then the graphs with that co become prohibited as they all have hb cycles, and the tools again will not detect any liveness violations. But if we go further and relax the release barrier as below while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); WRITE_ONCE(bar,1); while (!READ_ONCE(foo)); then the hb cycle disappears, and the coherence order in which foo=0 overwrites foo=1 becomes allowed. Again, the tools will detect this and state that thread 2 could be stuck in its while loop forever. In each of these cases, the decision can be made by looking for a graph in which the loop is executed for one iteration which reads from foo=0, and checking whether that store is co-maximal. So in some sense this is all that is needed to express the idea that a program can run forever, even though only in some very limited (but common) circumstances, namely that the loop iteration that repeats forever does not create modifications to state that are observed outside the loop. Luckily this is a very common case, so these checks have proven quite useful in practice. The same kind of check could be implemented in herd together with either the implicit assumption that fr is prefix finite (like in other tools) or with some special syntax like prefix-finite fr | co as fairness which hopefully clears up the question below: > > From an engineering perspective, I think the only issue is that cat > > *currently* does not have any syntax for this, > Syntax for what? The difference between wmb and mb? > [...] Thanks for your patience and hoping I explained it more clearly, jonas ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-09-12 10:46 ` Jonas Oberhauser @ 2022-09-12 12:02 ` Alan Stern 0 siblings, 0 replies; 28+ messages in thread From: Alan Stern @ 2022-09-12 12:02 UTC (permalink / raw) To: Jonas Oberhauser Cc: Hernan Luis Ponce de Leon, Boqun Feng, Peter Zijlstra, Paul E. McKenney, parri.andrea@gmail.com, will@kernel.org, npiggin@gmail.com, dhowells@redhat.com, j.alglave@ucl.ac.uk, luc.maranget@inria.fr, akiyks@gmail.com, dlustig@nvidia.com, joel@joelfernandes.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org On Mon, Sep 12, 2022 at 10:46:38AM +0000, Jonas Oberhauser wrote: > Hi Alan, > > Sorry for the confusion. > > > [...] it's certainly true (in all of these > models) than for any finite number N, there is a feasible execution in which a loop runs for more than N iterations before the termination condition eventually becomes true. > > Indeed; but more interestingly, as the Theorem 5.3 in "making weak memory models fair" states, under certain conditions it suffices to look at graphs where N=1 to decide whether a loop can run forever (despite all writes propagating to all cores eventually) or will always terminate. Cool! > And since herd can generate all such graphs, herd could be extended to make that decision and output it, just like many other tools already do. > > To illuminate this on an example, consider the program sent by Peter earlier: > WRITE_ONCE(foo, 1); while (!READ_ONCE(foo)); > > Without the assumption that fr is prefix finite, the graph with infinitely many reads of thread 2 all reading the initial value (and hence being fr-before the store foo=1) would be allowed. However, the tools used to analyze the qspinlock all assume that fr is prefix finite, and hence that such a graph with infinitely many fr-before events does not exist. Because of that, all of the tools will say that this loop always terminates. > > However, if you change the code into the following: > > WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); while (!READ_ONCE(foo)); > > then even under the assumption of fr-prefix-finiteness, the coherence order in which WRITE_ONCE(foo, 1); is overwritten by WRITE_ONCE(foo, 0); of thread 2 would lead to non-terminating behaviors, and these are detected by those tools. Furthermore, if we synchronize the two stores as follows: > > while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); smp_store_release(&bar,1); while (!READ_ONCE(foo)); > > then the graphs with that co become prohibited as they all have hb cycles, and the tools again will not detect any liveness violations. But if we go further and relax the release barrier as below > > > while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); WRITE_ONCE(bar,1); while (!READ_ONCE(foo)); > > then the hb cycle disappears, and the coherence order in which foo=0 overwrites foo=1 becomes allowed. Again, the tools will detect this and state that thread 2 could be stuck in its while loop forever. Neat stuff; I'll have to think about this. > In each of these cases, the decision can be made by looking for a graph in which the loop is executed for one iteration which reads from foo=0, and checking whether that store is co-maximal. So in some sense this is all that is needed to express the idea that a program can run forever, even though only in some very limited (but common) circumstances, namely that the loop iteration that repeats forever does not create modifications to state that are observed outside the loop. Luckily this is a very common case, so these checks have proven quite useful in practice. > > The same kind of check could be implemented in herd together with either the implicit assumption that fr is prefix finite (like in other tools) or with some special syntax like > > prefix-finite fr | co as fairness > > which hopefully clears up the question below: > > > > From an engineering perspective, I think the only issue is that cat > > > *currently* does not have any syntax for this, > > > Syntax for what? The difference between wmb and mb? > > [...] > > > Thanks for your patience and hoping I explained it more clearly, Yes indeed, thank you very much. Alan ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-26 12:48 "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" Paul E. McKenney 2022-08-26 16:21 ` Boqun Feng 2022-08-26 16:23 ` Peter Zijlstra @ 2022-08-29 2:33 ` Andrea Parri 2022-08-29 12:25 ` Paul E. McKenney 2 siblings, 1 reply; 28+ messages in thread From: Andrea Parri @ 2022-08-29 2:33 UTC (permalink / raw) To: Paul E. McKenney Cc: stern, will, peterz, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote: > Hello! > > I have not yet done more than glance at this one, but figured I should > send it along sooner rather than later. > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer. > https://arxiv.org/abs/2111.15240 > > The claim is that the queued spinlocks implementation with CNA violates > LKMM but actually works on all architectures having a formal hardware > memory model. > > Thoughts? Section 4 ends with a discussion about certain "spurious" data races. Do we have litmus tests with them? (I could repro with Dartagnan...) Thanks, Andrea ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" 2022-08-29 2:33 ` Andrea Parri @ 2022-08-29 12:25 ` Paul E. McKenney 0 siblings, 0 replies; 28+ messages in thread From: Paul E. McKenney @ 2022-08-29 12:25 UTC (permalink / raw) To: Andrea Parri Cc: stern, will, peterz, boqun.feng, npiggin, dhowells, j.alglave, luc.maranget, akiyks, dlustig, joel, linux-kernel, linux-arch On Mon, Aug 29, 2022 at 04:33:23AM +0200, Andrea Parri wrote: > On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote: > > Hello! > > > > I have not yet done more than glance at this one, but figured I should > > send it along sooner rather than later. > > > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak > > Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas > > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer. > > https://arxiv.org/abs/2111.15240 > > > > The claim is that the queued spinlocks implementation with CNA violates > > LKMM but actually works on all architectures having a formal hardware > > memory model. > > > > Thoughts? > > Section 4 ends with a discussion about certain "spurious" data races. > Do we have litmus tests with them? (I could repro with Dartagnan...) Their Figure 5 clearly shows a data race, but agreed, their claim is that this race is prevented by other code not in this litmus test. Me, I currently suspect that the spurious data race might be due to the failure to guarantee mutual exclusion, though I have not yet read that paper carefully. That is scheduled for tomorrow morning. Thanx, Paul ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2022-09-16 8:19 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-08-26 12:48 "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models" Paul E. McKenney 2022-08-26 16:21 ` Boqun Feng 2022-08-26 16:23 ` Peter Zijlstra 2022-08-26 17:10 ` Alan Stern 2022-08-26 20:42 ` Paul E. McKenney 2022-08-27 16:00 ` Alan Stern 2022-08-27 17:04 ` Paul E. McKenney 2022-09-13 11:24 ` Will Deacon 2022-09-13 12:21 ` Dan Lustig 2022-09-16 8:18 ` Will Deacon 2022-08-26 23:47 ` Peter Zijlstra 2022-08-27 16:05 ` Alan Stern 2022-08-27 16:44 ` Boqun Feng 2022-08-29 2:15 ` Andrea Parri 2022-09-09 11:45 ` Jonas Oberhauser 2022-09-10 12:11 ` Hernan Luis Ponce de Leon 2022-09-10 15:03 ` Alan Stern 2022-09-10 20:41 ` Hernan Luis Ponce de Leon 2022-09-11 10:20 ` Joel Fernandes 2022-09-12 10:13 ` Jonas Oberhauser 2022-09-12 11:10 ` Hernan Luis Ponce de Leon 2022-09-14 14:41 ` Alan Stern 2022-09-12 12:01 ` Alan Stern 2022-09-11 14:53 ` Andrea Parri 2022-09-12 10:46 ` Jonas Oberhauser 2022-09-12 12:02 ` Alan Stern 2022-08-29 2:33 ` Andrea Parri 2022-08-29 12:25 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).