* [RFC] Potential problem in qspinlock due to mixed-size accesses
@ 2025-06-12 14:55 Thomas Haas
2025-06-13 7:55 ` Peter Zijlstra
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Haas @ 2025-06-12 14:55 UTC (permalink / raw)
To: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm
Cc: hernan.poncedeleon, jonas.oberhauser, r.maseli@tu-bs.de
We have been taking a look if mixed-size accesses (MSA) can affect the
correctness of qspinlock.
We are focusing on aarch64 which is the only memory model with MSA
support [1].
For this we extended the dartagnan [2] tool to support MSA and now it
reports liveness, synchronization, and mutex issues.
Notice that we did something similar in the past for LKMM, but we were
ignoring MSA [3].
The culprit of all these issues is that atomicity of single load
instructions is not guaranteed in the presence of smaller-sized stores
(observed on real hardware according to [1] and Fig. 21/22)
Consider the following pseudo code:
int16 old = xchg16_rlx(&lock, 42);
int32 l = load32_acq(&lock);
Then the hardware can treat the code as (likely due to store-forwarding)
int16 old = xchg16_rlx(&lock, 42);
int16 l1 = load16_acq(&lock);
int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
arithmetic
and reorder it to
int16 l2 = load16_acq(&lock + 2);
int16 old = xchg16_rlx(&lock, 42);
int16 l1 = load16_acq(&lock);
Now another thread can overwrite "lock" in between the first two
accesses so that the original l (l1 and l2) ends up containing
parts of a lock value that is older than what the xchg observed.
Let us now explain how this can break qspinlock.
We need 3 threads going the slowpath, T1, T2, and T3.
T1 is a setup-thread that takes the lock just so that T2 and T3 observe
contention and try to enqueue themselves.
So consider a situation where
- T3's node is the only enqueued node (due to previous contention
with T1).
- T2 is about to enqueue its node.
- T3 is about to take the lock because it sees no contention (lock
is free, no pending bits are set, it has the only node in the queue).
- the lock is released (T1 is already done)
We focus on the following lines of code in qspinlock.c:
// <- T2 is here, but executes part of line 328 first.
277: old = xchg_tail(lock, tail); // ~ xchg_rlx(&lock->tail,
tail)
// ...
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK));
// ...
// <- T3 is here
352: if ((val & _Q_TAIL_MASK) == tail) {
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
Then the following sequence of operations is problematic:
(1) T2 reads half of the lock (the half that is not lock->tail, i.e., it
"only" reads lock->lock_pending)
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK)); // P1
// Roughly: val_lock_pending = load_acquire(&lock->lock_pending);
T2 observes a free lock and no pending bits set.
(2) T3 takes the lock because it does not observe contention
352: if ((val & _Q_TAIL_MASK) == tail) { // Satisfied because
only T3's node is enqueued and lock is free
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
T3 clears the tail and claims the lock. The queue is empty now.
(3) T2 enqueues itself
277: old = xchg_tail(lock, tail);
T2 sees an empty queue (old == 0) because T3 just cleared it, and
enqueues its node.
(4) T2 reads the remaining half of the lock from (1), reading the tail
(lock->tail) it just inserted.
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK)); // P2
// Roughly: val_tail = load_acquire(&lock->tail);
T2 observes its own tail + lock is free and no pending bits are set
(from (1))
Now there are two continuations, one leading to broken synchronisation,
another leading to non-termination or failure of mutual exclusion.
We first consider the synchronisation issue.
(5a) T3 finishes its critical section and releases the lock
(6a) T2 takes the lock with the same code as in point (2):
352: if ((val & _Q_TAIL_MASK) == tail) { // Satisfied
because only T2's node is enqueued and lock is free
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
Notice that the "atomic_try_cmpxchg_relaxed" would fail if the
lock was still taken by T3, because "val" has no lock bits set (as T2
observed the lock to be free).
Although T2 now properly enters the CS after T3 has finished, the
synchronisation is broken since T2 did not perform an acquire operation
to synchronise with the lock release.
Indeed, dartagnan reports no safety violations (with 3 threads)
if the CAS is made an acquire or the CS itself contains an acquire
barrier (smb_rmb or smb_mb).
Now, let's consider the alternative continuation that leads to
non-termination or failure of mutual exclusion.
(5b) T2 tries to take the lock as above in (6a) but the CAS fails
because the lock is still taken by T3.
(6b) Due to the failing CAS, T2 observes contention (at least it thinks
so). It sets the lock (although T3 might still have it!) and waits until
the (non-existent) "contending thread" enqueues its node:
/*
* Either somebody is queued behind us or _Q_PENDING_VAL got set
* which will then detect the remaining tail and queue behind us
* ensuring we'll see a @next.
*/
362: set_locked(lock);
/*
* contended path; wait for next if not observed yet, release.
*/
367: if (!next)
368: next = smp_cond_load_relaxed(&node->next, (VAL));
The first comment suggests that the assumed situation is that either
another thread enqueued a node or the pending bits got set. But neither is
true in the current execution: we got here because the lock was taken.
Now there are two outcomes:
- Non-termination: the "smp_cond_load_relaxed" waits forever,
because there is no other thread that enqueues another node.
- Broken mutex: another thread (T4) enqueues a node and therefore
releases T2 from its loop. Now T2 enters the CS while T3 still executes
its CS.
Indeed, dartagnan reports this safety violation only with 4+
threads.
NOTE: For the above examples we forced all threads to take the slowpath
of qspinlock by removing the fastpath. With the fastpath present,
another thread (T0) is needed to force the other threads into the
slowpath, i.e., we need 4-5 threads to witness the issues.
### Solutions
The problematic executions rely on the fact that T2 can move half of its
load operation (1) to before the xchg_tail (3).
Preventing this reordering solves all issues. Possible solutions are:
- make the xchg_tail full-sized (i.e, also touch lock/pending bits).
Note that if the kernel is configured with >= 16k cpus, then the
tail becomes larger than 16 bits and needs to be encoded in parts of the
pending byte as well.
In this case, the kernel makes a full-sized (32-bit) access for
the xchg. So the above bugs are only present in the < 16k cpus setting.
- make the xchg_tail an acquire operation.
- make the xchg_tail a release operation (this is an odd solution
by itself but works for aarch64 because it preserves REL->ACQ ordering).
In this case, maybe the preceding "smp_wmb()" can be removed.
- put some other read-read barrier between the xchg_tail and the load.
### Implications for qspinlock executed on non-ARM architectures.
Unfortunately, there are no MSA extensions for other hardware memory
models, so we have to speculate based on whether the problematic
reordering is permitted if the problematic load was treated as two
individual instructions.
It seems Power and RISCV would have no problem reordering the
instructions, so qspinlock might also break on those architectures.
TSO, on the other hand, does not permit such reordering. Also, the
xchg_tail is a rmw operation which acts like a full memory barrier under
TSO, so even if load-load reordering was permitted, the rmw would
prevent this.
[1] https://dl.acm.org/doi/10.1145/3458926
[2] https://github.com/hernanponcedeleon/Dat3M
[3] https://lkml.org/lkml/2022/8/26/597
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-12 14:55 [RFC] Potential problem in qspinlock due to mixed-size accesses Thomas Haas
@ 2025-06-13 7:55 ` Peter Zijlstra
2025-06-13 11:17 ` Andrea Parri
` (2 more replies)
0 siblings, 3 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-06-13 7:55 UTC (permalink / raw)
To: Thomas Haas
Cc: Alan Stern, Andrea Parri, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
> We have been taking a look if mixed-size accesses (MSA) can affect the
> correctness of qspinlock.
> We are focusing on aarch64 which is the only memory model with MSA support
> [1].
> For this we extended the dartagnan [2] tool to support MSA and now it
> reports liveness, synchronization, and mutex issues.
> Notice that we did something similar in the past for LKMM, but we were
> ignoring MSA [3].
>
> The culprit of all these issues is that atomicity of single load
> instructions is not guaranteed in the presence of smaller-sized stores
> (observed on real hardware according to [1] and Fig. 21/22)
> Consider the following pseudo code:
>
> int16 old = xchg16_rlx(&lock, 42);
> int32 l = load32_acq(&lock);
>
> Then the hardware can treat the code as (likely due to store-forwarding)
>
> int16 old = xchg16_rlx(&lock, 42);
> int16 l1 = load16_acq(&lock);
> int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
> arithmetic
>
> and reorder it to
>
> int16 l2 = load16_acq(&lock + 2);
> int16 old = xchg16_rlx(&lock, 42);
> int16 l1 = load16_acq(&lock);
>
> Now another thread can overwrite "lock" in between the first two accesses so
> that the original l (l1 and l2) ends up containing
> parts of a lock value that is older than what the xchg observed.
Oops :-(
(snip the excellent details)
> ### Solutions
>
> The problematic executions rely on the fact that T2 can move half of its
> load operation (1) to before the xchg_tail (3).
> Preventing this reordering solves all issues. Possible solutions are:
> - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
> Note that if the kernel is configured with >= 16k cpus, then the tail
> becomes larger than 16 bits and needs to be encoded in parts of the pending
> byte as well.
> In this case, the kernel makes a full-sized (32-bit) access for the
> xchg. So the above bugs are only present in the < 16k cpus setting.
Right, but that is the more expensive option for some.
> - make the xchg_tail an acquire operation.
> - make the xchg_tail a release operation (this is an odd solution by
> itself but works for aarch64 because it preserves REL->ACQ ordering). In
> this case, maybe the preceding "smp_wmb()" can be removed.
I think I prefer this one, it move a barrier, not really adding
additional overhead. Will?
> - put some other read-read barrier between the xchg_tail and the load.
>
>
> ### Implications for qspinlock executed on non-ARM architectures.
>
> Unfortunately, there are no MSA extensions for other hardware memory models,
> so we have to speculate based on whether the problematic reordering is
> permitted if the problematic load was treated as two individual
> instructions.
> It seems Power and RISCV would have no problem reordering the instructions,
> so qspinlock might also break on those architectures.
Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
word LL/SC and should be good.
But yes, ZABHA might be equally broken.
> TSO, on the other hand, does not permit such reordering. Also, the xchg_tail
> is a rmw operation which acts like a full memory barrier under TSO, so even
> if load-load reordering was permitted, the rmw would prevent this.
Right.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-13 7:55 ` Peter Zijlstra
@ 2025-06-13 11:17 ` Andrea Parri
[not found] ` <9264df13-36db-4b25-b2c4-7a9701df2f4d@tu-bs.de>
2025-06-16 14:23 ` Will Deacon
2025-06-18 6:51 ` Paul E. McKenney
2 siblings, 1 reply; 21+ messages in thread
From: Andrea Parri @ 2025-06-13 11:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Haas, Alan Stern, Will Deacon, Boqun Feng, Nicholas Piggin,
David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
> (snip the excellent details)
Indeed, joining in praising this report - Great work, Thomas!
> > ### Solutions
> >
> > The problematic executions rely on the fact that T2 can move half of its
> > load operation (1) to before the xchg_tail (3).
> > Preventing this reordering solves all issues. Possible solutions are:
> > - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
> > Note that if the kernel is configured with >= 16k cpus, then the tail
> > becomes larger than 16 bits and needs to be encoded in parts of the pending
> > byte as well.
> > In this case, the kernel makes a full-sized (32-bit) access for the
> > xchg. So the above bugs are only present in the < 16k cpus setting.
>
> Right, but that is the more expensive option for some.
>
> > - make the xchg_tail an acquire operation.
> > - make the xchg_tail a release operation (this is an odd solution by
> > itself but works for aarch64 because it preserves REL->ACQ ordering). In
> > this case, maybe the preceding "smp_wmb()" can be removed.
>
> I think I prefer this one, it move a barrier, not really adding
> additional overhead. Will?
>
> > - put some other read-read barrier between the xchg_tail and the load.
> >
> >
> > ### Implications for qspinlock executed on non-ARM architectures.
> >
> > Unfortunately, there are no MSA extensions for other hardware memory models,
> > so we have to speculate based on whether the problematic reordering is
> > permitted if the problematic load was treated as two individual
> > instructions.
> > It seems Power and RISCV would have no problem reordering the instructions,
> > so qspinlock might also break on those architectures.
>
> Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
> word LL/SC and should be good.
>
> But yes, ZABHA might be equally broken.
RISC-V forbids store-forwarding from AMOs or SCs, certain (non-normative)
commentary in the spec clarifies that the same ordering rule applies when
the memory accesses in question only overlap partially.
I am not aware of any "RISC-V implementation" manifesting the load-load
re-ordering in question. IAC, notice that making xchg_tail() a release
operation might not suffice to fix such an implementation given that the
arch has no plain load-acquire instruction yet and relies on the generic
(fence-based) code for atomic_cond_read_acquire().
Andrea
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
[not found] ` <9264df13-36db-4b25-b2c4-7a9701df2f4d@tu-bs.de>
@ 2025-06-16 6:21 ` Andrea Parri
2025-06-16 14:11 ` Alan Stern
0 siblings, 1 reply; 21+ messages in thread
From: Andrea Parri @ 2025-06-16 6:21 UTC (permalink / raw)
To: Thomas Haas
Cc: Peter Zijlstra, Alan Stern, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
> Thanks for the praise. I expected more questioning/discussion and less
> immediate acceptance :)
Well, the discussion isn't closed yet. ;-)
> Maybe one should also take into consideration a hypothetical extension of
> LKMM to MSA.
> I think LKMM (and also C11) do not preserve REL->ACQ ordering because this
> would disallow their implementation as simple stores/loads on TSO.
> That being said, maybe preserving "rmw;[REL];po;[ACQ]" on the
> language-level would be fine and sufficient for qspinlock.
On PPC say, the expression can translate to a sequence "lwsync ; lwarx ;
stwcx. ; ... ; lwz ; lwsync", in which the order of the two loads is not
necessarily preserved.
MSAs have been on the LKMM TODO list for quite some time. I'm confident
this thread will help to make some progress or at least to reinforce the
interest in the topic.
Andrea
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-16 6:21 ` Andrea Parri
@ 2025-06-16 14:11 ` Alan Stern
2025-06-17 14:09 ` Andrea Parri
0 siblings, 1 reply; 21+ messages in thread
From: Alan Stern @ 2025-06-16 14:11 UTC (permalink / raw)
To: Andrea Parri
Cc: Thomas Haas, Peter Zijlstra, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On Mon, Jun 16, 2025 at 08:21:50AM +0200, Andrea Parri wrote:
> > Thanks for the praise. I expected more questioning/discussion and less
> > immediate acceptance :)
>
> Well, the discussion isn't closed yet. ;-)
>
>
> > Maybe one should also take into consideration a hypothetical extension of
> > LKMM to MSA.
> > I think LKMM (and also C11) do not preserve REL->ACQ ordering because this
> > would disallow their implementation as simple stores/loads on TSO.
> > That being said, maybe preserving "rmw;[REL];po;[ACQ]" on the
> > language-level would be fine and sufficient for qspinlock.
>
> On PPC say, the expression can translate to a sequence "lwsync ; lwarx ;
> stwcx. ; ... ; lwz ; lwsync", in which the order of the two loads is not
> necessarily preserved.
>
> MSAs have been on the LKMM TODO list for quite some time. I'm confident
> this thread will help to make some progress or at least to reinforce the
> interest in the topic.
Indeed. I was surprised to learn that CPUs can sometimes change a
32-bit load into two 16-bit loads.
My question is: Do we have enough general knowledge at this point about
how the various types of hardware handle mixed-size accesses to come up
with a memory model everyone can agree one?
Alan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-13 7:55 ` Peter Zijlstra
2025-06-13 11:17 ` Andrea Parri
@ 2025-06-16 14:23 ` Will Deacon
2025-06-17 6:19 ` Thomas Haas
2025-06-18 6:51 ` Paul E. McKenney
2 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2025-06-16 14:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Haas, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
> > We have been taking a look if mixed-size accesses (MSA) can affect the
> > correctness of qspinlock.
> > We are focusing on aarch64 which is the only memory model with MSA support
> > [1].
> > For this we extended the dartagnan [2] tool to support MSA and now it
> > reports liveness, synchronization, and mutex issues.
> > Notice that we did something similar in the past for LKMM, but we were
> > ignoring MSA [3].
> >
> > The culprit of all these issues is that atomicity of single load
> > instructions is not guaranteed in the presence of smaller-sized stores
> > (observed on real hardware according to [1] and Fig. 21/22)
> > Consider the following pseudo code:
> >
> > int16 old = xchg16_rlx(&lock, 42);
> > int32 l = load32_acq(&lock);
> >
> > Then the hardware can treat the code as (likely due to store-forwarding)
> >
> > int16 old = xchg16_rlx(&lock, 42);
> > int16 l1 = load16_acq(&lock);
> > int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
> > arithmetic
> >
> > and reorder it to
> >
> > int16 l2 = load16_acq(&lock + 2);
> > int16 old = xchg16_rlx(&lock, 42);
> > int16 l1 = load16_acq(&lock);
> >
> > Now another thread can overwrite "lock" in between the first two accesses so
> > that the original l (l1 and l2) ends up containing
> > parts of a lock value that is older than what the xchg observed.
>
> Oops :-(
>
> (snip the excellent details)
>
> > ### Solutions
> >
> > The problematic executions rely on the fact that T2 can move half of its
> > load operation (1) to before the xchg_tail (3).
> > Preventing this reordering solves all issues. Possible solutions are:
> > - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
> > Note that if the kernel is configured with >= 16k cpus, then the tail
> > becomes larger than 16 bits and needs to be encoded in parts of the pending
> > byte as well.
> > In this case, the kernel makes a full-sized (32-bit) access for the
> > xchg. So the above bugs are only present in the < 16k cpus setting.
>
> Right, but that is the more expensive option for some.
>
> > - make the xchg_tail an acquire operation.
> > - make the xchg_tail a release operation (this is an odd solution by
> > itself but works for aarch64 because it preserves REL->ACQ ordering). In
> > this case, maybe the preceding "smp_wmb()" can be removed.
>
> I think I prefer this one, it move a barrier, not really adding
> additional overhead. Will?
I'm half inclined to think that the Arm memory model should be tightened
here; I can raise that with Arm and see what they say.
Although the cited paper does give examples of store-forwarding from a
narrow store to a wider load, the case in qspinlock is further
constrained by having the store come from an atomic rmw and the load
having acquire semantics. Setting aside the MSA part, that specific case
_is_ ordered in the Arm memory model (and C++ release sequences rely on
it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
an atomic rmw to an acquire load.
Given that, I don't see how this is going to occur in practice.
Will
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-16 14:23 ` Will Deacon
@ 2025-06-17 6:19 ` Thomas Haas
2025-06-17 8:42 ` Hernan Ponce de Leon
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Haas @ 2025-06-17 6:19 UTC (permalink / raw)
To: Will Deacon, Peter Zijlstra
Cc: Alan Stern, Andrea Parri, Boqun Feng, Nicholas Piggin,
David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
On 16.06.25 16:23, Will Deacon wrote:
> On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
>> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
>>> We have been taking a look if mixed-size accesses (MSA) can affect the
>>> correctness of qspinlock.
>>> We are focusing on aarch64 which is the only memory model with MSA support
>>> [1].
>>> For this we extended the dartagnan [2] tool to support MSA and now it
>>> reports liveness, synchronization, and mutex issues.
>>> Notice that we did something similar in the past for LKMM, but we were
>>> ignoring MSA [3].
>>>
>>> The culprit of all these issues is that atomicity of single load
>>> instructions is not guaranteed in the presence of smaller-sized stores
>>> (observed on real hardware according to [1] and Fig. 21/22)
>>> Consider the following pseudo code:
>>>
>>> int16 old = xchg16_rlx(&lock, 42);
>>> int32 l = load32_acq(&lock);
>>>
>>> Then the hardware can treat the code as (likely due to store-forwarding)
>>>
>>> int16 old = xchg16_rlx(&lock, 42);
>>> int16 l1 = load16_acq(&lock);
>>> int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
>>> arithmetic
>>>
>>> and reorder it to
>>>
>>> int16 l2 = load16_acq(&lock + 2);
>>> int16 old = xchg16_rlx(&lock, 42);
>>> int16 l1 = load16_acq(&lock);
>>>
>>> Now another thread can overwrite "lock" in between the first two accesses so
>>> that the original l (l1 and l2) ends up containing
>>> parts of a lock value that is older than what the xchg observed.
>>
>> Oops :-(
>>
>> (snip the excellent details)
>>
>>> ### Solutions
>>>
>>> The problematic executions rely on the fact that T2 can move half of its
>>> load operation (1) to before the xchg_tail (3).
>>> Preventing this reordering solves all issues. Possible solutions are:
>>> - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
>>> Note that if the kernel is configured with >= 16k cpus, then the tail
>>> becomes larger than 16 bits and needs to be encoded in parts of the pending
>>> byte as well.
>>> In this case, the kernel makes a full-sized (32-bit) access for the
>>> xchg. So the above bugs are only present in the < 16k cpus setting.
>>
>> Right, but that is the more expensive option for some.
>>
>>> - make the xchg_tail an acquire operation.
>>> - make the xchg_tail a release operation (this is an odd solution by
>>> itself but works for aarch64 because it preserves REL->ACQ ordering). In
>>> this case, maybe the preceding "smp_wmb()" can be removed.
>>
>> I think I prefer this one, it move a barrier, not really adding
>> additional overhead. Will?
>
> I'm half inclined to think that the Arm memory model should be tightened
> here; I can raise that with Arm and see what they say.
>
> Although the cited paper does give examples of store-forwarding from a
> narrow store to a wider load, the case in qspinlock is further
> constrained by having the store come from an atomic rmw and the load
> having acquire semantics. Setting aside the MSA part, that specific case
> _is_ ordered in the Arm memory model (and C++ release sequences rely on
> it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
> an atomic rmw to an acquire load.
>
> Given that, I don't see how this is going to occur in practice.
>
> Will
You are probably right. The ARM model's atomic-ordered-before relation
let aob = rmw | [range(rmw)]; lrs; [A | Q]
clearly orders the rmw-store with subsequent acquire loads (lrs =
local-read-successor, A = acquire).
If we treat this relation (at least the second part) as a "global
ordering" and extend it by "si" (same-instruction), then the problematic
reordering under MSA should be gone.
I quickly ran Dartagnan on the MSA litmus tests with this change to the
ARM model and all the tests still pass.
We should definitely ask ARM about this. I did sent an email to Jade
before writing about this issue here, but she was (and still is) busy
and told me to ask at memory-model@arm.com .
I will ask them.
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 6:19 ` Thomas Haas
@ 2025-06-17 8:42 ` Hernan Ponce de Leon
2025-06-17 14:17 ` Will Deacon
0 siblings, 1 reply; 21+ messages in thread
From: Hernan Ponce de Leon @ 2025-06-17 8:42 UTC (permalink / raw)
To: Thomas Haas, Will Deacon, Peter Zijlstra
Cc: Alan Stern, Andrea Parri, Boqun Feng, Nicholas Piggin,
David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, jonas.oberhauser, r.maseli@tu-bs.de
On 6/17/2025 8:19 AM, Thomas Haas wrote:
>
>
> On 16.06.25 16:23, Will Deacon wrote:
>> On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
>>> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
>>>> We have been taking a look if mixed-size accesses (MSA) can affect the
>>>> correctness of qspinlock.
>>>> We are focusing on aarch64 which is the only memory model with MSA
>>>> support
>>>> [1].
>>>> For this we extended the dartagnan [2] tool to support MSA and now it
>>>> reports liveness, synchronization, and mutex issues.
>>>> Notice that we did something similar in the past for LKMM, but we were
>>>> ignoring MSA [3].
>>>>
>>>> The culprit of all these issues is that atomicity of single load
>>>> instructions is not guaranteed in the presence of smaller-sized stores
>>>> (observed on real hardware according to [1] and Fig. 21/22)
>>>> Consider the following pseudo code:
>>>>
>>>> int16 old = xchg16_rlx(&lock, 42);
>>>> int32 l = load32_acq(&lock);
>>>>
>>>> Then the hardware can treat the code as (likely due to store-
>>>> forwarding)
>>>>
>>>> int16 old = xchg16_rlx(&lock, 42);
>>>> int16 l1 = load16_acq(&lock);
>>>> int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
>>>> arithmetic
>>>>
>>>> and reorder it to
>>>>
>>>> int16 l2 = load16_acq(&lock + 2);
>>>> int16 old = xchg16_rlx(&lock, 42);
>>>> int16 l1 = load16_acq(&lock);
>>>>
>>>> Now another thread can overwrite "lock" in between the first two
>>>> accesses so
>>>> that the original l (l1 and l2) ends up containing
>>>> parts of a lock value that is older than what the xchg observed.
>>>
>>> Oops :-(
>>>
>>> (snip the excellent details)
>>>
>>>> ### Solutions
>>>>
>>>> The problematic executions rely on the fact that T2 can move half of
>>>> its
>>>> load operation (1) to before the xchg_tail (3).
>>>> Preventing this reordering solves all issues. Possible solutions are:
>>>> - make the xchg_tail full-sized (i.e, also touch lock/pending
>>>> bits).
>>>> Note that if the kernel is configured with >= 16k cpus, then
>>>> the tail
>>>> becomes larger than 16 bits and needs to be encoded in parts of the
>>>> pending
>>>> byte as well.
>>>> In this case, the kernel makes a full-sized (32-bit) access
>>>> for the
>>>> xchg. So the above bugs are only present in the < 16k cpus setting.
>>>
>>> Right, but that is the more expensive option for some.
>>>
>>>> - make the xchg_tail an acquire operation.
>>>> - make the xchg_tail a release operation (this is an odd
>>>> solution by
>>>> itself but works for aarch64 because it preserves REL->ACQ
>>>> ordering). In
>>>> this case, maybe the preceding "smp_wmb()" can be removed.
>>>
>>> I think I prefer this one, it move a barrier, not really adding
>>> additional overhead. Will?
>>
>> I'm half inclined to think that the Arm memory model should be tightened
>> here; I can raise that with Arm and see what they say.
>>
>> Although the cited paper does give examples of store-forwarding from a
>> narrow store to a wider load, the case in qspinlock is further
>> constrained by having the store come from an atomic rmw and the load
>> having acquire semantics. Setting aside the MSA part, that specific case
>> _is_ ordered in the Arm memory model (and C++ release sequences rely on
>> it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
>> an atomic rmw to an acquire load.
>>
>> Given that, I don't see how this is going to occur in practice.
>>
>> Will
>
> You are probably right. The ARM model's atomic-ordered-before relation
>
> let aob = rmw | [range(rmw)]; lrs; [A | Q]
>
> clearly orders the rmw-store with subsequent acquire loads (lrs = local-
> read-successor, A = acquire).
> If we treat this relation (at least the second part) as a "global
> ordering" and extend it by "si" (same-instruction), then the problematic
> reordering under MSA should be gone.
> I quickly ran Dartagnan on the MSA litmus tests with this change to the
> ARM model and all the tests still pass.
Even with this change I still get violations (both safety and
termination) for qspinlock with dartagnan.
I think the problem is actually with the Internal visibility axiom,
because only making that one stronger seems to remove the violations.
Hernan
>
> We should definitely ask ARM about this. I did sent an email to Jade
> before writing about this issue here, but she was (and still is) busy
> and told me to ask at memory-model@arm.com .
> I will ask them.
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-16 14:11 ` Alan Stern
@ 2025-06-17 14:09 ` Andrea Parri
2025-06-19 14:27 ` Thomas Haas
0 siblings, 1 reply; 21+ messages in thread
From: Andrea Parri @ 2025-06-17 14:09 UTC (permalink / raw)
To: Alan Stern
Cc: Thomas Haas, Peter Zijlstra, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
> My question is: Do we have enough general knowledge at this point about
> how the various types of hardware handle mixed-size accesses to come up
> with a memory model everyone can agree one?
You know, I can imagine a conversation along the following line if I
were to turn this question to certain "hardware people":
[Me/LKMM] People, how do you order such and those MSAs?
[RTL/DV] What are Linux's uses and requirements?
that is to say that the work mentioned is probably more "interactive"
and more dynamic than how your question may suggest. ;)
Said this, I do like to think that we (LKMM supporters and followers)
have enough knowledge to approach that effort. It would require some
changes to herd7 (and klitmus7), starting from teaching the tools the
new MSAs syntax and how to generate rf, co and other basic relations
(while monitoring potential non-MSA regressions). Non-trivial maybe,
but seems doable. Suffice it to say that herd7 can't currently parse
the following C test, but it can run its "lowerings"/assembly against
a bunch of hardware models and implementations, including arm64, x86,
powerpc and riscv! Any volunteers with ocaml expertise interested in
contributing to the LKMM? ;)
C C-thomas-haas
{
u32 x;
u16 *x_lh = x; // herd7 dialect for "... = &x;"
}
P0(u32 *x)
{
WRITE_ONCE(*x, 0x10001);
}
P1(u16 **x_lh, u32 *x)
{
u16 r0;
u32 r1;
r0 = xchg_relaxed(*x_lh, 0x2);
r1 = smp_load_acquire(x);
}
exists (1:r0=0x1 /\ 1:r1=0x2)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 8:42 ` Hernan Ponce de Leon
@ 2025-06-17 14:17 ` Will Deacon
2025-06-17 14:23 ` Thomas Haas
0 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2025-06-17 14:17 UTC (permalink / raw)
To: Hernan Ponce de Leon
Cc: Thomas Haas, Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
> On 6/17/2025 8:19 AM, Thomas Haas wrote:
> > On 16.06.25 16:23, Will Deacon wrote:
> > > I'm half inclined to think that the Arm memory model should be tightened
> > > here; I can raise that with Arm and see what they say.
> > >
> > > Although the cited paper does give examples of store-forwarding from a
> > > narrow store to a wider load, the case in qspinlock is further
> > > constrained by having the store come from an atomic rmw and the load
> > > having acquire semantics. Setting aside the MSA part, that specific case
> > > _is_ ordered in the Arm memory model (and C++ release sequences rely on
> > > it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
> > > an atomic rmw to an acquire load.
> > >
> > > Given that, I don't see how this is going to occur in practice.
> >
> > You are probably right. The ARM model's atomic-ordered-before relation
> >
> > let aob = rmw | [range(rmw)]; lrs; [A | Q]
> >
> > clearly orders the rmw-store with subsequent acquire loads (lrs = local-
> > read-successor, A = acquire).
> > If we treat this relation (at least the second part) as a "global
> > ordering" and extend it by "si" (same-instruction), then the problematic
> > reordering under MSA should be gone.
> > I quickly ran Dartagnan on the MSA litmus tests with this change to the
> > ARM model and all the tests still pass.
>
> Even with this change I still get violations (both safety and termination)
> for qspinlock with dartagnan.
Please can you be more specific about the problems you see?
> I think the problem is actually with the Internal visibility axiom, because
> only making that one stronger seems to remove the violations.
That sounds surprising to me, as there's nothing particularly weird about
Arm's coherence requirements when compared to other architectures, as far
as I'm aware.
Will
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 14:17 ` Will Deacon
@ 2025-06-17 14:23 ` Thomas Haas
2025-06-17 19:00 ` Hernan Ponce de Leon
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Haas @ 2025-06-17 14:23 UTC (permalink / raw)
To: Will Deacon, Hernan Ponce de Leon
Cc: Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On 17.06.25 16:17, Will Deacon wrote:
> On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
>> On 6/17/2025 8:19 AM, Thomas Haas wrote:
>>> On 16.06.25 16:23, Will Deacon wrote:
>>>> I'm half inclined to think that the Arm memory model should be tightened
>>>> here; I can raise that with Arm and see what they say.
>>>>
>>>> Although the cited paper does give examples of store-forwarding from a
>>>> narrow store to a wider load, the case in qspinlock is further
>>>> constrained by having the store come from an atomic rmw and the load
>>>> having acquire semantics. Setting aside the MSA part, that specific case
>>>> _is_ ordered in the Arm memory model (and C++ release sequences rely on
>>>> it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
>>>> an atomic rmw to an acquire load.
>>>>
>>>> Given that, I don't see how this is going to occur in practice.
>>>
>>> You are probably right. The ARM model's atomic-ordered-before relation
>>>
>>> let aob = rmw | [range(rmw)]; lrs; [A | Q]
>>>
>>> clearly orders the rmw-store with subsequent acquire loads (lrs = local-
>>> read-successor, A = acquire).
>>> If we treat this relation (at least the second part) as a "global
>>> ordering" and extend it by "si" (same-instruction), then the problematic
>>> reordering under MSA should be gone.
>>> I quickly ran Dartagnan on the MSA litmus tests with this change to the
>>> ARM model and all the tests still pass.
>>
>> Even with this change I still get violations (both safety and termination)
>> for qspinlock with dartagnan.
>
> Please can you be more specific about the problems you see?
I talked to Hernán personally and it turned out that he used the generic
implementation of smp_cond_acquire (not sure if the name is correct)
which uses a relaxed load followed by a barrier. In that case, replacing
aob by aob;si does not change anything.
Indeed, even in the reported problem we used the generic implementation
(I was unaware of this), though it is easy to check that changing the
relaxed load to acquire does not give sufficient orderings.
>
>> I think the problem is actually with the Internal visibility axiom, because
>> only making that one stronger seems to remove the violations.
>
> That sounds surprising to me, as there's nothing particularly weird about
> Arm's coherence requirements when compared to other architectures, as far
> as I'm aware.
>
I agree. The internal visibility axiom is not the issue I think.
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 14:23 ` Thomas Haas
@ 2025-06-17 19:00 ` Hernan Ponce de Leon
2025-06-19 12:30 ` Will Deacon
0 siblings, 1 reply; 21+ messages in thread
From: Hernan Ponce de Leon @ 2025-06-17 19:00 UTC (permalink / raw)
To: Thomas Haas, Will Deacon
Cc: Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On 6/17/2025 4:23 PM, Thomas Haas wrote:
>
>
> On 17.06.25 16:17, Will Deacon wrote:
>> On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
>>> On 6/17/2025 8:19 AM, Thomas Haas wrote:
>>>> On 16.06.25 16:23, Will Deacon wrote:
>>>>> I'm half inclined to think that the Arm memory model should be
>>>>> tightened
>>>>> here; I can raise that with Arm and see what they say.
>>>>>
>>>>> Although the cited paper does give examples of store-forwarding from a
>>>>> narrow store to a wider load, the case in qspinlock is further
>>>>> constrained by having the store come from an atomic rmw and the load
>>>>> having acquire semantics. Setting aside the MSA part, that specific
>>>>> case
>>>>> _is_ ordered in the Arm memory model (and C++ release sequences
>>>>> rely on
>>>>> it iirc), so it's fair to say that Arm CPUs don't permit forwarding
>>>>> from
>>>>> an atomic rmw to an acquire load.
>>>>>
>>>>> Given that, I don't see how this is going to occur in practice.
>>>>
>>>> You are probably right. The ARM model's atomic-ordered-before relation
>>>>
>>>> let aob = rmw | [range(rmw)]; lrs; [A | Q]
>>>>
>>>> clearly orders the rmw-store with subsequent acquire loads (lrs =
>>>> local-
>>>> read-successor, A = acquire).
>>>> If we treat this relation (at least the second part) as a "global
>>>> ordering" and extend it by "si" (same-instruction), then the
>>>> problematic
>>>> reordering under MSA should be gone.
>>>> I quickly ran Dartagnan on the MSA litmus tests with this change to the
>>>> ARM model and all the tests still pass.
>>>
>>> Even with this change I still get violations (both safety and
>>> termination)
>>> for qspinlock with dartagnan.
>>
>> Please can you be more specific about the problems you see?
>
> I talked to Hernán personally and it turned out that he used the generic
> implementation of smp_cond_acquire (not sure if the name is correct)
> which uses a relaxed load followed by a barrier. In that case, replacing
> aob by aob;si does not change anything.
> Indeed, even in the reported problem we used the generic implementation
> (I was unaware of this), though it is easy to check that changing the
> relaxed load to acquire does not give sufficient orderings.
Yes, my bad. I was using the generic header rather than the aarch64
specific one and then the changes to the model were having not effect
(as they should).
Now I am using the aarch64 specific ones and I can confirm dartagnan
still reports the violations with the current model and making the
change proposed by Thomas (adding ;si just to the second part seems to
be enough) indeed removes all violations.
Hernan
>
>>
>>> I think the problem is actually with the Internal visibility axiom,
>>> because
>>> only making that one stronger seems to remove the violations.
>>
>> That sounds surprising to me, as there's nothing particularly weird about
>> Arm's coherence requirements when compared to other architectures, as far
>> as I'm aware.
>>
>
> I agree. The internal visibility axiom is not the issue I think.
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-13 7:55 ` Peter Zijlstra
2025-06-13 11:17 ` Andrea Parri
2025-06-16 14:23 ` Will Deacon
@ 2025-06-18 6:51 ` Paul E. McKenney
2025-06-18 12:11 ` Paul E. McKenney
2 siblings, 1 reply; 21+ messages in thread
From: Paul E. McKenney @ 2025-06-18 6:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Haas, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
[ . . . ]
> > - put some other read-read barrier between the xchg_tail and the load.
> >
> >
> > ### Implications for qspinlock executed on non-ARM architectures.
> >
> > Unfortunately, there are no MSA extensions for other hardware memory models,
> > so we have to speculate based on whether the problematic reordering is
> > permitted if the problematic load was treated as two individual
> > instructions.
> > It seems Power and RISCV would have no problem reordering the instructions,
> > so qspinlock might also break on those architectures.
>
> Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
> word LL/SC and should be good.
>
> But yes, ZABHA might be equally broken.
All architectures handle eight-bit atomics and stores, but last I checked,
there were a few systems still around that failed to support 16-bit
atomics and stores. I will check again.
(But those systems's architectures can simply avoid supporting kernel
features requiring these 16-bit operations.)
It would be good to add multiple sizes to LKMM, and even moreso once we
have 16-bit support across the board.
Thanx, Paul
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-18 6:51 ` Paul E. McKenney
@ 2025-06-18 12:11 ` Paul E. McKenney
0 siblings, 0 replies; 21+ messages in thread
From: Paul E. McKenney @ 2025-06-18 12:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Haas, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
On Tue, Jun 17, 2025 at 11:51:49PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
>
> [ . . . ]
>
> > > - put some other read-read barrier between the xchg_tail and the load.
> > >
> > >
> > > ### Implications for qspinlock executed on non-ARM architectures.
> > >
> > > Unfortunately, there are no MSA extensions for other hardware memory models,
> > > so we have to speculate based on whether the problematic reordering is
> > > permitted if the problematic load was treated as two individual
> > > instructions.
> > > It seems Power and RISCV would have no problem reordering the instructions,
> > > so qspinlock might also break on those architectures.
> >
> > Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
> > word LL/SC and should be good.
> >
> > But yes, ZABHA might be equally broken.
>
> All architectures handle eight-bit atomics and stores, but last I checked,
> there were a few systems still around that failed to support 16-bit
> atomics and stores. I will check again.
>
> (But those systems's architectures can simply avoid supporting kernel
> features requiring these 16-bit operations.)
>
> It would be good to add multiple sizes to LKMM, and even moreso once we
> have 16-bit support across the board.
And Arnd tells me that the Linux kernel might be safe for 16-bit stores
in core code perhaps as early as the end of this year. ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 19:00 ` Hernan Ponce de Leon
@ 2025-06-19 12:30 ` Will Deacon
2025-06-19 14:11 ` Thomas Haas
0 siblings, 1 reply; 21+ messages in thread
From: Will Deacon @ 2025-06-19 12:30 UTC (permalink / raw)
To: Hernan Ponce de Leon
Cc: Thomas Haas, Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On Tue, Jun 17, 2025 at 09:00:30PM +0200, Hernan Ponce de Leon wrote:
> On 6/17/2025 4:23 PM, Thomas Haas wrote:
> >
> >
> > On 17.06.25 16:17, Will Deacon wrote:
> > > On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
> > > > On 6/17/2025 8:19 AM, Thomas Haas wrote:
> > > > > On 16.06.25 16:23, Will Deacon wrote:
> > > > > > I'm half inclined to think that the Arm memory model
> > > > > > should be tightened
> > > > > > here; I can raise that with Arm and see what they say.
> > > > > >
> > > > > > Although the cited paper does give examples of store-forwarding from a
> > > > > > narrow store to a wider load, the case in qspinlock is further
> > > > > > constrained by having the store come from an atomic rmw and the load
> > > > > > having acquire semantics. Setting aside the MSA part,
> > > > > > that specific case
> > > > > > _is_ ordered in the Arm memory model (and C++ release
> > > > > > sequences rely on
> > > > > > it iirc), so it's fair to say that Arm CPUs don't permit
> > > > > > forwarding from
> > > > > > an atomic rmw to an acquire load.
> > > > > >
> > > > > > Given that, I don't see how this is going to occur in practice.
> > > > >
> > > > > You are probably right. The ARM model's atomic-ordered-before relation
> > > > >
> > > > > let aob = rmw | [range(rmw)]; lrs; [A | Q]
> > > > >
> > > > > clearly orders the rmw-store with subsequent acquire loads
> > > > > (lrs = local-
> > > > > read-successor, A = acquire).
> > > > > If we treat this relation (at least the second part) as a "global
> > > > > ordering" and extend it by "si" (same-instruction), then the
> > > > > problematic
> > > > > reordering under MSA should be gone.
> > > > > I quickly ran Dartagnan on the MSA litmus tests with this change to the
> > > > > ARM model and all the tests still pass.
> > > >
> > > > Even with this change I still get violations (both safety and
> > > > termination)
> > > > for qspinlock with dartagnan.
> > >
> > > Please can you be more specific about the problems you see?
> >
> > I talked to Hernán personally and it turned out that he used the generic
> > implementation of smp_cond_acquire (not sure if the name is correct)
> > which uses a relaxed load followed by a barrier. In that case, replacing
> > aob by aob;si does not change anything.
> > Indeed, even in the reported problem we used the generic implementation
> > (I was unaware of this), though it is easy to check that changing the
> > relaxed load to acquire does not give sufficient orderings.
>
> Yes, my bad. I was using the generic header rather than the aarch64 specific
> one and then the changes to the model were having not effect (as they
> should).
>
> Now I am using the aarch64 specific ones and I can confirm dartagnan still
> reports the violations with the current model and making the change proposed
> by Thomas (adding ;si just to the second part seems to be enough) indeed
> removes all violations.
That's great! Thanks for working together to get to the bottom of it. I
was worried that this was the tip of the iceberg.
I'll try to follow-up with Arm to see if that ';si' addition is an
acceptable to atomic-ordered before. If not, we'll absorb the smp_wmb()
into xchg_release() with a fat comment about RCsc and mixed-size
accesses.
Will
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-19 12:30 ` Will Deacon
@ 2025-06-19 14:11 ` Thomas Haas
0 siblings, 0 replies; 21+ messages in thread
From: Thomas Haas @ 2025-06-19 14:11 UTC (permalink / raw)
To: Will Deacon, Hernan Ponce de Leon
Cc: Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On 19.06.25 14:30, Will Deacon wrote:
> On Tue, Jun 17, 2025 at 09:00:30PM +0200, Hernan Ponce de Leon wrote:
>> On 6/17/2025 4:23 PM, Thomas Haas wrote:
>>>
>>>
>>> On 17.06.25 16:17, Will Deacon wrote:
>>>> On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
>>>>> On 6/17/2025 8:19 AM, Thomas Haas wrote:
>>>>>> On 16.06.25 16:23, Will Deacon wrote:
>>>>>>> I'm half inclined to think that the Arm memory model
>>>>>>> should be tightened
>>>>>>> here; I can raise that with Arm and see what they say.
>>>>>>>
>>>>>>> Although the cited paper does give examples of store-forwarding from a
>>>>>>> narrow store to a wider load, the case in qspinlock is further
>>>>>>> constrained by having the store come from an atomic rmw and the load
>>>>>>> having acquire semantics. Setting aside the MSA part,
>>>>>>> that specific case
>>>>>>> _is_ ordered in the Arm memory model (and C++ release
>>>>>>> sequences rely on
>>>>>>> it iirc), so it's fair to say that Arm CPUs don't permit
>>>>>>> forwarding from
>>>>>>> an atomic rmw to an acquire load.
>>>>>>>
>>>>>>> Given that, I don't see how this is going to occur in practice.
>>>>>>
>>>>>> You are probably right. The ARM model's atomic-ordered-before relation
>>>>>>
>>>>>> let aob = rmw | [range(rmw)]; lrs; [A | Q]
>>>>>>
>>>>>> clearly orders the rmw-store with subsequent acquire loads
>>>>>> (lrs = local-
>>>>>> read-successor, A = acquire).
>>>>>> If we treat this relation (at least the second part) as a "global
>>>>>> ordering" and extend it by "si" (same-instruction), then the
>>>>>> problematic
>>>>>> reordering under MSA should be gone.
>>>>>> I quickly ran Dartagnan on the MSA litmus tests with this change to the
>>>>>> ARM model and all the tests still pass.
>>>>>
>>>>> Even with this change I still get violations (both safety and
>>>>> termination)
>>>>> for qspinlock with dartagnan.
>>>>
>>>> Please can you be more specific about the problems you see?
>>>
>>> I talked to Hernán personally and it turned out that he used the generic
>>> implementation of smp_cond_acquire (not sure if the name is correct)
>>> which uses a relaxed load followed by a barrier. In that case, replacing
>>> aob by aob;si does not change anything.
>>> Indeed, even in the reported problem we used the generic implementation
>>> (I was unaware of this), though it is easy to check that changing the
>>> relaxed load to acquire does not give sufficient orderings.
>>
>> Yes, my bad. I was using the generic header rather than the aarch64 specific
>> one and then the changes to the model were having not effect (as they
>> should).
>>
>> Now I am using the aarch64 specific ones and I can confirm dartagnan still
>> reports the violations with the current model and making the change proposed
>> by Thomas (adding ;si just to the second part seems to be enough) indeed
>> removes all violations.
>
> That's great! Thanks for working together to get to the bottom of it. I
> was worried that this was the tip of the iceberg.
>
> I'll try to follow-up with Arm to see if that ';si' addition is an
> acceptable to atomic-ordered before. If not, we'll absorb the smp_wmb()
> into xchg_release() with a fat comment about RCsc and mixed-size
> accesses.
>
> Will
I have already contacted ARM, but I think it will take some time to get
an answer.
Interestingly, the definition of aob has changed in a newer version of
the ARM cat model:
let aob = [Exp & M]; rmw; [Exp & M]
| [Exp & M]; rmw; lrs; [A | Q]
| [Imp & TTD & R]; rmw; [HU]
Ignoring all the extra Exp/Imp/TTD stuff, you can see that the second
part of the definition only orders the load-part of the rmw with the
following acquire load. This suggests that a form of store forwarding
might indeed be expected, seeing that they deliberately removed the
rmw_store-acq_load ordering they had previously.
Nevertheless, for qspinlock to work we just need the rmw_load-acq_load
ordering that is still provided as long as we extend it with ";si".
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 14:09 ` Andrea Parri
@ 2025-06-19 14:27 ` Thomas Haas
2025-06-19 14:32 ` Alan Stern
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Haas @ 2025-06-19 14:27 UTC (permalink / raw)
To: Andrea Parri, Alan Stern
Cc: Peter Zijlstra, Will Deacon, Boqun Feng, Nicholas Piggin,
David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
On 17.06.25 16:09, Andrea Parri wrote:
>> My question is: Do we have enough general knowledge at this point about
>> how the various types of hardware handle mixed-size accesses to come up
>> with a memory model everyone can agree one?
>
> You know, I can imagine a conversation along the following line if I
> were to turn this question to certain "hardware people":
>
> [Me/LKMM] People, how do you order such and those MSAs?
> [RTL/DV] What are Linux's uses and requirements?
>
> that is to say that the work mentioned is probably more "interactive"
> and more dynamic than how your question may suggest. ;)
>
> Said this, I do like to think that we (LKMM supporters and followers)
> have enough knowledge to approach that effort. It would require some
> changes to herd7 (and klitmus7), starting from teaching the tools the
> new MSAs syntax and how to generate rf, co and other basic relations
> (while monitoring potential non-MSA regressions). Non-trivial maybe,
> but seems doable. Suffice it to say that herd7 can't currently parse
> the following C test, but it can run its "lowerings"/assembly against
> a bunch of hardware models and implementations, including arm64, x86,
> powerpc and riscv! Any volunteers with ocaml expertise interested in
> contributing to the LKMM? ;)
>
> C C-thomas-haas
>
> {
> u32 x;
> u16 *x_lh = x; // herd7 dialect for "... = &x;"
> }
>
> P0(u32 *x)
> {
> WRITE_ONCE(*x, 0x10001);
> }
>
> P1(u16 **x_lh, u32 *x)
> {
> u16 r0;
> u32 r1;
>
> r0 = xchg_relaxed(*x_lh, 0x2);
> r1 = smp_load_acquire(x);
> }
>
> exists (1:r0=0x1 /\ 1:r1=0x2)
I support this endeavor, but from the Dartagnan side :).
We already support MSA in real C/Linux code and so extending our
supported Litmus fragment to MSA does not sound too hard to me.
We are just missing a LKMM cat model that supports MSA.
We cannot automatically generate and run tests on real hardware though,
so the support in herd7 and co. is definitely needed.
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-19 14:27 ` Thomas Haas
@ 2025-06-19 14:32 ` Alan Stern
2025-06-19 14:59 ` Thomas Haas
0 siblings, 1 reply; 21+ messages in thread
From: Alan Stern @ 2025-06-19 14:32 UTC (permalink / raw)
To: Thomas Haas
Cc: Andrea Parri, Peter Zijlstra, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On Thu, Jun 19, 2025 at 04:27:56PM +0200, Thomas Haas wrote:
> I support this endeavor, but from the Dartagnan side :).
> We already support MSA in real C/Linux code and so extending our supported
> Litmus fragment to MSA does not sound too hard to me.
> We are just missing a LKMM cat model that supports MSA.
To me, it doesn't feel all that easy. I'm not even sure where to start
changing the LKMM.\
Probably the best way to keep things organized would be to begin with
changes to the informal operational model and then figure out how to
formalize them. But what changes are needed to the operational model?
Alan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-19 14:32 ` Alan Stern
@ 2025-06-19 14:59 ` Thomas Haas
2025-06-19 17:56 ` Alan Stern
0 siblings, 1 reply; 21+ messages in thread
From: Thomas Haas @ 2025-06-19 14:59 UTC (permalink / raw)
To: Alan Stern
Cc: Andrea Parri, Peter Zijlstra, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On 19.06.25 16:32, Alan Stern wrote:
> On Thu, Jun 19, 2025 at 04:27:56PM +0200, Thomas Haas wrote:
>> I support this endeavor, but from the Dartagnan side :).
>> We already support MSA in real C/Linux code and so extending our supported
>> Litmus fragment to MSA does not sound too hard to me.
>> We are just missing a LKMM cat model that supports MSA.
>
> To me, it doesn't feel all that easy. I'm not even sure where to start
> changing the LKMM.\
>
> Probably the best way to keep things organized would be to begin with
> changes to the informal operational model and then figure out how to
> formalize them. But what changes are needed to the operational model?
>
> Alan
Of course, the difficult part is to get the model right. Maybe I
shouldn't have said that we are "just" missing the model :).
I'm only saying that we already have some tooling to validate changes to
the formal model.
I think it makes sense to first play around with the tooling and changes
to the formal model to just get a feeling of what can go wrong and what
needs to go right. Then it might become more clear on how the informal
operational model needs to change.
A good starting point might be to lift the existing ARM8 MSA litmus
tests to corresponding C/LKMM litmus tests and go from there.
If the informal operational model fails to explain them, then it needs
to change. This goes only one way though: if ARM permits a behavior then
so should LKMM. If ARM does not, then it is not so clear if LKMM should
or not.
Thomas
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-19 14:59 ` Thomas Haas
@ 2025-06-19 17:56 ` Alan Stern
2025-06-19 18:21 ` Thomas Haas
0 siblings, 1 reply; 21+ messages in thread
From: Alan Stern @ 2025-06-19 17:56 UTC (permalink / raw)
To: Thomas Haas
Cc: Andrea Parri, Peter Zijlstra, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On Thu, Jun 19, 2025 at 04:59:38PM +0200, Thomas Haas wrote:
>
>
> On 19.06.25 16:32, Alan Stern wrote:
> > On Thu, Jun 19, 2025 at 04:27:56PM +0200, Thomas Haas wrote:
> > > I support this endeavor, but from the Dartagnan side :).
> > > We already support MSA in real C/Linux code and so extending our supported
> > > Litmus fragment to MSA does not sound too hard to me.
> > > We are just missing a LKMM cat model that supports MSA.
> >
> > To me, it doesn't feel all that easy. I'm not even sure where to start
> > changing the LKMM.\
> >
> > Probably the best way to keep things organized would be to begin with
> > changes to the informal operational model and then figure out how to
> > formalize them. But what changes are needed to the operational model?
> >
> > Alan
>
> Of course, the difficult part is to get the model right. Maybe I shouldn't
> have said that we are "just" missing the model :).
> I'm only saying that we already have some tooling to validate changes to the
> formal model.
>
> I think it makes sense to first play around with the tooling and changes to
> the formal model to just get a feeling of what can go wrong and what needs
> to go right. Then it might become more clear on how the informal operational
> model needs to change.
>
> A good starting point might be to lift the existing ARM8 MSA litmus tests to
> corresponding C/LKMM litmus tests and go from there.
> If the informal operational model fails to explain them, then it needs to
> change. This goes only one way though: if ARM permits a behavior then so
> should LKMM. If ARM does not, then it is not so clear if LKMM should or not.
Okay, that seems reasonable.
BTW, I don't want to disagree with what you wrote ... but doesn't your
last paragraph contradict the paragraph before it? Is starting with the
various MSA litmus tests and seeing how the operational model fails to
explain them not the opposite of first playing around with the tooling
and changes to the formal model?
Alan
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-19 17:56 ` Alan Stern
@ 2025-06-19 18:21 ` Thomas Haas
0 siblings, 0 replies; 21+ messages in thread
From: Thomas Haas @ 2025-06-19 18:21 UTC (permalink / raw)
To: Alan Stern
Cc: Andrea Parri, Peter Zijlstra, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On 19.06.25 19:56, Alan Stern wrote:
> On Thu, Jun 19, 2025 at 04:59:38PM +0200, Thomas Haas wrote:
>>
>>
>> On 19.06.25 16:32, Alan Stern wrote:
>>> On Thu, Jun 19, 2025 at 04:27:56PM +0200, Thomas Haas wrote:
>>>> I support this endeavor, but from the Dartagnan side :).
>>>> We already support MSA in real C/Linux code and so extending our supported
>>>> Litmus fragment to MSA does not sound too hard to me.
>>>> We are just missing a LKMM cat model that supports MSA.
>>>
>>> To me, it doesn't feel all that easy. I'm not even sure where to start
>>> changing the LKMM.\
>>>
>>> Probably the best way to keep things organized would be to begin with
>>> changes to the informal operational model and then figure out how to
>>> formalize them. But what changes are needed to the operational model?
>>>
>>> Alan
>>
>> Of course, the difficult part is to get the model right. Maybe I shouldn't
>> have said that we are "just" missing the model :).
>> I'm only saying that we already have some tooling to validate changes to the
>> formal model.
>>
>> I think it makes sense to first play around with the tooling and changes to
>> the formal model to just get a feeling of what can go wrong and what needs
>> to go right. Then it might become more clear on how the informal operational
>> model needs to change.
>>
>> A good starting point might be to lift the existing ARM8 MSA litmus tests to
>> corresponding C/LKMM litmus tests and go from there.
>> If the informal operational model fails to explain them, then it needs to
>> change. This goes only one way though: if ARM permits a behavior then so
>> should LKMM. If ARM does not, then it is not so clear if LKMM should or not.
>
> Okay, that seems reasonable.
>
> BTW, I don't want to disagree with what you wrote ... but doesn't your
> last paragraph contradict the paragraph before it? Is starting with the
> various MSA litmus tests and seeing how the operational model fails to
> explain them not the opposite of first playing around with the tooling
> and changes to the formal model?
>
> Alan
I would rather see the approaches as complementary.
If you lift ARM tests to LKMM tests, you can use them to reason about
both the informal operational model and the formal axiomatic model.
I mean, there are going to be litmus tests that show behavior that
should reasonably be forbidden on all platforms, and trying to adapt the
formal LKMM to also forbid them seems like a good exercise.
Even if it turns out that the changes are unsound, I think you will end
up with better understanding, no?
Either way, I'm just suggesting possible ways to start but I myself
don't have the time (right now) to get this project going.
Thomas
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-06-19 18:21 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12 14:55 [RFC] Potential problem in qspinlock due to mixed-size accesses Thomas Haas
2025-06-13 7:55 ` Peter Zijlstra
2025-06-13 11:17 ` Andrea Parri
[not found] ` <9264df13-36db-4b25-b2c4-7a9701df2f4d@tu-bs.de>
2025-06-16 6:21 ` Andrea Parri
2025-06-16 14:11 ` Alan Stern
2025-06-17 14:09 ` Andrea Parri
2025-06-19 14:27 ` Thomas Haas
2025-06-19 14:32 ` Alan Stern
2025-06-19 14:59 ` Thomas Haas
2025-06-19 17:56 ` Alan Stern
2025-06-19 18:21 ` Thomas Haas
2025-06-16 14:23 ` Will Deacon
2025-06-17 6:19 ` Thomas Haas
2025-06-17 8:42 ` Hernan Ponce de Leon
2025-06-17 14:17 ` Will Deacon
2025-06-17 14:23 ` Thomas Haas
2025-06-17 19:00 ` Hernan Ponce de Leon
2025-06-19 12:30 ` Will Deacon
2025-06-19 14:11 ` Thomas Haas
2025-06-18 6:51 ` Paul E. McKenney
2025-06-18 12:11 ` Paul E. McKenney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).