* [RFC] Potential problem in qspinlock due to mixed-size accesses
@ 2025-06-12 14:55 Thomas Haas
2025-06-13 7:55 ` Peter Zijlstra
2025-12-17 19:05 ` Thomas Haas
0 siblings, 2 replies; 23+ messages in thread
From: Thomas Haas @ 2025-06-12 14:55 UTC (permalink / raw)
To: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm
Cc: hernan.poncedeleon, jonas.oberhauser, r.maseli@tu-bs.de
We have been taking a look if mixed-size accesses (MSA) can affect the
correctness of qspinlock.
We are focusing on aarch64 which is the only memory model with MSA
support [1].
For this we extended the dartagnan [2] tool to support MSA and now it
reports liveness, synchronization, and mutex issues.
Notice that we did something similar in the past for LKMM, but we were
ignoring MSA [3].
The culprit of all these issues is that atomicity of single load
instructions is not guaranteed in the presence of smaller-sized stores
(observed on real hardware according to [1] and Fig. 21/22)
Consider the following pseudo code:
int16 old = xchg16_rlx(&lock, 42);
int32 l = load32_acq(&lock);
Then the hardware can treat the code as (likely due to store-forwarding)
int16 old = xchg16_rlx(&lock, 42);
int16 l1 = load16_acq(&lock);
int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
arithmetic
and reorder it to
int16 l2 = load16_acq(&lock + 2);
int16 old = xchg16_rlx(&lock, 42);
int16 l1 = load16_acq(&lock);
Now another thread can overwrite "lock" in between the first two
accesses so that the original l (l1 and l2) ends up containing
parts of a lock value that is older than what the xchg observed.
Let us now explain how this can break qspinlock.
We need 3 threads going the slowpath, T1, T2, and T3.
T1 is a setup-thread that takes the lock just so that T2 and T3 observe
contention and try to enqueue themselves.
So consider a situation where
- T3's node is the only enqueued node (due to previous contention
with T1).
- T2 is about to enqueue its node.
- T3 is about to take the lock because it sees no contention (lock
is free, no pending bits are set, it has the only node in the queue).
- the lock is released (T1 is already done)
We focus on the following lines of code in qspinlock.c:
// <- T2 is here, but executes part of line 328 first.
277: old = xchg_tail(lock, tail); // ~ xchg_rlx(&lock->tail,
tail)
// ...
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK));
// ...
// <- T3 is here
352: if ((val & _Q_TAIL_MASK) == tail) {
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
Then the following sequence of operations is problematic:
(1) T2 reads half of the lock (the half that is not lock->tail, i.e., it
"only" reads lock->lock_pending)
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK)); // P1
// Roughly: val_lock_pending = load_acquire(&lock->lock_pending);
T2 observes a free lock and no pending bits set.
(2) T3 takes the lock because it does not observe contention
352: if ((val & _Q_TAIL_MASK) == tail) { // Satisfied because
only T3's node is enqueued and lock is free
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
T3 clears the tail and claims the lock. The queue is empty now.
(3) T2 enqueues itself
277: old = xchg_tail(lock, tail);
T2 sees an empty queue (old == 0) because T3 just cleared it, and
enqueues its node.
(4) T2 reads the remaining half of the lock from (1), reading the tail
(lock->tail) it just inserted.
328: val = atomic_cond_read_acquire(&lock->val, !(VAL &
_Q_LOCKED_PENDING_MASK)); // P2
// Roughly: val_tail = load_acquire(&lock->tail);
T2 observes its own tail + lock is free and no pending bits are set
(from (1))
Now there are two continuations, one leading to broken synchronisation,
another leading to non-termination or failure of mutual exclusion.
We first consider the synchronisation issue.
(5a) T3 finishes its critical section and releases the lock
(6a) T2 takes the lock with the same code as in point (2):
352: if ((val & _Q_TAIL_MASK) == tail) { // Satisfied
because only T2's node is enqueued and lock is free
if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
_Q_LOCKED_VAL))
goto release; /* No contention */
355: }
Notice that the "atomic_try_cmpxchg_relaxed" would fail if the
lock was still taken by T3, because "val" has no lock bits set (as T2
observed the lock to be free).
Although T2 now properly enters the CS after T3 has finished, the
synchronisation is broken since T2 did not perform an acquire operation
to synchronise with the lock release.
Indeed, dartagnan reports no safety violations (with 3 threads)
if the CAS is made an acquire or the CS itself contains an acquire
barrier (smb_rmb or smb_mb).
Now, let's consider the alternative continuation that leads to
non-termination or failure of mutual exclusion.
(5b) T2 tries to take the lock as above in (6a) but the CAS fails
because the lock is still taken by T3.
(6b) Due to the failing CAS, T2 observes contention (at least it thinks
so). It sets the lock (although T3 might still have it!) and waits until
the (non-existent) "contending thread" enqueues its node:
/*
* Either somebody is queued behind us or _Q_PENDING_VAL got set
* which will then detect the remaining tail and queue behind us
* ensuring we'll see a @next.
*/
362: set_locked(lock);
/*
* contended path; wait for next if not observed yet, release.
*/
367: if (!next)
368: next = smp_cond_load_relaxed(&node->next, (VAL));
The first comment suggests that the assumed situation is that either
another thread enqueued a node or the pending bits got set. But neither is
true in the current execution: we got here because the lock was taken.
Now there are two outcomes:
- Non-termination: the "smp_cond_load_relaxed" waits forever,
because there is no other thread that enqueues another node.
- Broken mutex: another thread (T4) enqueues a node and therefore
releases T2 from its loop. Now T2 enters the CS while T3 still executes
its CS.
Indeed, dartagnan reports this safety violation only with 4+
threads.
NOTE: For the above examples we forced all threads to take the slowpath
of qspinlock by removing the fastpath. With the fastpath present,
another thread (T0) is needed to force the other threads into the
slowpath, i.e., we need 4-5 threads to witness the issues.
### Solutions
The problematic executions rely on the fact that T2 can move half of its
load operation (1) to before the xchg_tail (3).
Preventing this reordering solves all issues. Possible solutions are:
- make the xchg_tail full-sized (i.e, also touch lock/pending bits).
Note that if the kernel is configured with >= 16k cpus, then the
tail becomes larger than 16 bits and needs to be encoded in parts of the
pending byte as well.
In this case, the kernel makes a full-sized (32-bit) access for
the xchg. So the above bugs are only present in the < 16k cpus setting.
- make the xchg_tail an acquire operation.
- make the xchg_tail a release operation (this is an odd solution
by itself but works for aarch64 because it preserves REL->ACQ ordering).
In this case, maybe the preceding "smp_wmb()" can be removed.
- put some other read-read barrier between the xchg_tail and the load.
### Implications for qspinlock executed on non-ARM architectures.
Unfortunately, there are no MSA extensions for other hardware memory
models, so we have to speculate based on whether the problematic
reordering is permitted if the problematic load was treated as two
individual instructions.
It seems Power and RISCV would have no problem reordering the
instructions, so qspinlock might also break on those architectures.
TSO, on the other hand, does not permit such reordering. Also, the
xchg_tail is a rmw operation which acts like a full memory barrier under
TSO, so even if load-load reordering was permitted, the rmw would
prevent this.
[1] https://dl.acm.org/doi/10.1145/3458926
[2] https://github.com/hernanponcedeleon/Dat3M
[3] https://lkml.org/lkml/2022/8/26/597
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-12 14:55 [RFC] Potential problem in qspinlock due to mixed-size accesses Thomas Haas
@ 2025-06-13 7:55 ` Peter Zijlstra
2025-06-13 11:17 ` Andrea Parri
` (2 more replies)
2025-12-17 19:05 ` Thomas Haas
1 sibling, 3 replies; 23+ messages in thread
From: Peter Zijlstra @ 2025-06-13 7:55 UTC (permalink / raw)
To: Thomas Haas
Cc: Alan Stern, Andrea Parri, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
> We have been taking a look if mixed-size accesses (MSA) can affect the
> correctness of qspinlock.
> We are focusing on aarch64 which is the only memory model with MSA support
> [1].
> For this we extended the dartagnan [2] tool to support MSA and now it
> reports liveness, synchronization, and mutex issues.
> Notice that we did something similar in the past for LKMM, but we were
> ignoring MSA [3].
>
> The culprit of all these issues is that atomicity of single load
> instructions is not guaranteed in the presence of smaller-sized stores
> (observed on real hardware according to [1] and Fig. 21/22)
> Consider the following pseudo code:
>
> int16 old = xchg16_rlx(&lock, 42);
> int32 l = load32_acq(&lock);
>
> Then the hardware can treat the code as (likely due to store-forwarding)
>
> int16 old = xchg16_rlx(&lock, 42);
> int16 l1 = load16_acq(&lock);
> int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
> arithmetic
>
> and reorder it to
>
> int16 l2 = load16_acq(&lock + 2);
> int16 old = xchg16_rlx(&lock, 42);
> int16 l1 = load16_acq(&lock);
>
> Now another thread can overwrite "lock" in between the first two accesses so
> that the original l (l1 and l2) ends up containing
> parts of a lock value that is older than what the xchg observed.
Oops :-(
(snip the excellent details)
> ### Solutions
>
> The problematic executions rely on the fact that T2 can move half of its
> load operation (1) to before the xchg_tail (3).
> Preventing this reordering solves all issues. Possible solutions are:
> - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
> Note that if the kernel is configured with >= 16k cpus, then the tail
> becomes larger than 16 bits and needs to be encoded in parts of the pending
> byte as well.
> In this case, the kernel makes a full-sized (32-bit) access for the
> xchg. So the above bugs are only present in the < 16k cpus setting.
Right, but that is the more expensive option for some.
> - make the xchg_tail an acquire operation.
> - make the xchg_tail a release operation (this is an odd solution by
> itself but works for aarch64 because it preserves REL->ACQ ordering). In
> this case, maybe the preceding "smp_wmb()" can be removed.
I think I prefer this one, it move a barrier, not really adding
additional overhead. Will?
> - put some other read-read barrier between the xchg_tail and the load.
>
>
> ### Implications for qspinlock executed on non-ARM architectures.
>
> Unfortunately, there are no MSA extensions for other hardware memory models,
> so we have to speculate based on whether the problematic reordering is
> permitted if the problematic load was treated as two individual
> instructions.
> It seems Power and RISCV would have no problem reordering the instructions,
> so qspinlock might also break on those architectures.
Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
word LL/SC and should be good.
But yes, ZABHA might be equally broken.
> TSO, on the other hand, does not permit such reordering. Also, the xchg_tail
> is a rmw operation which acts like a full memory barrier under TSO, so even
> if load-load reordering was permitted, the rmw would prevent this.
Right.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-13 7:55 ` Peter Zijlstra
@ 2025-06-13 11:17 ` Andrea Parri
[not found] ` <9264df13-36db-4b25-b2c4-7a9701df2f4d@tu-bs.de>
2025-06-16 14:23 ` Will Deacon
2025-06-18 6:51 ` Paul E. McKenney
2 siblings, 1 reply; 23+ messages in thread
From: Andrea Parri @ 2025-06-13 11:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Haas, Alan Stern, Will Deacon, Boqun Feng, Nicholas Piggin,
David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
> (snip the excellent details)
Indeed, joining in praising this report - Great work, Thomas!
> > ### Solutions
> >
> > The problematic executions rely on the fact that T2 can move half of its
> > load operation (1) to before the xchg_tail (3).
> > Preventing this reordering solves all issues. Possible solutions are:
> > - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
> > Note that if the kernel is configured with >= 16k cpus, then the tail
> > becomes larger than 16 bits and needs to be encoded in parts of the pending
> > byte as well.
> > In this case, the kernel makes a full-sized (32-bit) access for the
> > xchg. So the above bugs are only present in the < 16k cpus setting.
>
> Right, but that is the more expensive option for some.
>
> > - make the xchg_tail an acquire operation.
> > - make the xchg_tail a release operation (this is an odd solution by
> > itself but works for aarch64 because it preserves REL->ACQ ordering). In
> > this case, maybe the preceding "smp_wmb()" can be removed.
>
> I think I prefer this one, it move a barrier, not really adding
> additional overhead. Will?
>
> > - put some other read-read barrier between the xchg_tail and the load.
> >
> >
> > ### Implications for qspinlock executed on non-ARM architectures.
> >
> > Unfortunately, there are no MSA extensions for other hardware memory models,
> > so we have to speculate based on whether the problematic reordering is
> > permitted if the problematic load was treated as two individual
> > instructions.
> > It seems Power and RISCV would have no problem reordering the instructions,
> > so qspinlock might also break on those architectures.
>
> Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
> word LL/SC and should be good.
>
> But yes, ZABHA might be equally broken.
RISC-V forbids store-forwarding from AMOs or SCs, certain (non-normative)
commentary in the spec clarifies that the same ordering rule applies when
the memory accesses in question only overlap partially.
I am not aware of any "RISC-V implementation" manifesting the load-load
re-ordering in question. IAC, notice that making xchg_tail() a release
operation might not suffice to fix such an implementation given that the
arch has no plain load-acquire instruction yet and relies on the generic
(fence-based) code for atomic_cond_read_acquire().
Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-13 7:55 ` Peter Zijlstra
2025-06-13 11:17 ` Andrea Parri
@ 2025-06-16 14:23 ` Will Deacon
2025-06-17 6:19 ` Thomas Haas
2025-06-18 6:51 ` Paul E. McKenney
2 siblings, 1 reply; 23+ messages in thread
From: Will Deacon @ 2025-06-16 14:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Haas, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
> > We have been taking a look if mixed-size accesses (MSA) can affect the
> > correctness of qspinlock.
> > We are focusing on aarch64 which is the only memory model with MSA support
> > [1].
> > For this we extended the dartagnan [2] tool to support MSA and now it
> > reports liveness, synchronization, and mutex issues.
> > Notice that we did something similar in the past for LKMM, but we were
> > ignoring MSA [3].
> >
> > The culprit of all these issues is that atomicity of single load
> > instructions is not guaranteed in the presence of smaller-sized stores
> > (observed on real hardware according to [1] and Fig. 21/22)
> > Consider the following pseudo code:
> >
> > int16 old = xchg16_rlx(&lock, 42);
> > int32 l = load32_acq(&lock);
> >
> > Then the hardware can treat the code as (likely due to store-forwarding)
> >
> > int16 old = xchg16_rlx(&lock, 42);
> > int16 l1 = load16_acq(&lock);
> > int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
> > arithmetic
> >
> > and reorder it to
> >
> > int16 l2 = load16_acq(&lock + 2);
> > int16 old = xchg16_rlx(&lock, 42);
> > int16 l1 = load16_acq(&lock);
> >
> > Now another thread can overwrite "lock" in between the first two accesses so
> > that the original l (l1 and l2) ends up containing
> > parts of a lock value that is older than what the xchg observed.
>
> Oops :-(
>
> (snip the excellent details)
>
> > ### Solutions
> >
> > The problematic executions rely on the fact that T2 can move half of its
> > load operation (1) to before the xchg_tail (3).
> > Preventing this reordering solves all issues. Possible solutions are:
> > - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
> > Note that if the kernel is configured with >= 16k cpus, then the tail
> > becomes larger than 16 bits and needs to be encoded in parts of the pending
> > byte as well.
> > In this case, the kernel makes a full-sized (32-bit) access for the
> > xchg. So the above bugs are only present in the < 16k cpus setting.
>
> Right, but that is the more expensive option for some.
>
> > - make the xchg_tail an acquire operation.
> > - make the xchg_tail a release operation (this is an odd solution by
> > itself but works for aarch64 because it preserves REL->ACQ ordering). In
> > this case, maybe the preceding "smp_wmb()" can be removed.
>
> I think I prefer this one, it move a barrier, not really adding
> additional overhead. Will?
I'm half inclined to think that the Arm memory model should be tightened
here; I can raise that with Arm and see what they say.
Although the cited paper does give examples of store-forwarding from a
narrow store to a wider load, the case in qspinlock is further
constrained by having the store come from an atomic rmw and the load
having acquire semantics. Setting aside the MSA part, that specific case
_is_ ordered in the Arm memory model (and C++ release sequences rely on
it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
an atomic rmw to an acquire load.
Given that, I don't see how this is going to occur in practice.
Will
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-16 14:23 ` Will Deacon
@ 2025-06-17 6:19 ` Thomas Haas
2025-06-17 8:42 ` Hernan Ponce de Leon
0 siblings, 1 reply; 23+ messages in thread
From: Thomas Haas @ 2025-06-17 6:19 UTC (permalink / raw)
To: Will Deacon, Peter Zijlstra
Cc: Alan Stern, Andrea Parri, Boqun Feng, Nicholas Piggin,
David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
On 16.06.25 16:23, Will Deacon wrote:
> On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
>> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
>>> We have been taking a look if mixed-size accesses (MSA) can affect the
>>> correctness of qspinlock.
>>> We are focusing on aarch64 which is the only memory model with MSA support
>>> [1].
>>> For this we extended the dartagnan [2] tool to support MSA and now it
>>> reports liveness, synchronization, and mutex issues.
>>> Notice that we did something similar in the past for LKMM, but we were
>>> ignoring MSA [3].
>>>
>>> The culprit of all these issues is that atomicity of single load
>>> instructions is not guaranteed in the presence of smaller-sized stores
>>> (observed on real hardware according to [1] and Fig. 21/22)
>>> Consider the following pseudo code:
>>>
>>> int16 old = xchg16_rlx(&lock, 42);
>>> int32 l = load32_acq(&lock);
>>>
>>> Then the hardware can treat the code as (likely due to store-forwarding)
>>>
>>> int16 old = xchg16_rlx(&lock, 42);
>>> int16 l1 = load16_acq(&lock);
>>> int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
>>> arithmetic
>>>
>>> and reorder it to
>>>
>>> int16 l2 = load16_acq(&lock + 2);
>>> int16 old = xchg16_rlx(&lock, 42);
>>> int16 l1 = load16_acq(&lock);
>>>
>>> Now another thread can overwrite "lock" in between the first two accesses so
>>> that the original l (l1 and l2) ends up containing
>>> parts of a lock value that is older than what the xchg observed.
>>
>> Oops :-(
>>
>> (snip the excellent details)
>>
>>> ### Solutions
>>>
>>> The problematic executions rely on the fact that T2 can move half of its
>>> load operation (1) to before the xchg_tail (3).
>>> Preventing this reordering solves all issues. Possible solutions are:
>>> - make the xchg_tail full-sized (i.e, also touch lock/pending bits).
>>> Note that if the kernel is configured with >= 16k cpus, then the tail
>>> becomes larger than 16 bits and needs to be encoded in parts of the pending
>>> byte as well.
>>> In this case, the kernel makes a full-sized (32-bit) access for the
>>> xchg. So the above bugs are only present in the < 16k cpus setting.
>>
>> Right, but that is the more expensive option for some.
>>
>>> - make the xchg_tail an acquire operation.
>>> - make the xchg_tail a release operation (this is an odd solution by
>>> itself but works for aarch64 because it preserves REL->ACQ ordering). In
>>> this case, maybe the preceding "smp_wmb()" can be removed.
>>
>> I think I prefer this one, it move a barrier, not really adding
>> additional overhead. Will?
>
> I'm half inclined to think that the Arm memory model should be tightened
> here; I can raise that with Arm and see what they say.
>
> Although the cited paper does give examples of store-forwarding from a
> narrow store to a wider load, the case in qspinlock is further
> constrained by having the store come from an atomic rmw and the load
> having acquire semantics. Setting aside the MSA part, that specific case
> _is_ ordered in the Arm memory model (and C++ release sequences rely on
> it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
> an atomic rmw to an acquire load.
>
> Given that, I don't see how this is going to occur in practice.
>
> Will
You are probably right. The ARM model's atomic-ordered-before relation
let aob = rmw | [range(rmw)]; lrs; [A | Q]
clearly orders the rmw-store with subsequent acquire loads (lrs =
local-read-successor, A = acquire).
If we treat this relation (at least the second part) as a "global
ordering" and extend it by "si" (same-instruction), then the problematic
reordering under MSA should be gone.
I quickly ran Dartagnan on the MSA litmus tests with this change to the
ARM model and all the tests still pass.
We should definitely ask ARM about this. I did sent an email to Jade
before writing about this issue here, but she was (and still is) busy
and told me to ask at memory-model@arm.com .
I will ask them.
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 6:19 ` Thomas Haas
@ 2025-06-17 8:42 ` Hernan Ponce de Leon
2025-06-17 14:17 ` Will Deacon
0 siblings, 1 reply; 23+ messages in thread
From: Hernan Ponce de Leon @ 2025-06-17 8:42 UTC (permalink / raw)
To: Thomas Haas, Will Deacon, Peter Zijlstra
Cc: Alan Stern, Andrea Parri, Boqun Feng, Nicholas Piggin,
David Howells, Jade Alglave, Luc Maranget, Paul E. McKenney,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, jonas.oberhauser, r.maseli@tu-bs.de
On 6/17/2025 8:19 AM, Thomas Haas wrote:
>
>
> On 16.06.25 16:23, Will Deacon wrote:
>> On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
>>> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
>>>> We have been taking a look if mixed-size accesses (MSA) can affect the
>>>> correctness of qspinlock.
>>>> We are focusing on aarch64 which is the only memory model with MSA
>>>> support
>>>> [1].
>>>> For this we extended the dartagnan [2] tool to support MSA and now it
>>>> reports liveness, synchronization, and mutex issues.
>>>> Notice that we did something similar in the past for LKMM, but we were
>>>> ignoring MSA [3].
>>>>
>>>> The culprit of all these issues is that atomicity of single load
>>>> instructions is not guaranteed in the presence of smaller-sized stores
>>>> (observed on real hardware according to [1] and Fig. 21/22)
>>>> Consider the following pseudo code:
>>>>
>>>> int16 old = xchg16_rlx(&lock, 42);
>>>> int32 l = load32_acq(&lock);
>>>>
>>>> Then the hardware can treat the code as (likely due to store-
>>>> forwarding)
>>>>
>>>> int16 old = xchg16_rlx(&lock, 42);
>>>> int16 l1 = load16_acq(&lock);
>>>> int16 l2 = load16_acq(&lock + 2); // Assuming byte-precise pointer
>>>> arithmetic
>>>>
>>>> and reorder it to
>>>>
>>>> int16 l2 = load16_acq(&lock + 2);
>>>> int16 old = xchg16_rlx(&lock, 42);
>>>> int16 l1 = load16_acq(&lock);
>>>>
>>>> Now another thread can overwrite "lock" in between the first two
>>>> accesses so
>>>> that the original l (l1 and l2) ends up containing
>>>> parts of a lock value that is older than what the xchg observed.
>>>
>>> Oops :-(
>>>
>>> (snip the excellent details)
>>>
>>>> ### Solutions
>>>>
>>>> The problematic executions rely on the fact that T2 can move half of
>>>> its
>>>> load operation (1) to before the xchg_tail (3).
>>>> Preventing this reordering solves all issues. Possible solutions are:
>>>> - make the xchg_tail full-sized (i.e, also touch lock/pending
>>>> bits).
>>>> Note that if the kernel is configured with >= 16k cpus, then
>>>> the tail
>>>> becomes larger than 16 bits and needs to be encoded in parts of the
>>>> pending
>>>> byte as well.
>>>> In this case, the kernel makes a full-sized (32-bit) access
>>>> for the
>>>> xchg. So the above bugs are only present in the < 16k cpus setting.
>>>
>>> Right, but that is the more expensive option for some.
>>>
>>>> - make the xchg_tail an acquire operation.
>>>> - make the xchg_tail a release operation (this is an odd
>>>> solution by
>>>> itself but works for aarch64 because it preserves REL->ACQ
>>>> ordering). In
>>>> this case, maybe the preceding "smp_wmb()" can be removed.
>>>
>>> I think I prefer this one, it move a barrier, not really adding
>>> additional overhead. Will?
>>
>> I'm half inclined to think that the Arm memory model should be tightened
>> here; I can raise that with Arm and see what they say.
>>
>> Although the cited paper does give examples of store-forwarding from a
>> narrow store to a wider load, the case in qspinlock is further
>> constrained by having the store come from an atomic rmw and the load
>> having acquire semantics. Setting aside the MSA part, that specific case
>> _is_ ordered in the Arm memory model (and C++ release sequences rely on
>> it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
>> an atomic rmw to an acquire load.
>>
>> Given that, I don't see how this is going to occur in practice.
>>
>> Will
>
> You are probably right. The ARM model's atomic-ordered-before relation
>
> let aob = rmw | [range(rmw)]; lrs; [A | Q]
>
> clearly orders the rmw-store with subsequent acquire loads (lrs = local-
> read-successor, A = acquire).
> If we treat this relation (at least the second part) as a "global
> ordering" and extend it by "si" (same-instruction), then the problematic
> reordering under MSA should be gone.
> I quickly ran Dartagnan on the MSA litmus tests with this change to the
> ARM model and all the tests still pass.
Even with this change I still get violations (both safety and
termination) for qspinlock with dartagnan.
I think the problem is actually with the Internal visibility axiom,
because only making that one stronger seems to remove the violations.
Hernan
>
> We should definitely ask ARM about this. I did sent an email to Jade
> before writing about this issue here, but she was (and still is) busy
> and told me to ask at memory-model@arm.com .
> I will ask them.
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 8:42 ` Hernan Ponce de Leon
@ 2025-06-17 14:17 ` Will Deacon
2025-06-17 14:23 ` Thomas Haas
0 siblings, 1 reply; 23+ messages in thread
From: Will Deacon @ 2025-06-17 14:17 UTC (permalink / raw)
To: Hernan Ponce de Leon
Cc: Thomas Haas, Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
> On 6/17/2025 8:19 AM, Thomas Haas wrote:
> > On 16.06.25 16:23, Will Deacon wrote:
> > > I'm half inclined to think that the Arm memory model should be tightened
> > > here; I can raise that with Arm and see what they say.
> > >
> > > Although the cited paper does give examples of store-forwarding from a
> > > narrow store to a wider load, the case in qspinlock is further
> > > constrained by having the store come from an atomic rmw and the load
> > > having acquire semantics. Setting aside the MSA part, that specific case
> > > _is_ ordered in the Arm memory model (and C++ release sequences rely on
> > > it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
> > > an atomic rmw to an acquire load.
> > >
> > > Given that, I don't see how this is going to occur in practice.
> >
> > You are probably right. The ARM model's atomic-ordered-before relation
> >
> > let aob = rmw | [range(rmw)]; lrs; [A | Q]
> >
> > clearly orders the rmw-store with subsequent acquire loads (lrs = local-
> > read-successor, A = acquire).
> > If we treat this relation (at least the second part) as a "global
> > ordering" and extend it by "si" (same-instruction), then the problematic
> > reordering under MSA should be gone.
> > I quickly ran Dartagnan on the MSA litmus tests with this change to the
> > ARM model and all the tests still pass.
>
> Even with this change I still get violations (both safety and termination)
> for qspinlock with dartagnan.
Please can you be more specific about the problems you see?
> I think the problem is actually with the Internal visibility axiom, because
> only making that one stronger seems to remove the violations.
That sounds surprising to me, as there's nothing particularly weird about
Arm's coherence requirements when compared to other architectures, as far
as I'm aware.
Will
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 14:17 ` Will Deacon
@ 2025-06-17 14:23 ` Thomas Haas
2025-06-17 19:00 ` Hernan Ponce de Leon
0 siblings, 1 reply; 23+ messages in thread
From: Thomas Haas @ 2025-06-17 14:23 UTC (permalink / raw)
To: Will Deacon, Hernan Ponce de Leon
Cc: Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On 17.06.25 16:17, Will Deacon wrote:
> On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
>> On 6/17/2025 8:19 AM, Thomas Haas wrote:
>>> On 16.06.25 16:23, Will Deacon wrote:
>>>> I'm half inclined to think that the Arm memory model should be tightened
>>>> here; I can raise that with Arm and see what they say.
>>>>
>>>> Although the cited paper does give examples of store-forwarding from a
>>>> narrow store to a wider load, the case in qspinlock is further
>>>> constrained by having the store come from an atomic rmw and the load
>>>> having acquire semantics. Setting aside the MSA part, that specific case
>>>> _is_ ordered in the Arm memory model (and C++ release sequences rely on
>>>> it iirc), so it's fair to say that Arm CPUs don't permit forwarding from
>>>> an atomic rmw to an acquire load.
>>>>
>>>> Given that, I don't see how this is going to occur in practice.
>>>
>>> You are probably right. The ARM model's atomic-ordered-before relation
>>>
>>> let aob = rmw | [range(rmw)]; lrs; [A | Q]
>>>
>>> clearly orders the rmw-store with subsequent acquire loads (lrs = local-
>>> read-successor, A = acquire).
>>> If we treat this relation (at least the second part) as a "global
>>> ordering" and extend it by "si" (same-instruction), then the problematic
>>> reordering under MSA should be gone.
>>> I quickly ran Dartagnan on the MSA litmus tests with this change to the
>>> ARM model and all the tests still pass.
>>
>> Even with this change I still get violations (both safety and termination)
>> for qspinlock with dartagnan.
>
> Please can you be more specific about the problems you see?
I talked to Hernán personally and it turned out that he used the generic
implementation of smp_cond_acquire (not sure if the name is correct)
which uses a relaxed load followed by a barrier. In that case, replacing
aob by aob;si does not change anything.
Indeed, even in the reported problem we used the generic implementation
(I was unaware of this), though it is easy to check that changing the
relaxed load to acquire does not give sufficient orderings.
>
>> I think the problem is actually with the Internal visibility axiom, because
>> only making that one stronger seems to remove the violations.
>
> That sounds surprising to me, as there's nothing particularly weird about
> Arm's coherence requirements when compared to other architectures, as far
> as I'm aware.
>
I agree. The internal visibility axiom is not the issue I think.
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 14:23 ` Thomas Haas
@ 2025-06-17 19:00 ` Hernan Ponce de Leon
2025-06-19 12:30 ` Will Deacon
0 siblings, 1 reply; 23+ messages in thread
From: Hernan Ponce de Leon @ 2025-06-17 19:00 UTC (permalink / raw)
To: Thomas Haas, Will Deacon
Cc: Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On 6/17/2025 4:23 PM, Thomas Haas wrote:
>
>
> On 17.06.25 16:17, Will Deacon wrote:
>> On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
>>> On 6/17/2025 8:19 AM, Thomas Haas wrote:
>>>> On 16.06.25 16:23, Will Deacon wrote:
>>>>> I'm half inclined to think that the Arm memory model should be
>>>>> tightened
>>>>> here; I can raise that with Arm and see what they say.
>>>>>
>>>>> Although the cited paper does give examples of store-forwarding from a
>>>>> narrow store to a wider load, the case in qspinlock is further
>>>>> constrained by having the store come from an atomic rmw and the load
>>>>> having acquire semantics. Setting aside the MSA part, that specific
>>>>> case
>>>>> _is_ ordered in the Arm memory model (and C++ release sequences
>>>>> rely on
>>>>> it iirc), so it's fair to say that Arm CPUs don't permit forwarding
>>>>> from
>>>>> an atomic rmw to an acquire load.
>>>>>
>>>>> Given that, I don't see how this is going to occur in practice.
>>>>
>>>> You are probably right. The ARM model's atomic-ordered-before relation
>>>>
>>>> let aob = rmw | [range(rmw)]; lrs; [A | Q]
>>>>
>>>> clearly orders the rmw-store with subsequent acquire loads (lrs =
>>>> local-
>>>> read-successor, A = acquire).
>>>> If we treat this relation (at least the second part) as a "global
>>>> ordering" and extend it by "si" (same-instruction), then the
>>>> problematic
>>>> reordering under MSA should be gone.
>>>> I quickly ran Dartagnan on the MSA litmus tests with this change to the
>>>> ARM model and all the tests still pass.
>>>
>>> Even with this change I still get violations (both safety and
>>> termination)
>>> for qspinlock with dartagnan.
>>
>> Please can you be more specific about the problems you see?
>
> I talked to Hernán personally and it turned out that he used the generic
> implementation of smp_cond_acquire (not sure if the name is correct)
> which uses a relaxed load followed by a barrier. In that case, replacing
> aob by aob;si does not change anything.
> Indeed, even in the reported problem we used the generic implementation
> (I was unaware of this), though it is easy to check that changing the
> relaxed load to acquire does not give sufficient orderings.
Yes, my bad. I was using the generic header rather than the aarch64
specific one and then the changes to the model were having not effect
(as they should).
Now I am using the aarch64 specific ones and I can confirm dartagnan
still reports the violations with the current model and making the
change proposed by Thomas (adding ;si just to the second part seems to
be enough) indeed removes all violations.
Hernan
>
>>
>>> I think the problem is actually with the Internal visibility axiom,
>>> because
>>> only making that one stronger seems to remove the violations.
>>
>> That sounds surprising to me, as there's nothing particularly weird about
>> Arm's coherence requirements when compared to other architectures, as far
>> as I'm aware.
>>
>
> I agree. The internal visibility axiom is not the issue I think.
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-17 19:00 ` Hernan Ponce de Leon
@ 2025-06-19 12:30 ` Will Deacon
2025-06-19 14:11 ` Thomas Haas
0 siblings, 1 reply; 23+ messages in thread
From: Will Deacon @ 2025-06-19 12:30 UTC (permalink / raw)
To: Hernan Ponce de Leon
Cc: Thomas Haas, Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On Tue, Jun 17, 2025 at 09:00:30PM +0200, Hernan Ponce de Leon wrote:
> On 6/17/2025 4:23 PM, Thomas Haas wrote:
> >
> >
> > On 17.06.25 16:17, Will Deacon wrote:
> > > On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
> > > > On 6/17/2025 8:19 AM, Thomas Haas wrote:
> > > > > On 16.06.25 16:23, Will Deacon wrote:
> > > > > > I'm half inclined to think that the Arm memory model
> > > > > > should be tightened
> > > > > > here; I can raise that with Arm and see what they say.
> > > > > >
> > > > > > Although the cited paper does give examples of store-forwarding from a
> > > > > > narrow store to a wider load, the case in qspinlock is further
> > > > > > constrained by having the store come from an atomic rmw and the load
> > > > > > having acquire semantics. Setting aside the MSA part,
> > > > > > that specific case
> > > > > > _is_ ordered in the Arm memory model (and C++ release
> > > > > > sequences rely on
> > > > > > it iirc), so it's fair to say that Arm CPUs don't permit
> > > > > > forwarding from
> > > > > > an atomic rmw to an acquire load.
> > > > > >
> > > > > > Given that, I don't see how this is going to occur in practice.
> > > > >
> > > > > You are probably right. The ARM model's atomic-ordered-before relation
> > > > >
> > > > > let aob = rmw | [range(rmw)]; lrs; [A | Q]
> > > > >
> > > > > clearly orders the rmw-store with subsequent acquire loads
> > > > > (lrs = local-
> > > > > read-successor, A = acquire).
> > > > > If we treat this relation (at least the second part) as a "global
> > > > > ordering" and extend it by "si" (same-instruction), then the
> > > > > problematic
> > > > > reordering under MSA should be gone.
> > > > > I quickly ran Dartagnan on the MSA litmus tests with this change to the
> > > > > ARM model and all the tests still pass.
> > > >
> > > > Even with this change I still get violations (both safety and
> > > > termination)
> > > > for qspinlock with dartagnan.
> > >
> > > Please can you be more specific about the problems you see?
> >
> > I talked to Hernán personally and it turned out that he used the generic
> > implementation of smp_cond_acquire (not sure if the name is correct)
> > which uses a relaxed load followed by a barrier. In that case, replacing
> > aob by aob;si does not change anything.
> > Indeed, even in the reported problem we used the generic implementation
> > (I was unaware of this), though it is easy to check that changing the
> > relaxed load to acquire does not give sufficient orderings.
>
> Yes, my bad. I was using the generic header rather than the aarch64 specific
> one and then the changes to the model were having not effect (as they
> should).
>
> Now I am using the aarch64 specific ones and I can confirm dartagnan still
> reports the violations with the current model and making the change proposed
> by Thomas (adding ;si just to the second part seems to be enough) indeed
> removes all violations.
That's great! Thanks for working together to get to the bottom of it. I
was worried that this was the tip of the iceberg.
I'll try to follow-up with Arm to see if that ';si' addition is an
acceptable to atomic-ordered before. If not, we'll absorb the smp_wmb()
into xchg_release() with a fat comment about RCsc and mixed-size
accesses.
Will
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-19 12:30 ` Will Deacon
@ 2025-06-19 14:11 ` Thomas Haas
0 siblings, 0 replies; 23+ messages in thread
From: Thomas Haas @ 2025-06-19 14:11 UTC (permalink / raw)
To: Will Deacon, Hernan Ponce de Leon
Cc: Peter Zijlstra, Alan Stern, Andrea Parri, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, jonas.oberhauser,
r.maseli@tu-bs.de
On 19.06.25 14:30, Will Deacon wrote:
> On Tue, Jun 17, 2025 at 09:00:30PM +0200, Hernan Ponce de Leon wrote:
>> On 6/17/2025 4:23 PM, Thomas Haas wrote:
>>>
>>>
>>> On 17.06.25 16:17, Will Deacon wrote:
>>>> On Tue, Jun 17, 2025 at 10:42:04AM +0200, Hernan Ponce de Leon wrote:
>>>>> On 6/17/2025 8:19 AM, Thomas Haas wrote:
>>>>>> On 16.06.25 16:23, Will Deacon wrote:
>>>>>>> I'm half inclined to think that the Arm memory model
>>>>>>> should be tightened
>>>>>>> here; I can raise that with Arm and see what they say.
>>>>>>>
>>>>>>> Although the cited paper does give examples of store-forwarding from a
>>>>>>> narrow store to a wider load, the case in qspinlock is further
>>>>>>> constrained by having the store come from an atomic rmw and the load
>>>>>>> having acquire semantics. Setting aside the MSA part,
>>>>>>> that specific case
>>>>>>> _is_ ordered in the Arm memory model (and C++ release
>>>>>>> sequences rely on
>>>>>>> it iirc), so it's fair to say that Arm CPUs don't permit
>>>>>>> forwarding from
>>>>>>> an atomic rmw to an acquire load.
>>>>>>>
>>>>>>> Given that, I don't see how this is going to occur in practice.
>>>>>>
>>>>>> You are probably right. The ARM model's atomic-ordered-before relation
>>>>>>
>>>>>> let aob = rmw | [range(rmw)]; lrs; [A | Q]
>>>>>>
>>>>>> clearly orders the rmw-store with subsequent acquire loads
>>>>>> (lrs = local-
>>>>>> read-successor, A = acquire).
>>>>>> If we treat this relation (at least the second part) as a "global
>>>>>> ordering" and extend it by "si" (same-instruction), then the
>>>>>> problematic
>>>>>> reordering under MSA should be gone.
>>>>>> I quickly ran Dartagnan on the MSA litmus tests with this change to the
>>>>>> ARM model and all the tests still pass.
>>>>>
>>>>> Even with this change I still get violations (both safety and
>>>>> termination)
>>>>> for qspinlock with dartagnan.
>>>>
>>>> Please can you be more specific about the problems you see?
>>>
>>> I talked to Hernán personally and it turned out that he used the generic
>>> implementation of smp_cond_acquire (not sure if the name is correct)
>>> which uses a relaxed load followed by a barrier. In that case, replacing
>>> aob by aob;si does not change anything.
>>> Indeed, even in the reported problem we used the generic implementation
>>> (I was unaware of this), though it is easy to check that changing the
>>> relaxed load to acquire does not give sufficient orderings.
>>
>> Yes, my bad. I was using the generic header rather than the aarch64 specific
>> one and then the changes to the model were having not effect (as they
>> should).
>>
>> Now I am using the aarch64 specific ones and I can confirm dartagnan still
>> reports the violations with the current model and making the change proposed
>> by Thomas (adding ;si just to the second part seems to be enough) indeed
>> removes all violations.
>
> That's great! Thanks for working together to get to the bottom of it. I
> was worried that this was the tip of the iceberg.
>
> I'll try to follow-up with Arm to see if that ';si' addition is an
> acceptable to atomic-ordered before. If not, we'll absorb the smp_wmb()
> into xchg_release() with a fat comment about RCsc and mixed-size
> accesses.
>
> Will
I have already contacted ARM, but I think it will take some time to get
an answer.
Interestingly, the definition of aob has changed in a newer version of
the ARM cat model:
let aob = [Exp & M]; rmw; [Exp & M]
| [Exp & M]; rmw; lrs; [A | Q]
| [Imp & TTD & R]; rmw; [HU]
Ignoring all the extra Exp/Imp/TTD stuff, you can see that the second
part of the definition only orders the load-part of the rmw with the
following acquire load. This suggests that a form of store forwarding
might indeed be expected, seeing that they deliberately removed the
rmw_store-acq_load ordering they had previously.
Nevertheless, for qspinlock to work we just need the rmw_load-acq_load
ordering that is still provided as long as we extend it with ";si".
--
=====================================
Thomas Haas
Technische Universität Braunschweig
Institut für Theoretische Informatik
Mühlenpfordtstr. 23, Raum IZ 343
38106 Braunschweig | Germany
t.haas@tu-braunschweig.de
https://www.tu-braunschweig.de/tcs/team/thomas-haas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-13 7:55 ` Peter Zijlstra
2025-06-13 11:17 ` Andrea Parri
2025-06-16 14:23 ` Will Deacon
@ 2025-06-18 6:51 ` Paul E. McKenney
2025-06-18 12:11 ` Paul E. McKenney
2 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2025-06-18 6:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Haas, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
> On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
[ . . . ]
> > - put some other read-read barrier between the xchg_tail and the load.
> >
> >
> > ### Implications for qspinlock executed on non-ARM architectures.
> >
> > Unfortunately, there are no MSA extensions for other hardware memory models,
> > so we have to speculate based on whether the problematic reordering is
> > permitted if the problematic load was treated as two individual
> > instructions.
> > It seems Power and RISCV would have no problem reordering the instructions,
> > so qspinlock might also break on those architectures.
>
> Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
> word LL/SC and should be good.
>
> But yes, ZABHA might be equally broken.
All architectures handle eight-bit atomics and stores, but last I checked,
there were a few systems still around that failed to support 16-bit
atomics and stores. I will check again.
(But those systems's architectures can simply avoid supporting kernel
features requiring these 16-bit operations.)
It would be good to add multiple sizes to LKMM, and even moreso once we
have 16-bit support across the board.
Thanx, Paul
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-18 6:51 ` Paul E. McKenney
@ 2025-06-18 12:11 ` Paul E. McKenney
0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2025-06-18 12:11 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Haas, Alan Stern, Andrea Parri, Will Deacon, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Akira Yokosawa, Daniel Lustig, Joel Fernandes, linux-kernel,
linux-arch, lkmm, hernan.poncedeleon, jonas.oberhauser,
r.maseli@tu-bs.de
On Tue, Jun 17, 2025 at 11:51:49PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 13, 2025 at 09:55:01AM +0200, Peter Zijlstra wrote:
> > On Thu, Jun 12, 2025 at 04:55:28PM +0200, Thomas Haas wrote:
>
> [ . . . ]
>
> > > - put some other read-read barrier between the xchg_tail and the load.
> > >
> > >
> > > ### Implications for qspinlock executed on non-ARM architectures.
> > >
> > > Unfortunately, there are no MSA extensions for other hardware memory models,
> > > so we have to speculate based on whether the problematic reordering is
> > > permitted if the problematic load was treated as two individual
> > > instructions.
> > > It seems Power and RISCV would have no problem reordering the instructions,
> > > so qspinlock might also break on those architectures.
> >
> > Power (and RiscV without ZABHA) 'emulate' the short XCHG using a full
> > word LL/SC and should be good.
> >
> > But yes, ZABHA might be equally broken.
>
> All architectures handle eight-bit atomics and stores, but last I checked,
> there were a few systems still around that failed to support 16-bit
> atomics and stores. I will check again.
>
> (But those systems's architectures can simply avoid supporting kernel
> features requiring these 16-bit operations.)
>
> It would be good to add multiple sizes to LKMM, and even moreso once we
> have 16-bit support across the board.
And Arnd tells me that the Linux kernel might be safe for 16-bit stores
in core code perhaps as early as the end of this year. ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-06-12 14:55 [RFC] Potential problem in qspinlock due to mixed-size accesses Thomas Haas
2025-06-13 7:55 ` Peter Zijlstra
@ 2025-12-17 19:05 ` Thomas Haas
2025-12-18 22:02 ` Andrea Parri
1 sibling, 1 reply; 23+ messages in thread
From: Thomas Haas @ 2025-12-17 19:05 UTC (permalink / raw)
To: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm
Cc: hernan.poncedeleon, jonas.oberhauser, r.maseli@tu-bs.de
On 12.06.25 16:55, Thomas Haas wrote:
> We have been taking a look if mixed-size accesses (MSA) can affect the
> correctness of qspinlock.
> We are focusing on aarch64 which is the only memory model with MSA
> support [1].
> For this we extended the dartagnan [2] tool to support MSA and now it
> reports liveness, synchronization, and mutex issues.
--- cut off ---
ARM has recently fixed the issue on their side by strengthening the
memory model (see
https://github.com/herd/herdtools7/commit/2b7921a44a61766e23a1234767d28af696b436a0)
With the updated ARM-MSA model, Dartagnan does no longer find violations
in qspinlock. No patches needed :)
With best regards,
Thomas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] Potential problem in qspinlock due to mixed-size accesses
2025-12-17 19:05 ` Thomas Haas
@ 2025-12-18 22:02 ` Andrea Parri
0 siblings, 0 replies; 23+ messages in thread
From: Andrea Parri @ 2025-12-18 22:02 UTC (permalink / raw)
To: Thomas Haas
Cc: Alan Stern, Will Deacon, Peter Zijlstra, Boqun Feng,
Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
linux-kernel, linux-arch, lkmm, hernan.poncedeleon,
jonas.oberhauser, r.maseli@tu-bs.de
> ARM has recently fixed the issue on their side by strengthening the memory
> model (see https://github.com/herd/herdtools7/commit/2b7921a44a61766e23a1234767d28af696b436a0)
>
> With the updated ARM-MSA model, Dartagnan does no longer find violations in
> qspinlock. No patches needed :)
Thanks for the heads-up, Thomas.
This reminds me: After your post, I wrote some script/files to test MSA
against the LKMM by leveraging the MSA support for AArch64 available in
herd7 and requiring no changes to herd7 internals. I've put these files
in the repository:
https://github.com/andrea-parri/lkmm-msa.git
in case someone here feels particularly bored during the Christmas holi-
days and desire to have a look! ;-)
I am hoping to be able to resume this study (and to write a proper README)
soonish - but the starting point was to "lift" ARMv8 execution graphs to
(corresponding) LKMM graphs, cf. the file cats/lkmm-from-armv8.cat; then
the usual "test, model and iterate" process, with unmarked accesses still
needing several iterations I'm afraid. Any feedback's welcome.
Andrea
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-12-18 22:02 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12 14:55 [RFC] Potential problem in qspinlock due to mixed-size accesses Thomas Haas
2025-06-13 7:55 ` Peter Zijlstra
2025-06-13 11:17 ` Andrea Parri
[not found] ` <9264df13-36db-4b25-b2c4-7a9701df2f4d@tu-bs.de>
2025-06-16 6:21 ` Andrea Parri
2025-06-16 14:11 ` Alan Stern
2025-06-17 14:09 ` Andrea Parri
2025-06-19 14:27 ` Thomas Haas
2025-06-19 14:32 ` Alan Stern
2025-06-19 14:59 ` Thomas Haas
2025-06-19 17:56 ` Alan Stern
2025-06-19 18:21 ` Thomas Haas
2025-06-16 14:23 ` Will Deacon
2025-06-17 6:19 ` Thomas Haas
2025-06-17 8:42 ` Hernan Ponce de Leon
2025-06-17 14:17 ` Will Deacon
2025-06-17 14:23 ` Thomas Haas
2025-06-17 19:00 ` Hernan Ponce de Leon
2025-06-19 12:30 ` Will Deacon
2025-06-19 14:11 ` Thomas Haas
2025-06-18 6:51 ` Paul E. McKenney
2025-06-18 12:11 ` Paul E. McKenney
2025-12-17 19:05 ` Thomas Haas
2025-12-18 22:02 ` Andrea Parri
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox