* [RFT PATCH] arm64: atomics: prefetch the destination prior to LSE operations
@ 2025-07-24 12:06 Yicong Yang
2025-08-08 11:35 ` Will Deacon
0 siblings, 1 reply; 4+ messages in thread
From: Yicong Yang @ 2025-07-24 12:06 UTC (permalink / raw)
To: will, mark.rutland, catalin.marinas, maz, broonie,
linux-arm-kernel
Cc: wangkefeng.wang, baohua, jonathan.cameron,
shameerali.kolothum.thodi, prime.zeng, xuwei5, yangyicong,
linuxarm, tiantao6
From: Yicong Yang <yangyicong@hisilicon.com>
commit 0ea366f5e1b6 ("arm64: atomics: prefetch the destination word for write prior to stxr")
adds prefetch prior to LL/SC operations due to performance concerns -
change the cacheline status from exclusive could be significant. This is
also true for LSE operations, so prefetch the destination prior to LSE
operations.
Tested on my HIP08 server (2 * 64 CPU) using `perf bench -r 100 futex all`
which could stress the spinlock of the futex hash bucket:
6.16-rc7 patched
futex/hash(ops/sec) 171843 204757 +19.15%
futex/wake(ms) 0.4630 0.4216 +8.94%
futex/wake-parallel(ms) 0.0048 0.0039 +18.75%
futex/requeue(ms) 0.1487 0.1508 -1.41%
(2nd validation) 0.1484 +0.2%
futex/lock-pi(ops/sec) 125 126 +0.8%
For a single wake test for different threads number using `perf bench
-r 100 futex wake -t <threads>`:
threads 6.16-rc7 patched
1 0.0035 0.0032 +8.57%
48 0.1454 0.1221 +16.02%
96 0.3047 0.2304 +24.38%
160 0.5489 0.5012 +8.69%
192 0.6675 0.5906 +11.52%
256 0.9445 0.8092 +14.33%
There're some variation for close numbers but overall results
look positive.
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
---
RFT for tests and feedbacks since not sure it's general or just the optimization
on some specific implementations.
arch/arm64/include/asm/atomic_lse.h | 7 +++++++
arch/arm64/include/asm/cmpxchg.h | 3 ++-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index 87f568a94e55..a45e49d5d857 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -16,6 +16,7 @@ __lse_atomic_##op(int i, atomic_t *v) \
{ \
asm volatile( \
__LSE_PREAMBLE \
+ " prfm pstl1strm, %[v]\n" \
" " #asm_op " %w[i], %[v]\n" \
: [v] "+Q" (v->counter) \
: [i] "r" (i)); \
@@ -41,6 +42,7 @@ __lse_atomic_fetch_##op##name(int i, atomic_t *v) \
\
asm volatile( \
__LSE_PREAMBLE \
+ " prfm pstl1strm, %[v]\n" \
" " #asm_op #mb " %w[i], %w[old], %[v]" \
: [v] "+Q" (v->counter), \
[old] "=r" (old) \
@@ -123,6 +125,7 @@ __lse_atomic64_##op(s64 i, atomic64_t *v) \
{ \
asm volatile( \
__LSE_PREAMBLE \
+ " prfm pstl1strm, %[v]\n" \
" " #asm_op " %[i], %[v]\n" \
: [v] "+Q" (v->counter) \
: [i] "r" (i)); \
@@ -148,6 +151,7 @@ __lse_atomic64_fetch_##op##name(s64 i, atomic64_t *v) \
\
asm volatile( \
__LSE_PREAMBLE \
+ " prfm pstl1strm, %[v]\n" \
" " #asm_op #mb " %[i], %[old], %[v]" \
: [v] "+Q" (v->counter), \
[old] "=r" (old) \
@@ -230,6 +234,7 @@ static __always_inline s64 __lse_atomic64_dec_if_positive(atomic64_t *v)
asm volatile(
__LSE_PREAMBLE
+ " prfm pstl1strm, %[v]\n" \
"1: ldr %x[tmp], %[v]\n"
" subs %[ret], %x[tmp], #1\n"
" b.lt 2f\n"
@@ -253,6 +258,7 @@ __lse__cmpxchg_case_##name##sz(volatile void *ptr, \
{ \
asm volatile( \
__LSE_PREAMBLE \
+ " prfm pstl1strm, %[v]\n" \
" cas" #mb #sfx " %" #w "[old], %" #w "[new], %[v]\n" \
: [v] "+Q" (*(u##sz *)ptr), \
[old] "+r" (old) \
@@ -295,6 +301,7 @@ __lse__cmpxchg128##name(volatile u128 *ptr, u128 old, u128 new) \
\
asm volatile( \
__LSE_PREAMBLE \
+ " prfm pstl1strm, %[v]\n" \
" casp" #mb "\t%[old1], %[old2], %[new1], %[new2], %[v]\n"\
: [old1] "+&r" (x0), [old2] "+&r" (x1), \
[v] "+Q" (*(u128 *)ptr) \
diff --git a/arch/arm64/include/asm/cmpxchg.h b/arch/arm64/include/asm/cmpxchg.h
index d7a540736741..daacbabeadb7 100644
--- a/arch/arm64/include/asm/cmpxchg.h
+++ b/arch/arm64/include/asm/cmpxchg.h
@@ -32,8 +32,9 @@ static inline u##sz __xchg_case_##name##sz(u##sz x, volatile void *ptr) \
" cbnz %w1, 1b\n" \
" " #mb, \
/* LSE atomics */ \
+ " prfm pstl1strm, %2\n" \
" swp" #acq_lse #rel #sfx "\t%" #w "3, %" #w "0, %2\n" \
- __nops(3) \
+ __nops(2) \
" " #nop_lse) \
: "=&r" (ret), "=&r" (tmp), "+Q" (*(u##sz *)ptr) \
: "r" (x) \
--
2.24.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [RFT PATCH] arm64: atomics: prefetch the destination prior to LSE operations
2025-07-24 12:06 [RFT PATCH] arm64: atomics: prefetch the destination prior to LSE operations Yicong Yang
@ 2025-08-08 11:35 ` Will Deacon
2025-08-09 9:48 ` Yicong Yang
0 siblings, 1 reply; 4+ messages in thread
From: Will Deacon @ 2025-08-08 11:35 UTC (permalink / raw)
To: Yicong Yang
Cc: mark.rutland, catalin.marinas, maz, broonie, linux-arm-kernel,
wangkefeng.wang, baohua, jonathan.cameron,
shameerali.kolothum.thodi, prime.zeng, xuwei5, yangyicong,
linuxarm, tiantao6
On Thu, Jul 24, 2025 at 08:06:51PM +0800, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@hisilicon.com>
>
> commit 0ea366f5e1b6 ("arm64: atomics: prefetch the destination word for write prior to stxr")
> adds prefetch prior to LL/SC operations due to performance concerns -
> change the cacheline status from exclusive could be significant. This is
> also true for LSE operations, so prefetch the destination prior to LSE
> operations.
>
> Tested on my HIP08 server (2 * 64 CPU) using `perf bench -r 100 futex all`
> which could stress the spinlock of the futex hash bucket:
> 6.16-rc7 patched
> futex/hash(ops/sec) 171843 204757 +19.15%
> futex/wake(ms) 0.4630 0.4216 +8.94%
> futex/wake-parallel(ms) 0.0048 0.0039 +18.75%
> futex/requeue(ms) 0.1487 0.1508 -1.41%
> (2nd validation) 0.1484 +0.2%
> futex/lock-pi(ops/sec) 125 126 +0.8%
>
> For a single wake test for different threads number using `perf bench
> -r 100 futex wake -t <threads>`:
> threads 6.16-rc7 patched
> 1 0.0035 0.0032 +8.57%
> 48 0.1454 0.1221 +16.02%
> 96 0.3047 0.2304 +24.38%
> 160 0.5489 0.5012 +8.69%
> 192 0.6675 0.5906 +11.52%
> 256 0.9445 0.8092 +14.33%
>
> There're some variation for close numbers but overall results
> look positive.
>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> ---
>
> RFT for tests and feedbacks since not sure it's general or just the optimization
> on some specific implementations.
>
> arch/arm64/include/asm/atomic_lse.h | 7 +++++++
> arch/arm64/include/asm/cmpxchg.h | 3 ++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
One of the motivations behind rmw instructions (as opposed to ldxr/stxr
loops) is so that the atomic operation can be performed at different
places in the memory hierarchy depending upon where the data resides.
For example, if a shared counter is sitting at a level of system cache,
it may be optimal to leave it there so that CPUs around the system can
post atomic increments to it without forcing the line up and down the
cache hierarchy every time.
So, although adding an L1 prefetch may help some specific benchmarks on
a specific system, I don't think this is generally a good idea for
scalability. The hardware should be able to figure out the best place to
do the operation and, if you have a system where that means it should
always be performed within the CPU, then you should probably configure
it not to send the atomic remotely rather than force that in the kernel
for everybody.
Will
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [RFT PATCH] arm64: atomics: prefetch the destination prior to LSE operations
2025-08-08 11:35 ` Will Deacon
@ 2025-08-09 9:48 ` Yicong Yang
2025-11-06 22:23 ` Palmer Dabbelt
0 siblings, 1 reply; 4+ messages in thread
From: Yicong Yang @ 2025-08-09 9:48 UTC (permalink / raw)
To: Will Deacon
Cc: yangyicong, mark.rutland, catalin.marinas, maz, broonie,
linux-arm-kernel, wangkefeng.wang, baohua, jonathan.cameron,
shameerali.kolothum.thodi, prime.zeng, xuwei5, linuxarm, tiantao6
On 2025/8/8 19:35, Will Deacon wrote:
> On Thu, Jul 24, 2025 at 08:06:51PM +0800, Yicong Yang wrote:
>> From: Yicong Yang <yangyicong@hisilicon.com>
>>
>> commit 0ea366f5e1b6 ("arm64: atomics: prefetch the destination word for write prior to stxr")
>> adds prefetch prior to LL/SC operations due to performance concerns -
>> change the cacheline status from exclusive could be significant. This is
>> also true for LSE operations, so prefetch the destination prior to LSE
>> operations.
>>
>> Tested on my HIP08 server (2 * 64 CPU) using `perf bench -r 100 futex all`
>> which could stress the spinlock of the futex hash bucket:
>> 6.16-rc7 patched
>> futex/hash(ops/sec) 171843 204757 +19.15%
>> futex/wake(ms) 0.4630 0.4216 +8.94%
>> futex/wake-parallel(ms) 0.0048 0.0039 +18.75%
>> futex/requeue(ms) 0.1487 0.1508 -1.41%
>> (2nd validation) 0.1484 +0.2%
>> futex/lock-pi(ops/sec) 125 126 +0.8%
>>
>> For a single wake test for different threads number using `perf bench
>> -r 100 futex wake -t <threads>`:
>> threads 6.16-rc7 patched
>> 1 0.0035 0.0032 +8.57%
>> 48 0.1454 0.1221 +16.02%
>> 96 0.3047 0.2304 +24.38%
>> 160 0.5489 0.5012 +8.69%
>> 192 0.6675 0.5906 +11.52%
>> 256 0.9445 0.8092 +14.33%
>>
>> There're some variation for close numbers but overall results
>> look positive.
>>
>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>> ---
>>
>> RFT for tests and feedbacks since not sure it's general or just the optimization
>> on some specific implementations.
>>
>> arch/arm64/include/asm/atomic_lse.h | 7 +++++++
>> arch/arm64/include/asm/cmpxchg.h | 3 ++-
>> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> One of the motivations behind rmw instructions (as opposed to ldxr/stxr
> loops) is so that the atomic operation can be performed at different
> places in the memory hierarchy depending upon where the data resides.
>
> For example, if a shared counter is sitting at a level of system cache,
> it may be optimal to leave it there so that CPUs around the system can
> post atomic increments to it without forcing the line up and down the
> cache hierarchy every time.
yes it's true. for a CHI based system the atomic can be implemented in
the cpu (RN-F) which is termed as near atomic and outside the cpu on the
system component (system cache, etc) which is termed as far atomic [1].
the above example should refer to the far atomic and the atomic operations
don't need to be finished in the cpu cache.
[1] https://developer.arm.com/documentation/102714/0100/Atomic-fundamentals
>
> So, although adding an L1 prefetch may help some specific benchmarks on
> a specific system, I don't think this is generally a good idea for
> scalability. The hardware should be able to figure out the best place to
> do the operation and, if you have a system where that means it should
> always be performed within the CPU, then you should probably configure
> it not to send the atomic remotely rather than force that in the kernel
> for everybody.
>
the prefetch may not be benefit for the far atomic since the atomic operation
is not doned in the cpu cache, but will help those system implemented as near
atomic since it can load the data into the cpu cache prior to the atomic
operations. So alternatively, instead of enabling this all the time is it
acceptable to make this a kconfig/cmdline option as an optimization for near
atomic systems then those users can benefit from this?
Thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [RFT PATCH] arm64: atomics: prefetch the destination prior to LSE operations
2025-08-09 9:48 ` Yicong Yang
@ 2025-11-06 22:23 ` Palmer Dabbelt
0 siblings, 0 replies; 4+ messages in thread
From: Palmer Dabbelt @ 2025-11-06 22:23 UTC (permalink / raw)
To: yangyicong
Cc: Will Deacon, yangyicong, Mark Rutland, Catalin Marinas,
Marc Zyngier, broonie, linux-arm-kernel, wangkefeng.wang, baohua,
jonathan.cameron, shameerali.kolothum.thodi, prime.zeng, xuwei5,
linuxarm, tiantao6
On Sat, 09 Aug 2025 02:48:41 PDT (-0700), yangyicong@huawei.com wrote:
> On 2025/8/8 19:35, Will Deacon wrote:
>> On Thu, Jul 24, 2025 at 08:06:51PM +0800, Yicong Yang wrote:
>>> From: Yicong Yang <yangyicong@hisilicon.com>
>>>
>>> commit 0ea366f5e1b6 ("arm64: atomics: prefetch the destination word for write prior to stxr")
>>> adds prefetch prior to LL/SC operations due to performance concerns -
>>> change the cacheline status from exclusive could be significant. This is
>>> also true for LSE operations, so prefetch the destination prior to LSE
>>> operations.
>>>
>>> Tested on my HIP08 server (2 * 64 CPU) using `perf bench -r 100 futex all`
>>> which could stress the spinlock of the futex hash bucket:
>>> 6.16-rc7 patched
>>> futex/hash(ops/sec) 171843 204757 +19.15%
>>> futex/wake(ms) 0.4630 0.4216 +8.94%
>>> futex/wake-parallel(ms) 0.0048 0.0039 +18.75%
>>> futex/requeue(ms) 0.1487 0.1508 -1.41%
>>> (2nd validation) 0.1484 +0.2%
>>> futex/lock-pi(ops/sec) 125 126 +0.8%
>>>
>>> For a single wake test for different threads number using `perf bench
>>> -r 100 futex wake -t <threads>`:
>>> threads 6.16-rc7 patched
>>> 1 0.0035 0.0032 +8.57%
>>> 48 0.1454 0.1221 +16.02%
>>> 96 0.3047 0.2304 +24.38%
>>> 160 0.5489 0.5012 +8.69%
>>> 192 0.6675 0.5906 +11.52%
>>> 256 0.9445 0.8092 +14.33%
>>>
>>> There're some variation for close numbers but overall results
>>> look positive.
>>>
>>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>>> ---
>>>
>>> RFT for tests and feedbacks since not sure it's general or just the optimization
>>> on some specific implementations.
>>>
>>> arch/arm64/include/asm/atomic_lse.h | 7 +++++++
>>> arch/arm64/include/asm/cmpxchg.h | 3 ++-
>>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>
>> One of the motivations behind rmw instructions (as opposed to ldxr/stxr
>> loops) is so that the atomic operation can be performed at different
>> places in the memory hierarchy depending upon where the data resides.
>>
>> For example, if a shared counter is sitting at a level of system cache,
>> it may be optimal to leave it there so that CPUs around the system can
>> post atomic increments to it without forcing the line up and down the
>> cache hierarchy every time.
A few of us were over here
https://lore.kernel.org/all/e7d539ed-ced0-4b96-8ecd-048a5b803b85@paulmck-laptop/
talking about similar things. It doesn't actually have anything to do with
Paul's issue, as that's in percpu, but I happened to have run into some cases
where ATOMIC_ST_NEAR produced better application-level throughput recently and
though I'd poke around here too.
Over there we also found that microbenchmarks report better performance for a
bunch of different flavors of these atomics (LDADD, prefetches, and
ATOMIC_ST_NEAR on my end). This was true even for the contended cases, which I
found kind of surprising.
I benchmarked this with schbench, and found it's about 10% worse p99
latency (and also slightly worse at the other tiers). I see the same
thing with ATOMIC_ST_NEAR (which IIUC basically just does this in HW).
I also converted everything to LDADD-style routines (ie, not just the percpu
ones). Those were the best in the microbenchmarks, but they don't show any
difference compared to STADD-style routines.
Here's the code in case anyone's interested, though:
diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
index 87f568a94e55..03fddf5fa46f 100644
--- a/arch/arm64/include/asm/atomic_lse.h
+++ b/arch/arm64/include/asm/atomic_lse.h
@@ -14,17 +14,18 @@
static __always_inline void \
__lse_atomic_##op(int i, atomic_t *v) \
{ \
+ long tmp; \
asm volatile( \
__LSE_PREAMBLE \
- " " #asm_op " %w[i], %[v]\n" \
- : [v] "+Q" (v->counter) \
+ " " #asm_op " %w[i], %w[t], %[v]\n" \
+ : [v] "+Q" (v->counter), [t] "=r" (tmp) \
: [i] "r" (i)); \
}
-ATOMIC_OP(andnot, stclr)
-ATOMIC_OP(or, stset)
-ATOMIC_OP(xor, steor)
-ATOMIC_OP(add, stadd)
+ATOMIC_OP(andnot, ldclr)
+ATOMIC_OP(or, ldset)
+ATOMIC_OP(xor, ldeor)
+ATOMIC_OP(add, ldadd)
static __always_inline void __lse_atomic_sub(int i, atomic_t *v)
{
@@ -121,17 +122,18 @@ ATOMIC_FETCH_OP_AND( , al, "memory")
static __always_inline void \
__lse_atomic64_##op(s64 i, atomic64_t *v) \
{ \
+ long tmp; \
asm volatile( \
__LSE_PREAMBLE \
- " " #asm_op " %[i], %[v]\n" \
- : [v] "+Q" (v->counter) \
+ " " #asm_op " %[i], %[t], %[v]\n" \
+ : [v] "+Q" (v->counter), [t] "=r" (tmp) \
: [i] "r" (i)); \
}
-ATOMIC64_OP(andnot, stclr)
-ATOMIC64_OP(or, stset)
-ATOMIC64_OP(xor, steor)
-ATOMIC64_OP(add, stadd)
+ATOMIC64_OP(andnot, ldclr)
+ATOMIC64_OP(or, ldset)
+ATOMIC64_OP(xor, ldeor)
+ATOMIC64_OP(add, ldadd)
static __always_inline void __lse_atomic64_sub(s64 i, atomic64_t *v)
{
> yes it's true. for a CHI based system the atomic can be implemented in
> the cpu (RN-F) which is termed as near atomic and outside the cpu on the
> system component (system cache, etc) which is termed as far atomic [1].
> the above example should refer to the far atomic and the atomic operations
> don't need to be finished in the cpu cache.
>
> [1] https://developer.arm.com/documentation/102714/0100/Atomic-fundamentals
>
>>
>> So, although adding an L1 prefetch may help some specific benchmarks on
>> a specific system, I don't think this is generally a good idea for
>> scalability. The hardware should be able to figure out the best place to
>> do the operation and, if you have a system where that means it should
>> always be performed within the CPU, then you should probably configure
>> it not to send the atomic remotely rather than force that in the kernel
>> for everybody.
>>
>
> the prefetch may not be benefit for the far atomic since the atomic operation
> is not doned in the cpu cache, but will help those system implemented as near
> atomic since it can load the data into the cpu cache prior to the atomic
> operations. So alternatively, instead of enabling this all the time is it
> acceptable to make this a kconfig/cmdline option as an optimization for near
> atomic systems then those users can benefit from this?
>
> Thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-11-06 22:24 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24 12:06 [RFT PATCH] arm64: atomics: prefetch the destination prior to LSE operations Yicong Yang
2025-08-08 11:35 ` Will Deacon
2025-08-09 9:48 ` Yicong Yang
2025-11-06 22:23 ` Palmer Dabbelt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox