* Overhead of arm64 LSE per-CPU atomics?
@ 2025-10-30 22:37 Paul E. McKenney
2025-10-31 18:30 ` Catalin Marinas
0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-10-30 22:37 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel
Hello!
To make event tracing safe for PREEMPT_RT kernels, I have been creating
optimized variants of SRCU readers that use per-CPU atomics. This works
quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
per-CPU atomic operation. This contrasts with a handful of nanoseconds
on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
In theory, I can mess with SRCU's counter protocol, but I figured I
should check first.
The patch below shows the flavor of the change, but for the simpler
uproberet case where NMI safety is not required, permitting me to
simply disable interrupts across the non-atomic increment operations.
The overhead of interrupt disabling is not wonderful, but ~13ns is way
better than ~100ns any day of the week.
I have to do something like this for internal use here because we have
real hardware that acts this way. If I don't hear otherwise, I will
also push it to mainline. So if there is a more dainty approach, this
would be a most excellent time to let me know about it. ;-)
Thanx, Paul
------------------------------------------------------------------------
commit 1eee41590d30805ec4f5b4e96c615603b0d058d9
Author: Paul E. McKenney <paulmck@kernel.org>
Date: Thu Oct 30 09:25:09 2025 -0700
refscale: Add SRCU-fast-updown readers
This commit adds refscale readers based on srcu_read_lock_fast_updown()
and srcu_read_lock_fast_updown() ("refscale.scale_type=srcu-fast-updown").
On my x86 laptop, these are about 2.2ns per pair.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index 7429ec9f0092..07a313782dfd 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -186,6 +186,7 @@ static const struct ref_scale_ops rcu_ops = {
// Definitions for SRCU ref scale testing.
DEFINE_STATIC_SRCU(srcu_refctl_scale);
DEFINE_STATIC_SRCU_FAST(srcu_fast_refctl_scale);
+DEFINE_STATIC_SRCU_FAST_UPDOWN(srcu_fast_updown_refctl_scale);
static struct srcu_struct *srcu_ctlp = &srcu_refctl_scale;
static void srcu_ref_scale_read_section(const int nloops)
@@ -254,6 +255,42 @@ static const struct ref_scale_ops srcu_fast_ops = {
.name = "srcu-fast"
};
+static bool srcu_fast_updown_sync_scale_init(void)
+{
+ srcu_ctlp = &srcu_fast_updown_refctl_scale;
+ return true;
+}
+
+static void srcu_fast_updown_ref_scale_read_section(const int nloops)
+{
+ int i;
+ struct srcu_ctr __percpu *scp;
+
+ for (i = nloops; i >= 0; i--) {
+ scp = srcu_read_lock_fast_updown(srcu_ctlp);
+ srcu_read_unlock_fast_updown(srcu_ctlp, scp);
+ }
+}
+
+static void srcu_fast_updown_ref_scale_delay_section(const int nloops, const int udl, const int ndl)
+{
+ int i;
+ struct srcu_ctr __percpu *scp;
+
+ for (i = nloops; i >= 0; i--) {
+ scp = srcu_read_lock_fast_updown(srcu_ctlp);
+ un_delay(udl, ndl);
+ srcu_read_unlock_fast_updown(srcu_ctlp, scp);
+ }
+}
+
+static const struct ref_scale_ops srcu_fast_updown_ops = {
+ .init = srcu_fast_updown_sync_scale_init,
+ .readsection = srcu_fast_updown_ref_scale_read_section,
+ .delaysection = srcu_fast_updown_ref_scale_delay_section,
+ .name = "srcu-fast-updown"
+};
+
#ifdef CONFIG_TASKS_RCU
// Definitions for RCU Tasks ref scale testing: Empty read markers.
@@ -1479,7 +1516,8 @@ ref_scale_init(void)
long i;
int firsterr = 0;
static const struct ref_scale_ops *scale_ops[] = {
- &rcu_ops, &srcu_ops, &srcu_fast_ops, RCU_TRACE_OPS RCU_TASKS_OPS
+ &rcu_ops, &srcu_ops, &srcu_fast_ops, &srcu_fast_updown_ops,
+ RCU_TRACE_OPS RCU_TASKS_OPS
&refcnt_ops, &percpuinc_ops, &incpercpu_ops, &incpercpupreempt_ops,
&incpercpubh_ops, &incpercpuirqsave_ops,
&rwlock_ops, &rwsem_ops, &lock_ops, &lock_irq_ops, &acqrel_ops,
^ permalink raw reply related [flat|nested] 46+ messages in thread* Re: Overhead of arm64 LSE per-CPU atomics? 2025-10-30 22:37 Overhead of arm64 LSE per-CPU atomics? Paul E. McKenney @ 2025-10-31 18:30 ` Catalin Marinas 2025-10-31 19:39 ` Paul E. McKenney 2025-11-04 15:59 ` Breno Leitao 0 siblings, 2 replies; 46+ messages in thread From: Catalin Marinas @ 2025-10-31 18:30 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote: > To make event tracing safe for PREEMPT_RT kernels, I have been creating > optimized variants of SRCU readers that use per-CPU atomics. This works > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single > per-CPU atomic operation. This contrasts with a handful of nanoseconds > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1). That's quite a difference. Does it get any better if CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it on the kernel command line. Depending on the implementation and configuration, the LSE atomics may skip the L1 cache and be executed closer to the memory (they used to be called far atomics). The CPUs try to be smarter like doing the operation "near" if it's in the cache but the heuristics may not always work. Interestingly, we had this patch recently to force a prefetch before the atomic: https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/ We rejected it but I wonder whether it improves the SRCU scenario. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-10-31 18:30 ` Catalin Marinas @ 2025-10-31 19:39 ` Paul E. McKenney 2025-10-31 22:21 ` Paul E. McKenney 2025-10-31 22:43 ` Catalin Marinas 2025-11-04 15:59 ` Breno Leitao 1 sibling, 2 replies; 46+ messages in thread From: Paul E. McKenney @ 2025-10-31 19:39 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote: > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote: > > To make event tracing safe for PREEMPT_RT kernels, I have been creating > > optimized variants of SRCU readers that use per-CPU atomics. This works > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single > > per-CPU atomic operation. This contrasts with a handful of nanoseconds > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1). > > That's quite a difference. Does it get any better if > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it > on the kernel command line. In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct? Yes, this gets me more than an order of magnitude, and about 30% better than my workaround of disabling interrupts around a non-atomic increment of those counters, thank you! Given that per-CPU atomics are usually not heavily contended, would it make sense to avoid LSE in that case? And I need to figure out whether I should recommend that Meta build its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n. And advice you might have would be deeply appreciated! (I am of course also following up internally.) > Depending on the implementation and configuration, the LSE atomics may > skip the L1 cache and be executed closer to the memory (they used to be > called far atomics). The CPUs try to be smarter like doing the operation > "near" if it's in the cache but the heuristics may not always work. My knowledge-free guess is that it is early days for LSE, and that it therefore has significant hardware-level optimization work ahead of it. For example, I well recall being roundly denounced by Intel engineers in my neighborhood for reporting similar performance results on Pentium 4 back in the day. The truth might well have set them free, but it sure didn't make them happy! ;-) But what would a non-knowledge-free guess be? > Interestingly, we had this patch recently to force a prefetch before the > atomic: > > https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/ > > We rejected it but I wonder whether it improves the SRCU scenario. No statistical difference on my system. This is a 72-CPU Neoverse V2, in case that matters. Here are my results for the underlying this_cpu_inc() and this_cpu_dec() pair of operations: LSE Atomics Enabled (Stock) LSE Atomics Disabled Without Yicong’s Patch (Stock) 110.786 9.852 With Yicong’s Patch 109.873 9.853 As you can see, disabling LSE gets about an order of magnitude and Yicong's patch has no statistically significant effect. This and more can be found in the "Per-CPU Increment/Decrement" section of this Google document: https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing Full disclosure: Calls to srcu_read_lock_fast() followed by srcu_read_unlock_fast() really use one this_cpu_inc() followed by another this_cpu_inc(), but I am not seeing any difference between the two. And testing the underlying primitives allows my tests to give reproducible results regardless of what state I have the SRCU code in. ;-) Thoughts? Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-10-31 19:39 ` Paul E. McKenney @ 2025-10-31 22:21 ` Paul E. McKenney 2025-10-31 22:43 ` Catalin Marinas 1 sibling, 0 replies; 46+ messages in thread From: Paul E. McKenney @ 2025-10-31 22:21 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote: > On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote: > > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote: > > > To make event tracing safe for PREEMPT_RT kernels, I have been creating > > > optimized variants of SRCU readers that use per-CPU atomics. This works > > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a > > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single > > > per-CPU atomic operation. This contrasts with a handful of nanoseconds > > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1). > > > > That's quite a difference. Does it get any better if > > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it > > on the kernel command line. > > In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct? > > Yes, this gets me more than an order of magnitude, and about 30% better > than my workaround of disabling interrupts around a non-atomic increment > of those counters, thank you! > > Given that per-CPU atomics are usually not heavily contended, would it > make sense to avoid LSE in that case? For example, how about something like the patch below? Thanx, Paul ------------------------------------------------------------------------ commit 0c0b71d19c997915c5ef5fe7e32eb56b4e4a750e Author: Paul E. McKenney <paulmckrcu@fb.com> Date: Fri Oct 31 14:14:13 2025 -0700 arm64: Separately select LSE for per-CPU atomics LSE atomics provide better scalability, but not always better single-CPU performance. In fact, on the ARM Neoverse V2, they degrade single-CPU performance by an order of magnitude, from about 5ns per operation to about 50ns. Now per-CPU atomics are rarely contended, in fact, a given per-CPU variable is usually used mostly by the CPU in question. This means that LSE's better scalability does not help, but its degraded single-CPU performance does hurt. Therefore, provide a new default-n ARM64_USE_LSE_PERCPU_ATOMICS Kconfig option that uses LSE for per-CPU atomics. This means that default kernel builds will use non-LSE atomics for this case, but will still use LSE atomics for the global atomic variables that are more likely to be heavily contended, and thus are more likely to benefit from LSE. Signed-off-by: Paul E. McKenney <paulmckrcu@fb.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <linux-arm-kernel@lists.infradead.org> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 58b782779138..b91b7cbe4569 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1927,6 +1927,21 @@ config ARM64_USE_LSE_ATOMICS atomic routines. This incurs a small overhead on CPUs that do not support these instructions. +config ARM64_USE_LSE_PERCPU_ATOMICS + bool "LSE for per-CPU atomic instructions" + default n + help + As part of the Large System Extensions, ARMv8.1 introduces new + atomic instructions that are designed specifically to scale in + very large systems. However, contention on per-CPU atomics + is usually quite low by design, so these atomics likely benefit + from higher performance, even if this is purchased with reduced + performance under high contention. + + Say Y here to make use of these instructions for the in-kernel + per-CPU atomic routines. This incurs a small overhead on CPUs + that do not support these instructions. + endmenu # "ARMv8.1 architectural features" menu "ARMv8.2 architectural features" diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h index 3129a5819d0e..2d5eff217d63 100644 --- a/arch/arm64/include/asm/lse.h +++ b/arch/arm64/include/asm/lse.h @@ -26,12 +26,19 @@ /* In-line patching at runtime */ #define ARM64_LSE_ATOMIC_INSN(llsc, lse) \ ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS) +#if IS_ENABLED(CONFIG_ARM64_USE_LSE_PERCPU_ATOMICS) +#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse) \ + ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS) +#else +#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse) llsc +#endif #else /* CONFIG_ARM64_LSE_ATOMICS */ #define __lse_ll_sc_body(op, ...) __ll_sc_##op(__VA_ARGS__) #define ARM64_LSE_ATOMIC_INSN(llsc, lse) llsc +#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse) llsc #endif /* CONFIG_ARM64_LSE_ATOMICS */ #endif /* __ASM_LSE_H */ diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h index 9abcc8ef3087..eaa3c2f87407 100644 --- a/arch/arm64/include/asm/percpu.h +++ b/arch/arm64/include/asm/percpu.h @@ -70,7 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ unsigned int loop; \ u##sz tmp; \ \ - asm volatile (ARM64_LSE_ATOMIC_INSN( \ + asm volatile (ARM64_LSE_PERCPU_ATOMIC_INSN( \ /* LL/SC */ \ "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ #op_llsc "\t%" #w "[tmp], %" #w "[tmp], %" #w "[val]\n" \ @@ -91,7 +91,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ unsigned int loop; \ u##sz ret; \ \ - asm volatile (ARM64_LSE_ATOMIC_INSN( \ + asm volatile (ARM64_LSE_PERCPU_ATOMIC_INSN( \ /* LL/SC */ \ "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ #op_llsc "\t%" #w "[ret], %" #w "[ret], %" #w "[val]\n" \ ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-10-31 19:39 ` Paul E. McKenney 2025-10-31 22:21 ` Paul E. McKenney @ 2025-10-31 22:43 ` Catalin Marinas 2025-10-31 23:38 ` Paul E. McKenney 1 sibling, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-10-31 22:43 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote: > On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote: > > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote: > > > To make event tracing safe for PREEMPT_RT kernels, I have been creating > > > optimized variants of SRCU readers that use per-CPU atomics. This works > > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a > > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single > > > per-CPU atomic operation. This contrasts with a handful of nanoseconds > > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1). > > > > That's quite a difference. Does it get any better if > > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it > > on the kernel command line. > > In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct? Yes. > Yes, this gets me more than an order of magnitude, and about 30% better > than my workaround of disabling interrupts around a non-atomic increment > of those counters, thank you! > > Given that per-CPU atomics are usually not heavily contended, would it > make sense to avoid LSE in that case? In theory the LSE atomics should be as fast but microarchitecture decisions likely did not cover all the use-cases. I'll raise this internally as well, maybe we get some ideas from the hardware people. > And I need to figure out whether I should recommend that Meta build > its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n. And advice you > might have would be deeply appreciated! (I am of course also following > up internally.) I wouldn't advise turning them off just yet, they are beneficial for other use-cases. But it needs more thinking (and not that late at night ;)). > > Interestingly, we had this patch recently to force a prefetch before the > > atomic: > > > > https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/ > > > > We rejected it but I wonder whether it improves the SRCU scenario. > > No statistical difference on my system. This is a 72-CPU Neoverse V2, in > case that matters. I just realised that patch doesn't touch percpu.h at all. So what about something like (untested): -----------------8<------------------------ diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h index 9abcc8ef3087..e381034324e1 100644 --- a/arch/arm64/include/asm/percpu.h +++ b/arch/arm64/include/asm/percpu.h @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ unsigned int loop; \ u##sz tmp; \ \ + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); asm volatile (ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ unsigned int loop; \ u##sz ret; \ \ + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); asm volatile (ARM64_LSE_ATOMIC_INSN( \ /* LL/SC */ \ "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ -----------------8<------------------------ > Here are my results for the underlying this_cpu_inc() > and this_cpu_dec() pair of operations: > > LSE Atomics Enabled (Stock) LSE Atomics Disabled > > Without Yicong’s Patch (Stock) > > 110.786 9.852 > > With Yicong’s Patch > > 109.873 9.853 > > As you can see, disabling LSE gets about an order of magnitude > and Yicong's patch has no statistically significant effect. > > This and more can be found in the "Per-CPU Increment/Decrement" > section of this Google document: > > https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing > > Full disclosure: Calls to srcu_read_lock_fast() followed by > srcu_read_unlock_fast() really use one this_cpu_inc() followed by another > this_cpu_inc(), but I am not seeing any difference between the two. > And testing the underlying primitives allows my tests to give reproducible > results regardless of what state I have the SRCU code in. ;-) Thanks. I'll go through your emails in more detail tomorrow/Monday. -- Catalin ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-10-31 22:43 ` Catalin Marinas @ 2025-10-31 23:38 ` Paul E. McKenney 2025-11-01 3:25 ` Paul E. McKenney 0 siblings, 1 reply; 46+ messages in thread From: Paul E. McKenney @ 2025-10-31 23:38 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: > On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote: > > On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote: > > > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote: > > > > To make event tracing safe for PREEMPT_RT kernels, I have been creating > > > > optimized variants of SRCU readers that use per-CPU atomics. This works > > > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a > > > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single > > > > per-CPU atomic operation. This contrasts with a handful of nanoseconds > > > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1). > > > > > > That's quite a difference. Does it get any better if > > > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it > > > on the kernel command line. > > > > In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct? > > Yes. > > > Yes, this gets me more than an order of magnitude, and about 30% better > > than my workaround of disabling interrupts around a non-atomic increment > > of those counters, thank you! > > > > Given that per-CPU atomics are usually not heavily contended, would it > > make sense to avoid LSE in that case? > > In theory the LSE atomics should be as fast but microarchitecture > decisions likely did not cover all the use-cases. I'll raise this > internally as well, maybe we get some ideas from the hardware people. Understood, and please let me know what you can from the hardware people. > > And I need to figure out whether I should recommend that Meta build > > its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n. And advice you > > might have would be deeply appreciated! (I am of course also following > > up internally.) > > I wouldn't advise turning them off just yet, they are beneficial for > other use-cases. But it needs more thinking (and not that late at night ;)). Fair enough! > > > Interestingly, we had this patch recently to force a prefetch before the > > > atomic: > > > > > > https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/ > > > > > > We rejected it but I wonder whether it improves the SRCU scenario. > > > > No statistical difference on my system. This is a 72-CPU Neoverse V2, in > > case that matters. > > I just realised that patch doesn't touch percpu.h at all. So what about > something like (untested): > > -----------------8<------------------------ > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > index 9abcc8ef3087..e381034324e1 100644 > --- a/arch/arm64/include/asm/percpu.h > +++ b/arch/arm64/include/asm/percpu.h > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > unsigned int loop; \ > u##sz tmp; \ > \ > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > /* LL/SC */ \ > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > unsigned int loop; \ > u##sz ret; \ > \ > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > /* LL/SC */ \ > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > -----------------8<------------------------ I will give this a shot, thank you! > > Here are my results for the underlying this_cpu_inc() > > and this_cpu_dec() pair of operations: > > > > LSE Atomics Enabled (Stock) LSE Atomics Disabled > > > > Without Yicong’s Patch (Stock) > > > > 110.786 9.852 > > > > With Yicong’s Patch > > > > 109.873 9.853 > > > > As you can see, disabling LSE gets about an order of magnitude > > and Yicong's patch has no statistically significant effect. > > > > This and more can be found in the "Per-CPU Increment/Decrement" > > section of this Google document: > > > > https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing > > > > Full disclosure: Calls to srcu_read_lock_fast() followed by > > srcu_read_unlock_fast() really use one this_cpu_inc() followed by another > > this_cpu_inc(), but I am not seeing any difference between the two. > > And testing the underlying primitives allows my tests to give reproducible > > results regardless of what state I have the SRCU code in. ;-) > > Thanks. I'll go through your emails in more detail tomorrow/Monday. Thank you! Not violently urgent, but I do look forward to hearing what you come up with. In the meantime, I am testing with the patch I sent and will let you know if problems arise. So far, so good... Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-10-31 23:38 ` Paul E. McKenney @ 2025-11-01 3:25 ` Paul E. McKenney 2025-11-01 9:44 ` Willy Tarreau ` (3 more replies) 0 siblings, 4 replies; 46+ messages in thread From: Paul E. McKenney @ 2025-11-01 3:25 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: > > On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote: > > > On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote: > > > > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote: > > > > > To make event tracing safe for PREEMPT_RT kernels, I have been creating > > > > > optimized variants of SRCU readers that use per-CPU atomics. This works > > > > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a > > > > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single > > > > > per-CPU atomic operation. This contrasts with a handful of nanoseconds > > > > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1). > > > > > > > > That's quite a difference. Does it get any better if > > > > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it > > > > on the kernel command line. > > > > > > In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct? > > > > Yes. > > > > > Yes, this gets me more than an order of magnitude, and about 30% better > > > than my workaround of disabling interrupts around a non-atomic increment > > > of those counters, thank you! > > > > > > Given that per-CPU atomics are usually not heavily contended, would it > > > make sense to avoid LSE in that case? > > > > In theory the LSE atomics should be as fast but microarchitecture > > decisions likely did not cover all the use-cases. I'll raise this > > internally as well, maybe we get some ideas from the hardware people. > > Understood, and please let me know what you can from the hardware people. > > > > And I need to figure out whether I should recommend that Meta build > > > its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n. And advice you > > > might have would be deeply appreciated! (I am of course also following > > > up internally.) > > > > I wouldn't advise turning them off just yet, they are beneficial for > > other use-cases. But it needs more thinking (and not that late at night ;)). > > Fair enough! > > > > > Interestingly, we had this patch recently to force a prefetch before the > > > > atomic: > > > > > > > > https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/ > > > > > > > > We rejected it but I wonder whether it improves the SRCU scenario. > > > > > > No statistical difference on my system. This is a 72-CPU Neoverse V2, in > > > case that matters. > > > > I just realised that patch doesn't touch percpu.h at all. So what about > > something like (untested): > > > > -----------------8<------------------------ > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > index 9abcc8ef3087..e381034324e1 100644 > > --- a/arch/arm64/include/asm/percpu.h > > +++ b/arch/arm64/include/asm/percpu.h > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > unsigned int loop; \ > > u##sz tmp; \ > > \ > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > /* LL/SC */ \ > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > unsigned int loop; \ > > u##sz ret; \ > > \ > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > /* LL/SC */ \ > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > -----------------8<------------------------ > > I will give this a shot, thank you! Jackpot!!! This reduces the overhead to 8.427, which is significantly better than the non-LSE value of 9.853. Still room for improvement, but much better than the 100ns values. I presume that you will send this up the normal path, but in the meantime, I will pull this in for further local testing, and thank you! Thanx, Paul > > > Here are my results for the underlying this_cpu_inc() > > > and this_cpu_dec() pair of operations: > > > > > > LSE Atomics Enabled (Stock) LSE Atomics Disabled > > > > > > Without Yicong’s Patch (Stock) > > > > > > 110.786 9.852 > > > > > > With Yicong’s Patch > > > > > > 109.873 9.853 > > > > > > As you can see, disabling LSE gets about an order of magnitude > > > and Yicong's patch has no statistically significant effect. > > > > > > This and more can be found in the "Per-CPU Increment/Decrement" > > > section of this Google document: > > > > > > https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing > > > > > > Full disclosure: Calls to srcu_read_lock_fast() followed by > > > srcu_read_unlock_fast() really use one this_cpu_inc() followed by another > > > this_cpu_inc(), but I am not seeing any difference between the two. > > > And testing the underlying primitives allows my tests to give reproducible > > > results regardless of what state I have the SRCU code in. ;-) > > > > Thanks. I'll go through your emails in more detail tomorrow/Monday. > > Thank you! Not violently urgent, but I do look forward to hearing what > you come up with. In the meantime, I am testing with the patch I sent > and will let you know if problems arise. So far, so good... > > Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-01 3:25 ` Paul E. McKenney @ 2025-11-01 9:44 ` Willy Tarreau 2025-11-01 18:07 ` Paul E. McKenney 2025-11-01 11:23 ` Catalin Marinas ` (2 subsequent siblings) 3 siblings, 1 reply; 46+ messages in thread From: Willy Tarreau @ 2025-11-01 9:44 UTC (permalink / raw) To: Paul E. McKenney Cc: Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel Hi! On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: > > > -----------------8<------------------------ > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > > index 9abcc8ef3087..e381034324e1 100644 > > > --- a/arch/arm64/include/asm/percpu.h > > > +++ b/arch/arm64/include/asm/percpu.h > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > > unsigned int loop; \ > > > u##sz tmp; \ > > > \ > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > /* LL/SC */ \ > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > > unsigned int loop; \ > > > u##sz ret; \ > > > \ > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > /* LL/SC */ \ > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > > -----------------8<------------------------ > > > > I will give this a shot, thank you! > > Jackpot!!! > > This reduces the overhead to 8.427, which is significantly better than > the non-LSE value of 9.853. Still room for improvement, but much > better than the 100ns values. This is super interesting! I've blindly applied a similar change to all of our atomics in haproxy and am seeing a consistent 2-7% perf increase depending on the tests on a 80-core Ampere Altra (neoverse-n1). There as well we're significantly using atomics to read/update mostly local variables as we avoid sharing as much as possible. I'm pretty sure it does hurt in certain cases, and we don't have this distinction of per_cpu variants like here, however that makes me think about adding a "mostly local" variant that we can choose from depending on the context. I'll continue to experiment, thanks for sharing this trick (particularly to Yicong Yang, the original reporter). Willy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-01 9:44 ` Willy Tarreau @ 2025-11-01 18:07 ` Paul E. McKenney 0 siblings, 0 replies; 46+ messages in thread From: Paul E. McKenney @ 2025-11-01 18:07 UTC (permalink / raw) To: Willy Tarreau Cc: Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel On Sat, Nov 01, 2025 at 10:44:48AM +0100, Willy Tarreau wrote: > Hi! > > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: > > > > -----------------8<------------------------ > > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > > > index 9abcc8ef3087..e381034324e1 100644 > > > > --- a/arch/arm64/include/asm/percpu.h > > > > +++ b/arch/arm64/include/asm/percpu.h > > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > > > unsigned int loop; \ > > > > u##sz tmp; \ > > > > \ > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > > /* LL/SC */ \ > > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > > > unsigned int loop; \ > > > > u##sz ret; \ > > > > \ > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > > /* LL/SC */ \ > > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > > > -----------------8<------------------------ > > > > > > I will give this a shot, thank you! > > > > Jackpot!!! > > > > This reduces the overhead to 8.427, which is significantly better than > > the non-LSE value of 9.853. Still room for improvement, but much > > better than the 100ns values. > > This is super interesting! I've blindly applied a similar change to all > of our atomics in haproxy and am seeing a consistent 2-7% perf increase > depending on the tests on a 80-core Ampere Altra (neoverse-n1). There > as well we're significantly using atomics to read/update mostly local > variables as we avoid sharing as much as possible. I'm pretty sure it > does hurt in certain cases, and we don't have this distinction of per_cpu > variants like here, however that makes me think about adding a "mostly > local" variant that we can choose from depending on the context. I'll > continue to experiment, thanks for sharing this trick (particularly to > Yicong Yang, the original reporter). Agreed! And before I forget (again!): Tested-by: Paul E. McKenney <paulmck@kernel.org> Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-01 3:25 ` Paul E. McKenney 2025-11-01 9:44 ` Willy Tarreau @ 2025-11-01 11:23 ` Catalin Marinas 2025-11-01 11:41 ` Yicong Yang 2025-11-03 20:12 ` Palmer Dabbelt 2025-11-03 21:49 ` Catalin Marinas 2025-11-04 17:05 ` Catalin Marinas 3 siblings, 2 replies; 46+ messages in thread From: Catalin Marinas @ 2025-11-01 11:23 UTC (permalink / raw) To: Paul E. McKenney Cc: Will Deacon, Mark Rutland, linux-arm-kernel, Willy Tarreau, Yicong Yang On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: > > > I just realised that patch doesn't touch percpu.h at all. So what about > > > something like (untested): > > > > > > -----------------8<------------------------ > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > > index 9abcc8ef3087..e381034324e1 100644 > > > --- a/arch/arm64/include/asm/percpu.h > > > +++ b/arch/arm64/include/asm/percpu.h > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > > unsigned int loop; \ > > > u##sz tmp; \ > > > \ > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > /* LL/SC */ \ > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > > unsigned int loop; \ > > > u##sz ret; \ > > > \ > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > /* LL/SC */ \ > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > > -----------------8<------------------------ > > > > I will give this a shot, thank you! > > Jackpot!!! > > This reduces the overhead to 8.427, which is significantly better than > the non-LSE value of 9.853. Still room for improvement, but much > better than the 100ns values. > > I presume that you will send this up the normal path, but in the meantime, > I will pull this in for further local testing, and thank you! I think for this specific case it may work, for the futex as well but not generally. The Neoverse-V2 TRM lists some controls in the IMP_CPUECTLR_EL1, bits 29 to 33: https://developer.arm.com/documentation/102375/0002 These can be configured depending on the system configuration but they are too big knobs to cover all use-cases within an OS. This register is typically configured by firmware, we don't touch it in Linux. I'll dig some more but we may have to do tricks like prefetch if we can't find a hardware configuration that satisfies all cases. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-01 11:23 ` Catalin Marinas @ 2025-11-01 11:41 ` Yicong Yang 2025-11-05 13:25 ` Catalin Marinas 2025-11-03 20:12 ` Palmer Dabbelt 1 sibling, 1 reply; 46+ messages in thread From: Yicong Yang @ 2025-11-01 11:41 UTC (permalink / raw) To: Catalin Marinas, Paul E. McKenney Cc: yangyccccc, Will Deacon, Mark Rutland, linux-arm-kernel, Willy Tarreau On 2025/11/1 19:23, Catalin Marinas wrote: > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: >> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: >>> On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: >>>> I just realised that patch doesn't touch percpu.h at all. So what about >>>> something like (untested): >>>> >>>> -----------------8<------------------------ >>>> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h >>>> index 9abcc8ef3087..e381034324e1 100644 >>>> --- a/arch/arm64/include/asm/percpu.h >>>> +++ b/arch/arm64/include/asm/percpu.h >>>> @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ >>>> unsigned int loop; \ >>>> u##sz tmp; \ >>>> \ >>>> + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); >>>> asm volatile (ARM64_LSE_ATOMIC_INSN( \ >>>> /* LL/SC */ \ >>>> "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ >>>> @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ >>>> unsigned int loop; \ >>>> u##sz ret; \ >>>> \ >>>> + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); >>>> asm volatile (ARM64_LSE_ATOMIC_INSN( \ >>>> /* LL/SC */ \ >>>> "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ >>>> -----------------8<------------------------ >>> I will give this a shot, thank you! >> Jackpot!!! >> >> This reduces the overhead to 8.427, which is significantly better than >> the non-LSE value of 9.853. Still room for improvement, but much >> better than the 100ns values. >> >> I presume that you will send this up the normal path, but in the meantime, >> I will pull this in for further local testing, and thank you! > I think for this specific case it may work, for the futex as well but > not generally. The Neoverse-V2 TRM lists some controls in the > IMP_CPUECTLR_EL1, bits 29 to 33: > > https://developer.arm.com/documentation/102375/0002 > > These can be configured depending on the system configuration but they > are too big knobs to cover all use-cases within an OS. This register is > typically configured by firmware, we don't touch it in Linux. > > I'll dig some more but we may have to do tricks like prefetch if we > can't find a hardware configuration that satisfies all cases. > FYI, there's a version to allow prefetech added prior to LSE opertaions by one boot option [1], if we want to reconsidered in this way, it's more flexible and can be controlled by the OS without touching the system configurations (may need to update the firmware). But need to add the prefetch in per-cpu implementation as you've noticed above (didn't add it since no prefetch for LL/SC implementation there, maybe a missing?) [1] https://lore.kernel.org/all/20250919091747.3702-1-yangyicong@huawei.com/ thanks. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-01 11:41 ` Yicong Yang @ 2025-11-05 13:25 ` Catalin Marinas 2025-11-05 13:42 ` Willy Tarreau 0 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-05 13:25 UTC (permalink / raw) To: Yicong Yang Cc: Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel, Willy Tarreau On Sat, Nov 01, 2025 at 07:41:02PM +0800, Yicong Yang wrote: > FYI, there's a version to allow prefetech added prior to LSE > opertaions by one boot option [1], if we want to reconsidered in this > way, it's more flexible and can be controlled by the OS without > touching the system configurations (may need to update the firmware). I'm against adding boot time options for this. We either add them permanently if beneficial for most microarchitectures or we get back to the hardware people to ask for improvements (or, potentially, imp def configurations like we have on a few of the Arm Ltd implementations). > But need to add the prefetch in per-cpu implementation as you've > noticed above (didn't add it since no prefetch for LL/SC > implementation there, maybe a missing?) Maybe no-one stressed these to notice any difference between LL/SC and LSE. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 13:25 ` Catalin Marinas @ 2025-11-05 13:42 ` Willy Tarreau 2025-11-05 14:49 ` Catalin Marinas 0 siblings, 1 reply; 46+ messages in thread From: Willy Tarreau @ 2025-11-05 13:42 UTC (permalink / raw) To: Catalin Marinas Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote: > > But need to add the prefetch in per-cpu implementation as you've > > noticed above (didn't add it since no prefetch for LL/SC > > implementation there, maybe a missing?) > > Maybe no-one stressed these to notice any difference between LL/SC and > LSE. Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for having faced catastrophic performance there on haproxy, while with LSE it continues to scale almost linearly at least till 64. But that does not mean that if some possibilities are within reach to recover 90% of the atomic overhead in uncontended case we shouldn't try to grab it at a reasonable cost! I'm definitely adding in my todo list to experiment more on this on various CPUs now ;-) Willy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 13:42 ` Willy Tarreau @ 2025-11-05 14:49 ` Catalin Marinas 2025-11-05 16:21 ` Breno Leitao 2025-11-06 7:44 ` Willy Tarreau 0 siblings, 2 replies; 46+ messages in thread From: Catalin Marinas @ 2025-11-05 14:49 UTC (permalink / raw) To: Willy Tarreau Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote: > On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote: > > > But need to add the prefetch in per-cpu implementation as you've > > > noticed above (didn't add it since no prefetch for LL/SC > > > implementation there, maybe a missing?) > > > > Maybe no-one stressed these to notice any difference between LL/SC and > > LSE. > > Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for > having faced catastrophic performance there on haproxy, while with LSE > it continues to scale almost linearly at least till 64. I was referring only to the this_cpu_add() etc. functions (until Paul started using them). There definitely have been lots of benchmarks on the scalability of LL/SC. That's one of the reasons Arm added the LSE atomics years ago. > But that does > not mean that if some possibilities are within reach to recover 90% of > the atomic overhead in uncontended case we shouldn't try to grab it at > a reasonable cost! I agree. Even for these cases, I don't think the solution is LL/SC but rather better use of LSE (and better understanding of the hardware behaviour; feedback here should go both ways). > I'm definitely adding in my todo list to experiment more on this on > various CPUs now ;-) Thanks for the tests so far, very insightful. I think what's still good to assess is how PRFM+STADD compares to LDADD (without PRFM) in Breno's microbenchmarks. I suspect LDADD is still better. FWIW, Neoverse-N1 has an erratum affecting the far atomics and they are all forced near, so this explains the consistent results you got with STADD on this CPU. On other CPUs, STADD would likely be executed far unless it hits in the L1 cache. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 14:49 ` Catalin Marinas @ 2025-11-05 16:21 ` Breno Leitao 2025-11-06 7:44 ` Willy Tarreau 1 sibling, 0 replies; 46+ messages in thread From: Breno Leitao @ 2025-11-05 16:21 UTC (permalink / raw) To: Catalin Marinas Cc: Willy Tarreau, Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel, kernel-team, rmikey, palmer On Wed, Nov 05, 2025 at 02:49:39PM +0000, Catalin Marinas wrote: > On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote: > > On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote: > > > > But need to add the prefetch in per-cpu implementation as you've > > > > noticed above (didn't add it since no prefetch for LL/SC > > > > implementation there, maybe a missing?) > > > > > > Maybe no-one stressed these to notice any difference between LL/SC and > > > LSE. > > > > Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for > > having faced catastrophic performance there on haproxy, while with LSE > > it continues to scale almost linearly at least till 64. > > I was referring only to the this_cpu_add() etc. functions (until Paul > started using them). There definitely have been lots of benchmarks on > the scalability of LL/SC. That's one of the reasons Arm added the LSE > atomics years ago. > > > But that does > > not mean that if some possibilities are within reach to recover 90% of > > the atomic overhead in uncontended case we shouldn't try to grab it at > > a reasonable cost! > > I agree. Even for these cases, I don't think the solution is LL/SC but > rather better use of LSE (and better understanding of the hardware > behaviour; feedback here should go both ways). > > > I'm definitely adding in my todo list to experiment more on this on > > various CPUs now ;-) > > Thanks for the tests so far, very insightful. I think what's still > good to assess is how PRFM+STADD compares to LDADD (without PRFM) in > Breno's microbenchmarks. I suspect LDADD is still better. I've hacked my microbenchmark to add these tests Catalin suggested, and it seems prfm improve the latency variation. This is what I am measuring now: /* LL/SC implementation */ void __percpu_add_case_64_llsc(void *ptr, unsigned long val) { asm volatile( /* LL/SC */ "1: ldxr %[tmp], %[ptr]\n" " add %[tmp], %[tmp], %[val]\n" " stxr %w[loop], %[tmp], %[ptr]\n" " cbnz %w[loop], 1b" : [loop] "=&r"(loop), [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr) : [val] "r"((u64)(val)) : "memory"); } /* LSE implementation using stadd */ void __percpu_add_case_64_lse(void *ptr, unsigned long val) { asm volatile( /* LSE atomics */ " stadd %[val], %[ptr]\n" : [ptr] "+Q"(*(u64 *)ptr) : [val] "r"((u64)(val)) : "memory"); } /* LSE implementation using ldadd */ void __percpu_add_case_64_ldadd(void *ptr, unsigned long val) { asm volatile( /* LSE atomics */ " ldadd %[val], %[tmp], %[ptr]\n" : [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr) : [val] "r"((u64)(val)) : "memory"); } /* LSE implementation using PRFM + stadd */ void __percpu_add_case_64_prfm_stadd(void *ptr, unsigned long val) { asm volatile( /* Prefetch + LSE atomics */ " prfm pstl1keep, %[ptr]\n" " stadd %[val], %[ptr]\n" : [ptr] "+Q"(*(u64 *)ptr) : [val] "r"((u64)(val)) : "memory"); } /* LSE implementation using PRFM STRM + stadd */ void __percpu_add_case_64_prfm_strm_stadd(void *ptr, unsigned long val) { asm volatile( /* Prefetch streaming + LSE atomics */ " prfm pstl1strm, %[ptr]\n" " stadd %[val], %[ptr]\n" : [ptr] "+Q"(*(u64 *)ptr) : [val] "r"((u64)(val)) : "memory"); } And prfm definitely added some stabilityu to STDADD, but, in most cases, it is still a bit behind the regular ldxr/stxr. CPU: 0 - Latency Percentiles: ==================== LL/SC : p50: 5.73 ns p95: 5.90 ns p99: 7.35 ns STADD : p50: 65.99 ns p95: 68.98 ns p99: 70.13 ns LDADD : p50: 4.33 ns p95: 4.34 ns p99: 4.34 ns PRFM_KEEP+STADD : p50: 7.89 ns p95: 7.91 ns p99: 8.82 ns PRFM_STRM+STADD : p50: 7.89 ns p95: 8.11 ns p99: 9.76 ns CPU: 1 - Latency Percentiles: ==================== LL/SC : p50: 7.72 ns p95: 18.00 ns p99: 31.51 ns STADD : p50: 103.81 ns p95: 127.60 ns p99: 137.12 ns LDADD : p50: 4.35 ns p95: 22.46 ns p99: 25.03 ns PRFM_KEEP+STADD : p50: 7.89 ns p95: 22.04 ns p99: 23.66 ns PRFM_STRM+STADD : p50: 7.89 ns p95: 8.75 ns p99: 11.10 ns CPU: 2 - Latency Percentiles: ==================== LL/SC : p50: 5.73 ns p95: 6.87 ns p99: 23.96 ns STADD : p50: 63.30 ns p95: 63.33 ns p99: 63.36 ns LDADD : p50: 4.34 ns p95: 4.35 ns p99: 4.35 ns PRFM_KEEP+STADD : p50: 7.89 ns p95: 7.90 ns p99: 7.91 ns PRFM_STRM+STADD : p50: 7.89 ns p95: 7.90 ns p99: 7.90 ns CPU: 3 - Latency Percentiles: ==================== LL/SC : p50: 5.70 ns p95: 5.71 ns p99: 5.72 ns STADD : p50: 61.94 ns p95: 62.95 ns p99: 65.05 ns LDADD : p50: 4.32 ns p95: 4.33 ns p99: 7.28 ns PRFM_KEEP+STADD : p50: 7.86 ns p95: 7.87 ns p99: 8.08 ns PRFM_STRM+STADD : p50: 7.86 ns p95: 7.87 ns p99: 8.25 ns CPU: 4 - Latency Percentiles: ==================== LL/SC : p50: 5.72 ns p95: 5.73 ns p99: 5.74 ns STADD : p50: 62.04 ns p95: 122.78 ns p99: 131.43 ns LDADD : p50: 8.08 ns p95: 11.70 ns p99: 14.89 ns PRFM_KEEP+STADD : p50: 13.83 ns p95: 20.70 ns p99: 22.54 ns PRFM_STRM+STADD : p50: 12.80 ns p95: 19.42 ns p99: 20.36 ns CPU: 5 - Latency Percentiles: ==================== LL/SC : p50: 5.68 ns p95: 5.70 ns p99: 5.70 ns STADD : p50: 59.30 ns p95: 60.52 ns p99: 66.53 ns LDADD : p50: 4.30 ns p95: 4.31 ns p99: 4.32 ns PRFM_KEEP+STADD : p50: 7.84 ns p95: 7.85 ns p99: 7.85 ns PRFM_STRM+STADD : p50: 7.84 ns p95: 7.85 ns p99: 7.85 ns CPU: 6 - Latency Percentiles: ==================== LL/SC : p50: 5.70 ns p95: 5.71 ns p99: 5.72 ns STADD : p50: 59.37 ns p95: 59.41 ns p99: 59.42 ns LDADD : p50: 4.32 ns p95: 4.32 ns p99: 4.34 ns PRFM_KEEP+STADD : p50: 7.85 ns p95: 7.86 ns p99: 7.88 ns PRFM_STRM+STADD : p50: 7.85 ns p95: 7.86 ns p99: 7.86 ns CPU: 7 - Latency Percentiles: ==================== LL/SC : p50: 5.72 ns p95: 5.74 ns p99: 6.90 ns STADD : p50: 64.46 ns p95: 74.34 ns p99: 77.47 ns LDADD : p50: 4.35 ns p95: 7.50 ns p99: 10.06 ns PRFM_KEEP+STADD : p50: 8.92 ns p95: 14.34 ns p99: 17.31 ns PRFM_STRM+STADD : p50: 8.88 ns p95: 13.74 ns p99: 15.11 ns As always, the code could be found at https://github.com/leitao/debug/tree/main/LSE ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 14:49 ` Catalin Marinas 2025-11-05 16:21 ` Breno Leitao @ 2025-11-06 7:44 ` Willy Tarreau 2025-11-06 13:53 ` Catalin Marinas 1 sibling, 1 reply; 46+ messages in thread From: Willy Tarreau @ 2025-11-06 7:44 UTC (permalink / raw) To: Catalin Marinas Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 02:49:39PM +0000, Catalin Marinas wrote: > On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote: > > On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote: > > > > But need to add the prefetch in per-cpu implementation as you've > > > > noticed above (didn't add it since no prefetch for LL/SC > > > > implementation there, maybe a missing?) > > > > > > Maybe no-one stressed these to notice any difference between LL/SC and > > > LSE. > > > > Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for > > having faced catastrophic performance there on haproxy, while with LSE > > it continues to scale almost linearly at least till 64. > > I was referring only to the this_cpu_add() etc. functions (until Paul > started using them). Ah OK thanks for clarifying! > There definitely have been lots of benchmarks on > the scalability of LL/SC. That's one of the reasons Arm added the LSE > atomics years ago. Yes that's what I thought, which is why your sentence shocked me in the first place :-) > > But that does > > not mean that if some possibilities are within reach to recover 90% of > > the atomic overhead in uncontended case we shouldn't try to grab it at > > a reasonable cost! > > I agree. Even for these cases, I don't think the solution is LL/SC but > rather better use of LSE (and better understanding of the hardware > behaviour; feedback here should go both ways). I totally agree. I'm happy to have discovered the near vs far distinction there that I was not aware of because it will make me think differently in the future when having to design around shared stuff. > > I'm definitely adding in my todo list to experiment more on this on > > various CPUs now ;-) > > Thanks for the tests so far, very insightful. I think what's still > good to assess is how PRFM+STADD compares to LDADD (without PRFM) in > Breno's microbenchmarks. I suspect LDADD is still better. Yep as confirmed with Breno's last test after your message. > FWIW, Neoverse-N1 has an erratum affecting the far atomics and they are > all forced near, so this explains the consistent results you got with > STADD on this CPU. On other CPUs, STADD would likely be executed far > unless it hits in the L1 cache. Ah, thanks for letting me know! This indeed explains the difference. Do you have pointers to some docs suggesting what instructions to use when you prefer a near or far operation, like here with stadd vs ldadd ? Also does this mean that with LSE a pure store will always be far unless prefetched ? Or should we trick stores using stadd mem,0 / ldadd mem,0 to hint a near vs far store for example ? I'm also wondering about CAS, if there's a way to perform the usual load+CAS sequence exclusively using far operations to avoid cache lines bouncing in contended environments, because there are cases where a constant 50-60ns per CAS would be awesome, or maybe even a CAS that remains far in case of failure or triggers the prefetch of the line in case of success, for the typical CAS(ptr, NULL, mine) used to try to own a shared resource. Thanks, Willy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-06 7:44 ` Willy Tarreau @ 2025-11-06 13:53 ` Catalin Marinas 2025-11-06 14:16 ` Willy Tarreau 0 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-06 13:53 UTC (permalink / raw) To: Willy Tarreau Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel On Thu, Nov 06, 2025 at 08:44:39AM +0100, Willy Tarreau wrote: > Do you have pointers to some docs suggesting what instructions to use > when you prefer a near or far operation, like here with stadd vs ldadd ? Unfortunately, the architecture spec does not make any distinction between far or near atomics, that's rather a microarchitecture and system implementation detail. Some of the information is hidden in specific CPU TRMs and the behaviour may differ between implementations. I hope Arm will publish some docs/blogs to give some guidance to software folk (and other non-Arm Ltd microarchitects; it would be good if they are all aligned, though some may see this as their value-add). > Also does this mean that with LSE a pure store will always be far unless > prefetched ? Or should we trick stores using stadd mem,0 / ldadd mem,0 > to hint a near vs far store for example ? For the Arm Ltd implementations, _usually_ store-only atomics are executed far while those returning a value are near. But that's subject to implementation-defined configurations (e.g. IMP_CPUECTLR_EL1). Also the hardware may try to be smarter, e.g. detect contention and switch from one behaviour to another. > I'm also wondering about CAS, > if there's a way to perform the usual load+CAS sequence exclusively using > far operations to avoid cache lines bouncing in contended environments, > because there are cases where a constant 50-60ns per CAS would be awesome, > or maybe even a CAS that remains far in case of failure or triggers the > prefetch of the line in case of success, for the typical > CAS(ptr, NULL, mine) used to try to own a shared resource. Talking to other engineers in Arm, I learnt that the architecture even describes a way the programmer can hint at CAS loops. Instead of an LDR, use something (informally) called ICAS - a CAS where the Xs and Xt registers are the same (actual registers, not the value they contain). The in-memory value comparison with Xs either passes and the written value would be the same (imp def whether a write actually takes place) or fails (in theory, hw is allowed to write the same old value back). So while the value in Xs is less relevant, CAS will return the value in memory. The hardware detects the ICAS+CAS constructs and aims to make them faster. From the C6.2.50 in the Arm ARM (the CAS description): For a CAS or CASA instruction, when <Ws> or <Xs> specifies the same register as <Wt> or <Xt>, this signals to the memory system that an additional subsequent CAS, CASA, CASAL, or CASL access to the specified location is likely to occur in the near future. The memory system can respond by taking actions that are expected to enable the subsequent CAS, CASA, CASAL, or CASL access to succeed when it does occur. I guess something to add to Breno's microbenchmarks. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-06 13:53 ` Catalin Marinas @ 2025-11-06 14:16 ` Willy Tarreau 0 siblings, 0 replies; 46+ messages in thread From: Willy Tarreau @ 2025-11-06 14:16 UTC (permalink / raw) To: Catalin Marinas Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel On Thu, Nov 06, 2025 at 01:53:04PM +0000, Catalin Marinas wrote: > On Thu, Nov 06, 2025 at 08:44:39AM +0100, Willy Tarreau wrote: > > Do you have pointers to some docs suggesting what instructions to use > > when you prefer a near or far operation, like here with stadd vs ldadd ? > > Unfortunately, the architecture spec does not make any distinction > between far or near atomics, that's rather a microarchitecture and > system implementation detail. Some of the information is hidden in > specific CPU TRMs and the behaviour may differ between implementations. > > I hope Arm will publish some docs/blogs to give some guidance to > software folk (and other non-Arm Ltd microarchitects; it would be good > if they are all aligned, though some may see this as their value-add). Yes I can definitely understand that it's never easy to place the cursor between how to help developers get the most of your arch and how to keep competitors away. > > Also does this mean that with LSE a pure store will always be far unless > > prefetched ? Or should we trick stores using stadd mem,0 / ldadd mem,0 > > to hint a near vs far store for example ? > > For the Arm Ltd implementations, _usually_ store-only atomics are > executed far while those returning a value are near. But that's subject > to implementation-defined configurations (e.g. IMP_CPUECTLR_EL1). Also > the hardware may try to be smarter, e.g. detect contention and switch > from one behaviour to another. OK, thanks for the explanation. It makes sense and tends to match what one could naturally expect. > > I'm also wondering about CAS, > > if there's a way to perform the usual load+CAS sequence exclusively using > > far operations to avoid cache lines bouncing in contended environments, > > because there are cases where a constant 50-60ns per CAS would be awesome, > > or maybe even a CAS that remains far in case of failure or triggers the > > prefetch of the line in case of success, for the typical > > CAS(ptr, NULL, mine) used to try to own a shared resource. > > Talking to other engineers in Arm, I learnt that the architecture even > describes a way the programmer can hint at CAS loops. Instead of an LDR, > use something (informally) called ICAS - a CAS where the Xs and Xt > registers are the same (actual registers, not the value they contain). > The in-memory value comparison with Xs either passes and the written > value would be the same (imp def whether a write actually takes place) > or fails (in theory, hw is allowed to write the same old value back). This is super interesting, thanks for sharing! > So > while the value in Xs is less relevant, CAS will return the value in > memory. The hardware detects the ICAS+CAS constructs and aims to make > them faster. I had already notice some x86 models being able to often succeed on the second CAS attempt, and suspected that they'd force the L1 to hold the line until the next attempt for this purpose. This could be roughly similar. > >From the C6.2.50 in the Arm ARM (the CAS description): > > For a CAS or CASA instruction, when <Ws> or <Xs> specifies the same > register as <Wt> or <Xt>, this signals to the memory system that an > additional subsequent CAS, CASA, CASAL, or CASL access to the > specified location is likely to occur in the near future. The memory > system can respond by taking actions that are expected to enable the > subsequent CAS, CASA, CASAL, or CASL access to succeed when it does > occur. > > I guess something to add to Breno's microbenchmarks. I think so as well. Many thanks again for sharing such precious info! Willy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-01 11:23 ` Catalin Marinas 2025-11-01 11:41 ` Yicong Yang @ 2025-11-03 20:12 ` Palmer Dabbelt 1 sibling, 0 replies; 46+ messages in thread From: Palmer Dabbelt @ 2025-11-03 20:12 UTC (permalink / raw) To: Catalin Marinas Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel, w, yangyicong On Sat, 01 Nov 2025 04:23:22 PDT (-0700), Catalin Marinas wrote: > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: >> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: >> > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: >> > > I just realised that patch doesn't touch percpu.h at all. So what about >> > > something like (untested): >> > > >> > > -----------------8<------------------------ >> > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h >> > > index 9abcc8ef3087..e381034324e1 100644 >> > > --- a/arch/arm64/include/asm/percpu.h >> > > +++ b/arch/arm64/include/asm/percpu.h >> > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ >> > > unsigned int loop; \ >> > > u##sz tmp; \ >> > > \ >> > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); >> > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ >> > > /* LL/SC */ \ >> > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ >> > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ >> > > unsigned int loop; \ >> > > u##sz ret; \ >> > > \ >> > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); >> > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ >> > > /* LL/SC */ \ >> > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ >> > > -----------------8<------------------------ >> > >> > I will give this a shot, thank you! >> >> Jackpot!!! >> >> This reduces the overhead to 8.427, which is significantly better than >> the non-LSE value of 9.853. Still room for improvement, but much >> better than the 100ns values. >> >> I presume that you will send this up the normal path, but in the meantime, >> I will pull this in for further local testing, and thank you! > > I think for this specific case it may work, for the futex as well but > not generally. The Neoverse-V2 TRM lists some controls in the > IMP_CPUECTLR_EL1, bits 29 to 33: > > https://developer.arm.com/documentation/102375/0002 > > These can be configured depending on the system configuration but they > are too big knobs to cover all use-cases within an OS. This register is > typically configured by firmware, we don't touch it in Linux. Mostly for Paul: I have patch to let you do this from Linux, and I have some firmware for some of these internal systems that lets you set most of these magic bits. I've noticed some unexpected behavior around prefetch distance on an internal workload, but haven't gotten much farther there. There's also some other bits that to wacky things... Just FYI: Marc described trying to set these dynamically as trying to swallow a running chainsaw, but LMK if you're feeling risky and I can try and get you a copy of my setup. They seem to work fine for me ;) > I'll dig some more but we may have to do tricks like prefetch if we > can't find a hardware configuration that satisfies all cases. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-01 3:25 ` Paul E. McKenney 2025-11-01 9:44 ` Willy Tarreau 2025-11-01 11:23 ` Catalin Marinas @ 2025-11-03 21:49 ` Catalin Marinas 2025-11-03 21:56 ` Willy Tarreau 2025-11-04 17:05 ` Catalin Marinas 3 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-03 21:49 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > > index 9abcc8ef3087..e381034324e1 100644 > > > --- a/arch/arm64/include/asm/percpu.h > > > +++ b/arch/arm64/include/asm/percpu.h > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > > unsigned int loop; \ > > > u##sz tmp; \ > > > \ > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > /* LL/SC */ \ > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > > unsigned int loop; \ > > > u##sz ret; \ > > > \ > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > /* LL/SC */ \ > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > > -----------------8<------------------------ > > > > I will give this a shot, thank you! > > Jackpot!!! > > This reduces the overhead to 8.427, which is significantly better than > the non-LSE value of 9.853. Still room for improvement, but much > better than the 100ns values. Just curious, if you have time, could you try prefetchw() instead of the above asm? That would be a PRFM PSTL1KEEP instead of STRM. Are __srcu_read_lock() and __srcu_read_unlock() usually touching the same cache line? Thanks. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-03 21:49 ` Catalin Marinas @ 2025-11-03 21:56 ` Willy Tarreau 0 siblings, 0 replies; 46+ messages in thread From: Willy Tarreau @ 2025-11-03 21:56 UTC (permalink / raw) To: Catalin Marinas Cc: Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel On Mon, Nov 03, 2025 at 09:49:56PM +0000, Catalin Marinas wrote: > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: > > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: > > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: > > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > > > index 9abcc8ef3087..e381034324e1 100644 > > > > --- a/arch/arm64/include/asm/percpu.h > > > > +++ b/arch/arm64/include/asm/percpu.h > > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > > > unsigned int loop; \ > > > > u##sz tmp; \ > > > > \ > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > > /* LL/SC */ \ > > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > > > unsigned int loop; \ > > > > u##sz ret; \ > > > > \ > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > > /* LL/SC */ \ > > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > > > -----------------8<------------------------ > > > > > > I will give this a shot, thank you! > > > > Jackpot!!! > > > > This reduces the overhead to 8.427, which is significantly better than > > the non-LSE value of 9.853. Still room for improvement, but much > > better than the 100ns values. > > Just curious, if you have time, could you try prefetchw() instead of the > above asm? That would be a PRFM PSTL1KEEP instead of STRM. Are > __srcu_read_lock() and __srcu_read_unlock() usually touching the same > cache line? FWIW I tested PRFM PSTL1KEEP this morning on the Altra with haproxy just out of curiosity and didn't notice a difference with PRFM PSTL1STRM. Maybe in the kernel it will be different though. Willy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-01 3:25 ` Paul E. McKenney ` (2 preceding siblings ...) 2025-11-03 21:49 ` Catalin Marinas @ 2025-11-04 17:05 ` Catalin Marinas 2025-11-04 18:43 ` Paul E. McKenney 3 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-04 17:05 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > > index 9abcc8ef3087..e381034324e1 100644 > > > --- a/arch/arm64/include/asm/percpu.h > > > +++ b/arch/arm64/include/asm/percpu.h > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > > unsigned int loop; \ > > > u##sz tmp; \ > > > \ > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > /* LL/SC */ \ > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > > unsigned int loop; \ > > > u##sz ret; \ > > > \ > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > /* LL/SC */ \ > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > > -----------------8<------------------------ > > > > I will give this a shot, thank you! > > Jackpot!!! > > This reduces the overhead to 8.427, which is significantly better than > the non-LSE value of 9.853. Still room for improvement, but much > better than the 100ns values. > > I presume that you will send this up the normal path, but in the meantime, > I will pull this in for further local testing, and thank you! After an educative discussion with the microarchitects, I think the hardware is behaving as intended, it just doesn't always fit the software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in Linux as a STADD instruction (that's LDADD with XZR as destination; i.e. no need to return the value read from memory). This is typically executed as "far" or posted (unless it hits in the L1 cache) and intended for stat updates. At a quick grep, it matches the majority of the use-cases in Linux. Most other atomics (those with a return) are executed "near", so filling the cache lines (assuming default CPUECTLR configuration). For the SRCU case, STADD especially together with the DMB after lock and before unlock, executing it far does slow things down. A microbenchmark doing this in a loop is a lot worse than it would appear in practice (saturating buses down the path to memory). A quick test to check this theory, if that's the functions you were benchmarking (it generates LDADD instead): ---------------------8<---------------------------------------- diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index 42098e0fa0b7..5a6f3999883d 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp); if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE)) - this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader. + this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader. else atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks)); // Y, and implicit RCU reader. barrier(); /* Avoid leaking the critical section. */ @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp) { barrier(); /* Avoid leaking the critical section. */ if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE)) - this_cpu_inc(scp->srcu_unlocks.counter); // Z, and implicit RCU reader. + this_cpu_inc_return(scp->srcu_unlocks.counter); // Z, and implicit RCU reader. else atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks)); // Z, and implicit RCU reader. } diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 1ff94b76d91f..c025d9135689 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp) { struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp); - this_cpu_inc(scp->srcu_locks.counter); + this_cpu_inc_return(scp->srcu_locks.counter); smp_mb(); /* B */ /* Avoid leaking the critical section. */ return __srcu_ptr_to_ctr(ssp, scp); } @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock); void __srcu_read_unlock(struct srcu_struct *ssp, int idx) { smp_mb(); /* C */ /* Avoid leaking the critical section. */ - this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter); + this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter); } EXPORT_SYMBOL_GPL(__srcu_read_unlock); ---------------------8<---------------------------------------- To make things better for the non-fast variants above, we should add this_cpu_inc_return_acquire() etc. semantics (strangely, this_cpu_inc_return() doesn't have full barrier semantics as atomic_inc_return()). I'm not sure about adding the prefetch since most other uses of this_cpu_add() are meant for stat updates and there's not much point in brining in a cache line. I think we could add release/acquire variants that generate LDADDA/L and maybe a slightly different API for the __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add full barrier semantics to the current _return() variants. -- Catalin ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 17:05 ` Catalin Marinas @ 2025-11-04 18:43 ` Paul E. McKenney 2025-11-04 20:10 ` Paul E. McKenney 0 siblings, 1 reply; 46+ messages in thread From: Paul E. McKenney @ 2025-11-04 18:43 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote: > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: > > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: > > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: > > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > > > index 9abcc8ef3087..e381034324e1 100644 > > > > --- a/arch/arm64/include/asm/percpu.h > > > > +++ b/arch/arm64/include/asm/percpu.h > > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > > > unsigned int loop; \ > > > > u##sz tmp; \ > > > > \ > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > > /* LL/SC */ \ > > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > > > unsigned int loop; \ > > > > u##sz ret; \ > > > > \ > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > > /* LL/SC */ \ > > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > > > -----------------8<------------------------ > > > > > > I will give this a shot, thank you! > > > > Jackpot!!! > > > > This reduces the overhead to 8.427, which is significantly better than > > the non-LSE value of 9.853. Still room for improvement, but much > > better than the 100ns values. > > > > I presume that you will send this up the normal path, but in the meantime, > > I will pull this in for further local testing, and thank you! > > After an educative discussion with the microarchitects, I think the > hardware is behaving as intended, it just doesn't always fit the > software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in > Linux as a STADD instruction (that's LDADD with XZR as destination; i.e. > no need to return the value read from memory). This is typically > executed as "far" or posted (unless it hits in the L1 cache) and > intended for stat updates. At a quick grep, it matches the majority of > the use-cases in Linux. Most other atomics (those with a return) are > executed "near", so filling the cache lines (assuming default CPUECTLR > configuration). OK... > For the SRCU case, STADD especially together with the DMB after lock and > before unlock, executing it far does slow things down. A microbenchmark > doing this in a loop is a lot worse than it would appear in practice > (saturating buses down the path to memory). In this srcu_read_lock_fast_updown() case, there was no DMB. But for srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB. (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.) > A quick test to check this theory, if that's the functions you were > benchmarking (it generates LDADD instead): Thank you for digging into this! > ---------------------8<---------------------------------------- > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > index 42098e0fa0b7..5a6f3999883d 100644 > --- a/include/linux/srcutree.h > +++ b/include/linux/srcutree.h > @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src > struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp); > > if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE)) > - this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader. > + this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader. > else > atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks)); // Y, and implicit RCU reader. > barrier(); /* Avoid leaking the critical section. */ > @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp) > { > barrier(); /* Avoid leaking the critical section. */ > if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE)) > - this_cpu_inc(scp->srcu_unlocks.counter); // Z, and implicit RCU reader. > + this_cpu_inc_return(scp->srcu_unlocks.counter); // Z, and implicit RCU reader. > else > atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks)); // Z, and implicit RCU reader. > } > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 1ff94b76d91f..c025d9135689 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp) > { > struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp); > > - this_cpu_inc(scp->srcu_locks.counter); > + this_cpu_inc_return(scp->srcu_locks.counter); > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > return __srcu_ptr_to_ctr(ssp, scp); > } > @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock); > void __srcu_read_unlock(struct srcu_struct *ssp, int idx) > { > smp_mb(); /* C */ /* Avoid leaking the critical section. */ > - this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter); > + this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter); > } > EXPORT_SYMBOL_GPL(__srcu_read_unlock); > > ---------------------8<---------------------------------------- > > To make things better for the non-fast variants above, we should add > this_cpu_inc_return_acquire() etc. semantics (strangely, > this_cpu_inc_return() doesn't have full barrier semantics as > atomic_inc_return()). > > I'm not sure about adding the prefetch since most other uses of > this_cpu_add() are meant for stat updates and there's not much point in > brining in a cache line. I think we could add release/acquire variants > that generate LDADDA/L and maybe a slightly different API for the > __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add > full barrier semantics to the current _return() variants. But other architectures might well have this_cpu_inc_return() running more slowly than this_cpu_inc(). So my thought would be to make a this_cpu_inc_srcu() that mapped to this_cpu_inc_return() on arm64 and this_cpu_inc() elsewhere. I could imagine this_cpu_inc_local() or some such, but it is not clear that the added API explosion is yet justified. Or is there a better way? Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 18:43 ` Paul E. McKenney @ 2025-11-04 20:10 ` Paul E. McKenney 2025-11-05 15:34 ` Catalin Marinas 0 siblings, 1 reply; 46+ messages in thread From: Paul E. McKenney @ 2025-11-04 20:10 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote: > On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote: > > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote: > > > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote: > > > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote: > > > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > > > > > index 9abcc8ef3087..e381034324e1 100644 > > > > > --- a/arch/arm64/include/asm/percpu.h > > > > > +++ b/arch/arm64/include/asm/percpu.h > > > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > > > > > unsigned int loop; \ > > > > > u##sz tmp; \ > > > > > \ > > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > > > /* LL/SC */ \ > > > > > "1: ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n" \ > > > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val) \ > > > > > unsigned int loop; \ > > > > > u##sz ret; \ > > > > > \ > > > > > + asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr)); > > > > > asm volatile (ARM64_LSE_ATOMIC_INSN( \ > > > > > /* LL/SC */ \ > > > > > "1: ldxr" #sfx "\t%" #w "[ret], %[ptr]\n" \ > > > > > -----------------8<------------------------ > > > > > > > > I will give this a shot, thank you! > > > > > > Jackpot!!! > > > > > > This reduces the overhead to 8.427, which is significantly better than > > > the non-LSE value of 9.853. Still room for improvement, but much > > > better than the 100ns values. > > > > > > I presume that you will send this up the normal path, but in the meantime, > > > I will pull this in for further local testing, and thank you! > > > > After an educative discussion with the microarchitects, I think the > > hardware is behaving as intended, it just doesn't always fit the > > software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in > > Linux as a STADD instruction (that's LDADD with XZR as destination; i.e. > > no need to return the value read from memory). This is typically > > executed as "far" or posted (unless it hits in the L1 cache) and > > intended for stat updates. At a quick grep, it matches the majority of > > the use-cases in Linux. Most other atomics (those with a return) are > > executed "near", so filling the cache lines (assuming default CPUECTLR > > configuration). > > OK... > > > For the SRCU case, STADD especially together with the DMB after lock and > > before unlock, executing it far does slow things down. A microbenchmark > > doing this in a loop is a lot worse than it would appear in practice > > (saturating buses down the path to memory). > > In this srcu_read_lock_fast_updown() case, there was no DMB. But for > srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB. > (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.) > > > A quick test to check this theory, if that's the functions you were > > benchmarking (it generates LDADD instead): > > Thank you for digging into this! And this_cpu_inc_return() does speed things up on my hardware to about the same extent as did the prefetch instruction, so thank you again. However, it gets me more than a 4x slowdown on x86, so I cannot make this change in common code. So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via something like this_cpu_inc_srcu(), but not for the upcoming merge window, but the one after that, sticking with my current interrupt-disabling non-atomic approach in the meantime (which gets me most of the benefit). Alternatively, would it work for me to put that cache-prefetch instruction into SRCU for arm64? My guess is "absolutely not!", but I figured that I should ask. But if both of these approaches proves problematic, I might need some way to distinguish between systems having slow LSE and those that do not. Or I can stick with disabling interrupts across non-atomic updates. Thoughts? Thanx, Paul > > ---------------------8<---------------------------------------- > > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h > > index 42098e0fa0b7..5a6f3999883d 100644 > > --- a/include/linux/srcutree.h > > +++ b/include/linux/srcutree.h > > @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src > > struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp); > > > > if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE)) > > - this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader. > > + this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader. > > else > > atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks)); // Y, and implicit RCU reader. > > barrier(); /* Avoid leaking the critical section. */ > > @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp) > > { > > barrier(); /* Avoid leaking the critical section. */ > > if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE)) > > - this_cpu_inc(scp->srcu_unlocks.counter); // Z, and implicit RCU reader. > > + this_cpu_inc_return(scp->srcu_unlocks.counter); // Z, and implicit RCU reader. > > else > > atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks)); // Z, and implicit RCU reader. > > } > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > > index 1ff94b76d91f..c025d9135689 100644 > > --- a/kernel/rcu/srcutree.c > > +++ b/kernel/rcu/srcutree.c > > @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp) > > { > > struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp); > > > > - this_cpu_inc(scp->srcu_locks.counter); > > + this_cpu_inc_return(scp->srcu_locks.counter); > > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > > return __srcu_ptr_to_ctr(ssp, scp); > > } > > @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock); > > void __srcu_read_unlock(struct srcu_struct *ssp, int idx) > > { > > smp_mb(); /* C */ /* Avoid leaking the critical section. */ > > - this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter); > > + this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter); > > } > > EXPORT_SYMBOL_GPL(__srcu_read_unlock); > > > > ---------------------8<---------------------------------------- > > > > To make things better for the non-fast variants above, we should add > > this_cpu_inc_return_acquire() etc. semantics (strangely, > > this_cpu_inc_return() doesn't have full barrier semantics as > > atomic_inc_return()). > > > > I'm not sure about adding the prefetch since most other uses of > > this_cpu_add() are meant for stat updates and there's not much point in > > brining in a cache line. I think we could add release/acquire variants > > that generate LDADDA/L and maybe a slightly different API for the > > __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add > > full barrier semantics to the current _return() variants. > > But other architectures might well have this_cpu_inc_return() running > more slowly than this_cpu_inc(). So my thought would be to make a > this_cpu_inc_srcu() that mapped to this_cpu_inc_return() on arm64 and > this_cpu_inc() elsewhere. > > I could imagine this_cpu_inc_local() or some such, but it is not clear > that the added API explosion is yet justified. > > Or is there a better way? > > Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 20:10 ` Paul E. McKenney @ 2025-11-05 15:34 ` Catalin Marinas 2025-11-05 16:25 ` Paul E. McKenney 0 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-05 15:34 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Tue, Nov 04, 2025 at 12:10:36PM -0800, Paul E. McKenney wrote: > On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote: > > On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote: > > > For the SRCU case, STADD especially together with the DMB after lock and > > > before unlock, executing it far does slow things down. A microbenchmark > > > doing this in a loop is a lot worse than it would appear in practice > > > (saturating buses down the path to memory). > > > > In this srcu_read_lock_fast_updown() case, there was no DMB. But for > > srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB. > > (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.) > > > > > A quick test to check this theory, if that's the functions you were > > > benchmarking (it generates LDADD instead): > > > > Thank you for digging into this! > > And this_cpu_inc_return() does speed things up on my hardware to about > the same extent as did the prefetch instruction, so thank you again. > However, it gets me more than a 4x slowdown on x86, so I cannot make > this change in common code. Definitely not suggesting that we use the 'return' variants in the generic code. More likely change the arm64 code to use them for the per-CPU atomics. > So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via > something like this_cpu_inc_srcu(), but not for the upcoming merge window, > but the one after that, sticking with my current interrupt-disabling > non-atomic approach in the meantime (which gets me most of the benefit). > Alternatively, would it work for me to put that cache-prefetch instruction > into SRCU for arm64? My guess is "absolutely not!", but I figured that > I should ask. Given that this_cpu_*() are meant for the local CPU, there's less risk of cache line bouncing between CPUs, so I'm happy to change them to either use PRFM or LDADD (I think I prefer the latter). This would not be a generic change for the other atomics, only the per-CPU ones. > But if both of these approaches proves problematic, I might need some > way to distinguish between systems having slow LSE and those that do not. It's not that systems have slow or fast atomics, more like they are slow or fast for specific use-cases. Their default behaviour may differ and at least in the Arm Ltd cases, this is configurable. An STADD executed in the L1 cache (near) may be better for your case and some microbenchmarks but not necessarily for others. I've heard of results of database use-cases where STADD executed far is better than LDADD executed near when the location is shared between multiple CPUs. In these cases even a PRFM can be problematic as it tends to bring a unique copy of the cacheline invalidating the others (well, again, microarch specific). For the Arm Ltd implementations, I think the behaviour for most of the (recent) CPUs is that load atomics, CAS, SWP are executed near while the store atomics far (subject to configuration, errata, interconnect). Arm should probably provide some guidance here so that other implementers and software people know how/when to use them. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 15:34 ` Catalin Marinas @ 2025-11-05 16:25 ` Paul E. McKenney 2025-11-05 17:15 ` Catalin Marinas 0 siblings, 1 reply; 46+ messages in thread From: Paul E. McKenney @ 2025-11-05 16:25 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote: > On Tue, Nov 04, 2025 at 12:10:36PM -0800, Paul E. McKenney wrote: > > On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote: > > > On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote: > > > > For the SRCU case, STADD especially together with the DMB after lock and > > > > before unlock, executing it far does slow things down. A microbenchmark > > > > doing this in a loop is a lot worse than it would appear in practice > > > > (saturating buses down the path to memory). > > > > > > In this srcu_read_lock_fast_updown() case, there was no DMB. But for > > > srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB. > > > (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.) > > > > > > > A quick test to check this theory, if that's the functions you were > > > > benchmarking (it generates LDADD instead): > > > > > > Thank you for digging into this! > > > > And this_cpu_inc_return() does speed things up on my hardware to about > > the same extent as did the prefetch instruction, so thank you again. > > However, it gets me more than a 4x slowdown on x86, so I cannot make > > this change in common code. > > Definitely not suggesting that we use the 'return' variants in the > generic code. More likely change the arm64 code to use them for the > per-CPU atomics. Whew!!! ;-) > > So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via > > something like this_cpu_inc_srcu(), but not for the upcoming merge window, > > but the one after that, sticking with my current interrupt-disabling > > non-atomic approach in the meantime (which gets me most of the benefit). > > Alternatively, would it work for me to put that cache-prefetch instruction > > into SRCU for arm64? My guess is "absolutely not!", but I figured that > > I should ask. > > Given that this_cpu_*() are meant for the local CPU, there's less risk > of cache line bouncing between CPUs, so I'm happy to change them to > either use PRFM or LDADD (I think I prefer the latter). This would not > be a generic change for the other atomics, only the per-CPU ones. I have easy access to only the one type of ARM system, and of course the choice must be driven by a wide range of systems. But yes, it would be much better if we can just use this_cpu_inc(). I will use the non-atomics protected by interrupt disabling in the meantime, but look forward to being able to switch back. > > But if both of these approaches proves problematic, I might need some > > way to distinguish between systems having slow LSE and those that do not. > > It's not that systems have slow or fast atomics, more like they are slow > or fast for specific use-cases. Their default behaviour may differ and > at least in the Arm Ltd cases, this is configurable. An STADD executed > in the L1 cache (near) may be better for your case and some > microbenchmarks but not necessarily for others. I've heard of results of > database use-cases where STADD executed far is better than LDADD > executed near when the location is shared between multiple CPUs. In > these cases even a PRFM can be problematic as it tends to bring a unique > copy of the cacheline invalidating the others (well, again, microarch > specific). Fair point, and I do need to be careful not to read too much into the results from my one type of system. Plus, to your point elsewhere in this thread, making the hardware better would be quite welcome as well. > For the Arm Ltd implementations, I think the behaviour for most of the > (recent) CPUs is that load atomics, CAS, SWP are executed near while the > store atomics far (subject to configuration, errata, interconnect). Arm > should probably provide some guidance here so that other implementers > and software people know how/when to use them. Or make the hardware figure out what to do automatically for each use case as it executes. Perhaps a bit utopian, but it is nevertheless a good direction to aim for. Thanx, Paul > -- > Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 16:25 ` Paul E. McKenney @ 2025-11-05 17:15 ` Catalin Marinas 2025-11-05 17:40 ` Paul E. McKenney 0 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-05 17:15 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote: > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote: > > Given that this_cpu_*() are meant for the local CPU, there's less risk > > of cache line bouncing between CPUs, so I'm happy to change them to > > either use PRFM or LDADD (I think I prefer the latter). This would not > > be a generic change for the other atomics, only the per-CPU ones. > > I have easy access to only the one type of ARM system, and of course > the choice must be driven by a wide range of systems. But yes, it > would be much better if we can just use this_cpu_inc(). I will use the > non-atomics protected by interrupt disabling in the meantime, but look > forward to being able to switch back. BTW, did you find a problem with this_cpu_inc() in normal use with SRCU or just in a microbenchmark hammering them? From what I understand from the hardware folk, doing STADD in a loop saturates some queues in the interconnect and slows down eventually. In normal use, it's just a posted operation not affecting the subsequent instructions (or at least that's the theory). -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 17:15 ` Catalin Marinas @ 2025-11-05 17:40 ` Paul E. McKenney 2025-11-05 19:16 ` Catalin Marinas 0 siblings, 1 reply; 46+ messages in thread From: Paul E. McKenney @ 2025-11-05 17:40 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote: > On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote: > > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote: > > > Given that this_cpu_*() are meant for the local CPU, there's less risk > > > of cache line bouncing between CPUs, so I'm happy to change them to > > > either use PRFM or LDADD (I think I prefer the latter). This would not > > > be a generic change for the other atomics, only the per-CPU ones. > > > > I have easy access to only the one type of ARM system, and of course > > the choice must be driven by a wide range of systems. But yes, it > > would be much better if we can just use this_cpu_inc(). I will use the > > non-atomics protected by interrupt disabling in the meantime, but look > > forward to being able to switch back. > > BTW, did you find a problem with this_cpu_inc() in normal use with SRCU > or just in a microbenchmark hammering them? From what I understand from > the hardware folk, doing STADD in a loop saturates some queues in the > interconnect and slows down eventually. In normal use, it's just a > posted operation not affecting the subsequent instructions (or at least > that's the theory). Only in a microbenchmark, and Breno did not find any issues in larger benchmarks, so good to hear! Now, some non-arm64 systems deal with it just fine, but perhaps I owe everyone an apology for the firedrill. But let me put it this way... Would you ack an SRCU patch that resulted in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on other systems? Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 17:40 ` Paul E. McKenney @ 2025-11-05 19:16 ` Catalin Marinas 2025-11-05 19:47 ` Paul E. McKenney 2025-11-05 21:13 ` Palmer Dabbelt 0 siblings, 2 replies; 46+ messages in thread From: Catalin Marinas @ 2025-11-05 19:16 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 09:40:32AM -0800, Paul E. McKenney wrote: > On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote: > > On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote: > > > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote: > > > > Given that this_cpu_*() are meant for the local CPU, there's less risk > > > > of cache line bouncing between CPUs, so I'm happy to change them to > > > > either use PRFM or LDADD (I think I prefer the latter). This would not > > > > be a generic change for the other atomics, only the per-CPU ones. > > > > > > I have easy access to only the one type of ARM system, and of course > > > the choice must be driven by a wide range of systems. But yes, it > > > would be much better if we can just use this_cpu_inc(). I will use the > > > non-atomics protected by interrupt disabling in the meantime, but look > > > forward to being able to switch back. > > > > BTW, did you find a problem with this_cpu_inc() in normal use with SRCU > > or just in a microbenchmark hammering them? From what I understand from > > the hardware folk, doing STADD in a loop saturates some queues in the > > interconnect and slows down eventually. In normal use, it's just a > > posted operation not affecting the subsequent instructions (or at least > > that's the theory). > > Only in a microbenchmark, and Breno did not find any issues in larger > benchmarks, so good to hear! > > Now, some non-arm64 systems deal with it just fine, but perhaps I owe > everyone an apology for the firedrill. That was a useful exercise, I learnt more things about the arm atomics. > But let me put it this way... Would you ack an SRCU patch that resulted > in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on > other systems? Only if it's backed by other microbenchmarks showing significant improvements ;). I think we should change the percpu atomics, it makes more sense to do them near, but I'll keep the others as they are. Planning to post a proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings all day). Something like below but with more comments and a commit log: ------------------------8<-------------------------- diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h index 9abcc8ef3087..d4dff4b0cf50 100644 --- a/arch/arm64/include/asm/percpu.h +++ b/arch/arm64/include/asm/percpu.h @@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ " stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n" \ " cbnz %w[loop], 1b", \ /* LSE atomics */ \ - #op_lse "\t%" #w "[val], %[ptr]\n" \ + #op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n" \ __nops(3)) \ : [loop] "=&r" (loop), [tmp] "=&r" (tmp), \ [ptr] "+Q"(*(u##sz *)ptr) \ @@ -124,9 +124,9 @@ PERCPU_RW_OPS(8) PERCPU_RW_OPS(16) PERCPU_RW_OPS(32) PERCPU_RW_OPS(64) -PERCPU_OP(add, add, stadd) -PERCPU_OP(andnot, bic, stclr) -PERCPU_OP(or, orr, stset) +PERCPU_OP(add, add, ldadd) +PERCPU_OP(andnot, bic, ldclr) +PERCPU_OP(or, orr, ldset) PERCPU_RET_OP(add, add, ldadd) #undef PERCPU_RW_OPS ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 19:16 ` Catalin Marinas @ 2025-11-05 19:47 ` Paul E. McKenney 2025-11-05 20:17 ` Catalin Marinas 2025-11-05 21:13 ` Palmer Dabbelt 1 sibling, 1 reply; 46+ messages in thread From: Paul E. McKenney @ 2025-11-05 19:47 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 07:16:42PM +0000, Catalin Marinas wrote: > On Wed, Nov 05, 2025 at 09:40:32AM -0800, Paul E. McKenney wrote: > > On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote: > > > On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote: > > > > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote: > > > > > Given that this_cpu_*() are meant for the local CPU, there's less risk > > > > > of cache line bouncing between CPUs, so I'm happy to change them to > > > > > either use PRFM or LDADD (I think I prefer the latter). This would not > > > > > be a generic change for the other atomics, only the per-CPU ones. > > > > > > > > I have easy access to only the one type of ARM system, and of course > > > > the choice must be driven by a wide range of systems. But yes, it > > > > would be much better if we can just use this_cpu_inc(). I will use the > > > > non-atomics protected by interrupt disabling in the meantime, but look > > > > forward to being able to switch back. > > > > > > BTW, did you find a problem with this_cpu_inc() in normal use with SRCU > > > or just in a microbenchmark hammering them? From what I understand from > > > the hardware folk, doing STADD in a loop saturates some queues in the > > > interconnect and slows down eventually. In normal use, it's just a > > > posted operation not affecting the subsequent instructions (or at least > > > that's the theory). > > > > Only in a microbenchmark, and Breno did not find any issues in larger > > benchmarks, so good to hear! > > > > Now, some non-arm64 systems deal with it just fine, but perhaps I owe > > everyone an apology for the firedrill. > > That was a useful exercise, I learnt more things about the arm atomics. I am glad that it had some good effect. ;-) > > But let me put it this way... Would you ack an SRCU patch that resulted > > in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on > > other systems? > > Only if it's backed by other microbenchmarks showing significant > improvements ;). Well, it did reduce from about 140ns with SRCU to about 100ns with SRCU-fast-updown due to removing the full memory barrier, so there is that. > I think we should change the percpu atomics, it makes more sense to do > them near, but I'll keep the others as they are. Planning to post a > proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings > all day). Something like below but with more comments and a commit log: I do like fixing this in the arm64 percpu atomics. However, this gets me assembler errors, perhaps because I am using an old version of gcc or perhaps because I am still based off of v6.17-rc1: /tmp/ccYlMkU1.s: Assembler messages: /tmp/ccYlMkU1.s:9292: Error: invalid addressing mode at operand 2 -- `stadd x2,x4,[x0]' /tmp/ccYlMkU1.s:9428: Error: invalid addressing mode at operand 2 -- `stadd x3,x6,[x1]' /tmp/ccYlMkU1.s:9299: Error: attempt to move .org backwards /tmp/ccYlMkU1.s:9435: Error: attempt to move .org backwards Thanx, Paul > ------------------------8<-------------------------- > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > index 9abcc8ef3087..d4dff4b0cf50 100644 > --- a/arch/arm64/include/asm/percpu.h > +++ b/arch/arm64/include/asm/percpu.h > @@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > " stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n" \ > " cbnz %w[loop], 1b", \ > /* LSE atomics */ \ > - #op_lse "\t%" #w "[val], %[ptr]\n" \ > + #op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n" \ > __nops(3)) \ > : [loop] "=&r" (loop), [tmp] "=&r" (tmp), \ > [ptr] "+Q"(*(u##sz *)ptr) \ > @@ -124,9 +124,9 @@ PERCPU_RW_OPS(8) > PERCPU_RW_OPS(16) > PERCPU_RW_OPS(32) > PERCPU_RW_OPS(64) > -PERCPU_OP(add, add, stadd) > -PERCPU_OP(andnot, bic, stclr) > -PERCPU_OP(or, orr, stset) > +PERCPU_OP(add, add, ldadd) > +PERCPU_OP(andnot, bic, ldclr) > +PERCPU_OP(or, orr, ldset) > PERCPU_RET_OP(add, add, ldadd) > > #undef PERCPU_RW_OPS > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 19:47 ` Paul E. McKenney @ 2025-11-05 20:17 ` Catalin Marinas 2025-11-05 20:45 ` Paul E. McKenney 0 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-05 20:17 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 11:47:07AM -0800, Paul E. McKenney wrote: > On Wed, Nov 05, 2025 at 07:16:42PM +0000, Catalin Marinas wrote: > > I think we should change the percpu atomics, it makes more sense to do > > them near, but I'll keep the others as they are. Planning to post a > > proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings > > all day). Something like below but with more comments and a commit log: > > I do like fixing this in the arm64 percpu atomics. However, this gets > me assembler errors, perhaps because I am using an old version of gcc > or perhaps because I am still based off of v6.17-rc1: > > /tmp/ccYlMkU1.s: Assembler messages: > /tmp/ccYlMkU1.s:9292: Error: invalid addressing mode at operand 2 -- `stadd x2,x4,[x0]' > /tmp/ccYlMkU1.s:9428: Error: invalid addressing mode at operand 2 -- `stadd x3,x6,[x1]' > /tmp/ccYlMkU1.s:9299: Error: attempt to move .org backwards > /tmp/ccYlMkU1.s:9435: Error: attempt to move .org backwards Are you sure it is applied correctly? There shouldn't be any trace of stadd in asm/percpu.h. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 20:17 ` Catalin Marinas @ 2025-11-05 20:45 ` Paul E. McKenney 0 siblings, 0 replies; 46+ messages in thread From: Paul E. McKenney @ 2025-11-05 20:45 UTC (permalink / raw) To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 08:17:25PM +0000, Catalin Marinas wrote: > On Wed, Nov 05, 2025 at 11:47:07AM -0800, Paul E. McKenney wrote: > > On Wed, Nov 05, 2025 at 07:16:42PM +0000, Catalin Marinas wrote: > > > I think we should change the percpu atomics, it makes more sense to do > > > them near, but I'll keep the others as they are. Planning to post a > > > proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings > > > all day). Something like below but with more comments and a commit log: > > > > I do like fixing this in the arm64 percpu atomics. However, this gets > > me assembler errors, perhaps because I am using an old version of gcc > > or perhaps because I am still based off of v6.17-rc1: > > > > /tmp/ccYlMkU1.s: Assembler messages: > > /tmp/ccYlMkU1.s:9292: Error: invalid addressing mode at operand 2 -- `stadd x2,x4,[x0]' > > /tmp/ccYlMkU1.s:9428: Error: invalid addressing mode at operand 2 -- `stadd x3,x6,[x1]' > > /tmp/ccYlMkU1.s:9299: Error: attempt to move .org backwards > > /tmp/ccYlMkU1.s:9435: Error: attempt to move .org backwards > > Are you sure it is applied correctly? There shouldn't be any trace of > stadd in asm/percpu.h. Right in one, apologies! And just for the record, it is a bad idea to apply such a patch while rebasing and being in a C++ forward-progress discussion. :-/ This gets us to the usual good-case latency, in this case 8.333ns. Tested-by: Paul E. McKenney <paulmck@kernel.org> Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 19:16 ` Catalin Marinas 2025-11-05 19:47 ` Paul E. McKenney @ 2025-11-05 21:13 ` Palmer Dabbelt 2025-11-06 14:00 ` Catalin Marinas 1 sibling, 1 reply; 46+ messages in thread From: Palmer Dabbelt @ 2025-11-05 21:13 UTC (permalink / raw) To: Catalin Marinas; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel On Wed, 05 Nov 2025 11:16:42 PST (-0800), Catalin Marinas wrote: > On Wed, Nov 05, 2025 at 09:40:32AM -0800, Paul E. McKenney wrote: >> On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote: >> > On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote: >> > > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote: >> > > > Given that this_cpu_*() are meant for the local CPU, there's less risk >> > > > of cache line bouncing between CPUs, so I'm happy to change them to >> > > > either use PRFM or LDADD (I think I prefer the latter). This would not >> > > > be a generic change for the other atomics, only the per-CPU ones. >> > > >> > > I have easy access to only the one type of ARM system, and of course >> > > the choice must be driven by a wide range of systems. But yes, it >> > > would be much better if we can just use this_cpu_inc(). I will use the >> > > non-atomics protected by interrupt disabling in the meantime, but look >> > > forward to being able to switch back. >> > >> > BTW, did you find a problem with this_cpu_inc() in normal use with SRCU >> > or just in a microbenchmark hammering them? From what I understand from >> > the hardware folk, doing STADD in a loop saturates some queues in the >> > interconnect and slows down eventually. In normal use, it's just a >> > posted operation not affecting the subsequent instructions (or at least >> > that's the theory). >> >> Only in a microbenchmark, and Breno did not find any issues in larger >> benchmarks, so good to hear! FWIW, I have a proxy workload where enabling ATOMIC_*_FORCE_NEAR is ~1% better (at application-level throughput). It's supposed to be representative of real workloads and isn't supposed to have contention, but I don't trust these workloads at all so take that with a grain of salt... I still had looking into this on my TODO list, I was planning on doing it all internally as a tuning thing but LMK if folks think it's interesting and I'll try to find some way to talk about it publicly. >> Now, some non-arm64 systems deal with it just fine, but perhaps I owe >> everyone an apology for the firedrill. > > That was a useful exercise, I learnt more things about the arm atomics. > >> But let me put it this way... Would you ack an SRCU patch that resulted >> in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on >> other systems? > > Only if it's backed by other microbenchmarks showing significant > improvements ;). > > I think we should change the percpu atomics, it makes more sense to do > them near, but I'll keep the others as they are. Planning to post a I guess I kind of went down a rabbit hole here, but I think I found some interesting stuff. This is all based on some modifications of Breno's microbenchmark to add two things: * A contending thread, which performs the same operation on the same counter in a loop, with operations separated by a variable-counted loop of NOPs. * Some busy work for the timed thread, which is also just a loop of NOPs. Those loops look like for (d = 0; d < duty; d++) __asm__ volatile ("nop"); in the code and get compiled to for (d = 0; d < duty; d++) 41037c: f90007ff str xzr, [sp, #8] 410380: 14000001 b 410384 <run_core_benchmark+0x74> 410384: f94007e8 ldr x8, [sp, #8] 410388: f85e03a9 ldur x9, [x29, #-32] 41038c: eb090108 subs x8, x8, x9 410390: 54000102 b.cs 4103b0 <run_core_benchmark+0xa0> // b.hs, b.nlast 410394: 14000001 b 410398 <run_core_benchmark+0x88> __asm__ volatile ("nop"); 410398: d503201f nop 41039c: 14000001 b 4103a0 <run_core_benchmark+0x90> for (d = 0; d < duty; d++) 4103a0: f94007e8 ldr x8, [sp, #8] 4103a4: 91000508 add x8, x8, #0x1 4103a8: f90007e8 str x8, [sp, #8] 4103ac: 17fffff6 b 410384 <run_core_benchmark+0x74> } which is I guess kind of wacky generated code, but is maybe a reasonable proxy for work -- it's got load/stores/branches, which IIUC is what real code does ;) I ran a bunch of cases with those: CPU: 0 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 063.65 ns p95: 065.02 ns p99: 065.32 ns LSE (stadd) (c 0, d 100): p50: 063.71 ns p95: 064.96 ns p99: 065.68 ns LSE (stadd) (c 0, d 200): p50: 068.07 ns p95: 082.98 ns p99: 083.24 ns LSE (stadd) (c 0, d 300): p50: 098.96 ns p95: 121.14 ns p99: 122.04 ns LSE (stadd) (c 10, d 0): p50: 115.33 ns p95: 117.25 ns p99: 117.35 ns LSE (stadd) (c 10, d 300): p50: 115.30 ns p95: 119.12 ns p99: 121.68 ns LSE (stadd) (c 10, d 500): p50: 162.94 ns p95: 185.24 ns p99: 195.79 ns LSE (stadd) (c 30, d 0): p50: 115.17 ns p95: 117.14 ns p99: 117.84 ns LSE (stadd) (c 100, d 0): p50: 115.17 ns p95: 117.13 ns p99: 117.35 ns LSE (stadd) (c 10000, d 0): p50: 064.81 ns p95: 066.24 ns p99: 067.08 ns LL/SC (c 0, d 0): p50: 005.66 ns p95: 006.45 ns p99: 006.47 ns LL/SC (c 0, d 10): p50: 006.19 ns p95: 006.98 ns p99: 007.01 ns LL/SC (c 0, d 20): p50: 007.35 ns p95: 008.88 ns p99: 009.46 ns LL/SC (c 10, d 0): p50: 164.16 ns p95: 462.97 ns p99: 580.92 ns LL/SC (c 10, d 10): p50: 303.22 ns p95: 575.03 ns p99: 609.62 ns LL/SC (c 10, d 20): p50: 032.24 ns p95: 042.03 ns p99: 048.71 ns LL/SC (c 1000, d 0): p50: 017.37 ns p95: 018.18 ns p99: 018.19 ns LL/SC (c 1000, d 10): p50: 019.54 ns p95: 020.37 ns p99: 021.79 ns LL/SC (c 1000000, d 0): p50: 015.46 ns p95: 017.00 ns p99: 017.25 ns LL/SC (c 1000000, d 10): p50: 017.57 ns p95: 019.16 ns p99: 019.47 ns LDADD (c 0, d 0): p50: 004.33 ns p95: 004.64 ns p99: 005.13 ns LDADD (c 0, d 100): p50: 032.15 ns p95: 040.29 ns p99: 040.69 ns LDADD (c 0, d 200): p50: 067.97 ns p95: 083.04 ns p99: 083.30 ns LDADD (c 0, d 300): p50: 098.93 ns p95: 120.79 ns p99: 122.52 ns LDADD (c 1, d 100): p50: 049.19 ns p95: 072.23 ns p99: 072.38 ns LDADD (c 1, d 200): p50: 143.15 ns p95: 145.34 ns p99: 145.90 ns LDADD (c 1, d 300): p50: 153.91 ns p95: 162.57 ns p99: 163.84 ns LDADD (c 10, d 0): p50: 012.46 ns p95: 013.24 ns p99: 014.33 ns LDADD (c 10, d 100): p50: 049.34 ns p95: 069.35 ns p99: 070.71 ns LDADD (c 10, d 200): p50: 141.66 ns p95: 143.65 ns p99: 144.31 ns LDADD (c 10, d 300): p50: 152.82 ns p95: 163.51 ns p99: 164.03 ns LDADD (c 100, d 0): p50: 012.37 ns p95: 013.23 ns p99: 014.52 ns LDADD (c 100, d 10): p50: 014.32 ns p95: 015.11 ns p99: 015.15 ns PFRM_KEEP+STADD (c 0, d 0): p50: 003.97 ns p95: 005.23 ns p99: 005.49 ns PFRM_KEEP+STADD (c 10, d 0): p50: 126.02 ns p95: 127.72 ns p99: 128.72 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 021.97 ns p95: 023.93 ns p99: 024.97 ns PFRM_KEEP+STADD (c 1000000, d 100): p50: 076.28 ns p95: 080.88 ns p99: 081.50 ns PFRM_KEEP+STADD (c 1000000, d 200): p50: 089.62 ns p95: 091.49 ns p99: 091.89 ns PFRM_STRM+STADD (c 0, d 0): p50: 003.97 ns p95: 005.23 ns p99: 005.47 ns PFRM_STRM+STADD (c 10, d 0): p50: 126.75 ns p95: 128.96 ns p99: 129.48 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 021.83 ns p95: 023.75 ns p99: 023.96 ns PFRM_STRM+STADD (c 1000000, d 100): p50: 074.48 ns p95: 079.56 ns p99: 080.73 ns PFRM_STRM+STADD (c 1000000, d 200): p50: 089.76 ns p95: 091.14 ns p99: 092.46 ns Which I'm interpreting to say the following: * LL/SC is pretty good for the common cases, but gets really bad under the pathological cases. It still seems always slower that LDADD. * STADD has latency that blocks other STADDs, but not other CPU-local work. I'd bet there's a bunch of interactions with caches and memory ordering here, but those would all juts make STADD look worse so I'm just ignoring them. * LDADD is better than STADD even under pathologically highly contended cases. I was actually kind of surprised about this one, I thought the far atomics would be better there. * The prefetches help STADD, but they don't seem to make it better that LDADD in any case. * The LDADD latency also happens concurrently with other CPU operations like the STADD latency does. It has less latency to hide, so the latency starts to go up with less extra work, but it's never worse that STADD. So I think at least on this system, LDADD is just always better. [My code's up in a PR to Breno's repo: https://github.com/leitao/debug/pull/2] > proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings > all day). Something like below but with more comments and a commit log: > > ------------------------8<-------------------------- > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h > index 9abcc8ef3087..d4dff4b0cf50 100644 > --- a/arch/arm64/include/asm/percpu.h > +++ b/arch/arm64/include/asm/percpu.h > @@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val) \ > " stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n" \ > " cbnz %w[loop], 1b", \ > /* LSE atomics */ \ > - #op_lse "\t%" #w "[val], %[ptr]\n" \ > + #op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n" \ > __nops(3)) \ > : [loop] "=&r" (loop), [tmp] "=&r" (tmp), \ > [ptr] "+Q"(*(u##sz *)ptr) \ > @@ -124,9 +124,9 @@ PERCPU_RW_OPS(8) > PERCPU_RW_OPS(16) > PERCPU_RW_OPS(32) > PERCPU_RW_OPS(64) > -PERCPU_OP(add, add, stadd) > -PERCPU_OP(andnot, bic, stclr) > -PERCPU_OP(or, orr, stset) > +PERCPU_OP(add, add, ldadd) > +PERCPU_OP(andnot, bic, ldclr) > +PERCPU_OP(or, orr, ldset) > PERCPU_RET_OP(add, add, ldadd) > > #undef PERCPU_RW_OPS ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-05 21:13 ` Palmer Dabbelt @ 2025-11-06 14:00 ` Catalin Marinas 2025-11-06 16:30 ` Palmer Dabbelt 0 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-06 14:00 UTC (permalink / raw) To: Palmer Dabbelt; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote: > I ran a bunch of cases with those: [...] > Which I'm interpreting to say the following: > > * LL/SC is pretty good for the common cases, but gets really bad under the > pathological cases. It still seems always slower that LDADD. > * STADD has latency that blocks other STADDs, but not other CPU-local work. > I'd bet there's a bunch of interactions with caches and memory ordering > here, but those would all juts make STADD look worse so I'm just ignoring > them. > * LDADD is better than STADD even under pathologically highly contended > cases. I was actually kind of surprised about this one, I thought the far > atomics would be better there. > * The prefetches help STADD, but they don't seem to make it better that > LDADD in any case. > * The LDADD latency also happens concurrently with other CPU operations > like the STADD latency does. It has less latency to hide, so the latency > starts to go up with less extra work, but it's never worse that STADD. > > So I think at least on this system, LDADD is just always better. Thanks for this, very useful. I guess that's expected in the light of I learnt from the other Arm engineers in the past couple of days. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-06 14:00 ` Catalin Marinas @ 2025-11-06 16:30 ` Palmer Dabbelt 2025-11-06 17:54 ` Catalin Marinas 0 siblings, 1 reply; 46+ messages in thread From: Palmer Dabbelt @ 2025-11-06 16:30 UTC (permalink / raw) To: Catalin Marinas; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote: > On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote: >> I ran a bunch of cases with those: > [...] >> Which I'm interpreting to say the following: >> >> * LL/SC is pretty good for the common cases, but gets really bad under the >> pathological cases. It still seems always slower that LDADD. >> * STADD has latency that blocks other STADDs, but not other CPU-local work. >> I'd bet there's a bunch of interactions with caches and memory ordering >> here, but those would all juts make STADD look worse so I'm just ignoring >> them. >> * LDADD is better than STADD even under pathologically highly contended >> cases. I was actually kind of surprised about this one, I thought the far >> atomics would be better there. >> * The prefetches help STADD, but they don't seem to make it better that >> LDADD in any case. >> * The LDADD latency also happens concurrently with other CPU operations >> like the STADD latency does. It has less latency to hide, so the latency >> starts to go up with less extra work, but it's never worse that STADD. >> >> So I think at least on this system, LDADD is just always better. > > Thanks for this, very useful. I guess that's expected in the light of I > learnt from the other Arm engineers in the past couple of days. OK, sorry if I misunderstood you earlier. From reading your posts I thought there would be some mode in which STADD was better -- probably high contention and enough extra work to hide the latency. So I was kind of surprised to find these results. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-06 16:30 ` Palmer Dabbelt @ 2025-11-06 17:54 ` Catalin Marinas 2025-11-06 18:23 ` Palmer Dabbelt 0 siblings, 1 reply; 46+ messages in thread From: Catalin Marinas @ 2025-11-06 17:54 UTC (permalink / raw) To: Palmer Dabbelt; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel On Thu, Nov 06, 2025 at 08:30:05AM -0800, Palmer Dabbelt wrote: > On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote: > > On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote: > > > I ran a bunch of cases with those: > > [...] > > > Which I'm interpreting to say the following: > > > > > > * LL/SC is pretty good for the common cases, but gets really bad under the > > > pathological cases. It still seems always slower that LDADD. > > > * STADD has latency that blocks other STADDs, but not other CPU-local work. > > > I'd bet there's a bunch of interactions with caches and memory ordering > > > here, but those would all juts make STADD look worse so I'm just ignoring > > > them. > > > * LDADD is better than STADD even under pathologically highly contended > > > cases. I was actually kind of surprised about this one, I thought the far > > > atomics would be better there. > > > * The prefetches help STADD, but they don't seem to make it better that > > > LDADD in any case. > > > * The LDADD latency also happens concurrently with other CPU operations > > > like the STADD latency does. It has less latency to hide, so the latency > > > starts to go up with less extra work, but it's never worse that STADD. > > > > > > So I think at least on this system, LDADD is just always better. > > > > Thanks for this, very useful. I guess that's expected in the light of I > > learnt from the other Arm engineers in the past couple of days. > > OK, sorry if I misunderstood you earlier. From reading your posts I thought > there would be some mode in which STADD was better -- probably high > contention and enough extra work to hide the latency. So I was kind of > surprised to find these results. I think STADD is better for cases where you update some stat counters but you do a lot of work in between. In your microbenchmark, just lots of STADDs back to back with NOPs in between (rather than lots of other memory transactions) are likely to be slower. If these are real use-cases, at some point the hardware may evolve to behave differently (or more dynamically). BTW, I've been pointed by Ola Liljedahl @ Arm at this collection of routines: https://github.com/ARM-software/progress64/tree/master. Building it with ATOMICS=yes makes the compiler generate LSE atomics for intrinsics like __atomic_fetch_add(). It won't generate STADD because of some aspects of the C consistency models (DMB LD wouldn't guarantee ordering with a prior STADD). -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-06 17:54 ` Catalin Marinas @ 2025-11-06 18:23 ` Palmer Dabbelt 0 siblings, 0 replies; 46+ messages in thread From: Palmer Dabbelt @ 2025-11-06 18:23 UTC (permalink / raw) To: Catalin Marinas; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel On Thu, 06 Nov 2025 09:54:31 PST (-0800), Catalin Marinas wrote: > On Thu, Nov 06, 2025 at 08:30:05AM -0800, Palmer Dabbelt wrote: >> On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote: >> > On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote: >> > > I ran a bunch of cases with those: >> > [...] >> > > Which I'm interpreting to say the following: >> > > >> > > * LL/SC is pretty good for the common cases, but gets really bad under the >> > > pathological cases. It still seems always slower that LDADD. >> > > * STADD has latency that blocks other STADDs, but not other CPU-local work. >> > > I'd bet there's a bunch of interactions with caches and memory ordering >> > > here, but those would all juts make STADD look worse so I'm just ignoring >> > > them. >> > > * LDADD is better than STADD even under pathologically highly contended >> > > cases. I was actually kind of surprised about this one, I thought the far >> > > atomics would be better there. >> > > * The prefetches help STADD, but they don't seem to make it better that >> > > LDADD in any case. >> > > * The LDADD latency also happens concurrently with other CPU operations >> > > like the STADD latency does. It has less latency to hide, so the latency >> > > starts to go up with less extra work, but it's never worse that STADD. >> > > >> > > So I think at least on this system, LDADD is just always better. >> > >> > Thanks for this, very useful. I guess that's expected in the light of I >> > learnt from the other Arm engineers in the past couple of days. >> >> OK, sorry if I misunderstood you earlier. From reading your posts I thought >> there would be some mode in which STADD was better -- probably high >> contention and enough extra work to hide the latency. So I was kind of >> surprised to find these results. > > I think STADD is better for cases where you update some stat counters > but you do a lot of work in between. In your microbenchmark, just lots > of STADDs back to back with NOPs in between (rather than lots of other > memory transactions) are likely to be slower. If these are real > use-cases, at some point the hardware may evolve to behave differently > (or more dynamically). OK, that's kind of what I was trying to demonstrate when putting together those new microbenchmark parameters. So I think at least I understood what you were saying, now I just need to figure out what's up... FWIW: there's actually a bunch of memory traffic, the compiler is doing something weird with that NOP loop and generating a bunch of load/stores/branches. I was kind of surprised, but I figured it's actually better that way. Also, I found there's a bug in the microbenchmarks: "tmp" is a global, so the LDADD code generates 00000000004102b0 <__percpu_add_case_64_ldadd>: 4102b0: 90000189 adrp x9, 440000 <memcpy@GLIBC_2.17> 4102b4: f8210008 ldadd x1, x8, [x0] 4102b8: f9005528 str x8, [x9, #168] 4102bc: d65f03c0 ret as opposed to the STADD code, which generates 00000000004102a8 <__percpu_add_case_64_lse>: 4102a8: f821001f stadd x1, [x0] 4102ac: d65f03c0 ret It doesn't seem to change my results any, but figured I'd say something in case anyone else tries to run this stuff (there's a fix up, too). > BTW, I've been pointed by Ola Liljedahl @ Arm at this collection of > routines: https://github.com/ARM-software/progress64/tree/master. > Building it with ATOMICS=yes makes the compiler generate LSE atomics for > intrinsics like __atomic_fetch_add(). It won't generate STADD because of > some aspects of the C consistency models (DMB LD wouldn't guarantee > ordering with a prior STADD). Awesome, thanks. I'll go take a look -- I'm trying to figure out enough of what's going on to figure out what we should do here, but that's mostly outside of kernel space now so I think it's just going to be a discussion for somewhere else... ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-10-31 18:30 ` Catalin Marinas 2025-10-31 19:39 ` Paul E. McKenney @ 2025-11-04 15:59 ` Breno Leitao 2025-11-04 17:06 ` Catalin Marinas ` (3 more replies) 1 sibling, 4 replies; 46+ messages in thread From: Breno Leitao @ 2025-11-04 15:59 UTC (permalink / raw) To: Catalin Marinas Cc: Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel, kernel-team, rmikey Hello Catalin, On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote: > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote: > > To make event tracing safe for PREEMPT_RT kernels, I have been creating > > optimized variants of SRCU readers that use per-CPU atomics. This works > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single > > per-CPU atomic operation. This contrasts with a handful of nanoseconds > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1). > > That's quite a difference. Does it get any better if > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it > on the kernel command line. > > Depending on the implementation and configuration, the LSE atomics may > skip the L1 cache and be executed closer to the memory (they used to be > called far atomics). The CPUs try to be smarter like doing the operation > "near" if it's in the cache but the heuristics may not always work. I am trying to play with LSE latency and compare it with LL/SC usecase. I _think_ I have a reproducer in userspace I've create a simple userspace program to compare the latency of a atomic add using LL/SC and LSE, basically comparing the following two functions while executing without any contention (single thread doing the atomic operation - no atomic contention): static inline void __percpu_add_case_64_llsc(void *ptr, unsigned long val) { asm volatile( /* LL/SC */ "1: ldxr %[tmp], %[ptr]\n" " add %[tmp], %[tmp], %[val]\n" " stxr %w[loop], %[tmp], %[ptr]\n" " cbnz %w[loop], 1b" : [loop] "=&r"(loop), [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr) : [val] "r"((u64)(val)) : "memory"); } and /* LSE implementation */ static inline void __percpu_add_case_64_lse(void *ptr, unsigned long val) { asm volatile( /* LSE atomics */ " stadd %[val], %[ptr]\n" : [ptr] "+Q"(*(u64 *)ptr) : [val] "r"((u64)(val)) : "memory"); } I found that the LSE case (__percpu_add_case_64_lse) has a huge variation, while LL/SC case is stable. In some case, LSE function runs at the same latency as LL/SC function and slightly faster on p50, but, something happen to the system and LSE operations start to take way longer than LL/SC. Here are some interesting output coming from the latency of the functions above> CPU: 47 - Latency Percentiles: ==================== LL/SC: p50: 5.69 ns p95: 5.71 ns p99: 5.80 ns LSE : p50: 45.53 ns p95: 54.06 ns p99: 55.18 ns CPU: 48 - Latency Percentiles: ==================== LL/SC: p50: 5.70 ns p95: 5.72 ns p99: 6.10 ns LSE : p50: 4.02 ns p95: 45.55 ns p99: 54.93 ns CPU: 49 - Latency Percentiles: ==================== LL/SC: p50: 5.74 ns p95: 5.75 ns p99: 5.78 ns LSE : p50: 4.04 ns p95: 50.32 ns p99: 53.04 ns At this stage, it is unclear what is causing these variations. The code above could be run with: # git clone https://github.com/leitao/debug.git # cd debug/LSE # make && ./percpu_bench ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 15:59 ` Breno Leitao @ 2025-11-04 17:06 ` Catalin Marinas 2025-11-04 18:08 ` Willy Tarreau ` (2 subsequent siblings) 3 siblings, 0 replies; 46+ messages in thread From: Catalin Marinas @ 2025-11-04 17:06 UTC (permalink / raw) To: Breno Leitao Cc: Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel, kernel-team, rmikey Hi Breno, On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote: > On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote: > > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote: > > > To make event tracing safe for PREEMPT_RT kernels, I have been creating > > > optimized variants of SRCU readers that use per-CPU atomics. This works > > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a > > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single > > > per-CPU atomic operation. This contrasts with a handful of nanoseconds > > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1). > > > > That's quite a difference. Does it get any better if > > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it > > on the kernel command line. > > > > Depending on the implementation and configuration, the LSE atomics may > > skip the L1 cache and be executed closer to the memory (they used to be > > called far atomics). The CPUs try to be smarter like doing the operation > > "near" if it's in the cache but the heuristics may not always work. > > I am trying to play with LSE latency and compare it with LL/SC usecase. I > _think_ I have a reproducer in userspace > > I've create a simple userspace program to compare the latency of a atomic add > using LL/SC and LSE, basically comparing the following two functions while > executing without any contention (single thread doing the atomic operation - > no atomic contention): > > static inline void __percpu_add_case_64_llsc(void *ptr, unsigned long val) > { > asm volatile( > /* LL/SC */ > "1: ldxr %[tmp], %[ptr]\n" > " add %[tmp], %[tmp], %[val]\n" > " stxr %w[loop], %[tmp], %[ptr]\n" > " cbnz %w[loop], 1b" > : [loop] "=&r"(loop), [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr) > : [val] "r"((u64)(val)) > : "memory"); > } > > and > > /* LSE implementation */ > static inline void __percpu_add_case_64_lse(void *ptr, unsigned long val) > { > asm volatile( > /* LSE atomics */ > " stadd %[val], %[ptr]\n" > : [ptr] "+Q"(*(u64 *)ptr) > : [val] "r"((u64)(val)) > : "memory"); > } Could you try with an ldadd instead? See my reply to Paul a few minutes ago. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 15:59 ` Breno Leitao 2025-11-04 17:06 ` Catalin Marinas @ 2025-11-04 18:08 ` Willy Tarreau 2025-11-04 18:22 ` Breno Leitao 2025-11-04 20:13 ` Paul E. McKenney 2025-11-04 20:57 ` Puranjay Mohan 2025-11-27 12:29 ` Wentao Guan 3 siblings, 2 replies; 46+ messages in thread From: Willy Tarreau @ 2025-11-04 18:08 UTC (permalink / raw) To: Breno Leitao Cc: Catalin Marinas, Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel, kernel-team, rmikey Hello Breno, On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote: > I found that the LSE case (__percpu_add_case_64_lse) has a huge variation, > while LL/SC case is stable. > In some case, LSE function runs at the same latency as LL/SC function and > slightly faster on p50, but, something happen to the system and LSE operations > start to take way longer than LL/SC. > > Here are some interesting output coming from the latency of the functions above> > > CPU: 47 - Latency Percentiles: > ==================== > LL/SC: p50: 5.69 ns p95: 5.71 ns p99: 5.80 ns > LSE : p50: 45.53 ns p95: 54.06 ns p99: 55.18 ns (...) Very interesting. I've run them here on a 80-core Ampere Altra made of Neoverse-N1 (armv8.2) and am getting very consistently better timings with LSE than LL/SC: CPU: 0 - Latency Percentiles: ==================== LL/SC: p50: 7.32 ns p95: 7.32 ns p99: 7.33 ns LSE : p50: 5.01 ns p95: 5.01 ns p99: 5.03 ns CPU: 1 - Latency Percentiles: ==================== LL/SC: p50: 7.32 ns p95: 7.32 ns p99: 7.33 ns LSE : p50: 5.01 ns p95: 5.01 ns p99: 5.03 ns CPU: 2 - Latency Percentiles: ==================== LL/SC: p50: 7.32 ns p95: 7.32 ns p99: 7.33 ns LSE : p50: 5.01 ns p95: 5.01 ns p99: 5.02 ns (...) They're *all* like this, between 7.32 and 7.36 for LL/SC p99, and 5.01 to 5.03 for LSE p99. However, on a CIX-P1 (armv9.2, 8xA720 + 4xA520), it's what you've observed, i.e. a lot of variations that do not even depend on big vs little cores: CPU: 0 - Latency Percentiles: ==================== LL/SC: p50: 6.56 ns p95: 7.13 ns p99: 8.81 ns LSE : p50: 45.79 ns p95: 45.80 ns p99: 45.86 ns CPU: 1 - Latency Percentiles: ==================== LL/SC: p50: 6.38 ns p95: 6.39 ns p99: 6.39 ns LSE : p50: 67.72 ns p95: 67.78 ns p99: 67.80 ns CPU: 2 - Latency Percentiles: ==================== LL/SC: p50: 5.56 ns p95: 5.57 ns p99: 5.60 ns LSE : p50: 59.19 ns p95: 59.23 ns p99: 59.25 ns (...) I tried the same on a RK3588 which has 4 Cortex A55 and 4 Cortex A76 (the latter being very close to Neoverse-N1), and the A76 (the 4 latest ones) show the same pattern as the Altra above and are consistently much better than the LL/SC one: CPU: 0 - Latency Percentiles: ==================== LL/SC: p50: 9.39 ns p95: 9.40 ns p99: 9.41 ns LSE : p50: 4.43 ns p95: 28.60 ns p99: 30.29 ns CPU: 1 - Latency Percentiles: ==================== LL/SC: p50: 9.39 ns p95: 9.40 ns p99: 9.59 ns LSE : p50: 4.42 ns p95: 27.51 ns p99: 29.46 ns CPU: 2 - Latency Percentiles: ==================== LL/SC: p50: 9.40 ns p95: 9.40 ns p99: 9.40 ns LSE : p50: 4.42 ns p95: 27.00 ns p99: 29.60 ns CPU: 3 - Latency Percentiles: ==================== LL/SC: p50: 9.39 ns p95: 9.40 ns p99: 10.43 ns LSE : p50: 8.02 ns p95: 29.72 ns p99: 31.05 ns CPU: 4 - Latency Percentiles: ==================== LL/SC: p50: 8.85 ns p95: 8.86 ns p99: 8.86 ns LSE : p50: 5.75 ns p95: 5.75 ns p99: 5.75 ns CPU: 5 - Latency Percentiles: ==================== LL/SC: p50: 8.85 ns p95: 8.85 ns p99: 9.28 ns LSE : p50: 5.75 ns p95: 5.75 ns p99: 8.29 ns CPU: 6 - Latency Percentiles: ==================== LL/SC: p50: 8.79 ns p95: 8.80 ns p99: 8.80 ns LSE : p50: 5.71 ns p95: 5.71 ns p99: 5.71 ns CPU: 7 - Latency Percentiles: ==================== LL/SC: p50: 8.80 ns p95: 8.80 ns p99: 9.30 ns LSE : p50: 5.71 ns p95: 5.72 ns p99: 5.72 ns Finally, on a Qualcomm QC6490 with 4xA55 + 4xA78, I'm getting something between the two (and the governor is in performance mode): ./percpu_bench ARM64 Per-CPU Atomic Add Benchmark =================================== Running percentile measurements (100 iterations)... Detected 8 CPUs CPU: 0 - Latency Percentiles: ==================== LL/SC: p50: 8.23 ns p95: 8.24 ns p99: 8.28 ns LSE : p50: 4.63 ns p95: 4.64 ns p99: 19.48 ns CPU: 1 - Latency Percentiles: ==================== LL/SC: p50: 8.23 ns p95: 8.24 ns p99: 8.26 ns LSE : p50: 4.63 ns p95: 4.64 ns p99: 16.30 ns CPU: 2 - Latency Percentiles: ==================== LL/SC: p50: 8.23 ns p95: 8.25 ns p99: 8.25 ns LSE : p50: 4.63 ns p95: 4.64 ns p99: 4.65 ns CPU: 3 - Latency Percentiles: ==================== LL/SC: p50: 8.23 ns p95: 8.25 ns p99: 8.36 ns LSE : p50: 4.63 ns p95: 19.01 ns p99: 32.15 ns CPU: 4 - Latency Percentiles: ==================== LL/SC: p50: 6.27 ns p95: 6.28 ns p99: 6.29 ns LSE : p50: 5.44 ns p95: 5.44 ns p99: 5.44 ns CPU: 5 - Latency Percentiles: ==================== LL/SC: p50: 6.27 ns p95: 6.28 ns p99: 6.29 ns LSE : p50: 5.44 ns p95: 5.44 ns p99: 5.44 ns CPU: 6 - Latency Percentiles: ==================== LL/SC: p50: 6.27 ns p95: 6.28 ns p99: 6.28 ns LSE : p50: 5.44 ns p95: 5.44 ns p99: 5.45 ns CPU: 7 - Latency Percentiles: ==================== LL/SC: p50: 5.56 ns p95: 5.57 ns p99: 5.58 ns LSE : p50: 4.82 ns p95: 4.82 ns p99: 4.83 ns So it seems at first glance that LL/SC is generally slower but can be more consistent on modern machines, that LSE is stable on older machines and can be stable sometimes even on some modern machines. @Catalin, I *tried* to do the ldadd test but I wasn't sure what to put in the Xt register (to be honest I've never understood Arm's docs regarding instructions, even the pseudo language is super cryptic to me), and I came up with this: asm volatile( /* LSE atomics */ " ldadd %[val], %[out], %[ptr]\n" : [ptr] "+Q"(*(u64 *)ptr), [out] "+r" (val) : [val] "r"((u64)(val)) : "memory"); which assembles like this: ab8: f8200040 ldadd x0, x0, [x2] It now gives me much better LSE performance on the ARMv9: CPU: 0 - Latency Percentiles: ==================== LL/SC: p50: 6.56 ns p95: 7.32 ns p99: 8.72 ns LSE : p50: 2.76 ns p95: 2.76 ns p99: 2.77 ns CPU: 1 - Latency Percentiles: ==================== LL/SC: p50: 6.38 ns p95: 6.39 ns p99: 6.39 ns LSE : p50: 5.09 ns p95: 5.11 ns p99: 5.11 ns CPU: 2 - Latency Percentiles: ==================== LL/SC: p50: 5.56 ns p95: 5.58 ns p99: 9.07 ns LSE : p50: 4.45 ns p95: 4.46 ns p99: 4.46 ns CPU: 3 - Latency Percentiles: ==================== LL/SC: p50: 5.56 ns p95: 5.57 ns p99: 7.42 ns LSE : p50: 4.45 ns p95: 4.46 ns p99: 4.46 ns CPU: 4 - Latency Percentiles: ==================== LL/SC: p50: 5.56 ns p95: 5.57 ns p99: 5.60 ns LSE : p50: 4.45 ns p95: 4.46 ns p99: 4.47 ns CPU: 5 - Latency Percentiles: ==================== LL/SC: p50: 7.40 ns p95: 7.40 ns p99: 7.40 ns LSE : p50: 3.08 ns p95: 3.08 ns p99: 3.08 ns CPU: 6 - Latency Percentiles: ==================== LL/SC: p50: 7.40 ns p95: 7.40 ns p99: 7.42 ns LSE : p50: 3.08 ns p95: 3.08 ns p99: 3.08 ns CPU: 7 - Latency Percentiles: ==================== LL/SC: p50: 7.40 ns p95: 7.40 ns p99: 7.40 ns LSE : p50: 3.08 ns p95: 3.08 ns p99: 3.08 ns CPU: 8 - Latency Percentiles: ==================== LL/SC: p50: 7.40 ns p95: 7.40 ns p99: 7.40 ns LSE : p50: 3.08 ns p95: 3.08 ns p99: 3.08 ns CPU: 9 - Latency Percentiles: ==================== LL/SC: p50: 7.05 ns p95: 7.06 ns p99: 7.07 ns LSE : p50: 2.96 ns p95: 2.97 ns p99: 2.97 ns CPU: 10 - Latency Percentiles: ==================== LL/SC: p50: 7.05 ns p95: 7.05 ns p99: 7.06 ns LSE : p50: 2.96 ns p95: 2.96 ns p99: 2.97 ns CPU: 11 - Latency Percentiles: ==================== LL/SC: p50: 6.56 ns p95: 6.56 ns p99: 6.57 ns LSE : p50: 2.76 ns p95: 2.76 ns p99: 2.76 ns (cores 0,5-11 are A720, cores 1-4 are A520). I'd just like a confirmation that my change is correct and that I'm not just doing something ignored that tries to add zero :-/ If that's OK, then it's indeed way better! Willy PS: thanks Breno for sharing your test code, that's super useful! ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 18:08 ` Willy Tarreau @ 2025-11-04 18:22 ` Breno Leitao 2025-11-04 20:13 ` Paul E. McKenney 1 sibling, 0 replies; 46+ messages in thread From: Breno Leitao @ 2025-11-04 18:22 UTC (permalink / raw) To: Willy Tarreau Cc: Catalin Marinas, Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel, kernel-team, rmikey On Tue, Nov 04, 2025 at 07:08:19PM +0100, Willy Tarreau wrote: > Hello Breno, > > On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote: > > I found that the LSE case (__percpu_add_case_64_lse) has a huge variation, > > while LL/SC case is stable. > > In some case, LSE function runs at the same latency as LL/SC function and > > slightly faster on p50, but, something happen to the system and LSE operations > > start to take way longer than LL/SC. > > > > Here are some interesting output coming from the latency of the functions above> > > > > CPU: 47 - Latency Percentiles: > > ==================== > > LL/SC: p50: 5.69 ns p95: 5.71 ns p99: 5.80 ns > > LSE : p50: 45.53 ns p95: 54.06 ns p99: 55.18 ns > (...) > > Very interesting. I've run them here on a 80-core Ampere Altra made > of Neoverse-N1 (armv8.2) and am getting very consistently better timings > with LSE than LL/SC: <snip> > It now gives me much better LSE performance on the ARMv9: I also see a stable latency for ldadd in my test case, also, better than LL/SC. CPU: 0 - Latency Percentiles: ==================== LL/SC: p50: 5.74 ns p95: 5.81 ns p99: 7.13 ns LSE : p50: 4.34 ns p95: 4.36 ns p99: 4.40 ns CPU: 1 - Latency Percentiles: ==================== LL/SC: p50: 5.74 ns p95: 5.77 ns p99: 5.82 ns LSE : p50: 4.35 ns p95: 4.37 ns p99: 4.42 ns CPU: 2 - Latency Percentiles: ==================== LL/SC: p50: 5.74 ns p95: 5.81 ns p99: 6.76 ns LSE : p50: 4.35 ns p95: 4.80 ns p99: 5.55 ns ... CPU: 71 - Latency Percentiles: ==================== LL/SC: p50: 5.72 ns p95: 5.75 ns p99: 5.91 ns LSE : p50: 4.33 ns p95: 4.35 ns p99: 4.38 ns > PS: thanks Breno for sharing your test code, that's super useful! Glad you liked it. I tried to narrow down the problem as much as I could, so, I could could follow up the discussion. :-) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 18:08 ` Willy Tarreau 2025-11-04 18:22 ` Breno Leitao @ 2025-11-04 20:13 ` Paul E. McKenney 2025-11-04 20:35 ` Willy Tarreau 1 sibling, 1 reply; 46+ messages in thread From: Paul E. McKenney @ 2025-11-04 20:13 UTC (permalink / raw) To: Willy Tarreau Cc: Breno Leitao, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel, kernel-team, rmikey On Tue, Nov 04, 2025 at 07:08:19PM +0100, Willy Tarreau wrote: > Hello Breno, > > On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote: > > I found that the LSE case (__percpu_add_case_64_lse) has a huge variation, > > while LL/SC case is stable. > > In some case, LSE function runs at the same latency as LL/SC function and > > slightly faster on p50, but, something happen to the system and LSE operations > > start to take way longer than LL/SC. > > > > Here are some interesting output coming from the latency of the functions above> > > > > CPU: 47 - Latency Percentiles: > > ==================== > > LL/SC: p50: 5.69 ns p95: 5.71 ns p99: 5.80 ns > > LSE : p50: 45.53 ns p95: 54.06 ns p99: 55.18 ns > (...) Thank you very much for the detailed testing on a variety of hardware platforms!!! > Very interesting. I've run them here on a 80-core Ampere Altra made > of Neoverse-N1 (armv8.2) and am getting very consistently better timings > with LSE than LL/SC: > > CPU: 0 - Latency Percentiles: > ==================== > LL/SC: p50: 7.32 ns p95: 7.32 ns p99: 7.33 ns > LSE : p50: 5.01 ns p95: 5.01 ns p99: 5.03 ns > > CPU: 1 - Latency Percentiles: > ==================== > LL/SC: p50: 7.32 ns p95: 7.32 ns p99: 7.33 ns > LSE : p50: 5.01 ns p95: 5.01 ns p99: 5.03 ns > > CPU: 2 - Latency Percentiles: > ==================== > LL/SC: p50: 7.32 ns p95: 7.32 ns p99: 7.33 ns > LSE : p50: 5.01 ns p95: 5.01 ns p99: 5.02 ns > (...) > > They're *all* like this, between 7.32 and 7.36 for LL/SC p99, > and 5.01 to 5.03 for LSE p99. > > However, on a CIX-P1 (armv9.2, 8xA720 + 4xA520), it's what you've > observed, i.e. a lot of variations that do not even depend on big > vs little cores: > > CPU: 0 - Latency Percentiles: > ==================== > LL/SC: p50: 6.56 ns p95: 7.13 ns p99: 8.81 ns > LSE : p50: 45.79 ns p95: 45.80 ns p99: 45.86 ns > > CPU: 1 - Latency Percentiles: > ==================== > LL/SC: p50: 6.38 ns p95: 6.39 ns p99: 6.39 ns > LSE : p50: 67.72 ns p95: 67.78 ns p99: 67.80 ns > > CPU: 2 - Latency Percentiles: > ==================== > LL/SC: p50: 5.56 ns p95: 5.57 ns p99: 5.60 ns > LSE : p50: 59.19 ns p95: 59.23 ns p99: 59.25 ns > (...) > > I tried the same on a RK3588 which has 4 Cortex A55 and 4 Cortex A76 > (the latter being very close to Neoverse-N1), and the A76 (the 4 latest > ones) show the same pattern as the Altra above and are consistently much > better than the LL/SC one: > > CPU: 0 - Latency Percentiles: > ==================== > LL/SC: p50: 9.39 ns p95: 9.40 ns p99: 9.41 ns > LSE : p50: 4.43 ns p95: 28.60 ns p99: 30.29 ns > > CPU: 1 - Latency Percentiles: > ==================== > LL/SC: p50: 9.39 ns p95: 9.40 ns p99: 9.59 ns > LSE : p50: 4.42 ns p95: 27.51 ns p99: 29.46 ns > > CPU: 2 - Latency Percentiles: > ==================== > LL/SC: p50: 9.40 ns p95: 9.40 ns p99: 9.40 ns > LSE : p50: 4.42 ns p95: 27.00 ns p99: 29.60 ns > > CPU: 3 - Latency Percentiles: > ==================== > LL/SC: p50: 9.39 ns p95: 9.40 ns p99: 10.43 ns > LSE : p50: 8.02 ns p95: 29.72 ns p99: 31.05 ns > > CPU: 4 - Latency Percentiles: > ==================== > LL/SC: p50: 8.85 ns p95: 8.86 ns p99: 8.86 ns > LSE : p50: 5.75 ns p95: 5.75 ns p99: 5.75 ns > > CPU: 5 - Latency Percentiles: > ==================== > LL/SC: p50: 8.85 ns p95: 8.85 ns p99: 9.28 ns > LSE : p50: 5.75 ns p95: 5.75 ns p99: 8.29 ns > > CPU: 6 - Latency Percentiles: > ==================== > LL/SC: p50: 8.79 ns p95: 8.80 ns p99: 8.80 ns > LSE : p50: 5.71 ns p95: 5.71 ns p99: 5.71 ns > > CPU: 7 - Latency Percentiles: > ==================== > LL/SC: p50: 8.80 ns p95: 8.80 ns p99: 9.30 ns > LSE : p50: 5.71 ns p95: 5.72 ns p99: 5.72 ns > > Finally, on a Qualcomm QC6490 with 4xA55 + 4xA78, I'm getting something > between the two (and the governor is in performance mode): > > ./percpu_bench > ARM64 Per-CPU Atomic Add Benchmark > =================================== > Running percentile measurements (100 iterations)... > Detected 8 CPUs > > CPU: 0 - Latency Percentiles: > ==================== > LL/SC: p50: 8.23 ns p95: 8.24 ns p99: 8.28 ns > LSE : p50: 4.63 ns p95: 4.64 ns p99: 19.48 ns > > CPU: 1 - Latency Percentiles: > ==================== > LL/SC: p50: 8.23 ns p95: 8.24 ns p99: 8.26 ns > LSE : p50: 4.63 ns p95: 4.64 ns p99: 16.30 ns > > CPU: 2 - Latency Percentiles: > ==================== > LL/SC: p50: 8.23 ns p95: 8.25 ns p99: 8.25 ns > LSE : p50: 4.63 ns p95: 4.64 ns p99: 4.65 ns > > CPU: 3 - Latency Percentiles: > ==================== > LL/SC: p50: 8.23 ns p95: 8.25 ns p99: 8.36 ns > LSE : p50: 4.63 ns p95: 19.01 ns p99: 32.15 ns > > CPU: 4 - Latency Percentiles: > ==================== > LL/SC: p50: 6.27 ns p95: 6.28 ns p99: 6.29 ns > LSE : p50: 5.44 ns p95: 5.44 ns p99: 5.44 ns > > CPU: 5 - Latency Percentiles: > ==================== > LL/SC: p50: 6.27 ns p95: 6.28 ns p99: 6.29 ns > LSE : p50: 5.44 ns p95: 5.44 ns p99: 5.44 ns > > CPU: 6 - Latency Percentiles: > ==================== > LL/SC: p50: 6.27 ns p95: 6.28 ns p99: 6.28 ns > LSE : p50: 5.44 ns p95: 5.44 ns p99: 5.45 ns > > CPU: 7 - Latency Percentiles: > ==================== > LL/SC: p50: 5.56 ns p95: 5.57 ns p99: 5.58 ns > LSE : p50: 4.82 ns p95: 4.82 ns p99: 4.83 ns > > So it seems at first glance that LL/SC is generally slower but can be > more consistent on modern machines, that LSE is stable on older machines > and can be stable sometimes even on some modern machines. I guess that I am glad that I am not alone? ;-) I am guessing that there is no reasonable way to check for whether a given system has slow LSE, as would be needed to use ALTERNATIVE(), but please let me know if I am mistaken. Thanx, Paul > @Catalin, I *tried* to do the ldadd test but I wasn't sure what to put in > the Xt register (to be honest I've never understood Arm's docs regarding > instructions, even the pseudo language is super cryptic to me), and I came > up with this: > > asm volatile( > /* LSE atomics */ > " ldadd %[val], %[out], %[ptr]\n" > : [ptr] "+Q"(*(u64 *)ptr), [out] "+r" (val) > : [val] "r"((u64)(val)) > : "memory"); > > which assembles like this: > > ab8: f8200040 ldadd x0, x0, [x2] > > It now gives me much better LSE performance on the ARMv9: > > CPU: 0 - Latency Percentiles: > ==================== > LL/SC: p50: 6.56 ns p95: 7.32 ns p99: 8.72 ns > LSE : p50: 2.76 ns p95: 2.76 ns p99: 2.77 ns > > CPU: 1 - Latency Percentiles: > ==================== > LL/SC: p50: 6.38 ns p95: 6.39 ns p99: 6.39 ns > LSE : p50: 5.09 ns p95: 5.11 ns p99: 5.11 ns > > CPU: 2 - Latency Percentiles: > ==================== > LL/SC: p50: 5.56 ns p95: 5.58 ns p99: 9.07 ns > LSE : p50: 4.45 ns p95: 4.46 ns p99: 4.46 ns > > CPU: 3 - Latency Percentiles: > ==================== > LL/SC: p50: 5.56 ns p95: 5.57 ns p99: 7.42 ns > LSE : p50: 4.45 ns p95: 4.46 ns p99: 4.46 ns > > CPU: 4 - Latency Percentiles: > ==================== > LL/SC: p50: 5.56 ns p95: 5.57 ns p99: 5.60 ns > LSE : p50: 4.45 ns p95: 4.46 ns p99: 4.47 ns > > CPU: 5 - Latency Percentiles: > ==================== > LL/SC: p50: 7.40 ns p95: 7.40 ns p99: 7.40 ns > LSE : p50: 3.08 ns p95: 3.08 ns p99: 3.08 ns > > CPU: 6 - Latency Percentiles: > ==================== > LL/SC: p50: 7.40 ns p95: 7.40 ns p99: 7.42 ns > LSE : p50: 3.08 ns p95: 3.08 ns p99: 3.08 ns > > CPU: 7 - Latency Percentiles: > ==================== > LL/SC: p50: 7.40 ns p95: 7.40 ns p99: 7.40 ns > LSE : p50: 3.08 ns p95: 3.08 ns p99: 3.08 ns > > CPU: 8 - Latency Percentiles: > ==================== > LL/SC: p50: 7.40 ns p95: 7.40 ns p99: 7.40 ns > LSE : p50: 3.08 ns p95: 3.08 ns p99: 3.08 ns > > CPU: 9 - Latency Percentiles: > ==================== > LL/SC: p50: 7.05 ns p95: 7.06 ns p99: 7.07 ns > LSE : p50: 2.96 ns p95: 2.97 ns p99: 2.97 ns > > CPU: 10 - Latency Percentiles: > ==================== > LL/SC: p50: 7.05 ns p95: 7.05 ns p99: 7.06 ns > LSE : p50: 2.96 ns p95: 2.96 ns p99: 2.97 ns > > CPU: 11 - Latency Percentiles: > ==================== > LL/SC: p50: 6.56 ns p95: 6.56 ns p99: 6.57 ns > LSE : p50: 2.76 ns p95: 2.76 ns p99: 2.76 ns > > (cores 0,5-11 are A720, cores 1-4 are A520). I'd just like a > confirmation that my change is correct and that I'm not just doing > something ignored that tries to add zero :-/ > > If that's OK, then it's indeed way better! > > Willy > > PS: thanks Breno for sharing your test code, that's super useful! ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 20:13 ` Paul E. McKenney @ 2025-11-04 20:35 ` Willy Tarreau 2025-11-04 21:25 ` Paul E. McKenney 0 siblings, 1 reply; 46+ messages in thread From: Willy Tarreau @ 2025-11-04 20:35 UTC (permalink / raw) To: Paul E. McKenney Cc: Breno Leitao, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel, kernel-team, rmikey On Tue, Nov 04, 2025 at 12:13:53PM -0800, Paul E. McKenney wrote: > > So it seems at first glance that LL/SC is generally slower but can be > > more consistent on modern machines, that LSE is stable on older machines > > and can be stable sometimes even on some modern machines. > > I guess that I am glad that I am not alone? ;-) > > I am guessing that there is no reasonable way to check for whether a > given system has slow LSE, as would be needed to use ALTERNATIVE(), > but please let me know if I am mistaken. I don't know either, and we've only tested additions (for which ldadd seems to do a better job than stadd for local values). I have no idea what happens with a CAS for example, that could be useful to set a max value for a metric and which can be quite inefficient using LL/SC, especially if the absolute value is stored in the same cache line as the max since every thread touching it would probably invalidate the update attempt. With a SWP instruction I don't see how it would be handled directly in SLC, since we need to know the previous value, hence load it into L1 (and hope nobody changes it between the load and the write attempt). But overall there seems to be a lot of unexplored possibilities here which I find quite interesting! Willy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 20:35 ` Willy Tarreau @ 2025-11-04 21:25 ` Paul E. McKenney 0 siblings, 0 replies; 46+ messages in thread From: Paul E. McKenney @ 2025-11-04 21:25 UTC (permalink / raw) To: Willy Tarreau Cc: Breno Leitao, Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel, kernel-team, rmikey On Tue, Nov 04, 2025 at 09:35:48PM +0100, Willy Tarreau wrote: > On Tue, Nov 04, 2025 at 12:13:53PM -0800, Paul E. McKenney wrote: > > > So it seems at first glance that LL/SC is generally slower but can be > > > more consistent on modern machines, that LSE is stable on older machines > > > and can be stable sometimes even on some modern machines. > > > > I guess that I am glad that I am not alone? ;-) > > > > I am guessing that there is no reasonable way to check for whether a > > given system has slow LSE, as would be needed to use ALTERNATIVE(), > > but please let me know if I am mistaken. > > I don't know either, and we've only tested additions (for which ldadd > seems to do a better job than stadd for local values). I have no idea > what happens with a CAS for example, that could be useful to set a max > value for a metric and which can be quite inefficient using LL/SC, > especially if the absolute value is stored in the same cache line as > the max since every thread touching it would probably invalidate the > update attempt. With a SWP instruction I don't see how it would be > handled directly in SLC, since we need to know the previous value, > hence load it into L1 (and hope nobody changes it between the load > and the write attempt). But overall there seems to be a lot of > unexplored possibilities here which I find quite interesting! I must admit that this is a fun one. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 15:59 ` Breno Leitao 2025-11-04 17:06 ` Catalin Marinas 2025-11-04 18:08 ` Willy Tarreau @ 2025-11-04 20:57 ` Puranjay Mohan 2025-11-27 12:29 ` Wentao Guan 3 siblings, 0 replies; 46+ messages in thread From: Puranjay Mohan @ 2025-11-04 20:57 UTC (permalink / raw) To: leitao Cc: catalin.marinas, kernel-team, linux-arm-kernel, mark.rutland, paulmck, rmikey, will, Puranjay Mohan Hi Breno, I tried your benchmark on AWS graviton platforms: On EC2 c8g.metal-24xl (96 cpus Neoverse-V2) (AWS Graviton 4): With ldadd, it was stable and LSE is always better than LL/SC But with stadd, I saw some spikes in p95 and p99: CPU: 28 - Latency Percentiles: ==================== LL/SC: p50: 6.61 ns p95: 6.61 ns p99: 6.62 ns LSE : p50: 4.64 ns p95: 4.65 ns p99: 4.65 ns CPU: 30 - Latency Percentiles: ==================== LL/SC: p50: 6.61 ns p95: 6.61 ns p99: 6.62 ns LSE : p50: 4.64 ns p95: 14.24 ns ***p99: 27.74 ns*** On EC2 m6g.metal (64 cpus Neoverse-N1) (AWS Graviton 2): Here both stadd and ldadd were stable and LSE was always better than LL/SC with ldadd: ARM64 Per-CPU Atomic Add Benchmark =================================== Running percentile measurements (100 iterations)... Detected 64 CPUs CPU: 0 - Latency Percentiles: ==================== LL/SC: p50: 8.40 ns p95: 8.40 ns p99: 8.42 ns LSE : p50: 5.60 ns p95: 5.60 ns p99: 5.61 ns CPU: 1 - Latency Percentiles: ==================== LL/SC: p50: 8.40 ns p95: 8.40 ns p99: 8.41 ns LSE : p50: 5.60 ns p95: 5.60 ns p99: 5.61 ns [....] CPU: 62 - Latency Percentiles: ==================== LL/SC: p50: 8.40 ns p95: 8.40 ns p99: 8.40 ns LSE : p50: 5.60 ns p95: 5.60 ns p99: 5.60 ns CPU: 63 - Latency Percentiles: ==================== LL/SC: p50: 8.40 ns p95: 8.40 ns p99: 8.41 ns LSE : p50: 5.60 ns p95: 5.60 ns p99: 5.60 ns === Benchmark Complete === With stadd: ARM64 Per-CPU Atomic Add Benchmark =================================== Running percentile measurements (100 iterations)... Detected 64 CPUs CPU: 0 - Latency Percentiles: ==================== LL/SC: p50: 8.00 ns p95: 8.01 ns p99: 8.02 ns LSE : p50: 5.20 ns p95: 5.21 ns p99: 5.21 ns CPU: 1 - Latency Percentiles: ==================== LL/SC: p50: 8.00 ns p95: 8.01 ns p99: 8.01 ns LSE : p50: 5.20 ns p95: 5.21 ns p99: 5.22 ns [.....] CPU: 62 - Latency Percentiles: ==================== LL/SC: p50: 8.00 ns p95: 8.01 ns p99: 8.14 ns LSE : p50: 5.20 ns p95: 5.21 ns p99: 5.21 ns CPU: 63 - Latency Percentiles: ==================== LL/SC: p50: 8.00 ns p95: 8.01 ns p99: 8.01 ns LSE : p50: 5.20 ns p95: 5.20 ns p99: 5.20 ns === Benchmark Complete === ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: Overhead of arm64 LSE per-CPU atomics? 2025-11-04 15:59 ` Breno Leitao ` (2 preceding siblings ...) 2025-11-04 20:57 ` Puranjay Mohan @ 2025-11-27 12:29 ` Wentao Guan 3 siblings, 0 replies; 46+ messages in thread From: Wentao Guan @ 2025-11-27 12:29 UTC (permalink / raw) To: leitao Cc: catalin.marinas, kernel-team, linux-arm-kernel, mark.rutland, paulmck, rmikey, will Hello All, Here is my result with HUAWEI HUAWEIPGU-WBY0, which has 24c TSV110 core.(kunpeng920), little strange --- stadd sometimes very fast, ldadd always slow, llsc always stable... I thought change detect cap ARM64_HAS_LSE_ATOMICS in arm64_features is little dirty, any good idea? Best Regrads Wentao Guan ARM64 Per-CPU Atomic Add Benchmark =================================== Running percentile measurements (100 iterations)... Detected 24 CPUs CPU: 0 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 029.00 ns p95: 029.09 ns p99: 029.12 ns LSE (stadd) (c 0, d 100): p50: 048.31 ns p95: 048.74 ns p99: 048.95 ns LSE (stadd) (c 0, d 200): p50: 086.32 ns p95: 086.60 ns p99: 086.75 ns sched_setaffinity: Invalid argument LSE (stadd) (c 10, d 0): p50: 058.10 ns p95: 058.32 ns p99: 058.49 ns sched_setaffinity: Invalid argument LSE (stadd) (c 10, d 300): p50: 248.03 ns p95: 248.31 ns p99: 248.55 ns sched_setaffinity: Invalid argument LSE (stadd) (c 10, d 500): p50: 402.60 ns p95: 403.10 ns p99: 403.24 ns sched_setaffinity: Invalid argument LSE (stadd) (c 30, d 0): p50: 058.02 ns p95: 058.31 ns p99: 058.33 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.51 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.68 ns sched_setaffinity: Invalid argument LL/SC (c 10, d 0): p50: 013.16 ns p95: 013.46 ns p99: 013.47 ns sched_setaffinity: Invalid argument LL/SC (c 10, d 10): p50: 017.00 ns p95: 017.20 ns p99: 017.29 ns sched_setaffinity: Invalid argument LL/SC (c 10, d 20): p50: 023.37 ns p95: 023.57 ns p99: 023.67 ns sched_setaffinity: Invalid argument LL/SC (c 1000, d 0): p50: 013.16 ns p95: 013.37 ns p99: 013.47 ns sched_setaffinity: Invalid argument LL/SC (c 1000, d 10): p50: 017.00 ns p95: 017.21 ns p99: 017.40 ns sched_setaffinity: Invalid argument LL/SC (c 1000000, d 0): p50: 013.17 ns p95: 013.37 ns p99: 013.37 ns sched_setaffinity: Invalid argument LL/SC (c 1000000, d 10): p50: 017.00 ns p95: 017.20 ns p99: 017.30 ns LDADD (c 0, d 0): p50: 069.55 ns p95: 069.57 ns p99: 069.71 ns LDADD (c 0, d 100): p50: 107.71 ns p95: 108.11 ns p99: 108.20 ns LDADD (c 0, d 200): p50: 152.85 ns p95: 152.91 ns p99: 152.93 ns LDADD (c 0, d 300): p50: 193.50 ns p95: 193.54 ns p99: 193.62 ns sched_setaffinity: Invalid argument LDADD (c 1, d 10): p50: 139.04 ns p95: 139.34 ns p99: 139.43 ns sched_setaffinity: Invalid argument LDADD (c 10, d 0): p50: 139.04 ns p95: 139.44 ns p99: 139.68 ns sched_setaffinity: Invalid argument LDADD (c 10, d 10): p50: 139.04 ns p95: 139.33 ns p99: 139.34 ns sched_setaffinity: Invalid argument LDADD (c 100, d 0): p50: 139.04 ns p95: 139.40 ns p99: 139.43 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.80 ns p99: 006.06 ns sched_setaffinity: Invalid argument PFRM_KEEP+STADD (c 10, d 0): p50: 011.59 ns p95: 011.89 ns p99: 011.99 ns sched_setaffinity: Invalid argument PFRM_KEEP+STADD (c 1000, d 0): p50: 011.59 ns p95: 011.79 ns p99: 011.89 ns sched_setaffinity: Invalid argument PFRM_KEEP+STADD (c 1000000, d 0): p50: 011.59 ns p95: 011.79 ns p99: 011.80 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.80 ns p99: 007.82 ns sched_setaffinity: Invalid argument PFRM_STRM+STADD (c 10, d 0): p50: 011.59 ns p95: 011.80 ns p99: 011.89 ns sched_setaffinity: Invalid argument PFRM_STRM+STADD (c 1000, d 0): p50: 011.59 ns p95: 011.79 ns p99: 011.89 ns sched_setaffinity: Invalid argument PFRM_STRM+STADD (c 1000000, d 0): p50: 011.59 ns p95: 011.80 ns p99: 013.47 ns CPU: 1 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 029.00 ns p99: 029.01 ns LSE (stadd) (c 0, d 100): p50: 048.53 ns p95: 048.96 ns p99: 049.05 ns LSE (stadd) (c 0, d 200): p50: 086.26 ns p95: 087.25 ns p99: 087.37 ns LSE (stadd) (c 10, d 0): p50: 038.29 ns p95: 038.31 ns p99: 038.52 ns LSE (stadd) (c 10, d 300): p50: 123.81 ns p95: 123.92 ns p99: 124.56 ns LSE (stadd) (c 10, d 500): p50: 201.16 ns p95: 201.20 ns p99: 201.22 ns LSE (stadd) (c 30, d 0): p50: 038.30 ns p95: 038.31 ns p99: 038.32 ns LL/SC (c 0, d 0): p50: 006.56 ns p95: 006.58 ns p99: 006.58 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.51 ns p99: 008.51 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.68 ns p99: 011.68 ns LL/SC (c 10, d 0): p50: 012.97 ns p95: 013.04 ns p99: 013.06 ns LL/SC (c 10, d 10): p50: 020.93 ns p95: 021.06 ns p99: 021.16 ns LL/SC (c 10, d 20): p50: 051.95 ns p95: 064.63 ns p99: 076.34 ns LL/SC (c 1000, d 0): p50: 012.81 ns p95: 012.83 ns p99: 012.84 ns LL/SC (c 1000, d 10): p50: 020.72 ns p95: 020.73 ns p99: 020.74 ns LL/SC (c 1000000, d 0): p50: 008.65 ns p95: 009.03 ns p99: 009.11 ns LL/SC (c 1000000, d 10): p50: 012.03 ns p95: 012.74 ns p99: 013.09 ns LDADD (c 0, d 0): p50: 010.04 ns p95: 010.06 ns p99: 010.06 ns LDADD (c 0, d 100): p50: 049.48 ns p95: 107.89 ns p99: 108.48 ns LDADD (c 0, d 200): p50: 152.75 ns p95: 152.89 ns p99: 152.90 ns LDADD (c 0, d 300): p50: 193.52 ns p95: 193.54 ns p99: 193.58 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.51 ns p99: 069.51 ns LDADD (c 10, d 0): p50: 069.67 ns p95: 069.69 ns p99: 069.69 ns LDADD (c 10, d 10): p50: 069.67 ns p95: 069.69 ns p99: 069.70 ns LDADD (c 100, d 0): p50: 070.91 ns p95: 070.95 ns p99: 071.00 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.81 ns PFRM_KEEP+STADD (c 10, d 0): p50: 082.54 ns p95: 082.62 ns p99: 082.68 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 065.04 ns p95: 065.39 ns p99: 065.62 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.55 ns p95: 020.03 ns p99: 020.15 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.81 ns PFRM_STRM+STADD (c 10, d 0): p50: 082.51 ns p95: 082.61 ns p99: 082.63 ns PFRM_STRM+STADD (c 1000, d 0): p50: 064.35 ns p95: 064.81 ns p99: 065.25 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.52 ns p95: 020.08 ns p99: 020.27 ns CPU: 2 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.68 ns p95: 049.06 ns p99: 049.11 ns LSE (stadd) (c 0, d 200): p50: 087.28 ns p95: 087.39 ns p99: 087.45 ns LSE (stadd) (c 10, d 0): p50: 038.28 ns p95: 038.29 ns p99: 038.31 ns LSE (stadd) (c 10, d 300): p50: 123.80 ns p95: 123.85 ns p99: 123.93 ns LSE (stadd) (c 10, d 500): p50: 201.18 ns p95: 203.31 ns p99: 203.39 ns LSE (stadd) (c 30, d 0): p50: 038.31 ns p95: 038.35 ns p99: 038.52 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 012.97 ns p95: 013.05 ns p99: 013.06 ns LL/SC (c 10, d 10): p50: 020.97 ns p95: 021.06 ns p99: 021.10 ns LL/SC (c 10, d 20): p50: 032.91 ns p95: 037.12 ns p99: 037.74 ns LL/SC (c 1000, d 0): p50: 012.82 ns p95: 012.83 ns p99: 012.83 ns LL/SC (c 1000, d 10): p50: 020.71 ns p95: 020.73 ns p99: 020.74 ns LL/SC (c 1000000, d 0): p50: 008.40 ns p95: 008.96 ns p99: 009.07 ns LL/SC (c 1000000, d 10): p50: 011.66 ns p95: 012.39 ns p99: 012.59 ns LDADD (c 0, d 0): p50: 069.53 ns p95: 069.56 ns p99: 069.58 ns LDADD (c 0, d 100): p50: 107.80 ns p95: 108.25 ns p99: 108.35 ns LDADD (c 0, d 200): p50: 152.45 ns p95: 152.50 ns p99: 152.74 ns LDADD (c 0, d 300): p50: 193.48 ns p95: 193.49 ns p99: 193.52 ns LDADD (c 1, d 10): p50: 069.47 ns p95: 069.50 ns p99: 069.51 ns LDADD (c 10, d 0): p50: 069.65 ns p95: 069.67 ns p99: 069.69 ns LDADD (c 10, d 10): p50: 069.65 ns p95: 069.67 ns p99: 069.68 ns LDADD (c 100, d 0): p50: 070.90 ns p95: 070.94 ns p99: 071.01 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 082.57 ns p95: 082.69 ns p99: 082.75 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 065.07 ns p95: 065.41 ns p99: 065.53 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.52 ns p95: 020.06 ns p99: 020.25 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 082.54 ns p95: 082.64 ns p99: 082.68 ns PFRM_STRM+STADD (c 1000, d 0): p50: 064.39 ns p95: 065.05 ns p99: 075.52 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.57 ns p95: 020.00 ns p99: 020.24 ns CPU: 3 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.55 ns p95: 049.01 ns p99: 049.13 ns LSE (stadd) (c 0, d 200): p50: 086.33 ns p95: 087.29 ns p99: 087.36 ns LSE (stadd) (c 10, d 0): p50: 038.28 ns p95: 038.29 ns p99: 038.30 ns LSE (stadd) (c 10, d 300): p50: 123.79 ns p95: 124.84 ns p99: 125.07 ns LSE (stadd) (c 10, d 500): p50: 202.45 ns p95: 202.92 ns p99: 203.01 ns LSE (stadd) (c 30, d 0): p50: 038.29 ns p95: 038.35 ns p99: 038.46 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 012.98 ns p95: 013.05 ns p99: 013.08 ns LL/SC (c 10, d 10): p50: 020.95 ns p95: 021.05 ns p99: 021.07 ns LL/SC (c 10, d 20): p50: 037.82 ns p95: 041.67 ns p99: 045.06 ns LL/SC (c 1000, d 0): p50: 012.82 ns p95: 012.84 ns p99: 012.84 ns LL/SC (c 1000, d 10): p50: 020.72 ns p95: 020.74 ns p99: 020.74 ns LL/SC (c 1000000, d 0): p50: 008.43 ns p95: 008.95 ns p99: 009.07 ns LL/SC (c 1000000, d 10): p50: 011.70 ns p95: 012.22 ns p99: 012.43 ns LDADD (c 0, d 0): p50: 010.04 ns p95: 010.04 ns p99: 010.04 ns LDADD (c 0, d 100): p50: 107.34 ns p95: 107.98 ns p99: 108.17 ns LDADD (c 0, d 200): p50: 152.46 ns p95: 152.82 ns p99: 152.84 ns LDADD (c 0, d 300): p50: 193.48 ns p95: 193.49 ns p99: 193.49 ns LDADD (c 1, d 10): p50: 069.47 ns p95: 069.51 ns p99: 069.62 ns LDADD (c 10, d 0): p50: 069.64 ns p95: 069.67 ns p99: 069.68 ns LDADD (c 10, d 10): p50: 069.65 ns p95: 069.67 ns p99: 069.68 ns LDADD (c 100, d 0): p50: 070.90 ns p95: 070.92 ns p99: 070.95 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 082.62 ns p95: 082.73 ns p99: 083.09 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 065.16 ns p95: 065.46 ns p99: 065.57 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.52 ns p95: 020.09 ns p99: 020.39 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 082.59 ns p95: 082.68 ns p99: 082.69 ns PFRM_STRM+STADD (c 1000, d 0): p50: 064.32 ns p95: 064.77 ns p99: 064.94 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.41 ns p95: 019.96 ns p99: 020.30 ns CPU: 4 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.69 ns p95: 049.16 ns p99: 049.22 ns LSE (stadd) (c 0, d 200): p50: 087.26 ns p95: 087.41 ns p99: 087.47 ns LSE (stadd) (c 10, d 0): p50: 038.14 ns p95: 038.16 ns p99: 038.16 ns LSE (stadd) (c 10, d 300): p50: 125.32 ns p95: 125.77 ns p99: 125.88 ns LSE (stadd) (c 10, d 500): p50: 202.66 ns p95: 203.15 ns p99: 203.23 ns LSE (stadd) (c 30, d 0): p50: 038.14 ns p95: 038.16 ns p99: 038.16 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 043.72 ns p95: 044.52 ns p99: 044.79 ns LL/SC (c 10, d 10): p50: 051.37 ns p95: 051.43 ns p99: 051.87 ns LL/SC (c 10, d 20): p50: 061.52 ns p95: 061.56 ns p99: 061.58 ns LL/SC (c 1000, d 0): p50: 014.23 ns p95: 014.47 ns p99: 014.54 ns LL/SC (c 1000, d 10): p50: 019.74 ns p95: 020.47 ns p99: 020.79 ns LL/SC (c 1000000, d 0): p50: 006.61 ns p95: 006.63 ns p99: 006.63 ns LL/SC (c 1000000, d 10): p50: 008.64 ns p95: 008.67 ns p99: 008.67 ns LDADD (c 0, d 0): p50: 057.96 ns p95: 058.04 ns p99: 058.06 ns LDADD (c 0, d 100): p50: 049.59 ns p95: 096.48 ns p99: 096.69 ns LDADD (c 0, d 200): p50: 140.93 ns p95: 141.00 ns p99: 141.02 ns LDADD (c 0, d 300): p50: 181.95 ns p95: 181.98 ns p99: 182.00 ns LDADD (c 1, d 10): p50: 069.52 ns p95: 069.54 ns p99: 069.54 ns LDADD (c 10, d 0): p50: 070.01 ns p95: 070.02 ns p99: 070.02 ns LDADD (c 10, d 10): p50: 070.02 ns p95: 070.04 ns p99: 070.05 ns LDADD (c 100, d 0): p50: 067.33 ns p95: 067.55 ns p99: 067.63 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 115.30 ns p95: 116.98 ns p99: 117.09 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 068.71 ns p95: 068.88 ns p99: 068.98 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.23 ns p95: 019.89 ns p99: 019.93 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 114.81 ns p95: 116.05 ns p99: 116.11 ns PFRM_STRM+STADD (c 1000, d 0): p50: 068.68 ns p95: 068.89 ns p99: 068.96 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.31 ns p95: 019.81 ns p99: 019.98 ns CPU: 5 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.54 ns p95: 048.93 ns p99: 049.19 ns LSE (stadd) (c 0, d 200): p50: 086.28 ns p95: 087.13 ns p99: 087.26 ns LSE (stadd) (c 10, d 0): p50: 038.14 ns p95: 038.15 ns p99: 038.16 ns LSE (stadd) (c 10, d 300): p50: 125.04 ns p95: 125.59 ns p99: 125.65 ns LSE (stadd) (c 10, d 500): p50: 200.96 ns p95: 201.00 ns p99: 201.02 ns LSE (stadd) (c 30, d 0): p50: 038.15 ns p95: 038.17 ns p99: 038.17 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 043.73 ns p95: 044.24 ns p99: 044.26 ns LL/SC (c 10, d 10): p50: 051.38 ns p95: 051.42 ns p99: 051.43 ns LL/SC (c 10, d 20): p50: 061.49 ns p95: 061.58 ns p99: 061.58 ns LL/SC (c 1000, d 0): p50: 014.21 ns p95: 014.47 ns p99: 014.61 ns LL/SC (c 1000, d 10): p50: 019.83 ns p95: 020.69 ns p99: 020.92 ns LL/SC (c 1000000, d 0): p50: 006.61 ns p95: 006.63 ns p99: 006.63 ns LL/SC (c 1000000, d 10): p50: 008.63 ns p95: 008.66 ns p99: 008.67 ns LDADD (c 0, d 0): p50: 058.00 ns p95: 058.07 ns p99: 058.12 ns LDADD (c 0, d 100): p50: 096.16 ns p95: 096.58 ns p99: 096.71 ns LDADD (c 0, d 200): p50: 140.94 ns p95: 141.01 ns p99: 141.14 ns LDADD (c 0, d 300): p50: 182.00 ns p95: 182.20 ns p99: 182.25 ns LDADD (c 1, d 10): p50: 069.52 ns p95: 069.54 ns p99: 069.55 ns LDADD (c 10, d 0): p50: 070.00 ns p95: 070.01 ns p99: 070.02 ns LDADD (c 10, d 10): p50: 070.01 ns p95: 070.03 ns p99: 070.09 ns LDADD (c 100, d 0): p50: 067.22 ns p95: 067.37 ns p99: 067.39 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 115.40 ns p95: 117.17 ns p99: 117.25 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 068.72 ns p95: 068.89 ns p99: 068.94 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.10 ns p95: 019.79 ns p99: 019.95 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 114.82 ns p95: 114.97 ns p99: 114.99 ns PFRM_STRM+STADD (c 1000, d 0): p50: 065.97 ns p95: 066.10 ns p99: 066.17 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.18 ns p95: 019.69 ns p99: 019.91 ns CPU: 6 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 024.33 ns p95: 024.33 ns p99: 024.33 ns LSE (stadd) (c 0, d 100): p50: 048.69 ns p95: 049.12 ns p99: 049.30 ns LSE (stadd) (c 0, d 200): p50: 087.26 ns p95: 087.45 ns p99: 087.68 ns LSE (stadd) (c 10, d 0): p50: 038.16 ns p95: 038.17 ns p99: 038.18 ns LSE (stadd) (c 10, d 300): p50: 123.87 ns p95: 124.02 ns p99: 125.13 ns LSE (stadd) (c 10, d 500): p50: 201.00 ns p95: 201.23 ns p99: 201.36 ns LSE (stadd) (c 30, d 0): p50: 038.14 ns p95: 038.16 ns p99: 038.17 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 043.71 ns p95: 044.23 ns p99: 044.71 ns LL/SC (c 10, d 10): p50: 051.36 ns p95: 051.44 ns p99: 051.90 ns LL/SC (c 10, d 20): p50: 061.50 ns p95: 061.56 ns p99: 061.57 ns LL/SC (c 1000, d 0): p50: 014.20 ns p95: 014.50 ns p99: 014.63 ns LL/SC (c 1000, d 10): p50: 019.64 ns p95: 020.58 ns p99: 020.98 ns LL/SC (c 1000000, d 0): p50: 006.61 ns p95: 006.63 ns p99: 006.64 ns LL/SC (c 1000000, d 10): p50: 008.64 ns p95: 008.67 ns p99: 008.70 ns LDADD (c 0, d 0): p50: 057.97 ns p95: 058.02 ns p99: 058.05 ns LDADD (c 0, d 100): p50: 095.93 ns p95: 096.45 ns p99: 096.57 ns LDADD (c 0, d 200): p50: 141.19 ns p95: 141.39 ns p99: 141.41 ns LDADD (c 0, d 300): p50: 181.97 ns p95: 182.07 ns p99: 182.12 ns LDADD (c 1, d 10): p50: 069.52 ns p95: 069.54 ns p99: 069.54 ns LDADD (c 10, d 0): p50: 070.00 ns p95: 070.06 ns p99: 070.08 ns LDADD (c 10, d 10): p50: 069.99 ns p95: 070.00 ns p99: 070.00 ns LDADD (c 100, d 0): p50: 067.25 ns p95: 067.32 ns p99: 067.36 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 115.46 ns p95: 117.05 ns p99: 117.10 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 066.10 ns p95: 066.25 ns p99: 066.30 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.30 ns p95: 019.87 ns p99: 020.07 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 114.77 ns p95: 114.97 ns p99: 115.71 ns PFRM_STRM+STADD (c 1000, d 0): p50: 066.07 ns p95: 068.72 ns p99: 068.80 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.29 ns p95: 019.96 ns p99: 020.11 ns CPU: 7 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.80 ns LSE (stadd) (c 0, d 100): p50: 048.44 ns p95: 048.83 ns p99: 048.89 ns LSE (stadd) (c 0, d 200): p50: 086.35 ns p95: 087.11 ns p99: 087.19 ns LSE (stadd) (c 10, d 0): p50: 038.16 ns p95: 038.18 ns p99: 038.18 ns LSE (stadd) (c 10, d 300): p50: 123.87 ns p95: 124.31 ns p99: 125.36 ns LSE (stadd) (c 10, d 500): p50: 201.00 ns p95: 201.04 ns p99: 201.08 ns LSE (stadd) (c 30, d 0): p50: 038.15 ns p95: 038.16 ns p99: 038.16 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 043.72 ns p95: 044.24 ns p99: 045.35 ns LL/SC (c 10, d 10): p50: 051.37 ns p95: 051.44 ns p99: 051.99 ns LL/SC (c 10, d 20): p50: 061.54 ns p95: 061.57 ns p99: 061.58 ns LL/SC (c 1000, d 0): p50: 014.18 ns p95: 014.48 ns p99: 014.59 ns LL/SC (c 1000, d 10): p50: 019.64 ns p95: 020.38 ns p99: 020.71 ns LL/SC (c 1000000, d 0): p50: 006.61 ns p95: 006.63 ns p99: 006.63 ns LL/SC (c 1000000, d 10): p50: 008.64 ns p95: 008.67 ns p99: 008.67 ns LDADD (c 0, d 0): p50: 058.00 ns p95: 058.05 ns p99: 058.07 ns LDADD (c 0, d 100): p50: 049.28 ns p95: 049.81 ns p99: 049.93 ns LDADD (c 0, d 200): p50: 140.93 ns p95: 141.31 ns p99: 141.36 ns LDADD (c 0, d 300): p50: 181.93 ns p95: 181.98 ns p99: 182.00 ns LDADD (c 1, d 10): p50: 069.52 ns p95: 069.62 ns p99: 069.68 ns LDADD (c 10, d 0): p50: 069.99 ns p95: 070.00 ns p99: 070.01 ns LDADD (c 10, d 10): p50: 070.00 ns p95: 070.01 ns p99: 070.01 ns LDADD (c 100, d 0): p50: 067.24 ns p95: 067.31 ns p99: 067.33 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 115.22 ns p95: 116.94 ns p99: 117.03 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 068.76 ns p95: 068.89 ns p99: 068.95 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.15 ns p95: 019.78 ns p99: 020.08 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 115.86 ns p95: 116.02 ns p99: 116.06 ns PFRM_STRM+STADD (c 1000, d 0): p50: 068.69 ns p95: 068.92 ns p99: 069.31 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.32 ns p95: 019.77 ns p99: 019.91 ns CPU: 8 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.73 ns p95: 049.11 ns p99: 049.16 ns LSE (stadd) (c 0, d 200): p50: 087.27 ns p95: 087.45 ns p99: 087.51 ns LSE (stadd) (c 10, d 0): p50: 038.15 ns p95: 038.16 ns p99: 038.16 ns LSE (stadd) (c 10, d 300): p50: 124.75 ns p95: 125.94 ns p99: 126.04 ns LSE (stadd) (c 10, d 500): p50: 202.94 ns p95: 203.22 ns p99: 203.27 ns LSE (stadd) (c 30, d 0): p50: 038.18 ns p95: 038.20 ns p99: 038.33 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 040.89 ns p95: 041.37 ns p99: 041.38 ns LL/SC (c 10, d 10): p50: 047.73 ns p95: 048.20 ns p99: 048.63 ns LL/SC (c 10, d 20): p50: 057.68 ns p95: 057.74 ns p99: 057.75 ns LL/SC (c 1000, d 0): p50: 013.79 ns p95: 014.37 ns p99: 014.50 ns LL/SC (c 1000, d 10): p50: 013.02 ns p95: 013.47 ns p99: 013.56 ns LL/SC (c 1000000, d 0): p50: 006.63 ns p95: 006.65 ns p99: 006.66 ns LL/SC (c 1000000, d 10): p50: 008.54 ns p95: 008.56 ns p99: 008.56 ns LDADD (c 0, d 0): p50: 066.45 ns p95: 066.47 ns p99: 066.48 ns LDADD (c 0, d 100): p50: 104.69 ns p95: 105.07 ns p99: 105.19 ns LDADD (c 0, d 200): p50: 149.37 ns p95: 149.43 ns p99: 149.77 ns LDADD (c 0, d 300): p50: 190.40 ns p95: 190.42 ns p99: 190.49 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.52 ns p99: 069.53 ns LDADD (c 10, d 0): p50: 069.67 ns p95: 069.70 ns p99: 069.71 ns LDADD (c 10, d 10): p50: 069.68 ns p95: 069.72 ns p99: 069.75 ns LDADD (c 100, d 0): p50: 070.93 ns p95: 070.95 ns p99: 071.00 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 119.70 ns p95: 120.06 ns p99: 120.20 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 069.04 ns p95: 069.28 ns p99: 069.42 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 018.98 ns p95: 019.60 ns p99: 019.66 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 120.95 ns p95: 121.22 ns p99: 121.27 ns PFRM_STRM+STADD (c 1000, d 0): p50: 077.95 ns p95: 078.32 ns p99: 078.42 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 020.04 ns p95: 020.73 ns p99: 020.89 ns CPU: 9 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 027.54 ns p99: 027.61 ns LSE (stadd) (c 0, d 100): p50: 048.42 ns p95: 048.91 ns p99: 049.00 ns LSE (stadd) (c 0, d 200): p50: 087.26 ns p95: 087.45 ns p99: 087.54 ns LSE (stadd) (c 10, d 0): p50: 038.18 ns p95: 038.19 ns p99: 038.19 ns LSE (stadd) (c 10, d 300): p50: 123.95 ns p95: 125.95 ns p99: 126.02 ns LSE (stadd) (c 10, d 500): p50: 201.11 ns p95: 201.20 ns p99: 201.23 ns LSE (stadd) (c 30, d 0): p50: 038.18 ns p95: 038.19 ns p99: 038.20 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 046.94 ns p95: 049.33 ns p99: 050.68 ns LL/SC (c 10, d 10): p50: 048.15 ns p95: 048.21 ns p99: 050.57 ns LL/SC (c 10, d 20): p50: 057.73 ns p95: 057.77 ns p99: 058.22 ns LL/SC (c 1000, d 0): p50: 013.96 ns p95: 014.56 ns p99: 014.70 ns LL/SC (c 1000, d 10): p50: 012.99 ns p95: 013.43 ns p99: 013.70 ns LL/SC (c 1000000, d 0): p50: 006.64 ns p95: 006.65 ns p99: 006.66 ns LL/SC (c 1000000, d 10): p50: 008.55 ns p95: 008.56 ns p99: 008.57 ns LDADD (c 0, d 0): p50: 066.44 ns p95: 066.45 ns p99: 066.46 ns LDADD (c 0, d 100): p50: 049.38 ns p95: 049.76 ns p99: 049.93 ns LDADD (c 0, d 200): p50: 149.39 ns p95: 149.77 ns p99: 149.79 ns LDADD (c 0, d 300): p50: 190.42 ns p95: 190.43 ns p99: 190.44 ns LDADD (c 1, d 10): p50: 069.48 ns p95: 069.52 ns p99: 069.61 ns LDADD (c 10, d 0): p50: 069.68 ns p95: 069.71 ns p99: 069.73 ns LDADD (c 10, d 10): p50: 069.68 ns p95: 069.71 ns p99: 069.72 ns LDADD (c 100, d 0): p50: 070.91 ns p95: 070.94 ns p99: 070.98 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 120.35 ns p95: 121.25 ns p99: 121.38 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 069.06 ns p95: 069.28 ns p99: 069.39 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.26 ns p95: 019.96 ns p99: 020.13 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 121.02 ns p95: 121.28 ns p99: 121.48 ns PFRM_STRM+STADD (c 1000, d 0): p50: 077.56 ns p95: 078.35 ns p99: 078.83 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.18 ns p95: 019.84 ns p99: 020.04 ns CPU: 10 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.80 ns LSE (stadd) (c 0, d 100): p50: 048.69 ns p95: 049.10 ns p99: 049.35 ns LSE (stadd) (c 0, d 200): p50: 086.24 ns p95: 086.49 ns p99: 087.15 ns LSE (stadd) (c 10, d 0): p50: 038.17 ns p95: 038.19 ns p99: 038.19 ns LSE (stadd) (c 10, d 300): p50: 123.87 ns p95: 125.86 ns p99: 125.99 ns LSE (stadd) (c 10, d 500): p50: 202.59 ns p95: 203.18 ns p99: 203.32 ns LSE (stadd) (c 30, d 0): p50: 038.20 ns p95: 038.22 ns p99: 038.22 ns LL/SC (c 0, d 0): p50: 006.56 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 040.89 ns p95: 041.36 ns p99: 041.38 ns LL/SC (c 10, d 10): p50: 048.16 ns p95: 048.20 ns p99: 048.21 ns LL/SC (c 10, d 20): p50: 057.72 ns p95: 058.19 ns p99: 058.22 ns LL/SC (c 1000, d 0): p50: 013.55 ns p95: 014.22 ns p99: 014.44 ns LL/SC (c 1000, d 10): p50: 013.19 ns p95: 013.60 ns p99: 013.79 ns LL/SC (c 1000000, d 0): p50: 006.64 ns p95: 006.66 ns p99: 006.66 ns LL/SC (c 1000000, d 10): p50: 008.55 ns p95: 008.56 ns p99: 008.56 ns LDADD (c 0, d 0): p50: 010.04 ns p95: 066.46 ns p99: 066.48 ns LDADD (c 0, d 100): p50: 104.62 ns p95: 105.05 ns p99: 105.17 ns LDADD (c 0, d 200): p50: 149.38 ns p95: 149.42 ns p99: 149.62 ns LDADD (c 0, d 300): p50: 190.40 ns p95: 190.42 ns p99: 190.48 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.51 ns p99: 069.52 ns LDADD (c 10, d 0): p50: 069.67 ns p95: 069.70 ns p99: 069.70 ns LDADD (c 10, d 10): p50: 069.68 ns p95: 069.70 ns p99: 069.72 ns LDADD (c 100, d 0): p50: 070.93 ns p95: 070.96 ns p99: 071.01 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 119.69 ns p95: 120.18 ns p99: 120.30 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 068.96 ns p95: 069.20 ns p99: 069.26 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.28 ns p95: 019.85 ns p99: 019.95 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 006.32 ns PFRM_STRM+STADD (c 10, d 0): p50: 121.20 ns p95: 121.47 ns p99: 121.51 ns PFRM_STRM+STADD (c 1000, d 0): p50: 077.93 ns p95: 078.36 ns p99: 078.60 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 020.10 ns p95: 020.70 ns p99: 020.95 ns CPU: 11 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 010.43 ns p95: 027.56 ns p99: 027.57 ns LSE (stadd) (c 0, d 100): p50: 048.64 ns p95: 049.06 ns p99: 049.12 ns LSE (stadd) (c 0, d 200): p50: 087.28 ns p95: 087.49 ns p99: 087.53 ns LSE (stadd) (c 10, d 0): p50: 038.18 ns p95: 038.19 ns p99: 038.20 ns LSE (stadd) (c 10, d 300): p50: 123.88 ns p95: 123.93 ns p99: 123.96 ns LSE (stadd) (c 10, d 500): p50: 201.13 ns p95: 201.19 ns p99: 201.27 ns LSE (stadd) (c 30, d 0): p50: 038.18 ns p95: 038.19 ns p99: 038.20 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 040.89 ns p95: 041.37 ns p99: 041.38 ns LL/SC (c 10, d 10): p50: 048.16 ns p95: 048.21 ns p99: 048.23 ns LL/SC (c 10, d 20): p50: 057.73 ns p95: 057.77 ns p99: 058.28 ns LL/SC (c 1000, d 0): p50: 013.59 ns p95: 014.21 ns p99: 014.60 ns LL/SC (c 1000, d 10): p50: 013.06 ns p95: 013.38 ns p99: 013.60 ns LL/SC (c 1000000, d 0): p50: 006.64 ns p95: 006.66 ns p99: 006.67 ns LL/SC (c 1000000, d 10): p50: 008.55 ns p95: 008.55 ns p99: 008.56 ns LDADD (c 0, d 0): p50: 066.45 ns p95: 066.47 ns p99: 066.48 ns LDADD (c 0, d 100): p50: 104.62 ns p95: 105.09 ns p99: 105.30 ns LDADD (c 0, d 200): p50: 149.38 ns p95: 149.73 ns p99: 149.91 ns LDADD (c 0, d 300): p50: 190.39 ns p95: 190.41 ns p99: 190.41 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.51 ns p99: 069.52 ns LDADD (c 10, d 0): p50: 069.68 ns p95: 069.74 ns p99: 069.77 ns LDADD (c 10, d 10): p50: 069.67 ns p95: 069.70 ns p99: 069.71 ns LDADD (c 100, d 0): p50: 070.93 ns p95: 070.94 ns p99: 070.99 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 010.08 ns p99: 010.32 ns PFRM_KEEP+STADD (c 10, d 0): p50: 119.88 ns p95: 120.38 ns p99: 120.56 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 069.03 ns p95: 069.26 ns p99: 069.33 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.33 ns p95: 019.90 ns p99: 019.98 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 009.76 ns PFRM_STRM+STADD (c 10, d 0): p50: 121.10 ns p95: 121.30 ns p99: 121.40 ns PFRM_STRM+STADD (c 1000, d 0): p50: 068.90 ns p95: 078.29 ns p99: 078.66 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.32 ns p95: 019.84 ns p99: 020.14 ns CPU: 12 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 026.02 ns p95: 026.03 ns p99: 026.03 ns LSE (stadd) (c 0, d 100): p50: 048.45 ns p95: 048.85 ns p99: 049.00 ns LSE (stadd) (c 0, d 200): p50: 087.27 ns p95: 087.52 ns p99: 087.59 ns LSE (stadd) (c 10, d 0): p50: 038.17 ns p95: 038.18 ns p99: 038.30 ns LSE (stadd) (c 10, d 300): p50: 125.28 ns p95: 125.78 ns p99: 125.90 ns LSE (stadd) (c 10, d 500): p50: 200.99 ns p95: 202.74 ns p99: 202.87 ns LSE (stadd) (c 30, d 0): p50: 038.16 ns p95: 038.34 ns p99: 038.62 ns LL/SC (c 0, d 0): p50: 006.56 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 048.25 ns p95: 051.04 ns p99: 052.49 ns LL/SC (c 10, d 10): p50: 049.65 ns p95: 057.90 ns p99: 060.36 ns LL/SC (c 10, d 20): p50: 059.60 ns p95: 059.65 ns p99: 059.66 ns LL/SC (c 1000, d 0): p50: 016.79 ns p95: 016.95 ns p99: 016.97 ns LL/SC (c 1000, d 10): p50: 018.73 ns p95: 019.36 ns p99: 019.60 ns LL/SC (c 1000000, d 0): p50: 006.68 ns p95: 006.70 ns p99: 006.71 ns LL/SC (c 1000000, d 10): p50: 008.60 ns p95: 008.62 ns p99: 008.63 ns LDADD (c 0, d 0): p50: 017.77 ns p95: 063.38 ns p99: 063.39 ns LDADD (c 0, d 100): p50: 101.71 ns p95: 102.20 ns p99: 102.32 ns LDADD (c 0, d 200): p50: 146.30 ns p95: 146.38 ns p99: 146.40 ns LDADD (c 0, d 300): p50: 187.31 ns p95: 187.33 ns p99: 187.37 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.51 ns p99: 069.52 ns LDADD (c 10, d 0): p50: 069.67 ns p95: 069.69 ns p99: 069.78 ns LDADD (c 10, d 10): p50: 069.66 ns p95: 069.68 ns p99: 069.69 ns LDADD (c 100, d 0): p50: 070.91 ns p95: 070.94 ns p99: 070.98 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.80 ns PFRM_KEEP+STADD (c 10, d 0): p50: 119.07 ns p95: 119.29 ns p99: 119.35 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 066.09 ns p95: 066.35 ns p99: 066.47 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 018.46 ns p95: 019.26 ns p99: 019.66 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 119.36 ns p95: 119.71 ns p99: 119.76 ns PFRM_STRM+STADD (c 1000, d 0): p50: 065.43 ns p95: 066.03 ns p99: 066.11 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.02 ns p95: 019.62 ns p99: 020.02 ns CPU: 13 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.52 ns p95: 049.01 ns p99: 049.08 ns LSE (stadd) (c 0, d 200): p50: 086.40 ns p95: 087.37 ns p99: 087.41 ns LSE (stadd) (c 10, d 0): p50: 038.17 ns p95: 038.18 ns p99: 038.18 ns LSE (stadd) (c 10, d 300): p50: 125.61 ns p95: 125.94 ns p99: 126.08 ns LSE (stadd) (c 10, d 500): p50: 201.14 ns p95: 203.16 ns p99: 203.20 ns LSE (stadd) (c 30, d 0): p50: 038.15 ns p95: 038.16 ns p99: 038.17 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 048.28 ns p95: 052.02 ns p99: 053.32 ns LL/SC (c 10, d 10): p50: 049.67 ns p95: 060.85 ns p99: 062.39 ns LL/SC (c 10, d 20): p50: 059.43 ns p95: 059.65 ns p99: 060.10 ns LL/SC (c 1000, d 0): p50: 016.75 ns p95: 016.93 ns p99: 017.01 ns LL/SC (c 1000, d 10): p50: 018.90 ns p95: 019.80 ns p99: 019.99 ns LL/SC (c 1000000, d 0): p50: 006.68 ns p95: 006.71 ns p99: 006.72 ns LL/SC (c 1000000, d 10): p50: 008.60 ns p95: 008.62 ns p99: 008.62 ns LDADD (c 0, d 0): p50: 063.35 ns p95: 063.37 ns p99: 063.37 ns LDADD (c 0, d 100): p50: 101.69 ns p95: 102.10 ns p99: 102.26 ns LDADD (c 0, d 200): p50: 146.67 ns p95: 146.70 ns p99: 146.77 ns LDADD (c 0, d 300): p50: 187.32 ns p95: 187.33 ns p99: 187.34 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.52 ns p99: 069.53 ns LDADD (c 10, d 0): p50: 069.66 ns p95: 069.72 ns p99: 069.79 ns LDADD (c 10, d 10): p50: 069.67 ns p95: 069.70 ns p99: 069.71 ns LDADD (c 100, d 0): p50: 070.90 ns p95: 070.93 ns p99: 070.94 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 119.09 ns p95: 119.46 ns p99: 119.51 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 066.19 ns p95: 066.49 ns p99: 066.53 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.20 ns p95: 019.66 ns p99: 019.78 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 119.03 ns p95: 119.64 ns p99: 119.69 ns PFRM_STRM+STADD (c 1000, d 0): p50: 065.49 ns p95: 066.52 ns p99: 067.66 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.12 ns p95: 019.68 ns p99: 019.94 ns CPU: 14 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.53 ns p95: 048.88 ns p99: 049.02 ns LSE (stadd) (c 0, d 200): p50: 087.29 ns p95: 087.44 ns p99: 087.47 ns LSE (stadd) (c 10, d 0): p50: 038.15 ns p95: 038.17 ns p99: 038.24 ns LSE (stadd) (c 10, d 300): p50: 123.85 ns p95: 125.01 ns p99: 125.28 ns LSE (stadd) (c 10, d 500): p50: 202.79 ns p95: 203.27 ns p99: 203.33 ns LSE (stadd) (c 30, d 0): p50: 038.15 ns p95: 038.17 ns p99: 038.25 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.77 ns LL/SC (c 10, d 0): p50: 042.23 ns p95: 042.74 ns p99: 042.78 ns LL/SC (c 10, d 10): p50: 049.30 ns p95: 049.71 ns p99: 049.73 ns LL/SC (c 10, d 20): p50: 059.19 ns p95: 059.62 ns p99: 059.66 ns LL/SC (c 1000, d 0): p50: 016.75 ns p95: 016.92 ns p99: 016.95 ns LL/SC (c 1000, d 10): p50: 018.88 ns p95: 019.55 ns p99: 019.72 ns LL/SC (c 1000000, d 0): p50: 006.67 ns p95: 006.69 ns p99: 006.70 ns LL/SC (c 1000000, d 10): p50: 008.59 ns p95: 008.62 ns p99: 008.70 ns LDADD (c 0, d 0): p50: 063.35 ns p95: 063.39 ns p99: 063.46 ns LDADD (c 0, d 100): p50: 049.45 ns p95: 101.79 ns p99: 101.92 ns LDADD (c 0, d 200): p50: 146.30 ns p95: 146.49 ns p99: 146.68 ns LDADD (c 0, d 300): p50: 187.32 ns p95: 187.42 ns p99: 187.43 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.59 ns p99: 069.60 ns LDADD (c 10, d 0): p50: 069.68 ns p95: 069.77 ns p99: 069.79 ns LDADD (c 10, d 10): p50: 069.68 ns p95: 069.78 ns p99: 069.79 ns LDADD (c 100, d 0): p50: 070.92 ns p95: 071.02 ns p99: 071.05 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.80 ns PFRM_KEEP+STADD (c 10, d 0): p50: 119.52 ns p95: 119.73 ns p99: 119.84 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 065.78 ns p95: 066.00 ns p99: 066.17 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.15 ns p95: 019.65 ns p99: 019.87 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 119.62 ns p95: 119.89 ns p99: 119.92 ns PFRM_STRM+STADD (c 1000, d 0): p50: 065.29 ns p95: 065.57 ns p99: 065.71 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.17 ns p95: 019.69 ns p99: 019.78 ns CPU: 15 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.76 ns p95: 049.11 ns p99: 049.19 ns LSE (stadd) (c 0, d 200): p50: 087.27 ns p95: 087.41 ns p99: 087.48 ns LSE (stadd) (c 10, d 0): p50: 038.14 ns p95: 038.18 ns p99: 038.30 ns LSE (stadd) (c 10, d 300): p50: 123.83 ns p95: 124.09 ns p99: 124.45 ns LSE (stadd) (c 10, d 500): p50: 200.99 ns p95: 201.05 ns p99: 201.59 ns LSE (stadd) (c 30, d 0): p50: 038.16 ns p95: 038.17 ns p99: 038.18 ns LL/SC (c 0, d 0): p50: 006.56 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 048.21 ns p95: 051.07 ns p99: 052.42 ns LL/SC (c 10, d 10): p50: 058.72 ns p95: 061.74 ns p99: 063.14 ns LL/SC (c 10, d 20): p50: 058.09 ns p95: 060.97 ns p99: 061.95 ns LL/SC (c 1000, d 0): p50: 013.06 ns p95: 013.43 ns p99: 013.72 ns LL/SC (c 1000, d 10): p50: 019.01 ns p95: 019.47 ns p99: 019.62 ns LL/SC (c 1000000, d 0): p50: 006.67 ns p95: 006.69 ns p99: 006.70 ns LL/SC (c 1000000, d 10): p50: 008.60 ns p95: 008.61 ns p99: 008.62 ns LDADD (c 0, d 0): p50: 063.35 ns p95: 063.38 ns p99: 063.38 ns LDADD (c 0, d 100): p50: 049.22 ns p95: 101.66 ns p99: 102.02 ns LDADD (c 0, d 200): p50: 146.29 ns p95: 146.48 ns p99: 146.68 ns LDADD (c 0, d 300): p50: 187.31 ns p95: 187.32 ns p99: 187.32 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.51 ns p99: 069.51 ns LDADD (c 10, d 0): p50: 069.67 ns p95: 069.69 ns p99: 069.70 ns LDADD (c 10, d 10): p50: 069.67 ns p95: 069.70 ns p99: 069.92 ns LDADD (c 100, d 0): p50: 070.91 ns p95: 070.92 ns p99: 070.93 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 119.36 ns p95: 119.68 ns p99: 119.78 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 066.18 ns p95: 066.54 ns p99: 066.63 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.12 ns p95: 019.70 ns p99: 019.87 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 118.91 ns p95: 119.18 ns p99: 119.32 ns PFRM_STRM+STADD (c 1000, d 0): p50: 066.00 ns p95: 066.39 ns p99: 066.48 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.14 ns p95: 019.73 ns p99: 019.87 ns CPU: 16 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 010.43 ns p95: 010.43 ns p99: 010.44 ns LSE (stadd) (c 0, d 100): p50: 048.54 ns p95: 048.96 ns p99: 049.08 ns LSE (stadd) (c 0, d 200): p50: 086.26 ns p95: 086.77 ns p99: 087.02 ns LSE (stadd) (c 10, d 0): p50: 040.49 ns p95: 040.51 ns p99: 040.51 ns LSE (stadd) (c 10, d 300): p50: 123.88 ns p95: 124.55 ns p99: 125.15 ns LSE (stadd) (c 10, d 500): p50: 201.12 ns p95: 202.73 ns p99: 202.98 ns LSE (stadd) (c 30, d 0): p50: 040.47 ns p95: 040.49 ns p99: 040.49 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 050.75 ns p95: 052.20 ns p99: 052.57 ns LL/SC (c 10, d 10): p50: 057.55 ns p95: 059.27 ns p99: 060.80 ns LL/SC (c 10, d 20): p50: 058.58 ns p95: 059.07 ns p99: 059.08 ns LL/SC (c 1000, d 0): p50: 012.37 ns p95: 012.78 ns p99: 012.83 ns LL/SC (c 1000, d 10): p50: 014.40 ns p95: 015.00 ns p99: 015.16 ns LL/SC (c 1000000, d 0): p50: 006.60 ns p95: 006.61 ns p99: 006.61 ns LL/SC (c 1000000, d 10): p50: 008.55 ns p95: 008.56 ns p99: 008.56 ns LDADD (c 0, d 0): p50: 017.77 ns p95: 074.18 ns p99: 074.19 ns LDADD (c 0, d 100): p50: 056.67 ns p95: 112.40 ns p99: 112.60 ns LDADD (c 0, d 200): p50: 157.10 ns p95: 157.15 ns p99: 157.30 ns LDADD (c 0, d 300): p50: 198.13 ns p95: 198.16 ns p99: 198.16 ns LDADD (c 1, d 10): p50: 074.15 ns p95: 074.17 ns p99: 074.28 ns LDADD (c 10, d 0): p50: 074.49 ns p95: 074.50 ns p99: 074.51 ns LDADD (c 10, d 10): p50: 074.48 ns p95: 074.50 ns p99: 074.54 ns LDADD (c 100, d 0): p50: 074.10 ns p95: 074.12 ns p99: 074.12 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 006.02 ns p99: 006.15 ns PFRM_KEEP+STADD (c 10, d 0): p50: 123.62 ns p95: 124.06 ns p99: 124.37 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 077.18 ns p95: 077.43 ns p99: 077.64 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 018.93 ns p95: 019.81 ns p99: 020.00 ns PFRM_STRM+STADD (c 0, d 0): p50: 006.40 ns p95: 009.75 ns p99: 010.36 ns PFRM_STRM+STADD (c 10, d 0): p50: 124.78 ns p95: 125.10 ns p99: 125.23 ns PFRM_STRM+STADD (c 1000, d 0): p50: 088.70 ns p95: 089.01 ns p99: 089.12 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 020.20 ns p95: 020.95 ns p99: 021.07 ns CPU: 17 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.75 ns p95: 049.08 ns p99: 049.19 ns LSE (stadd) (c 0, d 200): p50: 086.21 ns p95: 086.89 ns p99: 087.07 ns LSE (stadd) (c 10, d 0): p50: 040.48 ns p95: 040.50 ns p99: 040.58 ns LSE (stadd) (c 10, d 300): p50: 123.84 ns p95: 125.56 ns p99: 125.76 ns LSE (stadd) (c 10, d 500): p50: 200.99 ns p95: 201.53 ns p99: 201.84 ns LSE (stadd) (c 30, d 0): p50: 040.47 ns p95: 040.48 ns p99: 040.49 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 041.36 ns p95: 042.29 ns p99: 042.31 ns LL/SC (c 10, d 10): p50: 048.59 ns p95: 049.10 ns p99: 049.58 ns LL/SC (c 10, d 20): p50: 058.60 ns p95: 059.53 ns p99: 060.02 ns LL/SC (c 1000, d 0): p50: 012.33 ns p95: 012.65 ns p99: 012.81 ns LL/SC (c 1000, d 10): p50: 014.34 ns p95: 014.91 ns p99: 014.96 ns LL/SC (c 1000000, d 0): p50: 006.60 ns p95: 006.61 ns p99: 006.62 ns LL/SC (c 1000000, d 10): p50: 008.54 ns p95: 008.56 ns p99: 008.56 ns LDADD (c 0, d 0): p50: 074.16 ns p95: 074.19 ns p99: 074.22 ns LDADD (c 0, d 100): p50: 112.12 ns p95: 112.65 ns p99: 113.08 ns LDADD (c 0, d 200): p50: 157.11 ns p95: 157.50 ns p99: 157.52 ns LDADD (c 0, d 300): p50: 198.12 ns p95: 198.13 ns p99: 198.14 ns LDADD (c 1, d 10): p50: 074.14 ns p95: 074.15 ns p99: 074.16 ns LDADD (c 10, d 0): p50: 074.48 ns p95: 074.50 ns p99: 074.60 ns LDADD (c 10, d 10): p50: 074.48 ns p95: 074.50 ns p99: 074.52 ns LDADD (c 100, d 0): p50: 074.09 ns p95: 074.12 ns p99: 074.13 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 123.87 ns p95: 124.64 ns p99: 124.82 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 077.16 ns p95: 077.49 ns p99: 077.56 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.28 ns p95: 019.99 ns p99: 020.10 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 124.58 ns p95: 124.91 ns p99: 125.03 ns PFRM_STRM+STADD (c 1000, d 0): p50: 089.03 ns p95: 089.41 ns p99: 089.52 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.93 ns p95: 020.85 ns p99: 021.25 ns CPU: 18 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.62 ns p95: 049.08 ns p99: 049.14 ns LSE (stadd) (c 0, d 200): p50: 086.36 ns p95: 087.28 ns p99: 087.45 ns LSE (stadd) (c 10, d 0): p50: 040.46 ns p95: 040.49 ns p99: 040.49 ns LSE (stadd) (c 10, d 300): p50: 123.84 ns p95: 123.89 ns p99: 123.95 ns LSE (stadd) (c 10, d 500): p50: 202.59 ns p95: 202.96 ns p99: 203.02 ns LSE (stadd) (c 30, d 0): p50: 040.46 ns p95: 040.49 ns p99: 040.62 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 050.70 ns p95: 052.22 ns p99: 052.64 ns LL/SC (c 10, d 10): p50: 057.53 ns p95: 059.50 ns p99: 059.82 ns LL/SC (c 10, d 20): p50: 058.31 ns p95: 059.81 ns p99: 060.28 ns LL/SC (c 1000, d 0): p50: 011.99 ns p95: 012.66 ns p99: 012.82 ns LL/SC (c 1000, d 10): p50: 016.97 ns p95: 017.73 ns p99: 018.05 ns LL/SC (c 1000000, d 0): p50: 006.60 ns p95: 006.61 ns p99: 006.61 ns LL/SC (c 1000000, d 10): p50: 008.55 ns p95: 008.56 ns p99: 008.56 ns LDADD (c 0, d 0): p50: 010.04 ns p95: 074.17 ns p99: 074.17 ns LDADD (c 0, d 100): p50: 111.99 ns p95: 112.57 ns p99: 112.70 ns LDADD (c 0, d 200): p50: 157.37 ns p95: 157.49 ns p99: 157.49 ns LDADD (c 0, d 300): p50: 198.12 ns p95: 198.13 ns p99: 198.18 ns LDADD (c 1, d 10): p50: 074.14 ns p95: 074.17 ns p99: 074.20 ns LDADD (c 10, d 0): p50: 074.49 ns p95: 074.54 ns p99: 074.58 ns LDADD (c 10, d 10): p50: 074.48 ns p95: 074.52 ns p99: 074.59 ns LDADD (c 100, d 0): p50: 074.10 ns p95: 074.13 ns p99: 074.15 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.80 ns PFRM_KEEP+STADD (c 10, d 0): p50: 124.63 ns p95: 124.92 ns p99: 125.07 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 089.19 ns p95: 089.47 ns p99: 089.56 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.29 ns p95: 020.18 ns p99: 020.37 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 124.55 ns p95: 124.83 ns p99: 124.97 ns PFRM_STRM+STADD (c 1000, d 0): p50: 088.71 ns p95: 089.11 ns p99: 089.17 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 020.39 ns p95: 020.98 ns p99: 021.19 ns CPU: 19 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.63 ns p95: 048.97 ns p99: 049.24 ns LSE (stadd) (c 0, d 200): p50: 087.30 ns p95: 087.45 ns p99: 087.50 ns LSE (stadd) (c 10, d 0): p50: 040.49 ns p95: 040.50 ns p99: 040.51 ns LSE (stadd) (c 10, d 300): p50: 123.89 ns p95: 124.31 ns p99: 124.79 ns LSE (stadd) (c 10, d 500): p50: 202.37 ns p95: 203.12 ns p99: 203.23 ns LSE (stadd) (c 30, d 0): p50: 040.47 ns p95: 040.48 ns p99: 040.48 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 041.34 ns p95: 041.83 ns p99: 042.27 ns LL/SC (c 10, d 10): p50: 048.47 ns p95: 048.64 ns p99: 050.53 ns LL/SC (c 10, d 20): p50: 058.56 ns p95: 059.05 ns p99: 059.52 ns LL/SC (c 1000, d 0): p50: 012.16 ns p95: 012.57 ns p99: 012.67 ns LL/SC (c 1000, d 10): p50: 014.43 ns p95: 014.92 ns p99: 014.97 ns LL/SC (c 1000000, d 0): p50: 006.60 ns p95: 006.61 ns p99: 006.62 ns LL/SC (c 1000000, d 10): p50: 008.54 ns p95: 008.56 ns p99: 008.56 ns LDADD (c 0, d 0): p50: 010.04 ns p95: 010.04 ns p99: 010.04 ns LDADD (c 0, d 100): p50: 112.14 ns p95: 112.61 ns p99: 112.74 ns LDADD (c 0, d 200): p50: 157.39 ns p95: 157.51 ns p99: 157.51 ns LDADD (c 0, d 300): p50: 198.12 ns p95: 198.16 ns p99: 198.56 ns LDADD (c 1, d 10): p50: 074.14 ns p95: 074.16 ns p99: 074.32 ns LDADD (c 10, d 0): p50: 074.47 ns p95: 074.49 ns p99: 074.49 ns LDADD (c 10, d 10): p50: 074.48 ns p95: 074.51 ns p99: 074.72 ns LDADD (c 100, d 0): p50: 074.10 ns p95: 074.11 ns p99: 074.12 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 123.66 ns p95: 124.13 ns p99: 124.46 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 077.11 ns p95: 077.35 ns p99: 077.46 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.62 ns p95: 020.17 ns p99: 020.38 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 124.71 ns p95: 124.94 ns p99: 125.07 ns PFRM_STRM+STADD (c 1000, d 0): p50: 088.79 ns p95: 089.13 ns p99: 089.18 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 020.42 ns p95: 021.01 ns p99: 021.54 ns CPU: 20 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.72 ns p95: 049.11 ns p99: 049.25 ns LSE (stadd) (c 0, d 200): p50: 087.31 ns p95: 087.48 ns p99: 087.61 ns LSE (stadd) (c 10, d 0): p50: 038.15 ns p95: 038.20 ns p99: 038.41 ns LSE (stadd) (c 10, d 300): p50: 123.87 ns p95: 124.01 ns p99: 124.91 ns LSE (stadd) (c 10, d 500): p50: 201.04 ns p95: 202.93 ns p99: 203.03 ns LSE (stadd) (c 30, d 0): p50: 038.15 ns p95: 038.17 ns p99: 038.17 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 045.37 ns p95: 046.08 ns p99: 046.46 ns LL/SC (c 10, d 10): p50: 053.40 ns p95: 053.46 ns p99: 053.49 ns LL/SC (c 10, d 20): p50: 064.12 ns p95: 064.63 ns p99: 064.65 ns LL/SC (c 1000, d 0): p50: 010.25 ns p95: 010.64 ns p99: 010.70 ns LL/SC (c 1000, d 10): p50: 014.76 ns p95: 015.15 ns p99: 015.33 ns LL/SC (c 1000000, d 0): p50: 006.61 ns p95: 006.62 ns p99: 006.63 ns LL/SC (c 1000000, d 10): p50: 008.57 ns p95: 008.58 ns p99: 008.59 ns LDADD (c 0, d 0): p50: 064.90 ns p95: 064.92 ns p99: 065.00 ns LDADD (c 0, d 100): p50: 102.83 ns p95: 103.27 ns p99: 103.39 ns LDADD (c 0, d 200): p50: 101.11 ns p95: 148.21 ns p99: 148.23 ns LDADD (c 0, d 300): p50: 188.86 ns p95: 188.87 ns p99: 188.96 ns LDADD (c 1, d 10): p50: 069.50 ns p95: 069.52 ns p99: 069.53 ns LDADD (c 10, d 0): p50: 069.67 ns p95: 069.69 ns p99: 069.70 ns LDADD (c 10, d 10): p50: 069.69 ns p95: 069.72 ns p99: 069.81 ns LDADD (c 100, d 0): p50: 070.94 ns p95: 070.96 ns p99: 070.97 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 118.46 ns p95: 119.09 ns p99: 119.23 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 067.08 ns p95: 067.40 ns p99: 068.06 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 018.78 ns p95: 019.44 ns p99: 019.59 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 116.67 ns p95: 120.14 ns p99: 120.18 ns PFRM_STRM+STADD (c 1000, d 0): p50: 066.58 ns p95: 066.84 ns p99: 067.00 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.21 ns p95: 019.72 ns p99: 019.98 ns CPU: 21 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.46 ns p95: 048.82 ns p99: 048.88 ns LSE (stadd) (c 0, d 200): p50: 086.27 ns p95: 086.43 ns p99: 087.07 ns LSE (stadd) (c 10, d 0): p50: 038.17 ns p95: 038.19 ns p99: 038.19 ns LSE (stadd) (c 10, d 300): p50: 125.71 ns p95: 126.00 ns p99: 126.08 ns LSE (stadd) (c 10, d 500): p50: 202.44 ns p95: 203.04 ns p99: 203.15 ns LSE (stadd) (c 30, d 0): p50: 038.17 ns p95: 038.18 ns p99: 038.18 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 045.37 ns p95: 045.91 ns p99: 045.92 ns LL/SC (c 10, d 10): p50: 052.97 ns p95: 053.51 ns p99: 054.67 ns LL/SC (c 10, d 20): p50: 064.10 ns p95: 064.63 ns p99: 064.66 ns LL/SC (c 1000, d 0): p50: 010.17 ns p95: 010.49 ns p99: 010.62 ns LL/SC (c 1000, d 10): p50: 014.86 ns p95: 015.25 ns p99: 015.40 ns LL/SC (c 1000000, d 0): p50: 006.61 ns p95: 006.62 ns p99: 006.63 ns LL/SC (c 1000000, d 10): p50: 008.56 ns p95: 008.58 ns p99: 008.58 ns LDADD (c 0, d 0): p50: 064.90 ns p95: 064.93 ns p99: 065.00 ns LDADD (c 0, d 100): p50: 049.28 ns p95: 049.82 ns p99: 049.99 ns LDADD (c 0, d 200): p50: 148.22 ns p95: 148.25 ns p99: 148.27 ns LDADD (c 0, d 300): p50: 188.86 ns p95: 188.89 ns p99: 188.89 ns LDADD (c 1, d 10): p50: 069.49 ns p95: 069.53 ns p99: 069.67 ns LDADD (c 10, d 0): p50: 069.69 ns p95: 069.73 ns p99: 069.73 ns LDADD (c 10, d 10): p50: 069.69 ns p95: 069.73 ns p99: 069.75 ns LDADD (c 100, d 0): p50: 070.93 ns p95: 070.95 ns p99: 070.96 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 120.29 ns p95: 120.64 ns p99: 120.83 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 066.97 ns p95: 067.27 ns p99: 067.35 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.23 ns p95: 019.75 ns p99: 019.89 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 116.26 ns p95: 116.67 ns p99: 116.71 ns PFRM_STRM+STADD (c 1000, d 0): p50: 067.20 ns p95: 067.54 ns p99: 067.89 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.17 ns p95: 019.76 ns p99: 019.91 ns CPU: 22 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 010.43 ns p95: 026.99 ns p99: 026.99 ns LSE (stadd) (c 0, d 100): p50: 048.80 ns p95: 049.18 ns p99: 049.31 ns LSE (stadd) (c 0, d 200): p50: 087.29 ns p95: 087.45 ns p99: 087.49 ns LSE (stadd) (c 10, d 0): p50: 038.15 ns p95: 038.16 ns p99: 038.17 ns LSE (stadd) (c 10, d 300): p50: 123.84 ns p95: 123.88 ns p99: 123.89 ns LSE (stadd) (c 10, d 500): p50: 202.65 ns p95: 203.00 ns p99: 203.05 ns LSE (stadd) (c 30, d 0): p50: 038.17 ns p95: 038.19 ns p99: 038.37 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 045.36 ns p95: 045.90 ns p99: 045.93 ns LL/SC (c 10, d 10): p50: 052.90 ns p95: 053.44 ns p99: 053.49 ns LL/SC (c 10, d 20): p50: 064.10 ns p95: 064.62 ns p99: 065.14 ns LL/SC (c 1000, d 0): p50: 010.18 ns p95: 010.52 ns p99: 010.66 ns LL/SC (c 1000, d 10): p50: 014.70 ns p95: 015.15 ns p99: 015.37 ns LL/SC (c 1000000, d 0): p50: 006.60 ns p95: 006.61 ns p99: 006.62 ns LL/SC (c 1000000, d 10): p50: 008.57 ns p95: 008.59 ns p99: 008.61 ns LDADD (c 0, d 0): p50: 064.91 ns p95: 064.93 ns p99: 065.10 ns LDADD (c 0, d 100): p50: 103.09 ns p95: 103.53 ns p99: 103.63 ns LDADD (c 0, d 200): p50: 147.83 ns p95: 147.96 ns p99: 148.18 ns LDADD (c 0, d 300): p50: 188.86 ns p95: 188.88 ns p99: 188.88 ns LDADD (c 1, d 10): p50: 069.50 ns p95: 069.51 ns p99: 069.52 ns LDADD (c 10, d 0): p50: 069.67 ns p95: 069.69 ns p99: 069.70 ns LDADD (c 10, d 10): p50: 069.68 ns p95: 069.71 ns p99: 069.84 ns LDADD (c 100, d 0): p50: 070.91 ns p95: 070.93 ns p99: 070.95 ns PFRM_KEEP+STADD (c 0, d 0): p50: 007.04 ns p95: 007.71 ns p99: 007.88 ns PFRM_KEEP+STADD (c 10, d 0): p50: 118.50 ns p95: 118.99 ns p99: 119.34 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 067.13 ns p95: 067.52 ns p99: 067.95 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 018.40 ns p95: 019.36 ns p99: 019.67 ns PFRM_STRM+STADD (c 0, d 0): p50: 009.59 ns p95: 010.03 ns p99: 010.16 ns PFRM_STRM+STADD (c 10, d 0): p50: 116.54 ns p95: 117.08 ns p99: 117.22 ns PFRM_STRM+STADD (c 1000, d 0): p50: 066.99 ns p95: 067.24 ns p99: 067.35 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.27 ns p95: 019.77 ns p99: 020.07 ns CPU: 23 - Latency Percentiles: ==================== LSE (stadd) (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns LSE (stadd) (c 0, d 100): p50: 048.60 ns p95: 048.97 ns p99: 049.01 ns LSE (stadd) (c 0, d 200): p50: 087.26 ns p95: 087.42 ns p99: 087.48 ns LSE (stadd) (c 10, d 0): p50: 038.17 ns p95: 038.19 ns p99: 038.29 ns LSE (stadd) (c 10, d 300): p50: 123.87 ns p95: 123.93 ns p99: 123.99 ns LSE (stadd) (c 10, d 500): p50: 202.76 ns p95: 203.10 ns p99: 203.16 ns LSE (stadd) (c 30, d 0): p50: 038.17 ns p95: 038.39 ns p99: 038.68 ns LL/SC (c 0, d 0): p50: 006.57 ns p95: 006.57 ns p99: 006.57 ns LL/SC (c 0, d 10): p50: 008.50 ns p95: 008.50 ns p99: 008.50 ns LL/SC (c 0, d 20): p50: 011.67 ns p95: 011.67 ns p99: 011.67 ns LL/SC (c 10, d 0): p50: 045.37 ns p95: 045.91 ns p99: 045.91 ns LL/SC (c 10, d 10): p50: 052.91 ns p95: 053.50 ns p99: 055.37 ns LL/SC (c 10, d 20): p50: 064.11 ns p95: 064.65 ns p99: 065.10 ns LL/SC (c 1000, d 0): p50: 010.27 ns p95: 010.62 ns p99: 010.71 ns LL/SC (c 1000, d 10): p50: 014.77 ns p95: 015.15 ns p99: 015.26 ns LL/SC (c 1000000, d 0): p50: 006.61 ns p95: 006.62 ns p99: 006.62 ns LL/SC (c 1000000, d 10): p50: 008.56 ns p95: 008.58 ns p99: 008.58 ns LDADD (c 0, d 0): p50: 064.93 ns p95: 064.95 ns p99: 064.97 ns LDADD (c 0, d 100): p50: 049.43 ns p95: 103.16 ns p99: 103.40 ns LDADD (c 0, d 200): p50: 147.83 ns p95: 148.03 ns p99: 148.13 ns LDADD (c 0, d 300): p50: 188.86 ns p95: 188.90 ns p99: 188.92 ns LDADD (c 1, d 10): p50: 069.50 ns p95: 069.52 ns p99: 069.53 ns LDADD (c 10, d 0): p50: 069.68 ns p95: 069.70 ns p99: 069.85 ns LDADD (c 10, d 10): p50: 069.67 ns p95: 069.69 ns p99: 069.69 ns LDADD (c 100, d 0): p50: 070.92 ns p95: 070.94 ns p99: 070.95 ns PFRM_KEEP+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_KEEP+STADD (c 10, d 0): p50: 120.24 ns p95: 120.51 ns p99: 120.61 ns PFRM_KEEP+STADD (c 1000, d 0): p50: 066.93 ns p95: 067.12 ns p99: 067.21 ns PFRM_KEEP+STADD (c 1000000, d 0): p50: 019.19 ns p95: 019.77 ns p99: 020.00 ns PFRM_STRM+STADD (c 0, d 0): p50: 005.79 ns p95: 005.79 ns p99: 005.79 ns PFRM_STRM+STADD (c 10, d 0): p50: 120.13 ns p95: 120.35 ns p99: 120.48 ns PFRM_STRM+STADD (c 1000, d 0): p50: 066.58 ns p95: 067.48 ns p99: 069.22 ns PFRM_STRM+STADD (c 1000000, d 0): p50: 019.19 ns p95: 019.70 ns p99: 019.92 ns === Benchmark Complete === ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2025-11-27 12:32 UTC | newest] Thread overview: 46+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-30 22:37 Overhead of arm64 LSE per-CPU atomics? Paul E. McKenney 2025-10-31 18:30 ` Catalin Marinas 2025-10-31 19:39 ` Paul E. McKenney 2025-10-31 22:21 ` Paul E. McKenney 2025-10-31 22:43 ` Catalin Marinas 2025-10-31 23:38 ` Paul E. McKenney 2025-11-01 3:25 ` Paul E. McKenney 2025-11-01 9:44 ` Willy Tarreau 2025-11-01 18:07 ` Paul E. McKenney 2025-11-01 11:23 ` Catalin Marinas 2025-11-01 11:41 ` Yicong Yang 2025-11-05 13:25 ` Catalin Marinas 2025-11-05 13:42 ` Willy Tarreau 2025-11-05 14:49 ` Catalin Marinas 2025-11-05 16:21 ` Breno Leitao 2025-11-06 7:44 ` Willy Tarreau 2025-11-06 13:53 ` Catalin Marinas 2025-11-06 14:16 ` Willy Tarreau 2025-11-03 20:12 ` Palmer Dabbelt 2025-11-03 21:49 ` Catalin Marinas 2025-11-03 21:56 ` Willy Tarreau 2025-11-04 17:05 ` Catalin Marinas 2025-11-04 18:43 ` Paul E. McKenney 2025-11-04 20:10 ` Paul E. McKenney 2025-11-05 15:34 ` Catalin Marinas 2025-11-05 16:25 ` Paul E. McKenney 2025-11-05 17:15 ` Catalin Marinas 2025-11-05 17:40 ` Paul E. McKenney 2025-11-05 19:16 ` Catalin Marinas 2025-11-05 19:47 ` Paul E. McKenney 2025-11-05 20:17 ` Catalin Marinas 2025-11-05 20:45 ` Paul E. McKenney 2025-11-05 21:13 ` Palmer Dabbelt 2025-11-06 14:00 ` Catalin Marinas 2025-11-06 16:30 ` Palmer Dabbelt 2025-11-06 17:54 ` Catalin Marinas 2025-11-06 18:23 ` Palmer Dabbelt 2025-11-04 15:59 ` Breno Leitao 2025-11-04 17:06 ` Catalin Marinas 2025-11-04 18:08 ` Willy Tarreau 2025-11-04 18:22 ` Breno Leitao 2025-11-04 20:13 ` Paul E. McKenney 2025-11-04 20:35 ` Willy Tarreau 2025-11-04 21:25 ` Paul E. McKenney 2025-11-04 20:57 ` Puranjay Mohan 2025-11-27 12:29 ` Wentao Guan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).