linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Overhead of arm64 LSE per-CPU atomics?
@ 2025-10-30 22:37 Paul E. McKenney
  2025-10-31 18:30 ` Catalin Marinas
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-10-30 22:37 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

Hello!

To make event tracing safe for PREEMPT_RT kernels, I have been creating
optimized variants of SRCU readers that use per-CPU atomics.  This works
quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
per-CPU atomic operation.  This contrasts with a handful of nanoseconds
on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).

In theory, I can mess with SRCU's counter protocol, but I figured I
should check first.

The patch below shows the flavor of the change, but for the simpler
uproberet case where NMI safety is not required, permitting me to
simply disable interrupts across the non-atomic increment operations.
The overhead of interrupt disabling is not wonderful, but ~13ns is way
better than ~100ns any day of the week.

I have to do something like this for internal use here because we have
real hardware that acts this way.  If I don't hear otherwise, I will
also push it to mainline.  So if there is a more dainty approach, this
would be a most excellent time to let me know about it.  ;-)

							Thanx, Paul

------------------------------------------------------------------------

commit 1eee41590d30805ec4f5b4e96c615603b0d058d9
Author: Paul E. McKenney <paulmck@kernel.org>
Date:   Thu Oct 30 09:25:09 2025 -0700

    refscale: Add SRCU-fast-updown readers
    
    This commit adds refscale readers based on srcu_read_lock_fast_updown()
    and srcu_read_lock_fast_updown() ("refscale.scale_type=srcu-fast-updown").
    On my x86 laptop, these are about 2.2ns per pair.
    
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index 7429ec9f0092..07a313782dfd 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -186,6 +186,7 @@ static const struct ref_scale_ops rcu_ops = {
 // Definitions for SRCU ref scale testing.
 DEFINE_STATIC_SRCU(srcu_refctl_scale);
 DEFINE_STATIC_SRCU_FAST(srcu_fast_refctl_scale);
+DEFINE_STATIC_SRCU_FAST_UPDOWN(srcu_fast_updown_refctl_scale);
 static struct srcu_struct *srcu_ctlp = &srcu_refctl_scale;
 
 static void srcu_ref_scale_read_section(const int nloops)
@@ -254,6 +255,42 @@ static const struct ref_scale_ops srcu_fast_ops = {
 	.name		= "srcu-fast"
 };
 
+static bool srcu_fast_updown_sync_scale_init(void)
+{
+	srcu_ctlp = &srcu_fast_updown_refctl_scale;
+	return true;
+}
+
+static void srcu_fast_updown_ref_scale_read_section(const int nloops)
+{
+	int i;
+	struct srcu_ctr __percpu *scp;
+
+	for (i = nloops; i >= 0; i--) {
+		scp = srcu_read_lock_fast_updown(srcu_ctlp);
+		srcu_read_unlock_fast_updown(srcu_ctlp, scp);
+	}
+}
+
+static void srcu_fast_updown_ref_scale_delay_section(const int nloops, const int udl, const int ndl)
+{
+	int i;
+	struct srcu_ctr __percpu *scp;
+
+	for (i = nloops; i >= 0; i--) {
+		scp = srcu_read_lock_fast_updown(srcu_ctlp);
+		un_delay(udl, ndl);
+		srcu_read_unlock_fast_updown(srcu_ctlp, scp);
+	}
+}
+
+static const struct ref_scale_ops srcu_fast_updown_ops = {
+	.init		= srcu_fast_updown_sync_scale_init,
+	.readsection	= srcu_fast_updown_ref_scale_read_section,
+	.delaysection	= srcu_fast_updown_ref_scale_delay_section,
+	.name		= "srcu-fast-updown"
+};
+
 #ifdef CONFIG_TASKS_RCU
 
 // Definitions for RCU Tasks ref scale testing: Empty read markers.
@@ -1479,7 +1516,8 @@ ref_scale_init(void)
 	long i;
 	int firsterr = 0;
 	static const struct ref_scale_ops *scale_ops[] = {
-		&rcu_ops, &srcu_ops, &srcu_fast_ops, RCU_TRACE_OPS RCU_TASKS_OPS
+		&rcu_ops, &srcu_ops, &srcu_fast_ops, &srcu_fast_updown_ops,
+		RCU_TRACE_OPS RCU_TASKS_OPS
 		&refcnt_ops, &percpuinc_ops, &incpercpu_ops, &incpercpupreempt_ops,
 		&incpercpubh_ops, &incpercpuirqsave_ops,
 		&rwlock_ops, &rwsem_ops, &lock_ops, &lock_irq_ops, &acqrel_ops,


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-10-30 22:37 Overhead of arm64 LSE per-CPU atomics? Paul E. McKenney
@ 2025-10-31 18:30 ` Catalin Marinas
  2025-10-31 19:39   ` Paul E. McKenney
  2025-11-04 15:59   ` Breno Leitao
  0 siblings, 2 replies; 46+ messages in thread
From: Catalin Marinas @ 2025-10-31 18:30 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> To make event tracing safe for PREEMPT_RT kernels, I have been creating
> optimized variants of SRCU readers that use per-CPU atomics.  This works
> quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).

That's quite a difference. Does it get any better if
CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
on the kernel command line.

Depending on the implementation and configuration, the LSE atomics may
skip the L1 cache and be executed closer to the memory (they used to be
called far atomics). The CPUs try to be smarter like doing the operation
"near" if it's in the cache but the heuristics may not always work.

Interestingly, we had this patch recently to force a prefetch before the
atomic:

https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/

We rejected it but I wonder whether it improves the SRCU scenario.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-10-31 18:30 ` Catalin Marinas
@ 2025-10-31 19:39   ` Paul E. McKenney
  2025-10-31 22:21     ` Paul E. McKenney
  2025-10-31 22:43     ` Catalin Marinas
  2025-11-04 15:59   ` Breno Leitao
  1 sibling, 2 replies; 46+ messages in thread
From: Paul E. McKenney @ 2025-10-31 19:39 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> 
> That's quite a difference. Does it get any better if
> CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> on the kernel command line.

In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?

Yes, this gets me more than an order of magnitude, and about 30% better
than my workaround of disabling interrupts around a non-atomic increment
of those counters, thank you!

Given that per-CPU atomics are usually not heavily contended, would it
make sense to avoid LSE in that case?

And I need to figure out whether I should recommend that Meta build
its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n.  And advice you
might have would be deeply appreciated!  (I am of course also following
up internally.)

> Depending on the implementation and configuration, the LSE atomics may
> skip the L1 cache and be executed closer to the memory (they used to be
> called far atomics). The CPUs try to be smarter like doing the operation
> "near" if it's in the cache but the heuristics may not always work.

My knowledge-free guess is that it is early days for LSE, and that it
therefore has significant hardware-level optimization work ahead of it.
For example, I well recall being roundly denounced by Intel engineers in
my neighborhood for reporting similar performance results on Pentium 4
back in the day.  The truth might well have set them free, but it sure
didn't make them happy!  ;-)

But what would a non-knowledge-free guess be?

> Interestingly, we had this patch recently to force a prefetch before the
> atomic:
> 
> https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/
> 
> We rejected it but I wonder whether it improves the SRCU scenario.

No statistical difference on my system.  This is a 72-CPU Neoverse V2, in
case that matters.  Here are my results for the underlying this_cpu_inc()
and this_cpu_dec() pair of operations:

	LSE Atomics Enabled (Stock)	LSE Atomics Disabled

Without Yicong’s Patch (Stock)

			    110.786		       9.852

With Yicong’s Patch

			    109.873		       9.853

As you can see, disabling LSE gets about an order of magnitude
and Yicong's patch has no statistically significant effect.

This and more can be found in the "Per-CPU Increment/Decrement"
section of this Google document:

https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing

Full disclosure: Calls to srcu_read_lock_fast() followed by
srcu_read_unlock_fast() really use one this_cpu_inc() followed by another
this_cpu_inc(), but I am not seeing any difference between the two.
And testing the underlying primitives allows my tests to give reproducible
results regardless of what state I have the SRCU code in.  ;-)

Thoughts?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-10-31 19:39   ` Paul E. McKenney
@ 2025-10-31 22:21     ` Paul E. McKenney
  2025-10-31 22:43     ` Catalin Marinas
  1 sibling, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2025-10-31 22:21 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> > 
> > That's quite a difference. Does it get any better if
> > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> > on the kernel command line.
> 
> In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?
> 
> Yes, this gets me more than an order of magnitude, and about 30% better
> than my workaround of disabling interrupts around a non-atomic increment
> of those counters, thank you!
> 
> Given that per-CPU atomics are usually not heavily contended, would it
> make sense to avoid LSE in that case?

For example, how about something like the patch below?

							Thanx, Paul

------------------------------------------------------------------------

commit 0c0b71d19c997915c5ef5fe7e32eb56b4e4a750e
Author: Paul E. McKenney <paulmckrcu@fb.com>
Date:   Fri Oct 31 14:14:13 2025 -0700

    arm64: Separately select LSE for per-CPU atomics
    
    LSE atomics provide better scalability, but not always better single-CPU
    performance.  In fact, on the ARM Neoverse V2, they degrade single-CPU
    performance by an order of magnitude, from about 5ns per operation to
    about 50ns.
    
    Now per-CPU atomics are rarely contended, in fact, a given per-CPU
    variable is usually used mostly by the CPU in question.  This means
    that LSE's better scalability does not help, but its degraded single-CPU
    performance does hurt.
    
    Therefore, provide a new default-n ARM64_USE_LSE_PERCPU_ATOMICS Kconfig
    option that uses LSE for per-CPU atomics.  This means that default kernel
    builds will use non-LSE atomics for this case, but will still use LSE
    atomics for the global atomic variables that are more likely to be
    heavily contended, and thus are more likely to benefit from LSE.
    
    Signed-off-by: Paul E. McKenney <paulmckrcu@fb.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: Will Deacon <will@kernel.org>
    Cc: <linux-arm-kernel@lists.infradead.org>

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 58b782779138..b91b7cbe4569 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1927,6 +1927,21 @@ config ARM64_USE_LSE_ATOMICS
 	  atomic routines. This incurs a small overhead on CPUs that do
 	  not support these instructions.
 
+config ARM64_USE_LSE_PERCPU_ATOMICS
+	bool "LSE for per-CPU atomic instructions"
+	default n
+	help
+	  As part of the Large System Extensions, ARMv8.1 introduces new
+	  atomic instructions that are designed specifically to scale in
+	  very large systems.  However, contention on per-CPU atomics
+	  is usually quite low by design, so these atomics likely benefit
+	  from higher performance, even if this is purchased with reduced
+	  performance under high contention.
+
+	  Say Y here to make use of these instructions for the in-kernel
+	  per-CPU atomic routines. This incurs a small overhead on CPUs
+	  that do not support these instructions.
+
 endmenu # "ARMv8.1 architectural features"
 
 menu "ARMv8.2 architectural features"
diff --git a/arch/arm64/include/asm/lse.h b/arch/arm64/include/asm/lse.h
index 3129a5819d0e..2d5eff217d63 100644
--- a/arch/arm64/include/asm/lse.h
+++ b/arch/arm64/include/asm/lse.h
@@ -26,12 +26,19 @@
 /* In-line patching at runtime */
 #define ARM64_LSE_ATOMIC_INSN(llsc, lse)				\
 	ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS)
+#if IS_ENABLED(CONFIG_ARM64_USE_LSE_PERCPU_ATOMICS)
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse)				\
+	ALTERNATIVE(llsc, __LSE_PREAMBLE lse, ARM64_HAS_LSE_ATOMICS)
+#else
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse)	llsc
+#endif
 
 #else	/* CONFIG_ARM64_LSE_ATOMICS */
 
 #define __lse_ll_sc_body(op, ...)		__ll_sc_##op(__VA_ARGS__)
 
 #define ARM64_LSE_ATOMIC_INSN(llsc, lse)	llsc
+#define ARM64_LSE_PERCPU_ATOMIC_INSN(llsc, lse)	llsc
 
 #endif	/* CONFIG_ARM64_LSE_ATOMICS */
 #endif	/* __ASM_LSE_H */
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index 9abcc8ef3087..eaa3c2f87407 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -70,7 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
 	unsigned int loop;						\
 	u##sz tmp;							\
 									\
-	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
+	asm volatile (ARM64_LSE_PERCPU_ATOMIC_INSN(			\
 	/* LL/SC */							\
 	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
 		#op_llsc "\t%" #w "[tmp], %" #w "[tmp], %" #w "[val]\n"	\
@@ -91,7 +91,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
 	unsigned int loop;						\
 	u##sz ret;							\
 									\
-	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
+	asm volatile (ARM64_LSE_PERCPU_ATOMIC_INSN(			\
 	/* LL/SC */							\
 	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
 		#op_llsc "\t%" #w "[ret], %" #w "[ret], %" #w "[val]\n"	\


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-10-31 19:39   ` Paul E. McKenney
  2025-10-31 22:21     ` Paul E. McKenney
@ 2025-10-31 22:43     ` Catalin Marinas
  2025-10-31 23:38       ` Paul E. McKenney
  1 sibling, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-10-31 22:43 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> > 
> > That's quite a difference. Does it get any better if
> > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> > on the kernel command line.
> 
> In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?

Yes.

> Yes, this gets me more than an order of magnitude, and about 30% better
> than my workaround of disabling interrupts around a non-atomic increment
> of those counters, thank you!
> 
> Given that per-CPU atomics are usually not heavily contended, would it
> make sense to avoid LSE in that case?

In theory the LSE atomics should be as fast but microarchitecture
decisions likely did not cover all the use-cases. I'll raise this
internally as well, maybe we get some ideas from the hardware people.

> And I need to figure out whether I should recommend that Meta build
> its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n.  And advice you
> might have would be deeply appreciated!  (I am of course also following
> up internally.)

I wouldn't advise turning them off just yet, they are beneficial for
other use-cases. But it needs more thinking (and not that late at night ;)).

> > Interestingly, we had this patch recently to force a prefetch before the
> > atomic:
> > 
> > https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/
> > 
> > We rejected it but I wonder whether it improves the SRCU scenario.
> 
> No statistical difference on my system.  This is a 72-CPU Neoverse V2, in
> case that matters.

I just realised that patch doesn't touch percpu.h at all. So what about
something like (untested):

-----------------8<------------------------
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index 9abcc8ef3087..e381034324e1 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
 	unsigned int loop;						\
 	u##sz tmp;							\
 									\
+	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
 	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
 	/* LL/SC */							\
 	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
@@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
 	unsigned int loop;						\
 	u##sz ret;							\
 									\
+	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
 	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
 	/* LL/SC */							\
 	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
-----------------8<------------------------

> Here are my results for the underlying this_cpu_inc()
> and this_cpu_dec() pair of operations:
> 
> 	LSE Atomics Enabled (Stock)	LSE Atomics Disabled
> 
> Without Yicong’s Patch (Stock)
> 
> 			    110.786		       9.852
> 
> With Yicong’s Patch
> 
> 			    109.873		       9.853
> 
> As you can see, disabling LSE gets about an order of magnitude
> and Yicong's patch has no statistically significant effect.
> 
> This and more can be found in the "Per-CPU Increment/Decrement"
> section of this Google document:
> 
> https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing
> 
> Full disclosure: Calls to srcu_read_lock_fast() followed by
> srcu_read_unlock_fast() really use one this_cpu_inc() followed by another
> this_cpu_inc(), but I am not seeing any difference between the two.
> And testing the underlying primitives allows my tests to give reproducible
> results regardless of what state I have the SRCU code in.  ;-)

Thanks. I'll go through your emails in more detail tomorrow/Monday.

-- 
Catalin


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-10-31 22:43     ` Catalin Marinas
@ 2025-10-31 23:38       ` Paul E. McKenney
  2025-11-01  3:25         ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-10-31 23:38 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> > > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > > > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > > > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > > > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> > > 
> > > That's quite a difference. Does it get any better if
> > > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> > > on the kernel command line.
> > 
> > In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?
> 
> Yes.
> 
> > Yes, this gets me more than an order of magnitude, and about 30% better
> > than my workaround of disabling interrupts around a non-atomic increment
> > of those counters, thank you!
> > 
> > Given that per-CPU atomics are usually not heavily contended, would it
> > make sense to avoid LSE in that case?
> 
> In theory the LSE atomics should be as fast but microarchitecture
> decisions likely did not cover all the use-cases. I'll raise this
> internally as well, maybe we get some ideas from the hardware people.

Understood, and please let me know what you can from the hardware people.

> > And I need to figure out whether I should recommend that Meta build
> > its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n.  And advice you
> > might have would be deeply appreciated!  (I am of course also following
> > up internally.)
> 
> I wouldn't advise turning them off just yet, they are beneficial for
> other use-cases. But it needs more thinking (and not that late at night ;)).

Fair enough!

> > > Interestingly, we had this patch recently to force a prefetch before the
> > > atomic:
> > > 
> > > https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/
> > > 
> > > We rejected it but I wonder whether it improves the SRCU scenario.
> > 
> > No statistical difference on my system.  This is a 72-CPU Neoverse V2, in
> > case that matters.
> 
> I just realised that patch doesn't touch percpu.h at all. So what about
> something like (untested):
> 
> -----------------8<------------------------
> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> index 9abcc8ef3087..e381034324e1 100644
> --- a/arch/arm64/include/asm/percpu.h
> +++ b/arch/arm64/include/asm/percpu.h
> @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
>  	unsigned int loop;						\
>  	u##sz tmp;							\
>  									\
> +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
>  	/* LL/SC */							\
>  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
>  	unsigned int loop;						\
>  	u##sz ret;							\
>  									\
> +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
>  	/* LL/SC */							\
>  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> -----------------8<------------------------

I will give this a shot, thank you!

> > Here are my results for the underlying this_cpu_inc()
> > and this_cpu_dec() pair of operations:
> > 
> > 	LSE Atomics Enabled (Stock)	LSE Atomics Disabled
> > 
> > Without Yicong’s Patch (Stock)
> > 
> > 			    110.786		       9.852
> > 
> > With Yicong’s Patch
> > 
> > 			    109.873		       9.853
> > 
> > As you can see, disabling LSE gets about an order of magnitude
> > and Yicong's patch has no statistically significant effect.
> > 
> > This and more can be found in the "Per-CPU Increment/Decrement"
> > section of this Google document:
> > 
> > https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing
> > 
> > Full disclosure: Calls to srcu_read_lock_fast() followed by
> > srcu_read_unlock_fast() really use one this_cpu_inc() followed by another
> > this_cpu_inc(), but I am not seeing any difference between the two.
> > And testing the underlying primitives allows my tests to give reproducible
> > results regardless of what state I have the SRCU code in.  ;-)
> 
> Thanks. I'll go through your emails in more detail tomorrow/Monday.

Thank you!  Not violently urgent, but I do look forward to hearing what
you come up with.  In the meantime, I am testing with the patch I sent
and will let you know if problems arise.  So far, so good...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-10-31 23:38       ` Paul E. McKenney
@ 2025-11-01  3:25         ` Paul E. McKenney
  2025-11-01  9:44           ` Willy Tarreau
                             ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-01  3:25 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> > > > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > > > > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > > > > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > > > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > > > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > > > > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > > > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> > > > 
> > > > That's quite a difference. Does it get any better if
> > > > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> > > > on the kernel command line.
> > > 
> > > In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?
> > 
> > Yes.
> > 
> > > Yes, this gets me more than an order of magnitude, and about 30% better
> > > than my workaround of disabling interrupts around a non-atomic increment
> > > of those counters, thank you!
> > > 
> > > Given that per-CPU atomics are usually not heavily contended, would it
> > > make sense to avoid LSE in that case?
> > 
> > In theory the LSE atomics should be as fast but microarchitecture
> > decisions likely did not cover all the use-cases. I'll raise this
> > internally as well, maybe we get some ideas from the hardware people.
> 
> Understood, and please let me know what you can from the hardware people.
> 
> > > And I need to figure out whether I should recommend that Meta build
> > > its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n.  And advice you
> > > might have would be deeply appreciated!  (I am of course also following
> > > up internally.)
> > 
> > I wouldn't advise turning them off just yet, they are beneficial for
> > other use-cases. But it needs more thinking (and not that late at night ;)).
> 
> Fair enough!
> 
> > > > Interestingly, we had this patch recently to force a prefetch before the
> > > > atomic:
> > > > 
> > > > https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/
> > > > 
> > > > We rejected it but I wonder whether it improves the SRCU scenario.
> > > 
> > > No statistical difference on my system.  This is a 72-CPU Neoverse V2, in
> > > case that matters.
> > 
> > I just realised that patch doesn't touch percpu.h at all. So what about
> > something like (untested):
> > 
> > -----------------8<------------------------
> > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > index 9abcc8ef3087..e381034324e1 100644
> > --- a/arch/arm64/include/asm/percpu.h
> > +++ b/arch/arm64/include/asm/percpu.h
> > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> >  	unsigned int loop;						\
> >  	u##sz tmp;							\
> >  									\
> > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> >  	/* LL/SC */							\
> >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> >  	unsigned int loop;						\
> >  	u##sz ret;							\
> >  									\
> > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> >  	/* LL/SC */							\
> >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > -----------------8<------------------------
> 
> I will give this a shot, thank you!

Jackpot!!!

This reduces the overhead to 8.427, which is significantly better than
the non-LSE value of 9.853.  Still room for improvement, but much
better than the 100ns values.

I presume that you will send this up the normal path, but in the meantime,
I will pull this in for further local testing, and thank you!

							Thanx, Paul

> > > Here are my results for the underlying this_cpu_inc()
> > > and this_cpu_dec() pair of operations:
> > > 
> > > 	LSE Atomics Enabled (Stock)	LSE Atomics Disabled
> > > 
> > > Without Yicong’s Patch (Stock)
> > > 
> > > 			    110.786		       9.852
> > > 
> > > With Yicong’s Patch
> > > 
> > > 			    109.873		       9.853
> > > 
> > > As you can see, disabling LSE gets about an order of magnitude
> > > and Yicong's patch has no statistically significant effect.
> > > 
> > > This and more can be found in the "Per-CPU Increment/Decrement"
> > > section of this Google document:
> > > 
> > > https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing
> > > 
> > > Full disclosure: Calls to srcu_read_lock_fast() followed by
> > > srcu_read_unlock_fast() really use one this_cpu_inc() followed by another
> > > this_cpu_inc(), but I am not seeing any difference between the two.
> > > And testing the underlying primitives allows my tests to give reproducible
> > > results regardless of what state I have the SRCU code in.  ;-)
> > 
> > Thanks. I'll go through your emails in more detail tomorrow/Monday.
> 
> Thank you!  Not violently urgent, but I do look forward to hearing what
> you come up with.  In the meantime, I am testing with the patch I sent
> and will let you know if problems arise.  So far, so good...
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-01  3:25         ` Paul E. McKenney
@ 2025-11-01  9:44           ` Willy Tarreau
  2025-11-01 18:07             ` Paul E. McKenney
  2025-11-01 11:23           ` Catalin Marinas
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 46+ messages in thread
From: Willy Tarreau @ 2025-11-01  9:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

Hi!

On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > > -----------------8<------------------------
> > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > index 9abcc8ef3087..e381034324e1 100644
> > > --- a/arch/arm64/include/asm/percpu.h
> > > +++ b/arch/arm64/include/asm/percpu.h
> > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > >  	unsigned int loop;						\
> > >  	u##sz tmp;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > >  	unsigned int loop;						\
> > >  	u##sz ret;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > -----------------8<------------------------
> > 
> > I will give this a shot, thank you!
> 
> Jackpot!!!
> 
> This reduces the overhead to 8.427, which is significantly better than
> the non-LSE value of 9.853.  Still room for improvement, but much
> better than the 100ns values.

This is super interesting! I've blindly applied a similar change to all
of our atomics in haproxy and am seeing a consistent 2-7% perf increase
depending on the tests on a 80-core Ampere Altra (neoverse-n1). There
as well we're significantly using atomics to read/update mostly local
variables as we avoid sharing as much as possible. I'm pretty sure it
does hurt in certain cases, and we don't have this distinction of per_cpu
variants like here, however that makes me think about adding a "mostly
local" variant that we can choose from depending on the context. I'll
continue to experiment, thanks for sharing this trick (particularly to
Yicong Yang, the original reporter).

Willy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-01  3:25         ` Paul E. McKenney
  2025-11-01  9:44           ` Willy Tarreau
@ 2025-11-01 11:23           ` Catalin Marinas
  2025-11-01 11:41             ` Yicong Yang
  2025-11-03 20:12             ` Palmer Dabbelt
  2025-11-03 21:49           ` Catalin Marinas
  2025-11-04 17:05           ` Catalin Marinas
  3 siblings, 2 replies; 46+ messages in thread
From: Catalin Marinas @ 2025-11-01 11:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Mark Rutland, linux-arm-kernel, Willy Tarreau,
	Yicong Yang

On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > I just realised that patch doesn't touch percpu.h at all. So what about
> > > something like (untested):
> > > 
> > > -----------------8<------------------------
> > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > index 9abcc8ef3087..e381034324e1 100644
> > > --- a/arch/arm64/include/asm/percpu.h
> > > +++ b/arch/arm64/include/asm/percpu.h
> > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > >  	unsigned int loop;						\
> > >  	u##sz tmp;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > >  	unsigned int loop;						\
> > >  	u##sz ret;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > -----------------8<------------------------
> > 
> > I will give this a shot, thank you!
> 
> Jackpot!!!
> 
> This reduces the overhead to 8.427, which is significantly better than
> the non-LSE value of 9.853.  Still room for improvement, but much
> better than the 100ns values.
> 
> I presume that you will send this up the normal path, but in the meantime,
> I will pull this in for further local testing, and thank you!

I think for this specific case it may work, for the futex as well but
not generally. The Neoverse-V2 TRM lists some controls in the
IMP_CPUECTLR_EL1, bits 29 to 33:

https://developer.arm.com/documentation/102375/0002

These can be configured depending on the system configuration but they
are too big knobs to cover all use-cases within an OS. This register is
typically configured by firmware, we don't touch it in Linux.

I'll dig some more but we may have to do tricks like prefetch if we
can't find a hardware configuration that satisfies all cases.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-01 11:23           ` Catalin Marinas
@ 2025-11-01 11:41             ` Yicong Yang
  2025-11-05 13:25               ` Catalin Marinas
  2025-11-03 20:12             ` Palmer Dabbelt
  1 sibling, 1 reply; 46+ messages in thread
From: Yicong Yang @ 2025-11-01 11:41 UTC (permalink / raw)
  To: Catalin Marinas, Paul E. McKenney
  Cc: yangyccccc, Will Deacon, Mark Rutland, linux-arm-kernel,
	Willy Tarreau

On 2025/11/1 19:23, Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
>> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
>>> On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
>>>> I just realised that patch doesn't touch percpu.h at all. So what about
>>>> something like (untested):
>>>>
>>>> -----------------8<------------------------
>>>> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
>>>> index 9abcc8ef3087..e381034324e1 100644
>>>> --- a/arch/arm64/include/asm/percpu.h
>>>> +++ b/arch/arm64/include/asm/percpu.h
>>>> @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
>>>>  	unsigned int loop;						\
>>>>  	u##sz tmp;							\
>>>>  									\
>>>> +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>>>>  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
>>>>  	/* LL/SC */							\
>>>>  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
>>>> @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
>>>>  	unsigned int loop;						\
>>>>  	u##sz ret;							\
>>>>  									\
>>>> +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>>>>  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
>>>>  	/* LL/SC */							\
>>>>  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
>>>> -----------------8<------------------------
>>> I will give this a shot, thank you!
>> Jackpot!!!
>>
>> This reduces the overhead to 8.427, which is significantly better than
>> the non-LSE value of 9.853.  Still room for improvement, but much
>> better than the 100ns values.
>>
>> I presume that you will send this up the normal path, but in the meantime,
>> I will pull this in for further local testing, and thank you!
> I think for this specific case it may work, for the futex as well but
> not generally. The Neoverse-V2 TRM lists some controls in the
> IMP_CPUECTLR_EL1, bits 29 to 33:
>
> https://developer.arm.com/documentation/102375/0002
>
> These can be configured depending on the system configuration but they
> are too big knobs to cover all use-cases within an OS. This register is
> typically configured by firmware, we don't touch it in Linux.
>
> I'll dig some more but we may have to do tricks like prefetch if we
> can't find a hardware configuration that satisfies all cases.
>

FYI, there's a version to allow prefetech added prior to LSE opertaions by one boot option [1],
if we want to reconsidered in this way, it's more flexible and can be controlled by the OS without touching
the system configurations (may need to update the firmware). But need to add the prefetch in per-cpu
implementation as you've noticed above (didn't add it since no prefetch for LL/SC implementation there,
maybe a missing?)

[1] https://lore.kernel.org/all/20250919091747.3702-1-yangyicong@huawei.com/

thanks.






^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-01  9:44           ` Willy Tarreau
@ 2025-11-01 18:07             ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-01 18:07 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Catalin Marinas, Will Deacon, Mark Rutland, linux-arm-kernel

On Sat, Nov 01, 2025 at 10:44:48AM +0100, Willy Tarreau wrote:
> Hi!
> 
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > > > -----------------8<------------------------
> > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > index 9abcc8ef3087..e381034324e1 100644
> > > > --- a/arch/arm64/include/asm/percpu.h
> > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > > >  	unsigned int loop;						\
> > > >  	u##sz tmp;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > > >  	unsigned int loop;						\
> > > >  	u##sz ret;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > > -----------------8<------------------------
> > > 
> > > I will give this a shot, thank you!
> > 
> > Jackpot!!!
> > 
> > This reduces the overhead to 8.427, which is significantly better than
> > the non-LSE value of 9.853.  Still room for improvement, but much
> > better than the 100ns values.
> 
> This is super interesting! I've blindly applied a similar change to all
> of our atomics in haproxy and am seeing a consistent 2-7% perf increase
> depending on the tests on a 80-core Ampere Altra (neoverse-n1). There
> as well we're significantly using atomics to read/update mostly local
> variables as we avoid sharing as much as possible. I'm pretty sure it
> does hurt in certain cases, and we don't have this distinction of per_cpu
> variants like here, however that makes me think about adding a "mostly
> local" variant that we can choose from depending on the context. I'll
> continue to experiment, thanks for sharing this trick (particularly to
> Yicong Yang, the original reporter).

Agreed!

And before I forget (again!):

Tested-by: Paul E. McKenney <paulmck@kernel.org>

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-01 11:23           ` Catalin Marinas
  2025-11-01 11:41             ` Yicong Yang
@ 2025-11-03 20:12             ` Palmer Dabbelt
  1 sibling, 0 replies; 46+ messages in thread
From: Palmer Dabbelt @ 2025-11-03 20:12 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel, w,
	yangyicong

On Sat, 01 Nov 2025 04:23:22 PDT (-0700), Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
>> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
>> > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
>> > > I just realised that patch doesn't touch percpu.h at all. So what about
>> > > something like (untested):
>> > >
>> > > -----------------8<------------------------
>> > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
>> > > index 9abcc8ef3087..e381034324e1 100644
>> > > --- a/arch/arm64/include/asm/percpu.h
>> > > +++ b/arch/arm64/include/asm/percpu.h
>> > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
>> > >  	unsigned int loop;						\
>> > >  	u##sz tmp;							\
>> > >  									\
>> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
>> > >  	/* LL/SC */							\
>> > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
>> > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
>> > >  	unsigned int loop;						\
>> > >  	u##sz ret;							\
>> > >  									\
>> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
>> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
>> > >  	/* LL/SC */							\
>> > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
>> > > -----------------8<------------------------
>> >
>> > I will give this a shot, thank you!
>>
>> Jackpot!!!
>>
>> This reduces the overhead to 8.427, which is significantly better than
>> the non-LSE value of 9.853.  Still room for improvement, but much
>> better than the 100ns values.
>>
>> I presume that you will send this up the normal path, but in the meantime,
>> I will pull this in for further local testing, and thank you!
>
> I think for this specific case it may work, for the futex as well but
> not generally. The Neoverse-V2 TRM lists some controls in the
> IMP_CPUECTLR_EL1, bits 29 to 33:
>
> https://developer.arm.com/documentation/102375/0002
>
> These can be configured depending on the system configuration but they
> are too big knobs to cover all use-cases within an OS. This register is
> typically configured by firmware, we don't touch it in Linux.

Mostly for Paul:

I have patch to let you do this from Linux, and I have some 
firmware for some of these internal systems that lets you set most of 
these magic bits.  I've noticed some unexpected behavior around prefetch 
distance on an internal workload, but haven't gotten much farther there.  
There's also some other bits that to wacky things...

Just FYI: Marc described trying to set these dynamically as trying to 
swallow a running chainsaw, but LMK if you're feeling risky and I can 
try and get you a copy of my setup.  They seem to work fine for me ;)

> I'll dig some more but we may have to do tricks like prefetch if we
> can't find a hardware configuration that satisfies all cases.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-01  3:25         ` Paul E. McKenney
  2025-11-01  9:44           ` Willy Tarreau
  2025-11-01 11:23           ` Catalin Marinas
@ 2025-11-03 21:49           ` Catalin Marinas
  2025-11-03 21:56             ` Willy Tarreau
  2025-11-04 17:05           ` Catalin Marinas
  3 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-03 21:49 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > index 9abcc8ef3087..e381034324e1 100644
> > > --- a/arch/arm64/include/asm/percpu.h
> > > +++ b/arch/arm64/include/asm/percpu.h
> > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > >  	unsigned int loop;						\
> > >  	u##sz tmp;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > >  	unsigned int loop;						\
> > >  	u##sz ret;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > -----------------8<------------------------
> > 
> > I will give this a shot, thank you!
> 
> Jackpot!!!
> 
> This reduces the overhead to 8.427, which is significantly better than
> the non-LSE value of 9.853.  Still room for improvement, but much
> better than the 100ns values.

Just curious, if you have time, could you try prefetchw() instead of the
above asm? That would be a PRFM PSTL1KEEP instead of STRM. Are
__srcu_read_lock() and __srcu_read_unlock() usually touching the same
cache line?

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-03 21:49           ` Catalin Marinas
@ 2025-11-03 21:56             ` Willy Tarreau
  0 siblings, 0 replies; 46+ messages in thread
From: Willy Tarreau @ 2025-11-03 21:56 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel

On Mon, Nov 03, 2025 at 09:49:56PM +0000, Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > index 9abcc8ef3087..e381034324e1 100644
> > > > --- a/arch/arm64/include/asm/percpu.h
> > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > > >  	unsigned int loop;						\
> > > >  	u##sz tmp;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > > >  	unsigned int loop;						\
> > > >  	u##sz ret;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > > -----------------8<------------------------
> > > 
> > > I will give this a shot, thank you!
> > 
> > Jackpot!!!
> > 
> > This reduces the overhead to 8.427, which is significantly better than
> > the non-LSE value of 9.853.  Still room for improvement, but much
> > better than the 100ns values.
> 
> Just curious, if you have time, could you try prefetchw() instead of the
> above asm? That would be a PRFM PSTL1KEEP instead of STRM. Are
> __srcu_read_lock() and __srcu_read_unlock() usually touching the same
> cache line?

FWIW I tested PRFM PSTL1KEEP this morning on the Altra with haproxy just
out of curiosity and didn't notice a difference with PRFM PSTL1STRM. Maybe
in the kernel it will be different though.

Willy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-10-31 18:30 ` Catalin Marinas
  2025-10-31 19:39   ` Paul E. McKenney
@ 2025-11-04 15:59   ` Breno Leitao
  2025-11-04 17:06     ` Catalin Marinas
                       ` (3 more replies)
  1 sibling, 4 replies; 46+ messages in thread
From: Breno Leitao @ 2025-11-04 15:59 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel,
	kernel-team, rmikey

Hello Catalin,

On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> 
> That's quite a difference. Does it get any better if
> CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> on the kernel command line.
> 
> Depending on the implementation and configuration, the LSE atomics may
> skip the L1 cache and be executed closer to the memory (they used to be
> called far atomics). The CPUs try to be smarter like doing the operation
> "near" if it's in the cache but the heuristics may not always work.

I am trying to play with LSE latency and compare it with LL/SC usecase. I
_think_ I have a reproducer in userspace

I've create a simple userspace program to compare the latency of a atomic add
using LL/SC and LSE, basically comparing the following two functions while
executing without any contention (single thread doing the atomic operation -
no atomic contention):

	static inline void __percpu_add_case_64_llsc(void *ptr, unsigned long val)
	{
		asm volatile(
			/* LL/SC */
			"1:  ldxr    %[tmp], %[ptr]\n"
			"    add     %[tmp], %[tmp], %[val]\n"
			"    stxr    %w[loop], %[tmp], %[ptr]\n"
			"    cbnz    %w[loop], 1b"
			: [loop] "=&r"(loop), [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr)
			: [val] "r"((u64)(val))
			: "memory");
	}

and

	/* LSE implementation */
	static inline void __percpu_add_case_64_lse(void *ptr, unsigned long val)
	{
		asm volatile(
			/* LSE atomics */
			"    stadd    %[val], %[ptr]\n"
			: [ptr] "+Q"(*(u64 *)ptr)
			: [val] "r"((u64)(val))
			: "memory");
	}

I found that the LSE case (__percpu_add_case_64_lse) has a huge variation,
while LL/SC case is stable.
In some case, LSE function runs at the same latency as LL/SC function and
slightly faster on p50, but, something happen to the system and LSE operations
start to take way longer than LL/SC.

Here are some interesting output coming from the latency of the functions above>

	CPU: 47 - Latency Percentiles:
	====================
	LL/SC:   p50: 5.69 ns      p95: 5.71 ns      p99: 5.80 ns
	LSE  :   p50: 45.53 ns     p95: 54.06 ns     p99: 55.18 ns

	CPU: 48 - Latency Percentiles:
	====================
	LL/SC:   p50: 5.70 ns      p95: 5.72 ns      p99: 6.10 ns
	LSE  :   p50: 4.02 ns      p95: 45.55 ns     p99: 54.93 ns

	CPU: 49 - Latency Percentiles:
	====================
	LL/SC:   p50: 5.74 ns      p95: 5.75 ns      p99: 5.78 ns
	LSE  :   p50: 4.04 ns      p95: 50.32 ns     p99: 53.04 ns


At this stage, it is unclear what is causing these variations.

The code above could be run with:

 # git clone https://github.com/leitao/debug.git
 # cd debug/LSE
 # make && ./percpu_bench


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-01  3:25         ` Paul E. McKenney
                             ` (2 preceding siblings ...)
  2025-11-03 21:49           ` Catalin Marinas
@ 2025-11-04 17:05           ` Catalin Marinas
  2025-11-04 18:43             ` Paul E. McKenney
  3 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-04 17:05 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > index 9abcc8ef3087..e381034324e1 100644
> > > --- a/arch/arm64/include/asm/percpu.h
> > > +++ b/arch/arm64/include/asm/percpu.h
> > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > >  	unsigned int loop;						\
> > >  	u##sz tmp;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > >  	unsigned int loop;						\
> > >  	u##sz ret;							\
> > >  									\
> > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > >  	/* LL/SC */							\
> > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > -----------------8<------------------------
> > 
> > I will give this a shot, thank you!
> 
> Jackpot!!!
> 
> This reduces the overhead to 8.427, which is significantly better than
> the non-LSE value of 9.853.  Still room for improvement, but much
> better than the 100ns values.
> 
> I presume that you will send this up the normal path, but in the meantime,
> I will pull this in for further local testing, and thank you!

After an educative discussion with the microarchitects, I think the
hardware is behaving as intended, it just doesn't always fit the
software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in
Linux as a STADD instruction (that's LDADD with XZR as destination; i.e.
no need to return the value read from memory). This is typically
executed as "far" or posted (unless it hits in the L1 cache) and
intended for stat updates. At a quick grep, it matches the majority of
the use-cases in Linux. Most other atomics (those with a return) are
executed "near", so filling the cache lines (assuming default CPUECTLR
configuration).

For the SRCU case, STADD especially together with the DMB after lock and
before unlock, executing it far does slow things down. A microbenchmark
doing this in a loop is a lot worse than it would appear in practice
(saturating buses down the path to memory).

A quick test to check this theory, if that's the functions you were
benchmarking (it generates LDADD instead):

---------------------8<----------------------------------------
diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 42098e0fa0b7..5a6f3999883d 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src
 	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
 
 	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
-		this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader.
+		this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader.
 	else
 		atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks));  // Y, and implicit RCU reader.
 	barrier(); /* Avoid leaking the critical section. */
@@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp)
 {
 	barrier();  /* Avoid leaking the critical section. */
 	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
-		this_cpu_inc(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
+		this_cpu_inc_return(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
 	else
 		atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks));  // Z, and implicit RCU reader.
 }
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 1ff94b76d91f..c025d9135689 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp)
 {
 	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
 
-	this_cpu_inc(scp->srcu_locks.counter);
+	this_cpu_inc_return(scp->srcu_locks.counter);
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	return __srcu_ptr_to_ctr(ssp, scp);
 }
@@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
 void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
 {
 	smp_mb(); /* C */  /* Avoid leaking the critical section. */
-	this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
+	this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
 }
 EXPORT_SYMBOL_GPL(__srcu_read_unlock);
 
---------------------8<----------------------------------------

To make things better for the non-fast variants above, we should add
this_cpu_inc_return_acquire() etc. semantics (strangely,
this_cpu_inc_return() doesn't have full barrier semantics as
atomic_inc_return()).

I'm not sure about adding the prefetch since most other uses of
this_cpu_add() are meant for stat updates and there's not much point in
brining in a cache line. I think we could add release/acquire variants
that generate LDADDA/L and maybe a slightly different API for the
__srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add
full barrier semantics to the current _return() variants.

-- 
Catalin


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 15:59   ` Breno Leitao
@ 2025-11-04 17:06     ` Catalin Marinas
  2025-11-04 18:08     ` Willy Tarreau
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Catalin Marinas @ 2025-11-04 17:06 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel,
	kernel-team, rmikey

Hi Breno,

On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote:
> On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> > 
> > That's quite a difference. Does it get any better if
> > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> > on the kernel command line.
> > 
> > Depending on the implementation and configuration, the LSE atomics may
> > skip the L1 cache and be executed closer to the memory (they used to be
> > called far atomics). The CPUs try to be smarter like doing the operation
> > "near" if it's in the cache but the heuristics may not always work.
> 
> I am trying to play with LSE latency and compare it with LL/SC usecase. I
> _think_ I have a reproducer in userspace
> 
> I've create a simple userspace program to compare the latency of a atomic add
> using LL/SC and LSE, basically comparing the following two functions while
> executing without any contention (single thread doing the atomic operation -
> no atomic contention):
> 
> 	static inline void __percpu_add_case_64_llsc(void *ptr, unsigned long val)
> 	{
> 		asm volatile(
> 			/* LL/SC */
> 			"1:  ldxr    %[tmp], %[ptr]\n"
> 			"    add     %[tmp], %[tmp], %[val]\n"
> 			"    stxr    %w[loop], %[tmp], %[ptr]\n"
> 			"    cbnz    %w[loop], 1b"
> 			: [loop] "=&r"(loop), [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr)
> 			: [val] "r"((u64)(val))
> 			: "memory");
> 	}
> 
> and
> 
> 	/* LSE implementation */
> 	static inline void __percpu_add_case_64_lse(void *ptr, unsigned long val)
> 	{
> 		asm volatile(
> 			/* LSE atomics */
> 			"    stadd    %[val], %[ptr]\n"
> 			: [ptr] "+Q"(*(u64 *)ptr)
> 			: [val] "r"((u64)(val))
> 			: "memory");
> 	}

Could you try with an ldadd instead? See my reply to Paul a few minutes
ago.

Thanks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 15:59   ` Breno Leitao
  2025-11-04 17:06     ` Catalin Marinas
@ 2025-11-04 18:08     ` Willy Tarreau
  2025-11-04 18:22       ` Breno Leitao
  2025-11-04 20:13       ` Paul E. McKenney
  2025-11-04 20:57     ` Puranjay Mohan
  2025-11-27 12:29     ` Wentao Guan
  3 siblings, 2 replies; 46+ messages in thread
From: Willy Tarreau @ 2025-11-04 18:08 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Catalin Marinas, Paul E. McKenney, Will Deacon, Mark Rutland,
	linux-arm-kernel, kernel-team, rmikey

Hello Breno,

On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote:
> I found that the LSE case (__percpu_add_case_64_lse) has a huge variation,
> while LL/SC case is stable.
> In some case, LSE function runs at the same latency as LL/SC function and
> slightly faster on p50, but, something happen to the system and LSE operations
> start to take way longer than LL/SC.
> 
> Here are some interesting output coming from the latency of the functions above>
> 
> 	CPU: 47 - Latency Percentiles:
> 	====================
> 	LL/SC:   p50: 5.69 ns      p95: 5.71 ns      p99: 5.80 ns
> 	LSE  :   p50: 45.53 ns     p95: 54.06 ns     p99: 55.18 ns
(...)

Very interesting. I've run them here on a 80-core Ampere Altra made
of Neoverse-N1 (armv8.2) and am getting very consistently better timings
with LSE than LL/SC:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
  LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.02 ns
  (...)

They're *all* like this, between 7.32 and 7.36 for LL/SC p99,
and 5.01 to 5.03 for LSE p99.

However, on a CIX-P1 (armv9.2, 8xA720 + 4xA520), it's what you've
observed, i.e. a lot of variations that do not even depend on big
vs little cores:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 7.13 ns    p99: 8.81 ns
  LSE  :   p50: 45.79 ns    p95: 45.80 ns   p99: 45.86 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
  LSE  :   p50: 67.72 ns    p95: 67.78 ns   p99: 67.80 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
  LSE  :   p50: 59.19 ns    p95: 59.23 ns   p99: 59.25 ns
  (...)

I tried the same on a RK3588 which has 4 Cortex A55 and 4 Cortex A76
(the latter being very close to Neoverse-N1), and the A76 (the 4 latest
ones) show the same pattern as the Altra above and are consistently much
better than the LL/SC one:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.41 ns
  LSE  :   p50: 4.43 ns     p95: 28.60 ns   p99: 30.29 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.59 ns
  LSE  :   p50: 4.42 ns     p95: 27.51 ns   p99: 29.46 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.40 ns     p95: 9.40 ns    p99: 9.40 ns
  LSE  :   p50: 4.42 ns     p95: 27.00 ns   p99: 29.60 ns
  
   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 10.43 ns
  LSE  :   p50: 8.02 ns     p95: 29.72 ns   p99: 31.05 ns
  
   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.85 ns     p95: 8.86 ns    p99: 8.86 ns
  LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 5.75 ns
  
   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.85 ns     p95: 8.85 ns    p99: 9.28 ns
  LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 8.29 ns
  
   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.79 ns     p95: 8.80 ns    p99: 8.80 ns
  LSE  :   p50: 5.71 ns     p95: 5.71 ns    p99: 5.71 ns
  
   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.80 ns     p95: 8.80 ns    p99: 9.30 ns
  LSE  :   p50: 5.71 ns     p95: 5.72 ns    p99: 5.72 ns

Finally, on a Qualcomm QC6490 with 4xA55 + 4xA78, I'm getting something
between the two (and the governor is in performance mode):

 ./percpu_bench 
ARM64 Per-CPU Atomic Add Benchmark
===================================
Running percentile measurements (100 iterations)...
Detected 8 CPUs

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.28 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 19.48 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.26 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 16.30 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.25 ns
  LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 4.65 ns
  
   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.36 ns
  LSE  :   p50: 4.63 ns     p95: 19.01 ns   p99: 32.15 ns
  
   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns
  
   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns
  
   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.28 ns
  LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.45 ns
  
   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.58 ns
  LSE  :   p50: 4.82 ns     p95: 4.82 ns    p99: 4.83 ns

So it seems at first glance that LL/SC is generally slower but can be
more consistent on modern machines, that LSE is stable on older machines
and can be stable sometimes even on some modern machines.

@Catalin, I *tried* to do the ldadd test but I wasn't sure what to put in
the Xt register (to be honest I've never understood Arm's docs regarding
instructions, even the pseudo language is super cryptic to me), and I came
up with this:

        asm volatile(
                /* LSE atomics */
                "    ldadd    %[val], %[out], %[ptr]\n"
                : [ptr] "+Q"(*(u64 *)ptr), [out] "+r" (val)
                : [val] "r"((u64)(val))
                : "memory");

which assembles like this:

 ab8:   f8200040        ldadd   x0, x0, [x2]

It now gives me much better LSE performance on the ARMv9:

   CPU: 0 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 7.32 ns    p99: 8.72 ns
  LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.77 ns
  
   CPU: 1 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
  LSE  :   p50: 5.09 ns     p95: 5.11 ns    p99: 5.11 ns
  
   CPU: 2 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.58 ns    p99: 9.07 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns
  
   CPU: 3 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 7.42 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns
  
   CPU: 4 - Latency Percentiles:
  ====================
  LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
  LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.47 ns
  
   CPU: 5 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
  
   CPU: 6 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.42 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
  
   CPU: 7 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
  
   CPU: 8 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
  LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
  
   CPU: 9 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.05 ns     p95: 7.06 ns    p99: 7.07 ns
  LSE  :   p50: 2.96 ns     p95: 2.97 ns    p99: 2.97 ns
  
   CPU: 10 - Latency Percentiles:
  ====================
  LL/SC:   p50: 7.05 ns     p95: 7.05 ns    p99: 7.06 ns
  LSE  :   p50: 2.96 ns     p95: 2.96 ns    p99: 2.97 ns
  
   CPU: 11 - Latency Percentiles:
  ====================
  LL/SC:   p50: 6.56 ns     p95: 6.56 ns    p99: 6.57 ns
  LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.76 ns

(cores 0,5-11 are A720, cores 1-4 are A520). I'd just like a
confirmation that my change is correct and that I'm not just doing
something ignored that tries to add zero :-/

If that's OK, then it's indeed way better!

Willy

PS: thanks Breno for sharing your test code, that's super useful!


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 18:08     ` Willy Tarreau
@ 2025-11-04 18:22       ` Breno Leitao
  2025-11-04 20:13       ` Paul E. McKenney
  1 sibling, 0 replies; 46+ messages in thread
From: Breno Leitao @ 2025-11-04 18:22 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Catalin Marinas, Paul E. McKenney, Will Deacon, Mark Rutland,
	linux-arm-kernel, kernel-team, rmikey

On Tue, Nov 04, 2025 at 07:08:19PM +0100, Willy Tarreau wrote:
> Hello Breno,
> 
> On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote:
> > I found that the LSE case (__percpu_add_case_64_lse) has a huge variation,
> > while LL/SC case is stable.
> > In some case, LSE function runs at the same latency as LL/SC function and
> > slightly faster on p50, but, something happen to the system and LSE operations
> > start to take way longer than LL/SC.
> > 
> > Here are some interesting output coming from the latency of the functions above>
> > 
> > 	CPU: 47 - Latency Percentiles:
> > 	====================
> > 	LL/SC:   p50: 5.69 ns      p95: 5.71 ns      p99: 5.80 ns
> > 	LSE  :   p50: 45.53 ns     p95: 54.06 ns     p99: 55.18 ns
> (...)
> 
> Very interesting. I've run them here on a 80-core Ampere Altra made
> of Neoverse-N1 (armv8.2) and am getting very consistently better timings
> with LSE than LL/SC:

<snip>

> It now gives me much better LSE performance on the ARMv9:

I also see a stable latency for ldadd in my test case, also, better than LL/SC.

	CPU: 0 - Latency Percentiles:
	====================
	LL/SC:   p50: 5.74 ns      p95: 5.81 ns      p99: 7.13 ns
	LSE  :   p50: 4.34 ns      p95: 4.36 ns      p99: 4.40 ns

	CPU: 1 - Latency Percentiles:
	====================
	LL/SC:   p50: 5.74 ns      p95: 5.77 ns      p99: 5.82 ns
	LSE  :   p50: 4.35 ns      p95: 4.37 ns      p99: 4.42 ns

	CPU: 2 - Latency Percentiles:
	====================
	LL/SC:   p50: 5.74 ns      p95: 5.81 ns      p99: 6.76 ns
	LSE  :   p50: 4.35 ns      p95: 4.80 ns      p99: 5.55 ns

	...

	CPU: 71 - Latency Percentiles:
	====================
	LL/SC:   p50: 5.72 ns      p95: 5.75 ns      p99: 5.91 ns
	LSE  :   p50: 4.33 ns      p95: 4.35 ns      p99: 4.38 ns

> PS: thanks Breno for sharing your test code, that's super useful!

Glad you liked it. I tried to narrow down the problem as much as I could, so, I
could could follow up the discussion. :-)



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 17:05           ` Catalin Marinas
@ 2025-11-04 18:43             ` Paul E. McKenney
  2025-11-04 20:10               ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-04 18:43 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > index 9abcc8ef3087..e381034324e1 100644
> > > > --- a/arch/arm64/include/asm/percpu.h
> > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > > >  	unsigned int loop;						\
> > > >  	u##sz tmp;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > > >  	unsigned int loop;						\
> > > >  	u##sz ret;							\
> > > >  									\
> > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > >  	/* LL/SC */							\
> > > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > > -----------------8<------------------------
> > > 
> > > I will give this a shot, thank you!
> > 
> > Jackpot!!!
> > 
> > This reduces the overhead to 8.427, which is significantly better than
> > the non-LSE value of 9.853.  Still room for improvement, but much
> > better than the 100ns values.
> > 
> > I presume that you will send this up the normal path, but in the meantime,
> > I will pull this in for further local testing, and thank you!
> 
> After an educative discussion with the microarchitects, I think the
> hardware is behaving as intended, it just doesn't always fit the
> software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in
> Linux as a STADD instruction (that's LDADD with XZR as destination; i.e.
> no need to return the value read from memory). This is typically
> executed as "far" or posted (unless it hits in the L1 cache) and
> intended for stat updates. At a quick grep, it matches the majority of
> the use-cases in Linux. Most other atomics (those with a return) are
> executed "near", so filling the cache lines (assuming default CPUECTLR
> configuration).

OK...

> For the SRCU case, STADD especially together with the DMB after lock and
> before unlock, executing it far does slow things down. A microbenchmark
> doing this in a loop is a lot worse than it would appear in practice
> (saturating buses down the path to memory).

In this srcu_read_lock_fast_updown() case, there was no DMB.  But for
srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
(The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)

> A quick test to check this theory, if that's the functions you were
> benchmarking (it generates LDADD instead):

Thank you for digging into this!

> ---------------------8<----------------------------------------
> diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> index 42098e0fa0b7..5a6f3999883d 100644
> --- a/include/linux/srcutree.h
> +++ b/include/linux/srcutree.h
> @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src
>  	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
>  
>  	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> -		this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> +		this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader.
>  	else
>  		atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks));  // Y, and implicit RCU reader.
>  	barrier(); /* Avoid leaking the critical section. */
> @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp)
>  {
>  	barrier();  /* Avoid leaking the critical section. */
>  	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> -		this_cpu_inc(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
> +		this_cpu_inc_return(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
>  	else
>  		atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks));  // Z, and implicit RCU reader.
>  }
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 1ff94b76d91f..c025d9135689 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp)
>  {
>  	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
>  
> -	this_cpu_inc(scp->srcu_locks.counter);
> +	this_cpu_inc_return(scp->srcu_locks.counter);
>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>  	return __srcu_ptr_to_ctr(ssp, scp);
>  }
> @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
>  void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
>  {
>  	smp_mb(); /* C */  /* Avoid leaking the critical section. */
> -	this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> +	this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
>  }
>  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>  
> ---------------------8<----------------------------------------
> 
> To make things better for the non-fast variants above, we should add
> this_cpu_inc_return_acquire() etc. semantics (strangely,
> this_cpu_inc_return() doesn't have full barrier semantics as
> atomic_inc_return()).
> 
> I'm not sure about adding the prefetch since most other uses of
> this_cpu_add() are meant for stat updates and there's not much point in
> brining in a cache line. I think we could add release/acquire variants
> that generate LDADDA/L and maybe a slightly different API for the
> __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add
> full barrier semantics to the current _return() variants.

But other architectures might well have this_cpu_inc_return() running
more slowly than this_cpu_inc().  So my thought would be to make a
this_cpu_inc_srcu() that mapped to this_cpu_inc_return() on arm64 and
this_cpu_inc() elsewhere.

I could imagine this_cpu_inc_local() or some such, but it is not clear
that the added API explosion is yet justified.

Or is there a better way?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 18:43             ` Paul E. McKenney
@ 2025-11-04 20:10               ` Paul E. McKenney
  2025-11-05 15:34                 ` Catalin Marinas
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-04 20:10 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> > On Fri, Oct 31, 2025 at 08:25:07PM -0700, Paul E. McKenney wrote:
> > > On Fri, Oct 31, 2025 at 04:38:57PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Oct 31, 2025 at 10:43:35PM +0000, Catalin Marinas wrote:
> > > > > diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> > > > > index 9abcc8ef3087..e381034324e1 100644
> > > > > --- a/arch/arm64/include/asm/percpu.h
> > > > > +++ b/arch/arm64/include/asm/percpu.h
> > > > > @@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
> > > > >  	unsigned int loop;						\
> > > > >  	u##sz tmp;							\
> > > > >  									\
> > > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > > >  	/* LL/SC */							\
> > > > >  	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
> > > > > @@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
> > > > >  	unsigned int loop;						\
> > > > >  	u##sz ret;							\
> > > > >  									\
> > > > > +	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
> > > > >  	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
> > > > >  	/* LL/SC */							\
> > > > >  	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
> > > > > -----------------8<------------------------
> > > > 
> > > > I will give this a shot, thank you!
> > > 
> > > Jackpot!!!
> > > 
> > > This reduces the overhead to 8.427, which is significantly better than
> > > the non-LSE value of 9.853.  Still room for improvement, but much
> > > better than the 100ns values.
> > > 
> > > I presume that you will send this up the normal path, but in the meantime,
> > > I will pull this in for further local testing, and thank you!
> > 
> > After an educative discussion with the microarchitects, I think the
> > hardware is behaving as intended, it just doesn't always fit the
> > software use-cases ;). this_cpu_add() etc. (and atomic_add()) end up in
> > Linux as a STADD instruction (that's LDADD with XZR as destination; i.e.
> > no need to return the value read from memory). This is typically
> > executed as "far" or posted (unless it hits in the L1 cache) and
> > intended for stat updates. At a quick grep, it matches the majority of
> > the use-cases in Linux. Most other atomics (those with a return) are
> > executed "near", so filling the cache lines (assuming default CPUECTLR
> > configuration).
> 
> OK...
> 
> > For the SRCU case, STADD especially together with the DMB after lock and
> > before unlock, executing it far does slow things down. A microbenchmark
> > doing this in a loop is a lot worse than it would appear in practice
> > (saturating buses down the path to memory).
> 
> In this srcu_read_lock_fast_updown() case, there was no DMB.  But for
> srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
> (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)
> 
> > A quick test to check this theory, if that's the functions you were
> > benchmarking (it generates LDADD instead):
> 
> Thank you for digging into this!

And this_cpu_inc_return() does speed things up on my hardware to about
the same extent as did the prefetch instruction, so thank you again.
However, it gets me more than a 4x slowdown on x86, so I cannot make
this change in common code.

So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via
something like this_cpu_inc_srcu(), but not for the upcoming merge window,
but the one after that, sticking with my current interrupt-disabling
non-atomic approach in the meantime (which gets me most of the benefit).
Alternatively, would it work for me to put that cache-prefetch instruction
into SRCU for arm64?  My guess is "absolutely not!", but I figured that
I should ask.

But if both of these approaches proves problematic, I might need some
way to distinguish between systems having slow LSE and those that do not.
Or I can stick with disabling interrupts across non-atomic updates.

Thoughts?

							Thanx, Paul

> > ---------------------8<----------------------------------------
> > diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
> > index 42098e0fa0b7..5a6f3999883d 100644
> > --- a/include/linux/srcutree.h
> > +++ b/include/linux/srcutree.h
> > @@ -263,7 +263,7 @@ static inline struct srcu_ctr __percpu notrace *__srcu_read_lock_fast(struct src
> >  	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
> >  
> >  	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> > -		this_cpu_inc(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> > +		this_cpu_inc_return(scp->srcu_locks.counter); // Y, and implicit RCU reader.
> >  	else
> >  		atomic_long_inc(raw_cpu_ptr(&scp->srcu_locks));  // Y, and implicit RCU reader.
> >  	barrier(); /* Avoid leaking the critical section. */
> > @@ -284,7 +284,7 @@ __srcu_read_unlock_fast(struct srcu_struct *ssp, struct srcu_ctr __percpu *scp)
> >  {
> >  	barrier();  /* Avoid leaking the critical section. */
> >  	if (!IS_ENABLED(CONFIG_NEED_SRCU_NMI_SAFE))
> > -		this_cpu_inc(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
> > +		this_cpu_inc_return(scp->srcu_unlocks.counter);  // Z, and implicit RCU reader.
> >  	else
> >  		atomic_long_inc(raw_cpu_ptr(&scp->srcu_unlocks));  // Z, and implicit RCU reader.
> >  }
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 1ff94b76d91f..c025d9135689 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -753,7 +753,7 @@ int __srcu_read_lock(struct srcu_struct *ssp)
> >  {
> >  	struct srcu_ctr __percpu *scp = READ_ONCE(ssp->srcu_ctrp);
> >  
> > -	this_cpu_inc(scp->srcu_locks.counter);
> > +	this_cpu_inc_return(scp->srcu_locks.counter);
> >  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >  	return __srcu_ptr_to_ctr(ssp, scp);
> >  }
> > @@ -767,7 +767,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
> >  void __srcu_read_unlock(struct srcu_struct *ssp, int idx)
> >  {
> >  	smp_mb(); /* C */  /* Avoid leaking the critical section. */
> > -	this_cpu_inc(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> > +	this_cpu_inc_return(__srcu_ctr_to_ptr(ssp, idx)->srcu_unlocks.counter);
> >  }
> >  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> >  
> > ---------------------8<----------------------------------------
> > 
> > To make things better for the non-fast variants above, we should add
> > this_cpu_inc_return_acquire() etc. semantics (strangely,
> > this_cpu_inc_return() doesn't have full barrier semantics as
> > atomic_inc_return()).
> > 
> > I'm not sure about adding the prefetch since most other uses of
> > this_cpu_add() are meant for stat updates and there's not much point in
> > brining in a cache line. I think we could add release/acquire variants
> > that generate LDADDA/L and maybe a slightly different API for the
> > __srcu_*_fast() - or use a new this_cpu_add_return_relaxed() if we add
> > full barrier semantics to the current _return() variants.
> 
> But other architectures might well have this_cpu_inc_return() running
> more slowly than this_cpu_inc().  So my thought would be to make a
> this_cpu_inc_srcu() that mapped to this_cpu_inc_return() on arm64 and
> this_cpu_inc() elsewhere.
> 
> I could imagine this_cpu_inc_local() or some such, but it is not clear
> that the added API explosion is yet justified.
> 
> Or is there a better way?
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 18:08     ` Willy Tarreau
  2025-11-04 18:22       ` Breno Leitao
@ 2025-11-04 20:13       ` Paul E. McKenney
  2025-11-04 20:35         ` Willy Tarreau
  1 sibling, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-04 20:13 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Breno Leitao, Catalin Marinas, Will Deacon, Mark Rutland,
	linux-arm-kernel, kernel-team, rmikey

On Tue, Nov 04, 2025 at 07:08:19PM +0100, Willy Tarreau wrote:
> Hello Breno,
> 
> On Tue, Nov 04, 2025 at 07:59:38AM -0800, Breno Leitao wrote:
> > I found that the LSE case (__percpu_add_case_64_lse) has a huge variation,
> > while LL/SC case is stable.
> > In some case, LSE function runs at the same latency as LL/SC function and
> > slightly faster on p50, but, something happen to the system and LSE operations
> > start to take way longer than LL/SC.
> > 
> > Here are some interesting output coming from the latency of the functions above>
> > 
> > 	CPU: 47 - Latency Percentiles:
> > 	====================
> > 	LL/SC:   p50: 5.69 ns      p95: 5.71 ns      p99: 5.80 ns
> > 	LSE  :   p50: 45.53 ns     p95: 54.06 ns     p99: 55.18 ns
> (...)

Thank you very much for the detailed testing on a variety of hardware
platforms!!!

> Very interesting. I've run them here on a 80-core Ampere Altra made
> of Neoverse-N1 (armv8.2) and am getting very consistently better timings
> with LSE than LL/SC:
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
>   LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
>   LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.03 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.32 ns     p95: 7.32 ns    p99: 7.33 ns
>   LSE  :   p50: 5.01 ns     p95: 5.01 ns    p99: 5.02 ns
>   (...)
> 
> They're *all* like this, between 7.32 and 7.36 for LL/SC p99,
> and 5.01 to 5.03 for LSE p99.
> 
> However, on a CIX-P1 (armv9.2, 8xA720 + 4xA520), it's what you've
> observed, i.e. a lot of variations that do not even depend on big
> vs little cores:
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.56 ns     p95: 7.13 ns    p99: 8.81 ns
>   LSE  :   p50: 45.79 ns    p95: 45.80 ns   p99: 45.86 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
>   LSE  :   p50: 67.72 ns    p95: 67.78 ns   p99: 67.80 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
>   LSE  :   p50: 59.19 ns    p95: 59.23 ns   p99: 59.25 ns
>   (...)
> 
> I tried the same on a RK3588 which has 4 Cortex A55 and 4 Cortex A76
> (the latter being very close to Neoverse-N1), and the A76 (the 4 latest
> ones) show the same pattern as the Altra above and are consistently much
> better than the LL/SC one:
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.41 ns
>   LSE  :   p50: 4.43 ns     p95: 28.60 ns   p99: 30.29 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 9.59 ns
>   LSE  :   p50: 4.42 ns     p95: 27.51 ns   p99: 29.46 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 9.40 ns     p95: 9.40 ns    p99: 9.40 ns
>   LSE  :   p50: 4.42 ns     p95: 27.00 ns   p99: 29.60 ns
>   
>    CPU: 3 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 9.39 ns     p95: 9.40 ns    p99: 10.43 ns
>   LSE  :   p50: 8.02 ns     p95: 29.72 ns   p99: 31.05 ns
>   
>    CPU: 4 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.85 ns     p95: 8.86 ns    p99: 8.86 ns
>   LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 5.75 ns
>   
>    CPU: 5 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.85 ns     p95: 8.85 ns    p99: 9.28 ns
>   LSE  :   p50: 5.75 ns     p95: 5.75 ns    p99: 8.29 ns
>   
>    CPU: 6 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.79 ns     p95: 8.80 ns    p99: 8.80 ns
>   LSE  :   p50: 5.71 ns     p95: 5.71 ns    p99: 5.71 ns
>   
>    CPU: 7 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.80 ns     p95: 8.80 ns    p99: 9.30 ns
>   LSE  :   p50: 5.71 ns     p95: 5.72 ns    p99: 5.72 ns
> 
> Finally, on a Qualcomm QC6490 with 4xA55 + 4xA78, I'm getting something
> between the two (and the governor is in performance mode):
> 
>  ./percpu_bench 
> ARM64 Per-CPU Atomic Add Benchmark
> ===================================
> Running percentile measurements (100 iterations)...
> Detected 8 CPUs
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.28 ns
>   LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 19.48 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.23 ns     p95: 8.24 ns    p99: 8.26 ns
>   LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 16.30 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.25 ns
>   LSE  :   p50: 4.63 ns     p95: 4.64 ns    p99: 4.65 ns
>   
>    CPU: 3 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 8.23 ns     p95: 8.25 ns    p99: 8.36 ns
>   LSE  :   p50: 4.63 ns     p95: 19.01 ns   p99: 32.15 ns
>   
>    CPU: 4 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
>   LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns
>   
>    CPU: 5 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.29 ns
>   LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.44 ns
>   
>    CPU: 6 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.27 ns     p95: 6.28 ns    p99: 6.28 ns
>   LSE  :   p50: 5.44 ns     p95: 5.44 ns    p99: 5.45 ns
>   
>    CPU: 7 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.58 ns
>   LSE  :   p50: 4.82 ns     p95: 4.82 ns    p99: 4.83 ns
> 
> So it seems at first glance that LL/SC is generally slower but can be
> more consistent on modern machines, that LSE is stable on older machines
> and can be stable sometimes even on some modern machines.

I guess that I am glad that I am not alone?  ;-)

I am guessing that there is no reasonable way to check for whether a
given system has slow LSE, as would be needed to use ALTERNATIVE(),
but please let me know if I am mistaken.

							Thanx, Paul

> @Catalin, I *tried* to do the ldadd test but I wasn't sure what to put in
> the Xt register (to be honest I've never understood Arm's docs regarding
> instructions, even the pseudo language is super cryptic to me), and I came
> up with this:
> 
>         asm volatile(
>                 /* LSE atomics */
>                 "    ldadd    %[val], %[out], %[ptr]\n"
>                 : [ptr] "+Q"(*(u64 *)ptr), [out] "+r" (val)
>                 : [val] "r"((u64)(val))
>                 : "memory");
> 
> which assembles like this:
> 
>  ab8:   f8200040        ldadd   x0, x0, [x2]
> 
> It now gives me much better LSE performance on the ARMv9:
> 
>    CPU: 0 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.56 ns     p95: 7.32 ns    p99: 8.72 ns
>   LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.77 ns
>   
>    CPU: 1 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.38 ns     p95: 6.39 ns    p99: 6.39 ns
>   LSE  :   p50: 5.09 ns     p95: 5.11 ns    p99: 5.11 ns
>   
>    CPU: 2 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.58 ns    p99: 9.07 ns
>   LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns
>   
>    CPU: 3 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 7.42 ns
>   LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.46 ns
>   
>    CPU: 4 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 5.56 ns     p95: 5.57 ns    p99: 5.60 ns
>   LSE  :   p50: 4.45 ns     p95: 4.46 ns    p99: 4.47 ns
>   
>    CPU: 5 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
>   LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
>   
>    CPU: 6 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.42 ns
>   LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
>   
>    CPU: 7 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
>   LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
>   
>    CPU: 8 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.40 ns     p95: 7.40 ns    p99: 7.40 ns
>   LSE  :   p50: 3.08 ns     p95: 3.08 ns    p99: 3.08 ns
>   
>    CPU: 9 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.05 ns     p95: 7.06 ns    p99: 7.07 ns
>   LSE  :   p50: 2.96 ns     p95: 2.97 ns    p99: 2.97 ns
>   
>    CPU: 10 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 7.05 ns     p95: 7.05 ns    p99: 7.06 ns
>   LSE  :   p50: 2.96 ns     p95: 2.96 ns    p99: 2.97 ns
>   
>    CPU: 11 - Latency Percentiles:
>   ====================
>   LL/SC:   p50: 6.56 ns     p95: 6.56 ns    p99: 6.57 ns
>   LSE  :   p50: 2.76 ns     p95: 2.76 ns    p99: 2.76 ns
> 
> (cores 0,5-11 are A720, cores 1-4 are A520). I'd just like a
> confirmation that my change is correct and that I'm not just doing
> something ignored that tries to add zero :-/
> 
> If that's OK, then it's indeed way better!
> 
> Willy
> 
> PS: thanks Breno for sharing your test code, that's super useful!


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 20:13       ` Paul E. McKenney
@ 2025-11-04 20:35         ` Willy Tarreau
  2025-11-04 21:25           ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Willy Tarreau @ 2025-11-04 20:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Breno Leitao, Catalin Marinas, Will Deacon, Mark Rutland,
	linux-arm-kernel, kernel-team, rmikey

On Tue, Nov 04, 2025 at 12:13:53PM -0800, Paul E. McKenney wrote:
> > So it seems at first glance that LL/SC is generally slower but can be
> > more consistent on modern machines, that LSE is stable on older machines
> > and can be stable sometimes even on some modern machines.
> 
> I guess that I am glad that I am not alone?  ;-)
> 
> I am guessing that there is no reasonable way to check for whether a
> given system has slow LSE, as would be needed to use ALTERNATIVE(),
> but please let me know if I am mistaken.

I don't know either, and we've only tested additions (for which ldadd
seems to do a better job than stadd for local values). I have no idea
what happens with a CAS for example, that could be useful to set a max
value for a metric and which can be quite inefficient using LL/SC,
especially if the absolute value is stored in the same cache line as
the max since every thread touching it would probably invalidate the
update attempt. With a SWP instruction I don't see how it would be
handled directly in SLC, since we need to know the previous value,
hence load it into L1 (and hope nobody changes it between the load
and the write attempt). But overall there seems to be a lot of
unexplored possibilities here which I find quite interesting!

Willy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 15:59   ` Breno Leitao
  2025-11-04 17:06     ` Catalin Marinas
  2025-11-04 18:08     ` Willy Tarreau
@ 2025-11-04 20:57     ` Puranjay Mohan
  2025-11-27 12:29     ` Wentao Guan
  3 siblings, 0 replies; 46+ messages in thread
From: Puranjay Mohan @ 2025-11-04 20:57 UTC (permalink / raw)
  To: leitao
  Cc: catalin.marinas, kernel-team, linux-arm-kernel, mark.rutland,
	paulmck, rmikey, will, Puranjay Mohan

Hi Breno,

I tried your benchmark on AWS graviton platforms:

On EC2 c8g.metal-24xl (96 cpus Neoverse-V2) (AWS Graviton 4):

With ldadd, it was stable and LSE is always better than LL/SC

But with stadd, I saw some spikes in p95 and p99:

 CPU: 28 - Latency Percentiles:
====================
LL/SC:   p50: 6.61 ns     p95: 6.61 ns    p99: 6.62 ns
LSE  :   p50: 4.64 ns     p95: 4.65 ns    p99: 4.65 ns

 CPU: 30 - Latency Percentiles:
====================
LL/SC:   p50: 6.61 ns     p95: 6.61 ns    p99: 6.62 ns
LSE  :   p50: 4.64 ns     p95: 14.24 ns  ***p99: 27.74 ns***


On EC2 m6g.metal (64 cpus Neoverse-N1) (AWS Graviton 2):

Here both stadd and ldadd were stable and LSE was always better than LL/SC

with ldadd:

ARM64 Per-CPU Atomic Add Benchmark
===================================
Running percentile measurements (100 iterations)...
Detected 64 CPUs

 CPU: 0 - Latency Percentiles:
====================
LL/SC:   p50: 8.40 ns     p95: 8.40 ns    p99: 8.42 ns
LSE  :   p50: 5.60 ns     p95: 5.60 ns    p99: 5.61 ns

 CPU: 1 - Latency Percentiles:
====================
LL/SC:   p50: 8.40 ns     p95: 8.40 ns    p99: 8.41 ns
LSE  :   p50: 5.60 ns     p95: 5.60 ns    p99: 5.61 ns


[....]

 CPU: 62 - Latency Percentiles:
====================
LL/SC:   p50: 8.40 ns     p95: 8.40 ns    p99: 8.40 ns
LSE  :   p50: 5.60 ns     p95: 5.60 ns    p99: 5.60 ns

 CPU: 63 - Latency Percentiles:
====================
LL/SC:   p50: 8.40 ns     p95: 8.40 ns    p99: 8.41 ns
LSE  :   p50: 5.60 ns     p95: 5.60 ns    p99: 5.60 ns

=== Benchmark Complete ===

With stadd:

ARM64 Per-CPU Atomic Add Benchmark
===================================
Running percentile measurements (100 iterations)...
Detected 64 CPUs

 CPU: 0 - Latency Percentiles:
====================
LL/SC:   p50: 8.00 ns     p95: 8.01 ns    p99: 8.02 ns
LSE  :   p50: 5.20 ns     p95: 5.21 ns    p99: 5.21 ns

 CPU: 1 - Latency Percentiles:
====================
LL/SC:   p50: 8.00 ns     p95: 8.01 ns    p99: 8.01 ns
LSE  :   p50: 5.20 ns     p95: 5.21 ns    p99: 5.22 ns


[.....]

 CPU: 62 - Latency Percentiles:
====================
LL/SC:   p50: 8.00 ns     p95: 8.01 ns    p99: 8.14 ns
LSE  :   p50: 5.20 ns     p95: 5.21 ns    p99: 5.21 ns

 CPU: 63 - Latency Percentiles:
====================
LL/SC:   p50: 8.00 ns     p95: 8.01 ns    p99: 8.01 ns
LSE  :   p50: 5.20 ns     p95: 5.20 ns    p99: 5.20 ns

=== Benchmark Complete ===


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 20:35         ` Willy Tarreau
@ 2025-11-04 21:25           ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-04 21:25 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Breno Leitao, Catalin Marinas, Will Deacon, Mark Rutland,
	linux-arm-kernel, kernel-team, rmikey

On Tue, Nov 04, 2025 at 09:35:48PM +0100, Willy Tarreau wrote:
> On Tue, Nov 04, 2025 at 12:13:53PM -0800, Paul E. McKenney wrote:
> > > So it seems at first glance that LL/SC is generally slower but can be
> > > more consistent on modern machines, that LSE is stable on older machines
> > > and can be stable sometimes even on some modern machines.
> > 
> > I guess that I am glad that I am not alone?  ;-)
> > 
> > I am guessing that there is no reasonable way to check for whether a
> > given system has slow LSE, as would be needed to use ALTERNATIVE(),
> > but please let me know if I am mistaken.
> 
> I don't know either, and we've only tested additions (for which ldadd
> seems to do a better job than stadd for local values). I have no idea
> what happens with a CAS for example, that could be useful to set a max
> value for a metric and which can be quite inefficient using LL/SC,
> especially if the absolute value is stored in the same cache line as
> the max since every thread touching it would probably invalidate the
> update attempt. With a SWP instruction I don't see how it would be
> handled directly in SLC, since we need to know the previous value,
> hence load it into L1 (and hope nobody changes it between the load
> and the write attempt). But overall there seems to be a lot of
> unexplored possibilities here which I find quite interesting!

I must admit that this is a fun one.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-01 11:41             ` Yicong Yang
@ 2025-11-05 13:25               ` Catalin Marinas
  2025-11-05 13:42                 ` Willy Tarreau
  0 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-05 13:25 UTC (permalink / raw)
  To: Yicong Yang
  Cc: Paul E. McKenney, Will Deacon, Mark Rutland, linux-arm-kernel,
	Willy Tarreau

On Sat, Nov 01, 2025 at 07:41:02PM +0800, Yicong Yang wrote:
> FYI, there's a version to allow prefetech added prior to LSE
> opertaions by one boot option [1], if we want to reconsidered in this
> way, it's more flexible and can be controlled by the OS without
> touching the system configurations (may need to update the firmware).

I'm against adding boot time options for this. We either add them
permanently if beneficial for most microarchitectures or we get back to
the hardware people to ask for improvements (or, potentially, imp def
configurations like we have on a few of the Arm Ltd implementations).

> But need to add the prefetch in per-cpu implementation as you've
> noticed above (didn't add it since no prefetch for LL/SC
> implementation there, maybe a missing?)

Maybe no-one stressed these to notice any difference between LL/SC and
LSE.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 13:25               ` Catalin Marinas
@ 2025-11-05 13:42                 ` Willy Tarreau
  2025-11-05 14:49                   ` Catalin Marinas
  0 siblings, 1 reply; 46+ messages in thread
From: Willy Tarreau @ 2025-11-05 13:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland,
	linux-arm-kernel

On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote:
> > But need to add the prefetch in per-cpu implementation as you've
> > noticed above (didn't add it since no prefetch for LL/SC
> > implementation there, maybe a missing?)
> 
> Maybe no-one stressed these to notice any difference between LL/SC and
> LSE.

Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for
having faced catastrophic performance there on haproxy, while with LSE
it continues to scale almost linearly at least till 64. But that does
not mean that if some possibilities are within reach to recover 90% of
the atomic overhead in uncontended case we shouldn't try to grab it at
a reasonable cost!

I'm definitely adding in my todo list to experiment more on this on
various CPUs now ;-)

Willy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 13:42                 ` Willy Tarreau
@ 2025-11-05 14:49                   ` Catalin Marinas
  2025-11-05 16:21                     ` Breno Leitao
  2025-11-06  7:44                     ` Willy Tarreau
  0 siblings, 2 replies; 46+ messages in thread
From: Catalin Marinas @ 2025-11-05 14:49 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland,
	linux-arm-kernel

On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote:
> On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote:
> > > But need to add the prefetch in per-cpu implementation as you've
> > > noticed above (didn't add it since no prefetch for LL/SC
> > > implementation there, maybe a missing?)
> > 
> > Maybe no-one stressed these to notice any difference between LL/SC and
> > LSE.
> 
> Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for
> having faced catastrophic performance there on haproxy, while with LSE
> it continues to scale almost linearly at least till 64.

I was referring only to the this_cpu_add() etc. functions (until Paul
started using them). There definitely have been lots of benchmarks on
the scalability of LL/SC. That's one of the reasons Arm added the LSE
atomics years ago.

> But that does
> not mean that if some possibilities are within reach to recover 90% of
> the atomic overhead in uncontended case we shouldn't try to grab it at
> a reasonable cost!

I agree. Even for these cases, I don't think the solution is LL/SC but
rather better use of LSE (and better understanding of the hardware
behaviour; feedback here should go both ways).

> I'm definitely adding in my todo list to experiment more on this on
> various CPUs now ;-)

Thanks for the tests so far, very insightful. I think what's still
good to assess is how PRFM+STADD compares to LDADD (without PRFM) in
Breno's microbenchmarks. I suspect LDADD is still better.

FWIW, Neoverse-N1 has an erratum affecting the far atomics and they are
all forced near, so this explains the consistent results you got with
STADD on this CPU. On other CPUs, STADD would likely be executed far
unless it hits in the L1 cache.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 20:10               ` Paul E. McKenney
@ 2025-11-05 15:34                 ` Catalin Marinas
  2025-11-05 16:25                   ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-05 15:34 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Tue, Nov 04, 2025 at 12:10:36PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> > > For the SRCU case, STADD especially together with the DMB after lock and
> > > before unlock, executing it far does slow things down. A microbenchmark
> > > doing this in a loop is a lot worse than it would appear in practice
> > > (saturating buses down the path to memory).
> > 
> > In this srcu_read_lock_fast_updown() case, there was no DMB.  But for
> > srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
> > (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)
> > 
> > > A quick test to check this theory, if that's the functions you were
> > > benchmarking (it generates LDADD instead):
> > 
> > Thank you for digging into this!
> 
> And this_cpu_inc_return() does speed things up on my hardware to about
> the same extent as did the prefetch instruction, so thank you again.
> However, it gets me more than a 4x slowdown on x86, so I cannot make
> this change in common code.

Definitely not suggesting that we use the 'return' variants in the
generic code. More likely change the arm64 code to use them for the
per-CPU atomics.

> So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via
> something like this_cpu_inc_srcu(), but not for the upcoming merge window,
> but the one after that, sticking with my current interrupt-disabling
> non-atomic approach in the meantime (which gets me most of the benefit).
> Alternatively, would it work for me to put that cache-prefetch instruction
> into SRCU for arm64?  My guess is "absolutely not!", but I figured that
> I should ask.

Given that this_cpu_*() are meant for the local CPU, there's less risk
of cache line bouncing between CPUs, so I'm happy to change them to
either use PRFM or LDADD (I think I prefer the latter). This would not
be a generic change for the other atomics, only the per-CPU ones.

> But if both of these approaches proves problematic, I might need some
> way to distinguish between systems having slow LSE and those that do not.

It's not that systems have slow or fast atomics, more like they are slow
or fast for specific use-cases. Their default behaviour may differ and
at least in the Arm Ltd cases, this is configurable. An STADD executed
in the L1 cache (near) may be better for your case and some
microbenchmarks but not necessarily for others. I've heard of results of
database use-cases where STADD executed far is better than LDADD
executed near when the location is shared between multiple CPUs. In
these cases even a PRFM can be problematic as it tends to bring a unique
copy of the cacheline invalidating the others (well, again, microarch
specific).

For the Arm Ltd implementations, I think the behaviour for most of the
(recent) CPUs is that load atomics, CAS, SWP are executed near while the
store atomics far (subject to configuration, errata, interconnect). Arm
should probably provide some guidance here so that other implementers
and software people know how/when to use them.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 14:49                   ` Catalin Marinas
@ 2025-11-05 16:21                     ` Breno Leitao
  2025-11-06  7:44                     ` Willy Tarreau
  1 sibling, 0 replies; 46+ messages in thread
From: Breno Leitao @ 2025-11-05 16:21 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Willy Tarreau, Yicong Yang, Paul E. McKenney, Will Deacon,
	Mark Rutland, linux-arm-kernel, kernel-team, rmikey, palmer

On Wed, Nov 05, 2025 at 02:49:39PM +0000, Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote:
> > On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote:
> > > > But need to add the prefetch in per-cpu implementation as you've
> > > > noticed above (didn't add it since no prefetch for LL/SC
> > > > implementation there, maybe a missing?)
> > > 
> > > Maybe no-one stressed these to notice any difference between LL/SC and
> > > LSE.
> > 
> > Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for
> > having faced catastrophic performance there on haproxy, while with LSE
> > it continues to scale almost linearly at least till 64.
> 
> I was referring only to the this_cpu_add() etc. functions (until Paul
> started using them). There definitely have been lots of benchmarks on
> the scalability of LL/SC. That's one of the reasons Arm added the LSE
> atomics years ago.
> 
> > But that does
> > not mean that if some possibilities are within reach to recover 90% of
> > the atomic overhead in uncontended case we shouldn't try to grab it at
> > a reasonable cost!
> 
> I agree. Even for these cases, I don't think the solution is LL/SC but
> rather better use of LSE (and better understanding of the hardware
> behaviour; feedback here should go both ways).
> 
> > I'm definitely adding in my todo list to experiment more on this on
> > various CPUs now ;-)
> 
> Thanks for the tests so far, very insightful. I think what's still
> good to assess is how PRFM+STADD compares to LDADD (without PRFM) in
> Breno's microbenchmarks. I suspect LDADD is still better.

I've hacked my microbenchmark to add these tests Catalin suggested, and it seems prfm improve the latency variation.

This is what I am measuring now:

	/* LL/SC implementation */
	void __percpu_add_case_64_llsc(void *ptr, unsigned long val)
	{
	asm volatile(
		/* LL/SC */
		"1:  ldxr    %[tmp], %[ptr]\n"
		"    add     %[tmp], %[tmp], %[val]\n"
		"    stxr    %w[loop], %[tmp], %[ptr]\n"
		"    cbnz    %w[loop], 1b"
		: [loop] "=&r"(loop), [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

	/* LSE implementation using stadd */
	void __percpu_add_case_64_lse(void *ptr, unsigned long val)
	{
	asm volatile(
		/* LSE atomics */
		"    stadd    %[val], %[ptr]\n"
		: [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

	/* LSE implementation using ldadd */
	void __percpu_add_case_64_ldadd(void *ptr, unsigned long val)
	{
	asm volatile(
		/* LSE atomics */
		"    ldadd    %[val], %[tmp], %[ptr]\n"
		: [tmp] "=&r"(tmp), [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

	/* LSE implementation using PRFM + stadd */
	void __percpu_add_case_64_prfm_stadd(void *ptr, unsigned long val)
	{
	asm volatile(
		/* Prefetch + LSE atomics */
		"    prfm    pstl1keep, %[ptr]\n"
		"    stadd   %[val], %[ptr]\n"
		: [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

	/* LSE implementation using PRFM STRM + stadd */
	void __percpu_add_case_64_prfm_strm_stadd(void *ptr, unsigned long val)
	{
	asm volatile(
		/* Prefetch streaming + LSE atomics */
		"    prfm    pstl1strm, %[ptr]\n"
		"    stadd   %[val], %[ptr]\n"
		: [ptr] "+Q"(*(u64 *)ptr)
		: [val] "r"((u64)(val))
		: "memory");
	}

And prfm definitely added some stabilityu to STDADD, but, in most cases, it is
still a bit behind the regular ldxr/stxr.

	CPU: 0 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.73 ns      p95: 5.90 ns      p99: 7.35 ns
	STADD           :   p50: 65.99 ns     p95: 68.98 ns     p99: 70.13 ns
	LDADD           :   p50: 4.33 ns      p95: 4.34 ns      p99: 4.34 ns
	PRFM_KEEP+STADD :   p50: 7.89 ns      p95: 7.91 ns      p99: 8.82 ns
	PRFM_STRM+STADD :   p50: 7.89 ns      p95: 8.11 ns      p99: 9.76 ns

	CPU: 1 - Latency Percentiles:
	====================
	LL/SC           :   p50: 7.72 ns      p95: 18.00 ns      p99: 31.51 ns
	STADD           :   p50: 103.81 ns    p95: 127.60 ns     p99: 137.12 ns
	LDADD           :   p50: 4.35 ns      p95: 22.46 ns      p99: 25.03 ns
	PRFM_KEEP+STADD :   p50: 7.89 ns      p95: 22.04 ns      p99: 23.66 ns
	PRFM_STRM+STADD :   p50: 7.89 ns      p95: 8.75 ns       p99: 11.10 ns

	CPU: 2 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.73 ns      p95: 6.87 ns      p99: 23.96 ns
	STADD           :   p50: 63.30 ns      p95: 63.33 ns    p99: 63.36 ns
	LDADD           :   p50: 4.34 ns      p95: 4.35 ns      p99: 4.35 ns
	PRFM_KEEP+STADD :   p50: 7.89 ns      p95: 7.90 ns      p99: 7.91 ns
	PRFM_STRM+STADD :   p50: 7.89 ns      p95: 7.90 ns      p99: 7.90 ns

	CPU: 3 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.70 ns      p95: 5.71 ns      p99: 5.72 ns
	STADD           :   p50: 61.94 ns     p95: 62.95 ns     p99: 65.05 ns
	LDADD           :   p50: 4.32 ns      p95: 4.33 ns      p99: 7.28 ns
	PRFM_KEEP+STADD :   p50: 7.86 ns      p95: 7.87 ns      p99: 8.08 ns
	PRFM_STRM+STADD :   p50: 7.86 ns      p95: 7.87 ns      p99: 8.25 ns

	CPU: 4 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.72 ns      p95: 5.73 ns      p99: 5.74 ns
	STADD           :   p50: 62.04 ns     p95: 122.78 ns    p99: 131.43 ns
	LDADD           :   p50: 8.08 ns      p95: 11.70 ns     p99: 14.89 ns
	PRFM_KEEP+STADD :   p50: 13.83 ns     p95: 20.70 ns     p99: 22.54 ns
	PRFM_STRM+STADD :   p50: 12.80 ns     p95: 19.42 ns     p99: 20.36 ns

	CPU: 5 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.68 ns      p95: 5.70 ns      p99: 5.70 ns
	STADD           :   p50: 59.30 ns     p95: 60.52 ns     p99: 66.53 ns
	LDADD           :   p50: 4.30 ns      p95: 4.31 ns      p99: 4.32 ns
	PRFM_KEEP+STADD :   p50: 7.84 ns      p95: 7.85 ns      p99: 7.85 ns
	PRFM_STRM+STADD :   p50: 7.84 ns      p95: 7.85 ns      p99: 7.85 ns

	CPU: 6 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.70 ns      p95: 5.71 ns      p99: 5.72 ns
	STADD           :   p50: 59.37 ns     p95: 59.41 ns     p99: 59.42 ns
	LDADD           :   p50: 4.32 ns      p95: 4.32 ns      p99: 4.34 ns
	PRFM_KEEP+STADD :   p50: 7.85 ns      p95: 7.86 ns      p99: 7.88 ns
	PRFM_STRM+STADD :   p50: 7.85 ns      p95: 7.86 ns      p99: 7.86 ns

	CPU: 7 - Latency Percentiles:
	====================
	LL/SC           :   p50: 5.72 ns      p95: 5.74 ns      p99: 6.90 ns
	STADD           :   p50: 64.46 ns     p95: 74.34 ns     p99: 77.47 ns
	LDADD           :   p50: 4.35 ns      p95: 7.50 ns      p99: 10.06 ns
	PRFM_KEEP+STADD :   p50: 8.92 ns      p95: 14.34 ns     p99: 17.31 ns
	PRFM_STRM+STADD :   p50: 8.88 ns      p95: 13.74 ns     p99: 15.11 ns

As always, the code could be found at
https://github.com/leitao/debug/tree/main/LSE




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 15:34                 ` Catalin Marinas
@ 2025-11-05 16:25                   ` Paul E. McKenney
  2025-11-05 17:15                     ` Catalin Marinas
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-05 16:25 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
> On Tue, Nov 04, 2025 at 12:10:36PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 04, 2025 at 10:43:02AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 04, 2025 at 05:05:02PM +0000, Catalin Marinas wrote:
> > > > For the SRCU case, STADD especially together with the DMB after lock and
> > > > before unlock, executing it far does slow things down. A microbenchmark
> > > > doing this in a loop is a lot worse than it would appear in practice
> > > > (saturating buses down the path to memory).
> > > 
> > > In this srcu_read_lock_fast_updown() case, there was no DMB.  But for
> > > srcu_read_lock() and srcu_read_lock_nmisafe(), yes, there would be a DMB.
> > > (The srcu_read_lock_fast_updown() is new and is in my -rcu tree.)
> > > 
> > > > A quick test to check this theory, if that's the functions you were
> > > > benchmarking (it generates LDADD instead):
> > > 
> > > Thank you for digging into this!
> > 
> > And this_cpu_inc_return() does speed things up on my hardware to about
> > the same extent as did the prefetch instruction, so thank you again.
> > However, it gets me more than a 4x slowdown on x86, so I cannot make
> > this change in common code.
> 
> Definitely not suggesting that we use the 'return' variants in the
> generic code. More likely change the arm64 code to use them for the
> per-CPU atomics.

Whew!!!  ;-)

> > So, my thought is to push arm64-only this_cpu_inc_return() into SRCU via
> > something like this_cpu_inc_srcu(), but not for the upcoming merge window,
> > but the one after that, sticking with my current interrupt-disabling
> > non-atomic approach in the meantime (which gets me most of the benefit).
> > Alternatively, would it work for me to put that cache-prefetch instruction
> > into SRCU for arm64?  My guess is "absolutely not!", but I figured that
> > I should ask.
> 
> Given that this_cpu_*() are meant for the local CPU, there's less risk
> of cache line bouncing between CPUs, so I'm happy to change them to
> either use PRFM or LDADD (I think I prefer the latter). This would not
> be a generic change for the other atomics, only the per-CPU ones.

I have easy access to only the one type of ARM system, and of course
the choice must be driven by a wide range of systems.  But yes, it
would be much better if we can just use this_cpu_inc().  I will use the
non-atomics protected by interrupt disabling in the meantime, but look
forward to being able to switch back.

> > But if both of these approaches proves problematic, I might need some
> > way to distinguish between systems having slow LSE and those that do not.
> 
> It's not that systems have slow or fast atomics, more like they are slow
> or fast for specific use-cases. Their default behaviour may differ and
> at least in the Arm Ltd cases, this is configurable. An STADD executed
> in the L1 cache (near) may be better for your case and some
> microbenchmarks but not necessarily for others. I've heard of results of
> database use-cases where STADD executed far is better than LDADD
> executed near when the location is shared between multiple CPUs. In
> these cases even a PRFM can be problematic as it tends to bring a unique
> copy of the cacheline invalidating the others (well, again, microarch
> specific).

Fair point, and I do need to be careful not to read too much into the
results from my one type of system.  Plus, to your point elsewhere in
this thread, making the hardware better would be quite welcome as well.

> For the Arm Ltd implementations, I think the behaviour for most of the
> (recent) CPUs is that load atomics, CAS, SWP are executed near while the
> store atomics far (subject to configuration, errata, interconnect). Arm
> should probably provide some guidance here so that other implementers
> and software people know how/when to use them.

Or make the hardware figure out what to do automatically for each use
case as it executes.  Perhaps a bit utopian, but it is nevertheless a
good direction to aim for.

							Thanx, Paul

> -- 
> Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 16:25                   ` Paul E. McKenney
@ 2025-11-05 17:15                     ` Catalin Marinas
  2025-11-05 17:40                       ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-05 17:15 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
> > Given that this_cpu_*() are meant for the local CPU, there's less risk
> > of cache line bouncing between CPUs, so I'm happy to change them to
> > either use PRFM or LDADD (I think I prefer the latter). This would not
> > be a generic change for the other atomics, only the per-CPU ones.
> 
> I have easy access to only the one type of ARM system, and of course
> the choice must be driven by a wide range of systems.  But yes, it
> would be much better if we can just use this_cpu_inc().  I will use the
> non-atomics protected by interrupt disabling in the meantime, but look
> forward to being able to switch back.

BTW, did you find a problem with this_cpu_inc() in normal use with SRCU
or just in a microbenchmark hammering them? From what I understand from
the hardware folk, doing STADD in a loop saturates some queues in the
interconnect and slows down eventually. In normal use, it's just a
posted operation not affecting the subsequent instructions (or at least
that's the theory).

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 17:15                     ` Catalin Marinas
@ 2025-11-05 17:40                       ` Paul E. McKenney
  2025-11-05 19:16                         ` Catalin Marinas
  0 siblings, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-05 17:40 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
> > > Given that this_cpu_*() are meant for the local CPU, there's less risk
> > > of cache line bouncing between CPUs, so I'm happy to change them to
> > > either use PRFM or LDADD (I think I prefer the latter). This would not
> > > be a generic change for the other atomics, only the per-CPU ones.
> > 
> > I have easy access to only the one type of ARM system, and of course
> > the choice must be driven by a wide range of systems.  But yes, it
> > would be much better if we can just use this_cpu_inc().  I will use the
> > non-atomics protected by interrupt disabling in the meantime, but look
> > forward to being able to switch back.
> 
> BTW, did you find a problem with this_cpu_inc() in normal use with SRCU
> or just in a microbenchmark hammering them? From what I understand from
> the hardware folk, doing STADD in a loop saturates some queues in the
> interconnect and slows down eventually. In normal use, it's just a
> posted operation not affecting the subsequent instructions (or at least
> that's the theory).

Only in a microbenchmark, and Breno did not find any issues in larger
benchmarks, so good to hear!

Now, some non-arm64 systems deal with it just fine, but perhaps I owe
everyone an apology for the firedrill.

But let me put it this way...  Would you ack an SRCU patch that resulted
in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on
other systems?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 17:40                       ` Paul E. McKenney
@ 2025-11-05 19:16                         ` Catalin Marinas
  2025-11-05 19:47                           ` Paul E. McKenney
  2025-11-05 21:13                           ` Palmer Dabbelt
  0 siblings, 2 replies; 46+ messages in thread
From: Catalin Marinas @ 2025-11-05 19:16 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, Nov 05, 2025 at 09:40:32AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote:
> > On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote:
> > > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
> > > > Given that this_cpu_*() are meant for the local CPU, there's less risk
> > > > of cache line bouncing between CPUs, so I'm happy to change them to
> > > > either use PRFM or LDADD (I think I prefer the latter). This would not
> > > > be a generic change for the other atomics, only the per-CPU ones.
> > > 
> > > I have easy access to only the one type of ARM system, and of course
> > > the choice must be driven by a wide range of systems.  But yes, it
> > > would be much better if we can just use this_cpu_inc().  I will use the
> > > non-atomics protected by interrupt disabling in the meantime, but look
> > > forward to being able to switch back.
> > 
> > BTW, did you find a problem with this_cpu_inc() in normal use with SRCU
> > or just in a microbenchmark hammering them? From what I understand from
> > the hardware folk, doing STADD in a loop saturates some queues in the
> > interconnect and slows down eventually. In normal use, it's just a
> > posted operation not affecting the subsequent instructions (or at least
> > that's the theory).
> 
> Only in a microbenchmark, and Breno did not find any issues in larger
> benchmarks, so good to hear!
> 
> Now, some non-arm64 systems deal with it just fine, but perhaps I owe
> everyone an apology for the firedrill.

That was a useful exercise, I learnt more things about the arm atomics.

> But let me put it this way...  Would you ack an SRCU patch that resulted
> in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on
> other systems?

Only if it's backed by other microbenchmarks showing significant
improvements ;).

I think we should change the percpu atomics, it makes more sense to do
them near, but I'll keep the others as they are. Planning to post a
proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings
all day). Something like below but with more comments and a commit log:

------------------------8<--------------------------
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index 9abcc8ef3087..d4dff4b0cf50 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
 	"	stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n"		\
 	"	cbnz	%w[loop], 1b",					\
 	/* LSE atomics */						\
-		#op_lse "\t%" #w "[val], %[ptr]\n"			\
+		#op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n"	\
 		__nops(3))						\
 	: [loop] "=&r" (loop), [tmp] "=&r" (tmp),			\
 	  [ptr] "+Q"(*(u##sz *)ptr)					\
@@ -124,9 +124,9 @@ PERCPU_RW_OPS(8)
 PERCPU_RW_OPS(16)
 PERCPU_RW_OPS(32)
 PERCPU_RW_OPS(64)
-PERCPU_OP(add, add, stadd)
-PERCPU_OP(andnot, bic, stclr)
-PERCPU_OP(or, orr, stset)
+PERCPU_OP(add, add, ldadd)
+PERCPU_OP(andnot, bic, ldclr)
+PERCPU_OP(or, orr, ldset)
 PERCPU_RET_OP(add, add, ldadd)
 
 #undef PERCPU_RW_OPS



^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 19:16                         ` Catalin Marinas
@ 2025-11-05 19:47                           ` Paul E. McKenney
  2025-11-05 20:17                             ` Catalin Marinas
  2025-11-05 21:13                           ` Palmer Dabbelt
  1 sibling, 1 reply; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-05 19:47 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, Nov 05, 2025 at 07:16:42PM +0000, Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 09:40:32AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote:
> > > On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote:
> > > > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
> > > > > Given that this_cpu_*() are meant for the local CPU, there's less risk
> > > > > of cache line bouncing between CPUs, so I'm happy to change them to
> > > > > either use PRFM or LDADD (I think I prefer the latter). This would not
> > > > > be a generic change for the other atomics, only the per-CPU ones.
> > > > 
> > > > I have easy access to only the one type of ARM system, and of course
> > > > the choice must be driven by a wide range of systems.  But yes, it
> > > > would be much better if we can just use this_cpu_inc().  I will use the
> > > > non-atomics protected by interrupt disabling in the meantime, but look
> > > > forward to being able to switch back.
> > > 
> > > BTW, did you find a problem with this_cpu_inc() in normal use with SRCU
> > > or just in a microbenchmark hammering them? From what I understand from
> > > the hardware folk, doing STADD in a loop saturates some queues in the
> > > interconnect and slows down eventually. In normal use, it's just a
> > > posted operation not affecting the subsequent instructions (or at least
> > > that's the theory).
> > 
> > Only in a microbenchmark, and Breno did not find any issues in larger
> > benchmarks, so good to hear!
> > 
> > Now, some non-arm64 systems deal with it just fine, but perhaps I owe
> > everyone an apology for the firedrill.
> 
> That was a useful exercise, I learnt more things about the arm atomics.

I am glad that it had some good effect.  ;-)

> > But let me put it this way...  Would you ack an SRCU patch that resulted
> > in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on
> > other systems?
> 
> Only if it's backed by other microbenchmarks showing significant
> improvements ;).

Well, it did reduce from about 140ns with SRCU to about 100ns with
SRCU-fast-updown due to removing the full memory barrier, so there
is that.

> I think we should change the percpu atomics, it makes more sense to do
> them near, but I'll keep the others as they are. Planning to post a
> proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings
> all day). Something like below but with more comments and a commit log:

I do like fixing this in the arm64 percpu atomics.  However, this gets
me assembler errors, perhaps because I am using an old version of gcc
or perhaps because I am still based off of v6.17-rc1:

	/tmp/ccYlMkU1.s: Assembler messages:
	/tmp/ccYlMkU1.s:9292: Error: invalid addressing mode at operand 2 -- `stadd x2,x4,[x0]'
	/tmp/ccYlMkU1.s:9428: Error: invalid addressing mode at operand 2 -- `stadd x3,x6,[x1]'
	/tmp/ccYlMkU1.s:9299: Error: attempt to move .org backwards
	/tmp/ccYlMkU1.s:9435: Error: attempt to move .org backwards

							Thanx, Paul

> ------------------------8<--------------------------
> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> index 9abcc8ef3087..d4dff4b0cf50 100644
> --- a/arch/arm64/include/asm/percpu.h
> +++ b/arch/arm64/include/asm/percpu.h
> @@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
>  	"	stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n"		\
>  	"	cbnz	%w[loop], 1b",					\
>  	/* LSE atomics */						\
> -		#op_lse "\t%" #w "[val], %[ptr]\n"			\
> +		#op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n"	\
>  		__nops(3))						\
>  	: [loop] "=&r" (loop), [tmp] "=&r" (tmp),			\
>  	  [ptr] "+Q"(*(u##sz *)ptr)					\
> @@ -124,9 +124,9 @@ PERCPU_RW_OPS(8)
>  PERCPU_RW_OPS(16)
>  PERCPU_RW_OPS(32)
>  PERCPU_RW_OPS(64)
> -PERCPU_OP(add, add, stadd)
> -PERCPU_OP(andnot, bic, stclr)
> -PERCPU_OP(or, orr, stset)
> +PERCPU_OP(add, add, ldadd)
> +PERCPU_OP(andnot, bic, ldclr)
> +PERCPU_OP(or, orr, ldset)
>  PERCPU_RET_OP(add, add, ldadd)
>  
>  #undef PERCPU_RW_OPS
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 19:47                           ` Paul E. McKenney
@ 2025-11-05 20:17                             ` Catalin Marinas
  2025-11-05 20:45                               ` Paul E. McKenney
  0 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-05 20:17 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, Nov 05, 2025 at 11:47:07AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 05, 2025 at 07:16:42PM +0000, Catalin Marinas wrote:
> > I think we should change the percpu atomics, it makes more sense to do
> > them near, but I'll keep the others as they are. Planning to post a
> > proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings
> > all day). Something like below but with more comments and a commit log:
> 
> I do like fixing this in the arm64 percpu atomics.  However, this gets
> me assembler errors, perhaps because I am using an old version of gcc
> or perhaps because I am still based off of v6.17-rc1:
> 
> 	/tmp/ccYlMkU1.s: Assembler messages:
> 	/tmp/ccYlMkU1.s:9292: Error: invalid addressing mode at operand 2 -- `stadd x2,x4,[x0]'
> 	/tmp/ccYlMkU1.s:9428: Error: invalid addressing mode at operand 2 -- `stadd x3,x6,[x1]'
> 	/tmp/ccYlMkU1.s:9299: Error: attempt to move .org backwards
> 	/tmp/ccYlMkU1.s:9435: Error: attempt to move .org backwards

Are you sure it is applied correctly? There shouldn't be any trace of
stadd in asm/percpu.h.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 20:17                             ` Catalin Marinas
@ 2025-11-05 20:45                               ` Paul E. McKenney
  0 siblings, 0 replies; 46+ messages in thread
From: Paul E. McKenney @ 2025-11-05 20:45 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, Nov 05, 2025 at 08:17:25PM +0000, Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 11:47:07AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 05, 2025 at 07:16:42PM +0000, Catalin Marinas wrote:
> > > I think we should change the percpu atomics, it makes more sense to do
> > > them near, but I'll keep the others as they are. Planning to post a
> > > proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings
> > > all day). Something like below but with more comments and a commit log:
> > 
> > I do like fixing this in the arm64 percpu atomics.  However, this gets
> > me assembler errors, perhaps because I am using an old version of gcc
> > or perhaps because I am still based off of v6.17-rc1:
> > 
> > 	/tmp/ccYlMkU1.s: Assembler messages:
> > 	/tmp/ccYlMkU1.s:9292: Error: invalid addressing mode at operand 2 -- `stadd x2,x4,[x0]'
> > 	/tmp/ccYlMkU1.s:9428: Error: invalid addressing mode at operand 2 -- `stadd x3,x6,[x1]'
> > 	/tmp/ccYlMkU1.s:9299: Error: attempt to move .org backwards
> > 	/tmp/ccYlMkU1.s:9435: Error: attempt to move .org backwards
> 
> Are you sure it is applied correctly? There shouldn't be any trace of
> stadd in asm/percpu.h.

Right in one, apologies!  And just for the record, it is a bad idea to
apply such a patch while rebasing and being in a C++ forward-progress
discussion.  :-/

This gets us to the usual good-case latency, in this case 8.333ns.

Tested-by: Paul E. McKenney <paulmck@kernel.org>

							Thanx, Paul


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 19:16                         ` Catalin Marinas
  2025-11-05 19:47                           ` Paul E. McKenney
@ 2025-11-05 21:13                           ` Palmer Dabbelt
  2025-11-06 14:00                             ` Catalin Marinas
  1 sibling, 1 reply; 46+ messages in thread
From: Palmer Dabbelt @ 2025-11-05 21:13 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, 05 Nov 2025 11:16:42 PST (-0800), Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 09:40:32AM -0800, Paul E. McKenney wrote:
>> On Wed, Nov 05, 2025 at 05:15:51PM +0000, Catalin Marinas wrote:
>> > On Wed, Nov 05, 2025 at 08:25:51AM -0800, Paul E. McKenney wrote:
>> > > On Wed, Nov 05, 2025 at 03:34:21PM +0000, Catalin Marinas wrote:
>> > > > Given that this_cpu_*() are meant for the local CPU, there's less risk
>> > > > of cache line bouncing between CPUs, so I'm happy to change them to
>> > > > either use PRFM or LDADD (I think I prefer the latter). This would not
>> > > > be a generic change for the other atomics, only the per-CPU ones.
>> > >
>> > > I have easy access to only the one type of ARM system, and of course
>> > > the choice must be driven by a wide range of systems.  But yes, it
>> > > would be much better if we can just use this_cpu_inc().  I will use the
>> > > non-atomics protected by interrupt disabling in the meantime, but look
>> > > forward to being able to switch back.
>> >
>> > BTW, did you find a problem with this_cpu_inc() in normal use with SRCU
>> > or just in a microbenchmark hammering them? From what I understand from
>> > the hardware folk, doing STADD in a loop saturates some queues in the
>> > interconnect and slows down eventually. In normal use, it's just a
>> > posted operation not affecting the subsequent instructions (or at least
>> > that's the theory).
>>
>> Only in a microbenchmark, and Breno did not find any issues in larger
>> benchmarks, so good to hear!

FWIW, I have a proxy workload where enabling ATOMIC_*_FORCE_NEAR is ~1% 
better (at application-level throughput).  It's supposed to be 
representative of real workloads and isn't supposed to have contention, 
but I don't trust these workloads at all so take that with a grain of 
salt...

I still had looking into this on my TODO list, I was planning on doing 
it all internally as a tuning thing but LMK if folks think it's 
interesting and I'll try to find some way to talk about it publicly.

>> Now, some non-arm64 systems deal with it just fine, but perhaps I owe
>> everyone an apology for the firedrill.
>
> That was a useful exercise, I learnt more things about the arm atomics.
>
>> But let me put it this way...  Would you ack an SRCU patch that resulted
>> in 100ns microbenchmark numbers on arm64 compared to <2ns numbers on
>> other systems?
>
> Only if it's backed by other microbenchmarks showing significant
> improvements ;).
>
> I think we should change the percpu atomics, it makes more sense to do
> them near, but I'll keep the others as they are. Planning to post a

I guess I kind of went down a rabbit hole here, but I think I found some 
interesting stuff.  This is all based on some modifications of Breno's 
microbenchmark to add two things:

* A contending thread, which performs the same operation on the same 
  counter in a loop, with operations separated by a variable-counted 
  loop of NOPs.
* Some busy work for the timed thread, which is also just a loop of 
  NOPs.

Those loops look like

    for (d = 0; d < duty; d++)
        __asm__ volatile ("nop");

in the code and get compiled to

                            for (d = 0; d < duty; d++)
      41037c:       f90007ff        str     xzr, [sp, #8]
      410380:       14000001        b       410384 <run_core_benchmark+0x74>
      410384:       f94007e8        ldr     x8, [sp, #8]
      410388:       f85e03a9        ldur    x9, [x29, #-32]
      41038c:       eb090108        subs    x8, x8, x9
      410390:       54000102        b.cs    4103b0 <run_core_benchmark+0xa0>  // b.hs, b.nlast
      410394:       14000001        b       410398 <run_core_benchmark+0x88>
                                    __asm__ volatile ("nop");
      410398:       d503201f        nop
      41039c:       14000001        b       4103a0 <run_core_benchmark+0x90>
                            for (d = 0; d < duty; d++)
      4103a0:       f94007e8        ldr     x8, [sp, #8]
      4103a4:       91000508        add     x8, x8, #0x1
      4103a8:       f90007e8        str     x8, [sp, #8]
      4103ac:       17fffff6        b       410384 <run_core_benchmark+0x74>
                    }

which is I guess kind of wacky generated code, but is maybe a reasonable 
proxy for work -- it's got load/stores/branches, which IIUC is what real 
code does ;)

I ran a bunch of cases with those:

 CPU: 0 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 063.65 ns        p95: 065.02 ns          p99: 065.32 ns
LSE (stadd)     (c                0, d              100):   p50: 063.71 ns        p95: 064.96 ns          p99: 065.68 ns
LSE (stadd)     (c                0, d              200):   p50: 068.07 ns        p95: 082.98 ns          p99: 083.24 ns
LSE (stadd)     (c                0, d              300):   p50: 098.96 ns        p95: 121.14 ns          p99: 122.04 ns
LSE (stadd)     (c               10, d                0):   p50: 115.33 ns        p95: 117.25 ns          p99: 117.35 ns
LSE (stadd)     (c               10, d              300):   p50: 115.30 ns        p95: 119.12 ns          p99: 121.68 ns
LSE (stadd)     (c               10, d              500):   p50: 162.94 ns        p95: 185.24 ns          p99: 195.79 ns
LSE (stadd)     (c               30, d                0):   p50: 115.17 ns        p95: 117.14 ns          p99: 117.84 ns
LSE (stadd)     (c              100, d                0):   p50: 115.17 ns        p95: 117.13 ns          p99: 117.35 ns
LSE (stadd)     (c            10000, d                0):   p50: 064.81 ns        p95: 066.24 ns          p99: 067.08 ns
LL/SC           (c                0, d                0):   p50: 005.66 ns        p95: 006.45 ns          p99: 006.47 ns
LL/SC           (c                0, d               10):   p50: 006.19 ns        p95: 006.98 ns          p99: 007.01 ns
LL/SC           (c                0, d               20):   p50: 007.35 ns        p95: 008.88 ns          p99: 009.46 ns
LL/SC           (c               10, d                0):   p50: 164.16 ns        p95: 462.97 ns          p99: 580.92 ns
LL/SC           (c               10, d               10):   p50: 303.22 ns        p95: 575.03 ns          p99: 609.62 ns
LL/SC           (c               10, d               20):   p50: 032.24 ns        p95: 042.03 ns          p99: 048.71 ns
LL/SC           (c             1000, d                0):   p50: 017.37 ns        p95: 018.18 ns          p99: 018.19 ns
LL/SC           (c             1000, d               10):   p50: 019.54 ns        p95: 020.37 ns          p99: 021.79 ns
LL/SC           (c          1000000, d                0):   p50: 015.46 ns        p95: 017.00 ns          p99: 017.25 ns
LL/SC           (c          1000000, d               10):   p50: 017.57 ns        p95: 019.16 ns          p99: 019.47 ns
LDADD           (c                0, d                0):   p50: 004.33 ns        p95: 004.64 ns          p99: 005.13 ns
LDADD           (c                0, d              100):   p50: 032.15 ns        p95: 040.29 ns          p99: 040.69 ns
LDADD           (c                0, d              200):   p50: 067.97 ns        p95: 083.04 ns          p99: 083.30 ns
LDADD           (c                0, d              300):   p50: 098.93 ns        p95: 120.79 ns          p99: 122.52 ns
LDADD           (c                1, d              100):   p50: 049.19 ns        p95: 072.23 ns          p99: 072.38 ns
LDADD           (c                1, d              200):   p50: 143.15 ns        p95: 145.34 ns          p99: 145.90 ns
LDADD           (c                1, d              300):   p50: 153.91 ns        p95: 162.57 ns          p99: 163.84 ns
LDADD           (c               10, d                0):   p50: 012.46 ns        p95: 013.24 ns          p99: 014.33 ns
LDADD           (c               10, d              100):   p50: 049.34 ns        p95: 069.35 ns          p99: 070.71 ns
LDADD           (c               10, d              200):   p50: 141.66 ns        p95: 143.65 ns          p99: 144.31 ns
LDADD           (c               10, d              300):   p50: 152.82 ns        p95: 163.51 ns          p99: 164.03 ns
LDADD           (c              100, d                0):   p50: 012.37 ns        p95: 013.23 ns          p99: 014.52 ns
LDADD           (c              100, d               10):   p50: 014.32 ns        p95: 015.11 ns          p99: 015.15 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 003.97 ns        p95: 005.23 ns          p99: 005.49 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 126.02 ns        p95: 127.72 ns          p99: 128.72 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 021.97 ns        p95: 023.93 ns          p99: 024.97 ns
PFRM_KEEP+STADD (c          1000000, d              100):   p50: 076.28 ns        p95: 080.88 ns          p99: 081.50 ns
PFRM_KEEP+STADD (c          1000000, d              200):   p50: 089.62 ns        p95: 091.49 ns          p99: 091.89 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 003.97 ns        p95: 005.23 ns          p99: 005.47 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 126.75 ns        p95: 128.96 ns          p99: 129.48 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 021.83 ns        p95: 023.75 ns          p99: 023.96 ns
PFRM_STRM+STADD (c          1000000, d              100):   p50: 074.48 ns        p95: 079.56 ns          p99: 080.73 ns
PFRM_STRM+STADD (c          1000000, d              200):   p50: 089.76 ns        p95: 091.14 ns          p99: 092.46 ns

Which I'm interpreting to say the following:

* LL/SC is pretty good for the common cases, but gets really bad under 
  the pathological cases.  It still seems always slower that LDADD.
* STADD has latency that blocks other STADDs, but not other CPU-local 
  work.  I'd bet there's a bunch of interactions with caches and memory 
  ordering here, but those would all juts make STADD look worse so I'm 
  just ignoring them.
* LDADD is better than STADD even under pathologically highly contended 
  cases.  I was actually kind of surprised about this one, I thought the 
  far atomics would be better there.
* The prefetches help STADD, but they don't seem to make it better that 
  LDADD in any case.
* The LDADD latency also happens concurrently with other CPU operations 
  like the STADD latency does.  It has less latency to hide, so the 
  latency starts to go up with less extra work, but it's never worse 
  that STADD.

So I think at least on this system, LDADD is just always better.

[My code's up in a PR to Breno's repo: 
https://github.com/leitao/debug/pull/2]

> proper patch tomorrow and see if Will NAKs it ;) (I've been in meetings
> all day). Something like below but with more comments and a commit log:
>
> ------------------------8<--------------------------
> diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
> index 9abcc8ef3087..d4dff4b0cf50 100644
> --- a/arch/arm64/include/asm/percpu.h
> +++ b/arch/arm64/include/asm/percpu.h
> @@ -77,7 +77,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
>  	"	stxr" #sfx "\t%w[loop], %" #w "[tmp], %[ptr]\n"		\
>  	"	cbnz	%w[loop], 1b",					\
>  	/* LSE atomics */						\
> -		#op_lse "\t%" #w "[val], %[ptr]\n"			\
> +		#op_lse "\t%" #w "[val], %" #w "[tmp], %[ptr]\n"	\
>  		__nops(3))						\
>  	: [loop] "=&r" (loop), [tmp] "=&r" (tmp),			\
>  	  [ptr] "+Q"(*(u##sz *)ptr)					\
> @@ -124,9 +124,9 @@ PERCPU_RW_OPS(8)
>  PERCPU_RW_OPS(16)
>  PERCPU_RW_OPS(32)
>  PERCPU_RW_OPS(64)
> -PERCPU_OP(add, add, stadd)
> -PERCPU_OP(andnot, bic, stclr)
> -PERCPU_OP(or, orr, stset)
> +PERCPU_OP(add, add, ldadd)
> +PERCPU_OP(andnot, bic, ldclr)
> +PERCPU_OP(or, orr, ldset)
>  PERCPU_RET_OP(add, add, ldadd)
>
>  #undef PERCPU_RW_OPS


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 14:49                   ` Catalin Marinas
  2025-11-05 16:21                     ` Breno Leitao
@ 2025-11-06  7:44                     ` Willy Tarreau
  2025-11-06 13:53                       ` Catalin Marinas
  1 sibling, 1 reply; 46+ messages in thread
From: Willy Tarreau @ 2025-11-06  7:44 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland,
	linux-arm-kernel

On Wed, Nov 05, 2025 at 02:49:39PM +0000, Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 02:42:31PM +0100, Willy Tarreau wrote:
> > On Wed, Nov 05, 2025 at 01:25:25PM +0000, Catalin Marinas wrote:
> > > > But need to add the prefetch in per-cpu implementation as you've
> > > > noticed above (didn't add it since no prefetch for LL/SC
> > > > implementation there, maybe a missing?)
> > > 
> > > Maybe no-one stressed these to notice any difference between LL/SC and
> > > LSE.
> > 
> > Huh ? I can say for certain that LL/SC is a no-go beyond 16 cores, for
> > having faced catastrophic performance there on haproxy, while with LSE
> > it continues to scale almost linearly at least till 64.
> 
> I was referring only to the this_cpu_add() etc. functions (until Paul
> started using them).

Ah OK thanks for clarifying!

> There definitely have been lots of benchmarks on
> the scalability of LL/SC. That's one of the reasons Arm added the LSE
> atomics years ago.

Yes that's what I thought, which is why your sentence shocked me in the
first place :-)

> > But that does
> > not mean that if some possibilities are within reach to recover 90% of
> > the atomic overhead in uncontended case we shouldn't try to grab it at
> > a reasonable cost!
> 
> I agree. Even for these cases, I don't think the solution is LL/SC but
> rather better use of LSE (and better understanding of the hardware
> behaviour; feedback here should go both ways).

I totally agree. I'm happy to have discovered the near vs far distinction
there that I was not aware of because it will make me think differently
in the future when having to design around shared stuff.

> > I'm definitely adding in my todo list to experiment more on this on
> > various CPUs now ;-)
> 
> Thanks for the tests so far, very insightful. I think what's still
> good to assess is how PRFM+STADD compares to LDADD (without PRFM) in
> Breno's microbenchmarks. I suspect LDADD is still better.

Yep as confirmed with Breno's last test after your message.

> FWIW, Neoverse-N1 has an erratum affecting the far atomics and they are
> all forced near, so this explains the consistent results you got with
> STADD on this CPU. On other CPUs, STADD would likely be executed far
> unless it hits in the L1 cache.

Ah, thanks for letting me know! This indeed explains the difference.
Do you have pointers to some docs suggesting what instructions to use
when you prefer a near or far operation, like here with stadd vs ldadd ?
Also does this mean that with LSE a pure store will always be far unless
prefetched ? Or should we trick stores using stadd mem,0 / ldadd mem,0
to hint a near vs far store for example ? I'm also wondering about CAS,
if there's a way to perform the usual load+CAS sequence exclusively using
far operations to avoid cache lines bouncing in contended environments,
because there are cases where a constant 50-60ns per CAS would be awesome,
or maybe even a CAS that remains far in case of failure or triggers the
prefetch of the line in case of success, for the typical
CAS(ptr, NULL, mine) used to try to own a shared resource.

Thanks,
Willy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-06  7:44                     ` Willy Tarreau
@ 2025-11-06 13:53                       ` Catalin Marinas
  2025-11-06 14:16                         ` Willy Tarreau
  0 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-06 13:53 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland,
	linux-arm-kernel

On Thu, Nov 06, 2025 at 08:44:39AM +0100, Willy Tarreau wrote:
> Do you have pointers to some docs suggesting what instructions to use
> when you prefer a near or far operation, like here with stadd vs ldadd ?

Unfortunately, the architecture spec does not make any distinction
between far or near atomics, that's rather a microarchitecture and
system implementation detail. Some of the information is hidden in
specific CPU TRMs and the behaviour may differ between implementations.

I hope Arm will publish some docs/blogs to give some guidance to
software folk (and other non-Arm Ltd microarchitects; it would be good
if they are all aligned, though some may see this as their value-add).

> Also does this mean that with LSE a pure store will always be far unless
> prefetched ? Or should we trick stores using stadd mem,0 / ldadd mem,0
> to hint a near vs far store for example ?

For the Arm Ltd implementations, _usually_ store-only atomics are
executed far while those returning a value are near. But that's subject
to implementation-defined configurations (e.g. IMP_CPUECTLR_EL1). Also
the hardware may try to be smarter, e.g. detect contention and switch
from one behaviour to another.

> I'm also wondering about CAS,
> if there's a way to perform the usual load+CAS sequence exclusively using
> far operations to avoid cache lines bouncing in contended environments,
> because there are cases where a constant 50-60ns per CAS would be awesome,
> or maybe even a CAS that remains far in case of failure or triggers the
> prefetch of the line in case of success, for the typical
> CAS(ptr, NULL, mine) used to try to own a shared resource.

Talking to other engineers in Arm, I learnt that the architecture even
describes a way the programmer can hint at CAS loops. Instead of an LDR,
use something (informally) called ICAS - a CAS where the Xs and Xt
registers are the same (actual registers, not the value they contain).
The in-memory value comparison with Xs either passes and the written
value would be the same (imp def whether a write actually takes place)
or fails (in theory, hw is allowed to write the same old value back). So
while the value in Xs is less relevant, CAS will return the value in
memory. The hardware detects the ICAS+CAS constructs and aims to make
them faster.

From the C6.2.50 in the Arm ARM (the CAS description):

  For a CAS or CASA instruction, when <Ws> or <Xs> specifies the same
  register as <Wt> or <Xt>, this signals to the memory system that an
  additional subsequent CAS, CASA, CASAL, or CASL access to the
  specified location is likely to occur in the near future. The memory
  system can respond by taking actions that are expected to enable the
  subsequent CAS, CASA, CASAL, or CASL access to succeed when it does
  occur.

I guess something to add to Breno's microbenchmarks.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-05 21:13                           ` Palmer Dabbelt
@ 2025-11-06 14:00                             ` Catalin Marinas
  2025-11-06 16:30                               ` Palmer Dabbelt
  0 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-06 14:00 UTC (permalink / raw)
  To: Palmer Dabbelt; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel

On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote:
> I ran a bunch of cases with those:
[...]
> Which I'm interpreting to say the following:
> 
> * LL/SC is pretty good for the common cases, but gets really bad under  the
> pathological cases.  It still seems always slower that LDADD.
> * STADD has latency that blocks other STADDs, but not other CPU-local  work.
> I'd bet there's a bunch of interactions with caches and memory  ordering
> here, but those would all juts make STADD look worse so I'm  just ignoring
> them.
> * LDADD is better than STADD even under pathologically highly contended
> cases.  I was actually kind of surprised about this one, I thought the  far
> atomics would be better there.
> * The prefetches help STADD, but they don't seem to make it better that
> LDADD in any case.
> * The LDADD latency also happens concurrently with other CPU operations
> like the STADD latency does.  It has less latency to hide, so the  latency
> starts to go up with less extra work, but it's never worse  that STADD.
> 
> So I think at least on this system, LDADD is just always better.

Thanks for this, very useful. I guess that's expected in the light of I
learnt from the other Arm engineers in the past couple of days.

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-06 13:53                       ` Catalin Marinas
@ 2025-11-06 14:16                         ` Willy Tarreau
  0 siblings, 0 replies; 46+ messages in thread
From: Willy Tarreau @ 2025-11-06 14:16 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Yicong Yang, Paul E. McKenney, Will Deacon, Mark Rutland,
	linux-arm-kernel

On Thu, Nov 06, 2025 at 01:53:04PM +0000, Catalin Marinas wrote:
> On Thu, Nov 06, 2025 at 08:44:39AM +0100, Willy Tarreau wrote:
> > Do you have pointers to some docs suggesting what instructions to use
> > when you prefer a near or far operation, like here with stadd vs ldadd ?
> 
> Unfortunately, the architecture spec does not make any distinction
> between far or near atomics, that's rather a microarchitecture and
> system implementation detail. Some of the information is hidden in
> specific CPU TRMs and the behaviour may differ between implementations.
> 
> I hope Arm will publish some docs/blogs to give some guidance to
> software folk (and other non-Arm Ltd microarchitects; it would be good
> if they are all aligned, though some may see this as their value-add).

Yes I can definitely understand that it's never easy to place the cursor
between how to help developers get the most of your arch and how to keep
competitors away.

> > Also does this mean that with LSE a pure store will always be far unless
> > prefetched ? Or should we trick stores using stadd mem,0 / ldadd mem,0
> > to hint a near vs far store for example ?
> 
> For the Arm Ltd implementations, _usually_ store-only atomics are
> executed far while those returning a value are near. But that's subject
> to implementation-defined configurations (e.g. IMP_CPUECTLR_EL1). Also
> the hardware may try to be smarter, e.g. detect contention and switch
> from one behaviour to another.

OK, thanks for the explanation. It makes sense and tends to match what
one could naturally expect.

> > I'm also wondering about CAS,
> > if there's a way to perform the usual load+CAS sequence exclusively using
> > far operations to avoid cache lines bouncing in contended environments,
> > because there are cases where a constant 50-60ns per CAS would be awesome,
> > or maybe even a CAS that remains far in case of failure or triggers the
> > prefetch of the line in case of success, for the typical
> > CAS(ptr, NULL, mine) used to try to own a shared resource.
> 
> Talking to other engineers in Arm, I learnt that the architecture even
> describes a way the programmer can hint at CAS loops. Instead of an LDR,
> use something (informally) called ICAS - a CAS where the Xs and Xt
> registers are the same (actual registers, not the value they contain).
> The in-memory value comparison with Xs either passes and the written
> value would be the same (imp def whether a write actually takes place)
> or fails (in theory, hw is allowed to write the same old value back).

This is super interesting, thanks for sharing!

> So
> while the value in Xs is less relevant, CAS will return the value in
> memory. The hardware detects the ICAS+CAS constructs and aims to make
> them faster.

I had already notice some x86 models being able to often succeed on the
second CAS attempt, and suspected that they'd force the L1 to hold the
line until the next attempt for this purpose. This could be roughly
similar.

> >From the C6.2.50 in the Arm ARM (the CAS description):
> 
>   For a CAS or CASA instruction, when <Ws> or <Xs> specifies the same
>   register as <Wt> or <Xt>, this signals to the memory system that an
>   additional subsequent CAS, CASA, CASAL, or CASL access to the
>   specified location is likely to occur in the near future. The memory
>   system can respond by taking actions that are expected to enable the
>   subsequent CAS, CASA, CASAL, or CASL access to succeed when it does
>   occur.
> 
> I guess something to add to Breno's microbenchmarks.

I think so as well. Many thanks again for sharing such precious info!

Willy


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-06 14:00                             ` Catalin Marinas
@ 2025-11-06 16:30                               ` Palmer Dabbelt
  2025-11-06 17:54                                 ` Catalin Marinas
  0 siblings, 1 reply; 46+ messages in thread
From: Palmer Dabbelt @ 2025-11-06 16:30 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel

On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote:
> On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote:
>> I ran a bunch of cases with those:
> [...]
>> Which I'm interpreting to say the following:
>>
>> * LL/SC is pretty good for the common cases, but gets really bad under  the
>> pathological cases.  It still seems always slower that LDADD.
>> * STADD has latency that blocks other STADDs, but not other CPU-local  work.
>> I'd bet there's a bunch of interactions with caches and memory  ordering
>> here, but those would all juts make STADD look worse so I'm  just ignoring
>> them.
>> * LDADD is better than STADD even under pathologically highly contended
>> cases.  I was actually kind of surprised about this one, I thought the  far
>> atomics would be better there.
>> * The prefetches help STADD, but they don't seem to make it better that
>> LDADD in any case.
>> * The LDADD latency also happens concurrently with other CPU operations
>> like the STADD latency does.  It has less latency to hide, so the  latency
>> starts to go up with less extra work, but it's never worse  that STADD.
>>
>> So I think at least on this system, LDADD is just always better.
>
> Thanks for this, very useful. I guess that's expected in the light of I
> learnt from the other Arm engineers in the past couple of days.

OK, sorry if I misunderstood you earlier.  From reading your posts I 
thought there would be some mode in which STADD was better -- probably 
high contention and enough extra work to hide the latency.  So I was 
kind of surprised to find these results.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-06 16:30                               ` Palmer Dabbelt
@ 2025-11-06 17:54                                 ` Catalin Marinas
  2025-11-06 18:23                                   ` Palmer Dabbelt
  0 siblings, 1 reply; 46+ messages in thread
From: Catalin Marinas @ 2025-11-06 17:54 UTC (permalink / raw)
  To: Palmer Dabbelt; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel

On Thu, Nov 06, 2025 at 08:30:05AM -0800, Palmer Dabbelt wrote:
> On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote:
> > On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote:
> > > I ran a bunch of cases with those:
> > [...]
> > > Which I'm interpreting to say the following:
> > > 
> > > * LL/SC is pretty good for the common cases, but gets really bad under  the
> > > pathological cases.  It still seems always slower that LDADD.
> > > * STADD has latency that blocks other STADDs, but not other CPU-local  work.
> > > I'd bet there's a bunch of interactions with caches and memory  ordering
> > > here, but those would all juts make STADD look worse so I'm  just ignoring
> > > them.
> > > * LDADD is better than STADD even under pathologically highly contended
> > > cases.  I was actually kind of surprised about this one, I thought the  far
> > > atomics would be better there.
> > > * The prefetches help STADD, but they don't seem to make it better that
> > > LDADD in any case.
> > > * The LDADD latency also happens concurrently with other CPU operations
> > > like the STADD latency does.  It has less latency to hide, so the  latency
> > > starts to go up with less extra work, but it's never worse  that STADD.
> > > 
> > > So I think at least on this system, LDADD is just always better.
> > 
> > Thanks for this, very useful. I guess that's expected in the light of I
> > learnt from the other Arm engineers in the past couple of days.
> 
> OK, sorry if I misunderstood you earlier.  From reading your posts I thought
> there would be some mode in which STADD was better -- probably high
> contention and enough extra work to hide the latency.  So I was kind of
> surprised to find these results.

I think STADD is better for cases where you update some stat counters
but you do a lot of work in between. In your microbenchmark, just lots
of STADDs back to back with NOPs in between (rather than lots of other
memory transactions) are likely to be slower. If these are real
use-cases, at some point the hardware may evolve to behave differently
(or more dynamically).

BTW, I've been pointed by Ola Liljedahl @ Arm at this collection of
routines: https://github.com/ARM-software/progress64/tree/master.
Building it with ATOMICS=yes makes the compiler generate LSE atomics for
intrinsics like __atomic_fetch_add(). It won't generate STADD because of
some aspects of the C consistency models (DMB LD wouldn't guarantee
ordering with a prior STADD).

-- 
Catalin


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-06 17:54                                 ` Catalin Marinas
@ 2025-11-06 18:23                                   ` Palmer Dabbelt
  0 siblings, 0 replies; 46+ messages in thread
From: Palmer Dabbelt @ 2025-11-06 18:23 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: paulmck, Will Deacon, Mark Rutland, linux-arm-kernel

On Thu, 06 Nov 2025 09:54:31 PST (-0800), Catalin Marinas wrote:
> On Thu, Nov 06, 2025 at 08:30:05AM -0800, Palmer Dabbelt wrote:
>> On Thu, 06 Nov 2025 06:00:59 PST (-0800), Catalin Marinas wrote:
>> > On Wed, Nov 05, 2025 at 01:13:10PM -0800, Palmer Dabbelt wrote:
>> > > I ran a bunch of cases with those:
>> > [...]
>> > > Which I'm interpreting to say the following:
>> > >
>> > > * LL/SC is pretty good for the common cases, but gets really bad under  the
>> > > pathological cases.  It still seems always slower that LDADD.
>> > > * STADD has latency that blocks other STADDs, but not other CPU-local  work.
>> > > I'd bet there's a bunch of interactions with caches and memory  ordering
>> > > here, but those would all juts make STADD look worse so I'm  just ignoring
>> > > them.
>> > > * LDADD is better than STADD even under pathologically highly contended
>> > > cases.  I was actually kind of surprised about this one, I thought the  far
>> > > atomics would be better there.
>> > > * The prefetches help STADD, but they don't seem to make it better that
>> > > LDADD in any case.
>> > > * The LDADD latency also happens concurrently with other CPU operations
>> > > like the STADD latency does.  It has less latency to hide, so the  latency
>> > > starts to go up with less extra work, but it's never worse  that STADD.
>> > >
>> > > So I think at least on this system, LDADD is just always better.
>> >
>> > Thanks for this, very useful. I guess that's expected in the light of I
>> > learnt from the other Arm engineers in the past couple of days.
>>
>> OK, sorry if I misunderstood you earlier.  From reading your posts I thought
>> there would be some mode in which STADD was better -- probably high
>> contention and enough extra work to hide the latency.  So I was kind of
>> surprised to find these results.
>
> I think STADD is better for cases where you update some stat counters
> but you do a lot of work in between. In your microbenchmark, just lots
> of STADDs back to back with NOPs in between (rather than lots of other
> memory transactions) are likely to be slower. If these are real
> use-cases, at some point the hardware may evolve to behave differently
> (or more dynamically).

OK, that's kind of what I was trying to demonstrate when putting 
together those new microbenchmark parameters.  So I think at least I 
understood what you were saying, now I just need to figure out what's 
up...

FWIW: there's actually a bunch of memory traffic, the compiler is doing 
something weird with that NOP loop and generating a bunch of 
load/stores/branches.  I was kind of surprised, but I figured it's 
actually better that way.

Also, I found there's a bug in the microbenchmarks: "tmp" is a global, 
so the LDADD code generates

    00000000004102b0 <__percpu_add_case_64_ldadd>:
      4102b0:       90000189        adrp    x9, 440000 <memcpy@GLIBC_2.17>
      4102b4:       f8210008        ldadd   x1, x8, [x0]
      4102b8:       f9005528        str     x8, [x9, #168]
      4102bc:       d65f03c0        ret

as opposed to the STADD code, which generates

    00000000004102a8 <__percpu_add_case_64_lse>:
      4102a8:       f821001f        stadd   x1, [x0]
      4102ac:       d65f03c0        ret

It doesn't seem to change my results any, but figured I'd say something 
in case anyone else tries to run this stuff (there's a fix up, too).

> BTW, I've been pointed by Ola Liljedahl @ Arm at this collection of
> routines: https://github.com/ARM-software/progress64/tree/master.
> Building it with ATOMICS=yes makes the compiler generate LSE atomics for
> intrinsics like __atomic_fetch_add(). It won't generate STADD because of
> some aspects of the C consistency models (DMB LD wouldn't guarantee
> ordering with a prior STADD).

Awesome, thanks.  I'll go take a look -- I'm trying to figure out enough 
of what's going on to figure out what we should do here, but that's 
mostly outside of kernel space now so I think it's just going to be a 
discussion for somewhere else...


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Overhead of arm64 LSE per-CPU atomics?
  2025-11-04 15:59   ` Breno Leitao
                       ` (2 preceding siblings ...)
  2025-11-04 20:57     ` Puranjay Mohan
@ 2025-11-27 12:29     ` Wentao Guan
  3 siblings, 0 replies; 46+ messages in thread
From: Wentao Guan @ 2025-11-27 12:29 UTC (permalink / raw)
  To: leitao
  Cc: catalin.marinas, kernel-team, linux-arm-kernel, mark.rutland,
	paulmck, rmikey, will

Hello All,

Here is my result with HUAWEI HUAWEIPGU-WBY0, which has 24c TSV110 core.(kunpeng920),
little strange --- stadd sometimes very fast, ldadd always slow, llsc always stable...
I thought change detect cap ARM64_HAS_LSE_ATOMICS in arm64_features is little dirty, any good idea?

Best Regrads
Wentao Guan

ARM64 Per-CPU Atomic Add Benchmark
===================================
Running percentile measurements (100 iterations)...
Detected 24 CPUs

 CPU: 0 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 029.00 ns        p95: 029.09 ns          p99: 029.12 ns
LSE (stadd)     (c                0, d              100):   p50: 048.31 ns        p95: 048.74 ns          p99: 048.95 ns
LSE (stadd)     (c                0, d              200):   p50: 086.32 ns        p95: 086.60 ns          p99: 086.75 ns
sched_setaffinity: Invalid argument
LSE (stadd)     (c               10, d                0):   p50: 058.10 ns        p95: 058.32 ns          p99: 058.49 ns
sched_setaffinity: Invalid argument
LSE (stadd)     (c               10, d              300):   p50: 248.03 ns        p95: 248.31 ns          p99: 248.55 ns
sched_setaffinity: Invalid argument
LSE (stadd)     (c               10, d              500):   p50: 402.60 ns        p95: 403.10 ns          p99: 403.24 ns
sched_setaffinity: Invalid argument
LSE (stadd)     (c               30, d                0):   p50: 058.02 ns        p95: 058.31 ns          p99: 058.33 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.51 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.68 ns
sched_setaffinity: Invalid argument
LL/SC           (c               10, d                0):   p50: 013.16 ns        p95: 013.46 ns          p99: 013.47 ns
sched_setaffinity: Invalid argument
LL/SC           (c               10, d               10):   p50: 017.00 ns        p95: 017.20 ns          p99: 017.29 ns
sched_setaffinity: Invalid argument
LL/SC           (c               10, d               20):   p50: 023.37 ns        p95: 023.57 ns          p99: 023.67 ns
sched_setaffinity: Invalid argument
LL/SC           (c             1000, d                0):   p50: 013.16 ns        p95: 013.37 ns          p99: 013.47 ns
sched_setaffinity: Invalid argument
LL/SC           (c             1000, d               10):   p50: 017.00 ns        p95: 017.21 ns          p99: 017.40 ns
sched_setaffinity: Invalid argument
LL/SC           (c          1000000, d                0):   p50: 013.17 ns        p95: 013.37 ns          p99: 013.37 ns
sched_setaffinity: Invalid argument
LL/SC           (c          1000000, d               10):   p50: 017.00 ns        p95: 017.20 ns          p99: 017.30 ns
LDADD           (c                0, d                0):   p50: 069.55 ns        p95: 069.57 ns          p99: 069.71 ns
LDADD           (c                0, d              100):   p50: 107.71 ns        p95: 108.11 ns          p99: 108.20 ns
LDADD           (c                0, d              200):   p50: 152.85 ns        p95: 152.91 ns          p99: 152.93 ns
LDADD           (c                0, d              300):   p50: 193.50 ns        p95: 193.54 ns          p99: 193.62 ns
sched_setaffinity: Invalid argument
LDADD           (c                1, d               10):   p50: 139.04 ns        p95: 139.34 ns          p99: 139.43 ns
sched_setaffinity: Invalid argument
LDADD           (c               10, d                0):   p50: 139.04 ns        p95: 139.44 ns          p99: 139.68 ns
sched_setaffinity: Invalid argument
LDADD           (c               10, d               10):   p50: 139.04 ns        p95: 139.33 ns          p99: 139.34 ns
sched_setaffinity: Invalid argument
LDADD           (c              100, d                0):   p50: 139.04 ns        p95: 139.40 ns          p99: 139.43 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.80 ns          p99: 006.06 ns
sched_setaffinity: Invalid argument
PFRM_KEEP+STADD (c               10, d                0):   p50: 011.59 ns        p95: 011.89 ns          p99: 011.99 ns
sched_setaffinity: Invalid argument
PFRM_KEEP+STADD (c             1000, d                0):   p50: 011.59 ns        p95: 011.79 ns          p99: 011.89 ns
sched_setaffinity: Invalid argument
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 011.59 ns        p95: 011.79 ns          p99: 011.80 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.80 ns          p99: 007.82 ns
sched_setaffinity: Invalid argument
PFRM_STRM+STADD (c               10, d                0):   p50: 011.59 ns        p95: 011.80 ns          p99: 011.89 ns
sched_setaffinity: Invalid argument
PFRM_STRM+STADD (c             1000, d                0):   p50: 011.59 ns        p95: 011.79 ns          p99: 011.89 ns
sched_setaffinity: Invalid argument
PFRM_STRM+STADD (c          1000000, d                0):   p50: 011.59 ns        p95: 011.80 ns          p99: 013.47 ns

 CPU: 1 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 029.00 ns          p99: 029.01 ns
LSE (stadd)     (c                0, d              100):   p50: 048.53 ns        p95: 048.96 ns          p99: 049.05 ns
LSE (stadd)     (c                0, d              200):   p50: 086.26 ns        p95: 087.25 ns          p99: 087.37 ns
LSE (stadd)     (c               10, d                0):   p50: 038.29 ns        p95: 038.31 ns          p99: 038.52 ns
LSE (stadd)     (c               10, d              300):   p50: 123.81 ns        p95: 123.92 ns          p99: 124.56 ns
LSE (stadd)     (c               10, d              500):   p50: 201.16 ns        p95: 201.20 ns          p99: 201.22 ns
LSE (stadd)     (c               30, d                0):   p50: 038.30 ns        p95: 038.31 ns          p99: 038.32 ns
LL/SC           (c                0, d                0):   p50: 006.56 ns        p95: 006.58 ns          p99: 006.58 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.51 ns          p99: 008.51 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.68 ns          p99: 011.68 ns
LL/SC           (c               10, d                0):   p50: 012.97 ns        p95: 013.04 ns          p99: 013.06 ns
LL/SC           (c               10, d               10):   p50: 020.93 ns        p95: 021.06 ns          p99: 021.16 ns
LL/SC           (c               10, d               20):   p50: 051.95 ns        p95: 064.63 ns          p99: 076.34 ns
LL/SC           (c             1000, d                0):   p50: 012.81 ns        p95: 012.83 ns          p99: 012.84 ns
LL/SC           (c             1000, d               10):   p50: 020.72 ns        p95: 020.73 ns          p99: 020.74 ns
LL/SC           (c          1000000, d                0):   p50: 008.65 ns        p95: 009.03 ns          p99: 009.11 ns
LL/SC           (c          1000000, d               10):   p50: 012.03 ns        p95: 012.74 ns          p99: 013.09 ns
LDADD           (c                0, d                0):   p50: 010.04 ns        p95: 010.06 ns          p99: 010.06 ns
LDADD           (c                0, d              100):   p50: 049.48 ns        p95: 107.89 ns          p99: 108.48 ns
LDADD           (c                0, d              200):   p50: 152.75 ns        p95: 152.89 ns          p99: 152.90 ns
LDADD           (c                0, d              300):   p50: 193.52 ns        p95: 193.54 ns          p99: 193.58 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.51 ns          p99: 069.51 ns
LDADD           (c               10, d                0):   p50: 069.67 ns        p95: 069.69 ns          p99: 069.69 ns
LDADD           (c               10, d               10):   p50: 069.67 ns        p95: 069.69 ns          p99: 069.70 ns
LDADD           (c              100, d                0):   p50: 070.91 ns        p95: 070.95 ns          p99: 071.00 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.81 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 082.54 ns        p95: 082.62 ns          p99: 082.68 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 065.04 ns        p95: 065.39 ns          p99: 065.62 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.55 ns        p95: 020.03 ns          p99: 020.15 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.81 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 082.51 ns        p95: 082.61 ns          p99: 082.63 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 064.35 ns        p95: 064.81 ns          p99: 065.25 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.52 ns        p95: 020.08 ns          p99: 020.27 ns

 CPU: 2 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.68 ns        p95: 049.06 ns          p99: 049.11 ns
LSE (stadd)     (c                0, d              200):   p50: 087.28 ns        p95: 087.39 ns          p99: 087.45 ns
LSE (stadd)     (c               10, d                0):   p50: 038.28 ns        p95: 038.29 ns          p99: 038.31 ns
LSE (stadd)     (c               10, d              300):   p50: 123.80 ns        p95: 123.85 ns          p99: 123.93 ns
LSE (stadd)     (c               10, d              500):   p50: 201.18 ns        p95: 203.31 ns          p99: 203.39 ns
LSE (stadd)     (c               30, d                0):   p50: 038.31 ns        p95: 038.35 ns          p99: 038.52 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 012.97 ns        p95: 013.05 ns          p99: 013.06 ns
LL/SC           (c               10, d               10):   p50: 020.97 ns        p95: 021.06 ns          p99: 021.10 ns
LL/SC           (c               10, d               20):   p50: 032.91 ns        p95: 037.12 ns          p99: 037.74 ns
LL/SC           (c             1000, d                0):   p50: 012.82 ns        p95: 012.83 ns          p99: 012.83 ns
LL/SC           (c             1000, d               10):   p50: 020.71 ns        p95: 020.73 ns          p99: 020.74 ns
LL/SC           (c          1000000, d                0):   p50: 008.40 ns        p95: 008.96 ns          p99: 009.07 ns
LL/SC           (c          1000000, d               10):   p50: 011.66 ns        p95: 012.39 ns          p99: 012.59 ns
LDADD           (c                0, d                0):   p50: 069.53 ns        p95: 069.56 ns          p99: 069.58 ns
LDADD           (c                0, d              100):   p50: 107.80 ns        p95: 108.25 ns          p99: 108.35 ns
LDADD           (c                0, d              200):   p50: 152.45 ns        p95: 152.50 ns          p99: 152.74 ns
LDADD           (c                0, d              300):   p50: 193.48 ns        p95: 193.49 ns          p99: 193.52 ns
LDADD           (c                1, d               10):   p50: 069.47 ns        p95: 069.50 ns          p99: 069.51 ns
LDADD           (c               10, d                0):   p50: 069.65 ns        p95: 069.67 ns          p99: 069.69 ns
LDADD           (c               10, d               10):   p50: 069.65 ns        p95: 069.67 ns          p99: 069.68 ns
LDADD           (c              100, d                0):   p50: 070.90 ns        p95: 070.94 ns          p99: 071.01 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 082.57 ns        p95: 082.69 ns          p99: 082.75 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 065.07 ns        p95: 065.41 ns          p99: 065.53 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.52 ns        p95: 020.06 ns          p99: 020.25 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 082.54 ns        p95: 082.64 ns          p99: 082.68 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 064.39 ns        p95: 065.05 ns          p99: 075.52 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.57 ns        p95: 020.00 ns          p99: 020.24 ns

 CPU: 3 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.55 ns        p95: 049.01 ns          p99: 049.13 ns
LSE (stadd)     (c                0, d              200):   p50: 086.33 ns        p95: 087.29 ns          p99: 087.36 ns
LSE (stadd)     (c               10, d                0):   p50: 038.28 ns        p95: 038.29 ns          p99: 038.30 ns
LSE (stadd)     (c               10, d              300):   p50: 123.79 ns        p95: 124.84 ns          p99: 125.07 ns
LSE (stadd)     (c               10, d              500):   p50: 202.45 ns        p95: 202.92 ns          p99: 203.01 ns
LSE (stadd)     (c               30, d                0):   p50: 038.29 ns        p95: 038.35 ns          p99: 038.46 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 012.98 ns        p95: 013.05 ns          p99: 013.08 ns
LL/SC           (c               10, d               10):   p50: 020.95 ns        p95: 021.05 ns          p99: 021.07 ns
LL/SC           (c               10, d               20):   p50: 037.82 ns        p95: 041.67 ns          p99: 045.06 ns
LL/SC           (c             1000, d                0):   p50: 012.82 ns        p95: 012.84 ns          p99: 012.84 ns
LL/SC           (c             1000, d               10):   p50: 020.72 ns        p95: 020.74 ns          p99: 020.74 ns
LL/SC           (c          1000000, d                0):   p50: 008.43 ns        p95: 008.95 ns          p99: 009.07 ns
LL/SC           (c          1000000, d               10):   p50: 011.70 ns        p95: 012.22 ns          p99: 012.43 ns
LDADD           (c                0, d                0):   p50: 010.04 ns        p95: 010.04 ns          p99: 010.04 ns
LDADD           (c                0, d              100):   p50: 107.34 ns        p95: 107.98 ns          p99: 108.17 ns
LDADD           (c                0, d              200):   p50: 152.46 ns        p95: 152.82 ns          p99: 152.84 ns
LDADD           (c                0, d              300):   p50: 193.48 ns        p95: 193.49 ns          p99: 193.49 ns
LDADD           (c                1, d               10):   p50: 069.47 ns        p95: 069.51 ns          p99: 069.62 ns
LDADD           (c               10, d                0):   p50: 069.64 ns        p95: 069.67 ns          p99: 069.68 ns
LDADD           (c               10, d               10):   p50: 069.65 ns        p95: 069.67 ns          p99: 069.68 ns
LDADD           (c              100, d                0):   p50: 070.90 ns        p95: 070.92 ns          p99: 070.95 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 082.62 ns        p95: 082.73 ns          p99: 083.09 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 065.16 ns        p95: 065.46 ns          p99: 065.57 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.52 ns        p95: 020.09 ns          p99: 020.39 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 082.59 ns        p95: 082.68 ns          p99: 082.69 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 064.32 ns        p95: 064.77 ns          p99: 064.94 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.41 ns        p95: 019.96 ns          p99: 020.30 ns

 CPU: 4 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.69 ns        p95: 049.16 ns          p99: 049.22 ns
LSE (stadd)     (c                0, d              200):   p50: 087.26 ns        p95: 087.41 ns          p99: 087.47 ns
LSE (stadd)     (c               10, d                0):   p50: 038.14 ns        p95: 038.16 ns          p99: 038.16 ns
LSE (stadd)     (c               10, d              300):   p50: 125.32 ns        p95: 125.77 ns          p99: 125.88 ns
LSE (stadd)     (c               10, d              500):   p50: 202.66 ns        p95: 203.15 ns          p99: 203.23 ns
LSE (stadd)     (c               30, d                0):   p50: 038.14 ns        p95: 038.16 ns          p99: 038.16 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 043.72 ns        p95: 044.52 ns          p99: 044.79 ns
LL/SC           (c               10, d               10):   p50: 051.37 ns        p95: 051.43 ns          p99: 051.87 ns
LL/SC           (c               10, d               20):   p50: 061.52 ns        p95: 061.56 ns          p99: 061.58 ns
LL/SC           (c             1000, d                0):   p50: 014.23 ns        p95: 014.47 ns          p99: 014.54 ns
LL/SC           (c             1000, d               10):   p50: 019.74 ns        p95: 020.47 ns          p99: 020.79 ns
LL/SC           (c          1000000, d                0):   p50: 006.61 ns        p95: 006.63 ns          p99: 006.63 ns
LL/SC           (c          1000000, d               10):   p50: 008.64 ns        p95: 008.67 ns          p99: 008.67 ns
LDADD           (c                0, d                0):   p50: 057.96 ns        p95: 058.04 ns          p99: 058.06 ns
LDADD           (c                0, d              100):   p50: 049.59 ns        p95: 096.48 ns          p99: 096.69 ns
LDADD           (c                0, d              200):   p50: 140.93 ns        p95: 141.00 ns          p99: 141.02 ns
LDADD           (c                0, d              300):   p50: 181.95 ns        p95: 181.98 ns          p99: 182.00 ns
LDADD           (c                1, d               10):   p50: 069.52 ns        p95: 069.54 ns          p99: 069.54 ns
LDADD           (c               10, d                0):   p50: 070.01 ns        p95: 070.02 ns          p99: 070.02 ns
LDADD           (c               10, d               10):   p50: 070.02 ns        p95: 070.04 ns          p99: 070.05 ns
LDADD           (c              100, d                0):   p50: 067.33 ns        p95: 067.55 ns          p99: 067.63 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 115.30 ns        p95: 116.98 ns          p99: 117.09 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 068.71 ns        p95: 068.88 ns          p99: 068.98 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.23 ns        p95: 019.89 ns          p99: 019.93 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 114.81 ns        p95: 116.05 ns          p99: 116.11 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 068.68 ns        p95: 068.89 ns          p99: 068.96 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.31 ns        p95: 019.81 ns          p99: 019.98 ns

 CPU: 5 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.54 ns        p95: 048.93 ns          p99: 049.19 ns
LSE (stadd)     (c                0, d              200):   p50: 086.28 ns        p95: 087.13 ns          p99: 087.26 ns
LSE (stadd)     (c               10, d                0):   p50: 038.14 ns        p95: 038.15 ns          p99: 038.16 ns
LSE (stadd)     (c               10, d              300):   p50: 125.04 ns        p95: 125.59 ns          p99: 125.65 ns
LSE (stadd)     (c               10, d              500):   p50: 200.96 ns        p95: 201.00 ns          p99: 201.02 ns
LSE (stadd)     (c               30, d                0):   p50: 038.15 ns        p95: 038.17 ns          p99: 038.17 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 043.73 ns        p95: 044.24 ns          p99: 044.26 ns
LL/SC           (c               10, d               10):   p50: 051.38 ns        p95: 051.42 ns          p99: 051.43 ns
LL/SC           (c               10, d               20):   p50: 061.49 ns        p95: 061.58 ns          p99: 061.58 ns
LL/SC           (c             1000, d                0):   p50: 014.21 ns        p95: 014.47 ns          p99: 014.61 ns
LL/SC           (c             1000, d               10):   p50: 019.83 ns        p95: 020.69 ns          p99: 020.92 ns
LL/SC           (c          1000000, d                0):   p50: 006.61 ns        p95: 006.63 ns          p99: 006.63 ns
LL/SC           (c          1000000, d               10):   p50: 008.63 ns        p95: 008.66 ns          p99: 008.67 ns
LDADD           (c                0, d                0):   p50: 058.00 ns        p95: 058.07 ns          p99: 058.12 ns
LDADD           (c                0, d              100):   p50: 096.16 ns        p95: 096.58 ns          p99: 096.71 ns
LDADD           (c                0, d              200):   p50: 140.94 ns        p95: 141.01 ns          p99: 141.14 ns
LDADD           (c                0, d              300):   p50: 182.00 ns        p95: 182.20 ns          p99: 182.25 ns
LDADD           (c                1, d               10):   p50: 069.52 ns        p95: 069.54 ns          p99: 069.55 ns
LDADD           (c               10, d                0):   p50: 070.00 ns        p95: 070.01 ns          p99: 070.02 ns
LDADD           (c               10, d               10):   p50: 070.01 ns        p95: 070.03 ns          p99: 070.09 ns
LDADD           (c              100, d                0):   p50: 067.22 ns        p95: 067.37 ns          p99: 067.39 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 115.40 ns        p95: 117.17 ns          p99: 117.25 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 068.72 ns        p95: 068.89 ns          p99: 068.94 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.10 ns        p95: 019.79 ns          p99: 019.95 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 114.82 ns        p95: 114.97 ns          p99: 114.99 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 065.97 ns        p95: 066.10 ns          p99: 066.17 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.18 ns        p95: 019.69 ns          p99: 019.91 ns

 CPU: 6 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 024.33 ns        p95: 024.33 ns          p99: 024.33 ns
LSE (stadd)     (c                0, d              100):   p50: 048.69 ns        p95: 049.12 ns          p99: 049.30 ns
LSE (stadd)     (c                0, d              200):   p50: 087.26 ns        p95: 087.45 ns          p99: 087.68 ns
LSE (stadd)     (c               10, d                0):   p50: 038.16 ns        p95: 038.17 ns          p99: 038.18 ns
LSE (stadd)     (c               10, d              300):   p50: 123.87 ns        p95: 124.02 ns          p99: 125.13 ns
LSE (stadd)     (c               10, d              500):   p50: 201.00 ns        p95: 201.23 ns          p99: 201.36 ns
LSE (stadd)     (c               30, d                0):   p50: 038.14 ns        p95: 038.16 ns          p99: 038.17 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 043.71 ns        p95: 044.23 ns          p99: 044.71 ns
LL/SC           (c               10, d               10):   p50: 051.36 ns        p95: 051.44 ns          p99: 051.90 ns
LL/SC           (c               10, d               20):   p50: 061.50 ns        p95: 061.56 ns          p99: 061.57 ns
LL/SC           (c             1000, d                0):   p50: 014.20 ns        p95: 014.50 ns          p99: 014.63 ns
LL/SC           (c             1000, d               10):   p50: 019.64 ns        p95: 020.58 ns          p99: 020.98 ns
LL/SC           (c          1000000, d                0):   p50: 006.61 ns        p95: 006.63 ns          p99: 006.64 ns
LL/SC           (c          1000000, d               10):   p50: 008.64 ns        p95: 008.67 ns          p99: 008.70 ns
LDADD           (c                0, d                0):   p50: 057.97 ns        p95: 058.02 ns          p99: 058.05 ns
LDADD           (c                0, d              100):   p50: 095.93 ns        p95: 096.45 ns          p99: 096.57 ns
LDADD           (c                0, d              200):   p50: 141.19 ns        p95: 141.39 ns          p99: 141.41 ns
LDADD           (c                0, d              300):   p50: 181.97 ns        p95: 182.07 ns          p99: 182.12 ns
LDADD           (c                1, d               10):   p50: 069.52 ns        p95: 069.54 ns          p99: 069.54 ns
LDADD           (c               10, d                0):   p50: 070.00 ns        p95: 070.06 ns          p99: 070.08 ns
LDADD           (c               10, d               10):   p50: 069.99 ns        p95: 070.00 ns          p99: 070.00 ns
LDADD           (c              100, d                0):   p50: 067.25 ns        p95: 067.32 ns          p99: 067.36 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 115.46 ns        p95: 117.05 ns          p99: 117.10 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 066.10 ns        p95: 066.25 ns          p99: 066.30 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.30 ns        p95: 019.87 ns          p99: 020.07 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 114.77 ns        p95: 114.97 ns          p99: 115.71 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 066.07 ns        p95: 068.72 ns          p99: 068.80 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.29 ns        p95: 019.96 ns          p99: 020.11 ns

 CPU: 7 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.80 ns
LSE (stadd)     (c                0, d              100):   p50: 048.44 ns        p95: 048.83 ns          p99: 048.89 ns
LSE (stadd)     (c                0, d              200):   p50: 086.35 ns        p95: 087.11 ns          p99: 087.19 ns
LSE (stadd)     (c               10, d                0):   p50: 038.16 ns        p95: 038.18 ns          p99: 038.18 ns
LSE (stadd)     (c               10, d              300):   p50: 123.87 ns        p95: 124.31 ns          p99: 125.36 ns
LSE (stadd)     (c               10, d              500):   p50: 201.00 ns        p95: 201.04 ns          p99: 201.08 ns
LSE (stadd)     (c               30, d                0):   p50: 038.15 ns        p95: 038.16 ns          p99: 038.16 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 043.72 ns        p95: 044.24 ns          p99: 045.35 ns
LL/SC           (c               10, d               10):   p50: 051.37 ns        p95: 051.44 ns          p99: 051.99 ns
LL/SC           (c               10, d               20):   p50: 061.54 ns        p95: 061.57 ns          p99: 061.58 ns
LL/SC           (c             1000, d                0):   p50: 014.18 ns        p95: 014.48 ns          p99: 014.59 ns
LL/SC           (c             1000, d               10):   p50: 019.64 ns        p95: 020.38 ns          p99: 020.71 ns
LL/SC           (c          1000000, d                0):   p50: 006.61 ns        p95: 006.63 ns          p99: 006.63 ns
LL/SC           (c          1000000, d               10):   p50: 008.64 ns        p95: 008.67 ns          p99: 008.67 ns
LDADD           (c                0, d                0):   p50: 058.00 ns        p95: 058.05 ns          p99: 058.07 ns
LDADD           (c                0, d              100):   p50: 049.28 ns        p95: 049.81 ns          p99: 049.93 ns
LDADD           (c                0, d              200):   p50: 140.93 ns        p95: 141.31 ns          p99: 141.36 ns
LDADD           (c                0, d              300):   p50: 181.93 ns        p95: 181.98 ns          p99: 182.00 ns
LDADD           (c                1, d               10):   p50: 069.52 ns        p95: 069.62 ns          p99: 069.68 ns
LDADD           (c               10, d                0):   p50: 069.99 ns        p95: 070.00 ns          p99: 070.01 ns
LDADD           (c               10, d               10):   p50: 070.00 ns        p95: 070.01 ns          p99: 070.01 ns
LDADD           (c              100, d                0):   p50: 067.24 ns        p95: 067.31 ns          p99: 067.33 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 115.22 ns        p95: 116.94 ns          p99: 117.03 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 068.76 ns        p95: 068.89 ns          p99: 068.95 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.15 ns        p95: 019.78 ns          p99: 020.08 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 115.86 ns        p95: 116.02 ns          p99: 116.06 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 068.69 ns        p95: 068.92 ns          p99: 069.31 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.32 ns        p95: 019.77 ns          p99: 019.91 ns

 CPU: 8 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.73 ns        p95: 049.11 ns          p99: 049.16 ns
LSE (stadd)     (c                0, d              200):   p50: 087.27 ns        p95: 087.45 ns          p99: 087.51 ns
LSE (stadd)     (c               10, d                0):   p50: 038.15 ns        p95: 038.16 ns          p99: 038.16 ns
LSE (stadd)     (c               10, d              300):   p50: 124.75 ns        p95: 125.94 ns          p99: 126.04 ns
LSE (stadd)     (c               10, d              500):   p50: 202.94 ns        p95: 203.22 ns          p99: 203.27 ns
LSE (stadd)     (c               30, d                0):   p50: 038.18 ns        p95: 038.20 ns          p99: 038.33 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 040.89 ns        p95: 041.37 ns          p99: 041.38 ns
LL/SC           (c               10, d               10):   p50: 047.73 ns        p95: 048.20 ns          p99: 048.63 ns
LL/SC           (c               10, d               20):   p50: 057.68 ns        p95: 057.74 ns          p99: 057.75 ns
LL/SC           (c             1000, d                0):   p50: 013.79 ns        p95: 014.37 ns          p99: 014.50 ns
LL/SC           (c             1000, d               10):   p50: 013.02 ns        p95: 013.47 ns          p99: 013.56 ns
LL/SC           (c          1000000, d                0):   p50: 006.63 ns        p95: 006.65 ns          p99: 006.66 ns
LL/SC           (c          1000000, d               10):   p50: 008.54 ns        p95: 008.56 ns          p99: 008.56 ns
LDADD           (c                0, d                0):   p50: 066.45 ns        p95: 066.47 ns          p99: 066.48 ns
LDADD           (c                0, d              100):   p50: 104.69 ns        p95: 105.07 ns          p99: 105.19 ns
LDADD           (c                0, d              200):   p50: 149.37 ns        p95: 149.43 ns          p99: 149.77 ns
LDADD           (c                0, d              300):   p50: 190.40 ns        p95: 190.42 ns          p99: 190.49 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.52 ns          p99: 069.53 ns
LDADD           (c               10, d                0):   p50: 069.67 ns        p95: 069.70 ns          p99: 069.71 ns
LDADD           (c               10, d               10):   p50: 069.68 ns        p95: 069.72 ns          p99: 069.75 ns
LDADD           (c              100, d                0):   p50: 070.93 ns        p95: 070.95 ns          p99: 071.00 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 119.70 ns        p95: 120.06 ns          p99: 120.20 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 069.04 ns        p95: 069.28 ns          p99: 069.42 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 018.98 ns        p95: 019.60 ns          p99: 019.66 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 120.95 ns        p95: 121.22 ns          p99: 121.27 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 077.95 ns        p95: 078.32 ns          p99: 078.42 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 020.04 ns        p95: 020.73 ns          p99: 020.89 ns

 CPU: 9 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 027.54 ns          p99: 027.61 ns
LSE (stadd)     (c                0, d              100):   p50: 048.42 ns        p95: 048.91 ns          p99: 049.00 ns
LSE (stadd)     (c                0, d              200):   p50: 087.26 ns        p95: 087.45 ns          p99: 087.54 ns
LSE (stadd)     (c               10, d                0):   p50: 038.18 ns        p95: 038.19 ns          p99: 038.19 ns
LSE (stadd)     (c               10, d              300):   p50: 123.95 ns        p95: 125.95 ns          p99: 126.02 ns
LSE (stadd)     (c               10, d              500):   p50: 201.11 ns        p95: 201.20 ns          p99: 201.23 ns
LSE (stadd)     (c               30, d                0):   p50: 038.18 ns        p95: 038.19 ns          p99: 038.20 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 046.94 ns        p95: 049.33 ns          p99: 050.68 ns
LL/SC           (c               10, d               10):   p50: 048.15 ns        p95: 048.21 ns          p99: 050.57 ns
LL/SC           (c               10, d               20):   p50: 057.73 ns        p95: 057.77 ns          p99: 058.22 ns
LL/SC           (c             1000, d                0):   p50: 013.96 ns        p95: 014.56 ns          p99: 014.70 ns
LL/SC           (c             1000, d               10):   p50: 012.99 ns        p95: 013.43 ns          p99: 013.70 ns
LL/SC           (c          1000000, d                0):   p50: 006.64 ns        p95: 006.65 ns          p99: 006.66 ns
LL/SC           (c          1000000, d               10):   p50: 008.55 ns        p95: 008.56 ns          p99: 008.57 ns
LDADD           (c                0, d                0):   p50: 066.44 ns        p95: 066.45 ns          p99: 066.46 ns
LDADD           (c                0, d              100):   p50: 049.38 ns        p95: 049.76 ns          p99: 049.93 ns
LDADD           (c                0, d              200):   p50: 149.39 ns        p95: 149.77 ns          p99: 149.79 ns
LDADD           (c                0, d              300):   p50: 190.42 ns        p95: 190.43 ns          p99: 190.44 ns
LDADD           (c                1, d               10):   p50: 069.48 ns        p95: 069.52 ns          p99: 069.61 ns
LDADD           (c               10, d                0):   p50: 069.68 ns        p95: 069.71 ns          p99: 069.73 ns
LDADD           (c               10, d               10):   p50: 069.68 ns        p95: 069.71 ns          p99: 069.72 ns
LDADD           (c              100, d                0):   p50: 070.91 ns        p95: 070.94 ns          p99: 070.98 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 120.35 ns        p95: 121.25 ns          p99: 121.38 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 069.06 ns        p95: 069.28 ns          p99: 069.39 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.26 ns        p95: 019.96 ns          p99: 020.13 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 121.02 ns        p95: 121.28 ns          p99: 121.48 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 077.56 ns        p95: 078.35 ns          p99: 078.83 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.18 ns        p95: 019.84 ns          p99: 020.04 ns

 CPU: 10 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.80 ns
LSE (stadd)     (c                0, d              100):   p50: 048.69 ns        p95: 049.10 ns          p99: 049.35 ns
LSE (stadd)     (c                0, d              200):   p50: 086.24 ns        p95: 086.49 ns          p99: 087.15 ns
LSE (stadd)     (c               10, d                0):   p50: 038.17 ns        p95: 038.19 ns          p99: 038.19 ns
LSE (stadd)     (c               10, d              300):   p50: 123.87 ns        p95: 125.86 ns          p99: 125.99 ns
LSE (stadd)     (c               10, d              500):   p50: 202.59 ns        p95: 203.18 ns          p99: 203.32 ns
LSE (stadd)     (c               30, d                0):   p50: 038.20 ns        p95: 038.22 ns          p99: 038.22 ns
LL/SC           (c                0, d                0):   p50: 006.56 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 040.89 ns        p95: 041.36 ns          p99: 041.38 ns
LL/SC           (c               10, d               10):   p50: 048.16 ns        p95: 048.20 ns          p99: 048.21 ns
LL/SC           (c               10, d               20):   p50: 057.72 ns        p95: 058.19 ns          p99: 058.22 ns
LL/SC           (c             1000, d                0):   p50: 013.55 ns        p95: 014.22 ns          p99: 014.44 ns
LL/SC           (c             1000, d               10):   p50: 013.19 ns        p95: 013.60 ns          p99: 013.79 ns
LL/SC           (c          1000000, d                0):   p50: 006.64 ns        p95: 006.66 ns          p99: 006.66 ns
LL/SC           (c          1000000, d               10):   p50: 008.55 ns        p95: 008.56 ns          p99: 008.56 ns
LDADD           (c                0, d                0):   p50: 010.04 ns        p95: 066.46 ns          p99: 066.48 ns
LDADD           (c                0, d              100):   p50: 104.62 ns        p95: 105.05 ns          p99: 105.17 ns
LDADD           (c                0, d              200):   p50: 149.38 ns        p95: 149.42 ns          p99: 149.62 ns
LDADD           (c                0, d              300):   p50: 190.40 ns        p95: 190.42 ns          p99: 190.48 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.51 ns          p99: 069.52 ns
LDADD           (c               10, d                0):   p50: 069.67 ns        p95: 069.70 ns          p99: 069.70 ns
LDADD           (c               10, d               10):   p50: 069.68 ns        p95: 069.70 ns          p99: 069.72 ns
LDADD           (c              100, d                0):   p50: 070.93 ns        p95: 070.96 ns          p99: 071.01 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 119.69 ns        p95: 120.18 ns          p99: 120.30 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 068.96 ns        p95: 069.20 ns          p99: 069.26 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.28 ns        p95: 019.85 ns          p99: 019.95 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 006.32 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 121.20 ns        p95: 121.47 ns          p99: 121.51 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 077.93 ns        p95: 078.36 ns          p99: 078.60 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 020.10 ns        p95: 020.70 ns          p99: 020.95 ns

 CPU: 11 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 010.43 ns        p95: 027.56 ns          p99: 027.57 ns
LSE (stadd)     (c                0, d              100):   p50: 048.64 ns        p95: 049.06 ns          p99: 049.12 ns
LSE (stadd)     (c                0, d              200):   p50: 087.28 ns        p95: 087.49 ns          p99: 087.53 ns
LSE (stadd)     (c               10, d                0):   p50: 038.18 ns        p95: 038.19 ns          p99: 038.20 ns
LSE (stadd)     (c               10, d              300):   p50: 123.88 ns        p95: 123.93 ns          p99: 123.96 ns
LSE (stadd)     (c               10, d              500):   p50: 201.13 ns        p95: 201.19 ns          p99: 201.27 ns
LSE (stadd)     (c               30, d                0):   p50: 038.18 ns        p95: 038.19 ns          p99: 038.20 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 040.89 ns        p95: 041.37 ns          p99: 041.38 ns
LL/SC           (c               10, d               10):   p50: 048.16 ns        p95: 048.21 ns          p99: 048.23 ns
LL/SC           (c               10, d               20):   p50: 057.73 ns        p95: 057.77 ns          p99: 058.28 ns
LL/SC           (c             1000, d                0):   p50: 013.59 ns        p95: 014.21 ns          p99: 014.60 ns
LL/SC           (c             1000, d               10):   p50: 013.06 ns        p95: 013.38 ns          p99: 013.60 ns
LL/SC           (c          1000000, d                0):   p50: 006.64 ns        p95: 006.66 ns          p99: 006.67 ns
LL/SC           (c          1000000, d               10):   p50: 008.55 ns        p95: 008.55 ns          p99: 008.56 ns
LDADD           (c                0, d                0):   p50: 066.45 ns        p95: 066.47 ns          p99: 066.48 ns
LDADD           (c                0, d              100):   p50: 104.62 ns        p95: 105.09 ns          p99: 105.30 ns
LDADD           (c                0, d              200):   p50: 149.38 ns        p95: 149.73 ns          p99: 149.91 ns
LDADD           (c                0, d              300):   p50: 190.39 ns        p95: 190.41 ns          p99: 190.41 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.51 ns          p99: 069.52 ns
LDADD           (c               10, d                0):   p50: 069.68 ns        p95: 069.74 ns          p99: 069.77 ns
LDADD           (c               10, d               10):   p50: 069.67 ns        p95: 069.70 ns          p99: 069.71 ns
LDADD           (c              100, d                0):   p50: 070.93 ns        p95: 070.94 ns          p99: 070.99 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 010.08 ns          p99: 010.32 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 119.88 ns        p95: 120.38 ns          p99: 120.56 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 069.03 ns        p95: 069.26 ns          p99: 069.33 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.33 ns        p95: 019.90 ns          p99: 019.98 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 009.76 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 121.10 ns        p95: 121.30 ns          p99: 121.40 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 068.90 ns        p95: 078.29 ns          p99: 078.66 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.32 ns        p95: 019.84 ns          p99: 020.14 ns

 CPU: 12 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 026.02 ns        p95: 026.03 ns          p99: 026.03 ns
LSE (stadd)     (c                0, d              100):   p50: 048.45 ns        p95: 048.85 ns          p99: 049.00 ns
LSE (stadd)     (c                0, d              200):   p50: 087.27 ns        p95: 087.52 ns          p99: 087.59 ns
LSE (stadd)     (c               10, d                0):   p50: 038.17 ns        p95: 038.18 ns          p99: 038.30 ns
LSE (stadd)     (c               10, d              300):   p50: 125.28 ns        p95: 125.78 ns          p99: 125.90 ns
LSE (stadd)     (c               10, d              500):   p50: 200.99 ns        p95: 202.74 ns          p99: 202.87 ns
LSE (stadd)     (c               30, d                0):   p50: 038.16 ns        p95: 038.34 ns          p99: 038.62 ns
LL/SC           (c                0, d                0):   p50: 006.56 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 048.25 ns        p95: 051.04 ns          p99: 052.49 ns
LL/SC           (c               10, d               10):   p50: 049.65 ns        p95: 057.90 ns          p99: 060.36 ns
LL/SC           (c               10, d               20):   p50: 059.60 ns        p95: 059.65 ns          p99: 059.66 ns
LL/SC           (c             1000, d                0):   p50: 016.79 ns        p95: 016.95 ns          p99: 016.97 ns
LL/SC           (c             1000, d               10):   p50: 018.73 ns        p95: 019.36 ns          p99: 019.60 ns
LL/SC           (c          1000000, d                0):   p50: 006.68 ns        p95: 006.70 ns          p99: 006.71 ns
LL/SC           (c          1000000, d               10):   p50: 008.60 ns        p95: 008.62 ns          p99: 008.63 ns
LDADD           (c                0, d                0):   p50: 017.77 ns        p95: 063.38 ns          p99: 063.39 ns
LDADD           (c                0, d              100):   p50: 101.71 ns        p95: 102.20 ns          p99: 102.32 ns
LDADD           (c                0, d              200):   p50: 146.30 ns        p95: 146.38 ns          p99: 146.40 ns
LDADD           (c                0, d              300):   p50: 187.31 ns        p95: 187.33 ns          p99: 187.37 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.51 ns          p99: 069.52 ns
LDADD           (c               10, d                0):   p50: 069.67 ns        p95: 069.69 ns          p99: 069.78 ns
LDADD           (c               10, d               10):   p50: 069.66 ns        p95: 069.68 ns          p99: 069.69 ns
LDADD           (c              100, d                0):   p50: 070.91 ns        p95: 070.94 ns          p99: 070.98 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.80 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 119.07 ns        p95: 119.29 ns          p99: 119.35 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 066.09 ns        p95: 066.35 ns          p99: 066.47 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 018.46 ns        p95: 019.26 ns          p99: 019.66 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 119.36 ns        p95: 119.71 ns          p99: 119.76 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 065.43 ns        p95: 066.03 ns          p99: 066.11 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.02 ns        p95: 019.62 ns          p99: 020.02 ns

 CPU: 13 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.52 ns        p95: 049.01 ns          p99: 049.08 ns
LSE (stadd)     (c                0, d              200):   p50: 086.40 ns        p95: 087.37 ns          p99: 087.41 ns
LSE (stadd)     (c               10, d                0):   p50: 038.17 ns        p95: 038.18 ns          p99: 038.18 ns
LSE (stadd)     (c               10, d              300):   p50: 125.61 ns        p95: 125.94 ns          p99: 126.08 ns
LSE (stadd)     (c               10, d              500):   p50: 201.14 ns        p95: 203.16 ns          p99: 203.20 ns
LSE (stadd)     (c               30, d                0):   p50: 038.15 ns        p95: 038.16 ns          p99: 038.17 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 048.28 ns        p95: 052.02 ns          p99: 053.32 ns
LL/SC           (c               10, d               10):   p50: 049.67 ns        p95: 060.85 ns          p99: 062.39 ns
LL/SC           (c               10, d               20):   p50: 059.43 ns        p95: 059.65 ns          p99: 060.10 ns
LL/SC           (c             1000, d                0):   p50: 016.75 ns        p95: 016.93 ns          p99: 017.01 ns
LL/SC           (c             1000, d               10):   p50: 018.90 ns        p95: 019.80 ns          p99: 019.99 ns
LL/SC           (c          1000000, d                0):   p50: 006.68 ns        p95: 006.71 ns          p99: 006.72 ns
LL/SC           (c          1000000, d               10):   p50: 008.60 ns        p95: 008.62 ns          p99: 008.62 ns
LDADD           (c                0, d                0):   p50: 063.35 ns        p95: 063.37 ns          p99: 063.37 ns
LDADD           (c                0, d              100):   p50: 101.69 ns        p95: 102.10 ns          p99: 102.26 ns
LDADD           (c                0, d              200):   p50: 146.67 ns        p95: 146.70 ns          p99: 146.77 ns
LDADD           (c                0, d              300):   p50: 187.32 ns        p95: 187.33 ns          p99: 187.34 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.52 ns          p99: 069.53 ns
LDADD           (c               10, d                0):   p50: 069.66 ns        p95: 069.72 ns          p99: 069.79 ns
LDADD           (c               10, d               10):   p50: 069.67 ns        p95: 069.70 ns          p99: 069.71 ns
LDADD           (c              100, d                0):   p50: 070.90 ns        p95: 070.93 ns          p99: 070.94 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 119.09 ns        p95: 119.46 ns          p99: 119.51 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 066.19 ns        p95: 066.49 ns          p99: 066.53 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.20 ns        p95: 019.66 ns          p99: 019.78 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 119.03 ns        p95: 119.64 ns          p99: 119.69 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 065.49 ns        p95: 066.52 ns          p99: 067.66 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.12 ns        p95: 019.68 ns          p99: 019.94 ns

 CPU: 14 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.53 ns        p95: 048.88 ns          p99: 049.02 ns
LSE (stadd)     (c                0, d              200):   p50: 087.29 ns        p95: 087.44 ns          p99: 087.47 ns
LSE (stadd)     (c               10, d                0):   p50: 038.15 ns        p95: 038.17 ns          p99: 038.24 ns
LSE (stadd)     (c               10, d              300):   p50: 123.85 ns        p95: 125.01 ns          p99: 125.28 ns
LSE (stadd)     (c               10, d              500):   p50: 202.79 ns        p95: 203.27 ns          p99: 203.33 ns
LSE (stadd)     (c               30, d                0):   p50: 038.15 ns        p95: 038.17 ns          p99: 038.25 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.77 ns
LL/SC           (c               10, d                0):   p50: 042.23 ns        p95: 042.74 ns          p99: 042.78 ns
LL/SC           (c               10, d               10):   p50: 049.30 ns        p95: 049.71 ns          p99: 049.73 ns
LL/SC           (c               10, d               20):   p50: 059.19 ns        p95: 059.62 ns          p99: 059.66 ns
LL/SC           (c             1000, d                0):   p50: 016.75 ns        p95: 016.92 ns          p99: 016.95 ns
LL/SC           (c             1000, d               10):   p50: 018.88 ns        p95: 019.55 ns          p99: 019.72 ns
LL/SC           (c          1000000, d                0):   p50: 006.67 ns        p95: 006.69 ns          p99: 006.70 ns
LL/SC           (c          1000000, d               10):   p50: 008.59 ns        p95: 008.62 ns          p99: 008.70 ns
LDADD           (c                0, d                0):   p50: 063.35 ns        p95: 063.39 ns          p99: 063.46 ns
LDADD           (c                0, d              100):   p50: 049.45 ns        p95: 101.79 ns          p99: 101.92 ns
LDADD           (c                0, d              200):   p50: 146.30 ns        p95: 146.49 ns          p99: 146.68 ns
LDADD           (c                0, d              300):   p50: 187.32 ns        p95: 187.42 ns          p99: 187.43 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.59 ns          p99: 069.60 ns
LDADD           (c               10, d                0):   p50: 069.68 ns        p95: 069.77 ns          p99: 069.79 ns
LDADD           (c               10, d               10):   p50: 069.68 ns        p95: 069.78 ns          p99: 069.79 ns
LDADD           (c              100, d                0):   p50: 070.92 ns        p95: 071.02 ns          p99: 071.05 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.80 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 119.52 ns        p95: 119.73 ns          p99: 119.84 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 065.78 ns        p95: 066.00 ns          p99: 066.17 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.15 ns        p95: 019.65 ns          p99: 019.87 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 119.62 ns        p95: 119.89 ns          p99: 119.92 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 065.29 ns        p95: 065.57 ns          p99: 065.71 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.17 ns        p95: 019.69 ns          p99: 019.78 ns

 CPU: 15 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.76 ns        p95: 049.11 ns          p99: 049.19 ns
LSE (stadd)     (c                0, d              200):   p50: 087.27 ns        p95: 087.41 ns          p99: 087.48 ns
LSE (stadd)     (c               10, d                0):   p50: 038.14 ns        p95: 038.18 ns          p99: 038.30 ns
LSE (stadd)     (c               10, d              300):   p50: 123.83 ns        p95: 124.09 ns          p99: 124.45 ns
LSE (stadd)     (c               10, d              500):   p50: 200.99 ns        p95: 201.05 ns          p99: 201.59 ns
LSE (stadd)     (c               30, d                0):   p50: 038.16 ns        p95: 038.17 ns          p99: 038.18 ns
LL/SC           (c                0, d                0):   p50: 006.56 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 048.21 ns        p95: 051.07 ns          p99: 052.42 ns
LL/SC           (c               10, d               10):   p50: 058.72 ns        p95: 061.74 ns          p99: 063.14 ns
LL/SC           (c               10, d               20):   p50: 058.09 ns        p95: 060.97 ns          p99: 061.95 ns
LL/SC           (c             1000, d                0):   p50: 013.06 ns        p95: 013.43 ns          p99: 013.72 ns
LL/SC           (c             1000, d               10):   p50: 019.01 ns        p95: 019.47 ns          p99: 019.62 ns
LL/SC           (c          1000000, d                0):   p50: 006.67 ns        p95: 006.69 ns          p99: 006.70 ns
LL/SC           (c          1000000, d               10):   p50: 008.60 ns        p95: 008.61 ns          p99: 008.62 ns
LDADD           (c                0, d                0):   p50: 063.35 ns        p95: 063.38 ns          p99: 063.38 ns
LDADD           (c                0, d              100):   p50: 049.22 ns        p95: 101.66 ns          p99: 102.02 ns
LDADD           (c                0, d              200):   p50: 146.29 ns        p95: 146.48 ns          p99: 146.68 ns
LDADD           (c                0, d              300):   p50: 187.31 ns        p95: 187.32 ns          p99: 187.32 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.51 ns          p99: 069.51 ns
LDADD           (c               10, d                0):   p50: 069.67 ns        p95: 069.69 ns          p99: 069.70 ns
LDADD           (c               10, d               10):   p50: 069.67 ns        p95: 069.70 ns          p99: 069.92 ns
LDADD           (c              100, d                0):   p50: 070.91 ns        p95: 070.92 ns          p99: 070.93 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 119.36 ns        p95: 119.68 ns          p99: 119.78 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 066.18 ns        p95: 066.54 ns          p99: 066.63 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.12 ns        p95: 019.70 ns          p99: 019.87 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 118.91 ns        p95: 119.18 ns          p99: 119.32 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 066.00 ns        p95: 066.39 ns          p99: 066.48 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.14 ns        p95: 019.73 ns          p99: 019.87 ns

 CPU: 16 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 010.43 ns        p95: 010.43 ns          p99: 010.44 ns
LSE (stadd)     (c                0, d              100):   p50: 048.54 ns        p95: 048.96 ns          p99: 049.08 ns
LSE (stadd)     (c                0, d              200):   p50: 086.26 ns        p95: 086.77 ns          p99: 087.02 ns
LSE (stadd)     (c               10, d                0):   p50: 040.49 ns        p95: 040.51 ns          p99: 040.51 ns
LSE (stadd)     (c               10, d              300):   p50: 123.88 ns        p95: 124.55 ns          p99: 125.15 ns
LSE (stadd)     (c               10, d              500):   p50: 201.12 ns        p95: 202.73 ns          p99: 202.98 ns
LSE (stadd)     (c               30, d                0):   p50: 040.47 ns        p95: 040.49 ns          p99: 040.49 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 050.75 ns        p95: 052.20 ns          p99: 052.57 ns
LL/SC           (c               10, d               10):   p50: 057.55 ns        p95: 059.27 ns          p99: 060.80 ns
LL/SC           (c               10, d               20):   p50: 058.58 ns        p95: 059.07 ns          p99: 059.08 ns
LL/SC           (c             1000, d                0):   p50: 012.37 ns        p95: 012.78 ns          p99: 012.83 ns
LL/SC           (c             1000, d               10):   p50: 014.40 ns        p95: 015.00 ns          p99: 015.16 ns
LL/SC           (c          1000000, d                0):   p50: 006.60 ns        p95: 006.61 ns          p99: 006.61 ns
LL/SC           (c          1000000, d               10):   p50: 008.55 ns        p95: 008.56 ns          p99: 008.56 ns
LDADD           (c                0, d                0):   p50: 017.77 ns        p95: 074.18 ns          p99: 074.19 ns
LDADD           (c                0, d              100):   p50: 056.67 ns        p95: 112.40 ns          p99: 112.60 ns
LDADD           (c                0, d              200):   p50: 157.10 ns        p95: 157.15 ns          p99: 157.30 ns
LDADD           (c                0, d              300):   p50: 198.13 ns        p95: 198.16 ns          p99: 198.16 ns
LDADD           (c                1, d               10):   p50: 074.15 ns        p95: 074.17 ns          p99: 074.28 ns
LDADD           (c               10, d                0):   p50: 074.49 ns        p95: 074.50 ns          p99: 074.51 ns
LDADD           (c               10, d               10):   p50: 074.48 ns        p95: 074.50 ns          p99: 074.54 ns
LDADD           (c              100, d                0):   p50: 074.10 ns        p95: 074.12 ns          p99: 074.12 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 006.02 ns          p99: 006.15 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 123.62 ns        p95: 124.06 ns          p99: 124.37 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 077.18 ns        p95: 077.43 ns          p99: 077.64 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 018.93 ns        p95: 019.81 ns          p99: 020.00 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 006.40 ns        p95: 009.75 ns          p99: 010.36 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 124.78 ns        p95: 125.10 ns          p99: 125.23 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 088.70 ns        p95: 089.01 ns          p99: 089.12 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 020.20 ns        p95: 020.95 ns          p99: 021.07 ns

 CPU: 17 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.75 ns        p95: 049.08 ns          p99: 049.19 ns
LSE (stadd)     (c                0, d              200):   p50: 086.21 ns        p95: 086.89 ns          p99: 087.07 ns
LSE (stadd)     (c               10, d                0):   p50: 040.48 ns        p95: 040.50 ns          p99: 040.58 ns
LSE (stadd)     (c               10, d              300):   p50: 123.84 ns        p95: 125.56 ns          p99: 125.76 ns
LSE (stadd)     (c               10, d              500):   p50: 200.99 ns        p95: 201.53 ns          p99: 201.84 ns
LSE (stadd)     (c               30, d                0):   p50: 040.47 ns        p95: 040.48 ns          p99: 040.49 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 041.36 ns        p95: 042.29 ns          p99: 042.31 ns
LL/SC           (c               10, d               10):   p50: 048.59 ns        p95: 049.10 ns          p99: 049.58 ns
LL/SC           (c               10, d               20):   p50: 058.60 ns        p95: 059.53 ns          p99: 060.02 ns
LL/SC           (c             1000, d                0):   p50: 012.33 ns        p95: 012.65 ns          p99: 012.81 ns
LL/SC           (c             1000, d               10):   p50: 014.34 ns        p95: 014.91 ns          p99: 014.96 ns
LL/SC           (c          1000000, d                0):   p50: 006.60 ns        p95: 006.61 ns          p99: 006.62 ns
LL/SC           (c          1000000, d               10):   p50: 008.54 ns        p95: 008.56 ns          p99: 008.56 ns
LDADD           (c                0, d                0):   p50: 074.16 ns        p95: 074.19 ns          p99: 074.22 ns
LDADD           (c                0, d              100):   p50: 112.12 ns        p95: 112.65 ns          p99: 113.08 ns
LDADD           (c                0, d              200):   p50: 157.11 ns        p95: 157.50 ns          p99: 157.52 ns
LDADD           (c                0, d              300):   p50: 198.12 ns        p95: 198.13 ns          p99: 198.14 ns
LDADD           (c                1, d               10):   p50: 074.14 ns        p95: 074.15 ns          p99: 074.16 ns
LDADD           (c               10, d                0):   p50: 074.48 ns        p95: 074.50 ns          p99: 074.60 ns
LDADD           (c               10, d               10):   p50: 074.48 ns        p95: 074.50 ns          p99: 074.52 ns
LDADD           (c              100, d                0):   p50: 074.09 ns        p95: 074.12 ns          p99: 074.13 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 123.87 ns        p95: 124.64 ns          p99: 124.82 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 077.16 ns        p95: 077.49 ns          p99: 077.56 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.28 ns        p95: 019.99 ns          p99: 020.10 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 124.58 ns        p95: 124.91 ns          p99: 125.03 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 089.03 ns        p95: 089.41 ns          p99: 089.52 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.93 ns        p95: 020.85 ns          p99: 021.25 ns

 CPU: 18 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.62 ns        p95: 049.08 ns          p99: 049.14 ns
LSE (stadd)     (c                0, d              200):   p50: 086.36 ns        p95: 087.28 ns          p99: 087.45 ns
LSE (stadd)     (c               10, d                0):   p50: 040.46 ns        p95: 040.49 ns          p99: 040.49 ns
LSE (stadd)     (c               10, d              300):   p50: 123.84 ns        p95: 123.89 ns          p99: 123.95 ns
LSE (stadd)     (c               10, d              500):   p50: 202.59 ns        p95: 202.96 ns          p99: 203.02 ns
LSE (stadd)     (c               30, d                0):   p50: 040.46 ns        p95: 040.49 ns          p99: 040.62 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 050.70 ns        p95: 052.22 ns          p99: 052.64 ns
LL/SC           (c               10, d               10):   p50: 057.53 ns        p95: 059.50 ns          p99: 059.82 ns
LL/SC           (c               10, d               20):   p50: 058.31 ns        p95: 059.81 ns          p99: 060.28 ns
LL/SC           (c             1000, d                0):   p50: 011.99 ns        p95: 012.66 ns          p99: 012.82 ns
LL/SC           (c             1000, d               10):   p50: 016.97 ns        p95: 017.73 ns          p99: 018.05 ns
LL/SC           (c          1000000, d                0):   p50: 006.60 ns        p95: 006.61 ns          p99: 006.61 ns
LL/SC           (c          1000000, d               10):   p50: 008.55 ns        p95: 008.56 ns          p99: 008.56 ns
LDADD           (c                0, d                0):   p50: 010.04 ns        p95: 074.17 ns          p99: 074.17 ns
LDADD           (c                0, d              100):   p50: 111.99 ns        p95: 112.57 ns          p99: 112.70 ns
LDADD           (c                0, d              200):   p50: 157.37 ns        p95: 157.49 ns          p99: 157.49 ns
LDADD           (c                0, d              300):   p50: 198.12 ns        p95: 198.13 ns          p99: 198.18 ns
LDADD           (c                1, d               10):   p50: 074.14 ns        p95: 074.17 ns          p99: 074.20 ns
LDADD           (c               10, d                0):   p50: 074.49 ns        p95: 074.54 ns          p99: 074.58 ns
LDADD           (c               10, d               10):   p50: 074.48 ns        p95: 074.52 ns          p99: 074.59 ns
LDADD           (c              100, d                0):   p50: 074.10 ns        p95: 074.13 ns          p99: 074.15 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.80 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 124.63 ns        p95: 124.92 ns          p99: 125.07 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 089.19 ns        p95: 089.47 ns          p99: 089.56 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.29 ns        p95: 020.18 ns          p99: 020.37 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 124.55 ns        p95: 124.83 ns          p99: 124.97 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 088.71 ns        p95: 089.11 ns          p99: 089.17 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 020.39 ns        p95: 020.98 ns          p99: 021.19 ns

 CPU: 19 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.63 ns        p95: 048.97 ns          p99: 049.24 ns
LSE (stadd)     (c                0, d              200):   p50: 087.30 ns        p95: 087.45 ns          p99: 087.50 ns
LSE (stadd)     (c               10, d                0):   p50: 040.49 ns        p95: 040.50 ns          p99: 040.51 ns
LSE (stadd)     (c               10, d              300):   p50: 123.89 ns        p95: 124.31 ns          p99: 124.79 ns
LSE (stadd)     (c               10, d              500):   p50: 202.37 ns        p95: 203.12 ns          p99: 203.23 ns
LSE (stadd)     (c               30, d                0):   p50: 040.47 ns        p95: 040.48 ns          p99: 040.48 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 041.34 ns        p95: 041.83 ns          p99: 042.27 ns
LL/SC           (c               10, d               10):   p50: 048.47 ns        p95: 048.64 ns          p99: 050.53 ns
LL/SC           (c               10, d               20):   p50: 058.56 ns        p95: 059.05 ns          p99: 059.52 ns
LL/SC           (c             1000, d                0):   p50: 012.16 ns        p95: 012.57 ns          p99: 012.67 ns
LL/SC           (c             1000, d               10):   p50: 014.43 ns        p95: 014.92 ns          p99: 014.97 ns
LL/SC           (c          1000000, d                0):   p50: 006.60 ns        p95: 006.61 ns          p99: 006.62 ns
LL/SC           (c          1000000, d               10):   p50: 008.54 ns        p95: 008.56 ns          p99: 008.56 ns
LDADD           (c                0, d                0):   p50: 010.04 ns        p95: 010.04 ns          p99: 010.04 ns
LDADD           (c                0, d              100):   p50: 112.14 ns        p95: 112.61 ns          p99: 112.74 ns
LDADD           (c                0, d              200):   p50: 157.39 ns        p95: 157.51 ns          p99: 157.51 ns
LDADD           (c                0, d              300):   p50: 198.12 ns        p95: 198.16 ns          p99: 198.56 ns
LDADD           (c                1, d               10):   p50: 074.14 ns        p95: 074.16 ns          p99: 074.32 ns
LDADD           (c               10, d                0):   p50: 074.47 ns        p95: 074.49 ns          p99: 074.49 ns
LDADD           (c               10, d               10):   p50: 074.48 ns        p95: 074.51 ns          p99: 074.72 ns
LDADD           (c              100, d                0):   p50: 074.10 ns        p95: 074.11 ns          p99: 074.12 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 123.66 ns        p95: 124.13 ns          p99: 124.46 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 077.11 ns        p95: 077.35 ns          p99: 077.46 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.62 ns        p95: 020.17 ns          p99: 020.38 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 124.71 ns        p95: 124.94 ns          p99: 125.07 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 088.79 ns        p95: 089.13 ns          p99: 089.18 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 020.42 ns        p95: 021.01 ns          p99: 021.54 ns

 CPU: 20 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.72 ns        p95: 049.11 ns          p99: 049.25 ns
LSE (stadd)     (c                0, d              200):   p50: 087.31 ns        p95: 087.48 ns          p99: 087.61 ns
LSE (stadd)     (c               10, d                0):   p50: 038.15 ns        p95: 038.20 ns          p99: 038.41 ns
LSE (stadd)     (c               10, d              300):   p50: 123.87 ns        p95: 124.01 ns          p99: 124.91 ns
LSE (stadd)     (c               10, d              500):   p50: 201.04 ns        p95: 202.93 ns          p99: 203.03 ns
LSE (stadd)     (c               30, d                0):   p50: 038.15 ns        p95: 038.17 ns          p99: 038.17 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 045.37 ns        p95: 046.08 ns          p99: 046.46 ns
LL/SC           (c               10, d               10):   p50: 053.40 ns        p95: 053.46 ns          p99: 053.49 ns
LL/SC           (c               10, d               20):   p50: 064.12 ns        p95: 064.63 ns          p99: 064.65 ns
LL/SC           (c             1000, d                0):   p50: 010.25 ns        p95: 010.64 ns          p99: 010.70 ns
LL/SC           (c             1000, d               10):   p50: 014.76 ns        p95: 015.15 ns          p99: 015.33 ns
LL/SC           (c          1000000, d                0):   p50: 006.61 ns        p95: 006.62 ns          p99: 006.63 ns
LL/SC           (c          1000000, d               10):   p50: 008.57 ns        p95: 008.58 ns          p99: 008.59 ns
LDADD           (c                0, d                0):   p50: 064.90 ns        p95: 064.92 ns          p99: 065.00 ns
LDADD           (c                0, d              100):   p50: 102.83 ns        p95: 103.27 ns          p99: 103.39 ns
LDADD           (c                0, d              200):   p50: 101.11 ns        p95: 148.21 ns          p99: 148.23 ns
LDADD           (c                0, d              300):   p50: 188.86 ns        p95: 188.87 ns          p99: 188.96 ns
LDADD           (c                1, d               10):   p50: 069.50 ns        p95: 069.52 ns          p99: 069.53 ns
LDADD           (c               10, d                0):   p50: 069.67 ns        p95: 069.69 ns          p99: 069.70 ns
LDADD           (c               10, d               10):   p50: 069.69 ns        p95: 069.72 ns          p99: 069.81 ns
LDADD           (c              100, d                0):   p50: 070.94 ns        p95: 070.96 ns          p99: 070.97 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 118.46 ns        p95: 119.09 ns          p99: 119.23 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 067.08 ns        p95: 067.40 ns          p99: 068.06 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 018.78 ns        p95: 019.44 ns          p99: 019.59 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 116.67 ns        p95: 120.14 ns          p99: 120.18 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 066.58 ns        p95: 066.84 ns          p99: 067.00 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.21 ns        p95: 019.72 ns          p99: 019.98 ns

 CPU: 21 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.46 ns        p95: 048.82 ns          p99: 048.88 ns
LSE (stadd)     (c                0, d              200):   p50: 086.27 ns        p95: 086.43 ns          p99: 087.07 ns
LSE (stadd)     (c               10, d                0):   p50: 038.17 ns        p95: 038.19 ns          p99: 038.19 ns
LSE (stadd)     (c               10, d              300):   p50: 125.71 ns        p95: 126.00 ns          p99: 126.08 ns
LSE (stadd)     (c               10, d              500):   p50: 202.44 ns        p95: 203.04 ns          p99: 203.15 ns
LSE (stadd)     (c               30, d                0):   p50: 038.17 ns        p95: 038.18 ns          p99: 038.18 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 045.37 ns        p95: 045.91 ns          p99: 045.92 ns
LL/SC           (c               10, d               10):   p50: 052.97 ns        p95: 053.51 ns          p99: 054.67 ns
LL/SC           (c               10, d               20):   p50: 064.10 ns        p95: 064.63 ns          p99: 064.66 ns
LL/SC           (c             1000, d                0):   p50: 010.17 ns        p95: 010.49 ns          p99: 010.62 ns
LL/SC           (c             1000, d               10):   p50: 014.86 ns        p95: 015.25 ns          p99: 015.40 ns
LL/SC           (c          1000000, d                0):   p50: 006.61 ns        p95: 006.62 ns          p99: 006.63 ns
LL/SC           (c          1000000, d               10):   p50: 008.56 ns        p95: 008.58 ns          p99: 008.58 ns
LDADD           (c                0, d                0):   p50: 064.90 ns        p95: 064.93 ns          p99: 065.00 ns
LDADD           (c                0, d              100):   p50: 049.28 ns        p95: 049.82 ns          p99: 049.99 ns
LDADD           (c                0, d              200):   p50: 148.22 ns        p95: 148.25 ns          p99: 148.27 ns
LDADD           (c                0, d              300):   p50: 188.86 ns        p95: 188.89 ns          p99: 188.89 ns
LDADD           (c                1, d               10):   p50: 069.49 ns        p95: 069.53 ns          p99: 069.67 ns
LDADD           (c               10, d                0):   p50: 069.69 ns        p95: 069.73 ns          p99: 069.73 ns
LDADD           (c               10, d               10):   p50: 069.69 ns        p95: 069.73 ns          p99: 069.75 ns
LDADD           (c              100, d                0):   p50: 070.93 ns        p95: 070.95 ns          p99: 070.96 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 120.29 ns        p95: 120.64 ns          p99: 120.83 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 066.97 ns        p95: 067.27 ns          p99: 067.35 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.23 ns        p95: 019.75 ns          p99: 019.89 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 116.26 ns        p95: 116.67 ns          p99: 116.71 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 067.20 ns        p95: 067.54 ns          p99: 067.89 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.17 ns        p95: 019.76 ns          p99: 019.91 ns

 CPU: 22 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 010.43 ns        p95: 026.99 ns          p99: 026.99 ns
LSE (stadd)     (c                0, d              100):   p50: 048.80 ns        p95: 049.18 ns          p99: 049.31 ns
LSE (stadd)     (c                0, d              200):   p50: 087.29 ns        p95: 087.45 ns          p99: 087.49 ns
LSE (stadd)     (c               10, d                0):   p50: 038.15 ns        p95: 038.16 ns          p99: 038.17 ns
LSE (stadd)     (c               10, d              300):   p50: 123.84 ns        p95: 123.88 ns          p99: 123.89 ns
LSE (stadd)     (c               10, d              500):   p50: 202.65 ns        p95: 203.00 ns          p99: 203.05 ns
LSE (stadd)     (c               30, d                0):   p50: 038.17 ns        p95: 038.19 ns          p99: 038.37 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 045.36 ns        p95: 045.90 ns          p99: 045.93 ns
LL/SC           (c               10, d               10):   p50: 052.90 ns        p95: 053.44 ns          p99: 053.49 ns
LL/SC           (c               10, d               20):   p50: 064.10 ns        p95: 064.62 ns          p99: 065.14 ns
LL/SC           (c             1000, d                0):   p50: 010.18 ns        p95: 010.52 ns          p99: 010.66 ns
LL/SC           (c             1000, d               10):   p50: 014.70 ns        p95: 015.15 ns          p99: 015.37 ns
LL/SC           (c          1000000, d                0):   p50: 006.60 ns        p95: 006.61 ns          p99: 006.62 ns
LL/SC           (c          1000000, d               10):   p50: 008.57 ns        p95: 008.59 ns          p99: 008.61 ns
LDADD           (c                0, d                0):   p50: 064.91 ns        p95: 064.93 ns          p99: 065.10 ns
LDADD           (c                0, d              100):   p50: 103.09 ns        p95: 103.53 ns          p99: 103.63 ns
LDADD           (c                0, d              200):   p50: 147.83 ns        p95: 147.96 ns          p99: 148.18 ns
LDADD           (c                0, d              300):   p50: 188.86 ns        p95: 188.88 ns          p99: 188.88 ns
LDADD           (c                1, d               10):   p50: 069.50 ns        p95: 069.51 ns          p99: 069.52 ns
LDADD           (c               10, d                0):   p50: 069.67 ns        p95: 069.69 ns          p99: 069.70 ns
LDADD           (c               10, d               10):   p50: 069.68 ns        p95: 069.71 ns          p99: 069.84 ns
LDADD           (c              100, d                0):   p50: 070.91 ns        p95: 070.93 ns          p99: 070.95 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 007.04 ns        p95: 007.71 ns          p99: 007.88 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 118.50 ns        p95: 118.99 ns          p99: 119.34 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 067.13 ns        p95: 067.52 ns          p99: 067.95 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 018.40 ns        p95: 019.36 ns          p99: 019.67 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 009.59 ns        p95: 010.03 ns          p99: 010.16 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 116.54 ns        p95: 117.08 ns          p99: 117.22 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 066.99 ns        p95: 067.24 ns          p99: 067.35 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.27 ns        p95: 019.77 ns          p99: 020.07 ns

 CPU: 23 - Latency Percentiles:
====================
LSE (stadd)     (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
LSE (stadd)     (c                0, d              100):   p50: 048.60 ns        p95: 048.97 ns          p99: 049.01 ns
LSE (stadd)     (c                0, d              200):   p50: 087.26 ns        p95: 087.42 ns          p99: 087.48 ns
LSE (stadd)     (c               10, d                0):   p50: 038.17 ns        p95: 038.19 ns          p99: 038.29 ns
LSE (stadd)     (c               10, d              300):   p50: 123.87 ns        p95: 123.93 ns          p99: 123.99 ns
LSE (stadd)     (c               10, d              500):   p50: 202.76 ns        p95: 203.10 ns          p99: 203.16 ns
LSE (stadd)     (c               30, d                0):   p50: 038.17 ns        p95: 038.39 ns          p99: 038.68 ns
LL/SC           (c                0, d                0):   p50: 006.57 ns        p95: 006.57 ns          p99: 006.57 ns
LL/SC           (c                0, d               10):   p50: 008.50 ns        p95: 008.50 ns          p99: 008.50 ns
LL/SC           (c                0, d               20):   p50: 011.67 ns        p95: 011.67 ns          p99: 011.67 ns
LL/SC           (c               10, d                0):   p50: 045.37 ns        p95: 045.91 ns          p99: 045.91 ns
LL/SC           (c               10, d               10):   p50: 052.91 ns        p95: 053.50 ns          p99: 055.37 ns
LL/SC           (c               10, d               20):   p50: 064.11 ns        p95: 064.65 ns          p99: 065.10 ns
LL/SC           (c             1000, d                0):   p50: 010.27 ns        p95: 010.62 ns          p99: 010.71 ns
LL/SC           (c             1000, d               10):   p50: 014.77 ns        p95: 015.15 ns          p99: 015.26 ns
LL/SC           (c          1000000, d                0):   p50: 006.61 ns        p95: 006.62 ns          p99: 006.62 ns
LL/SC           (c          1000000, d               10):   p50: 008.56 ns        p95: 008.58 ns          p99: 008.58 ns
LDADD           (c                0, d                0):   p50: 064.93 ns        p95: 064.95 ns          p99: 064.97 ns
LDADD           (c                0, d              100):   p50: 049.43 ns        p95: 103.16 ns          p99: 103.40 ns
LDADD           (c                0, d              200):   p50: 147.83 ns        p95: 148.03 ns          p99: 148.13 ns
LDADD           (c                0, d              300):   p50: 188.86 ns        p95: 188.90 ns          p99: 188.92 ns
LDADD           (c                1, d               10):   p50: 069.50 ns        p95: 069.52 ns          p99: 069.53 ns
LDADD           (c               10, d                0):   p50: 069.68 ns        p95: 069.70 ns          p99: 069.85 ns
LDADD           (c               10, d               10):   p50: 069.67 ns        p95: 069.69 ns          p99: 069.69 ns
LDADD           (c              100, d                0):   p50: 070.92 ns        p95: 070.94 ns          p99: 070.95 ns
PFRM_KEEP+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_KEEP+STADD (c               10, d                0):   p50: 120.24 ns        p95: 120.51 ns          p99: 120.61 ns
PFRM_KEEP+STADD (c             1000, d                0):   p50: 066.93 ns        p95: 067.12 ns          p99: 067.21 ns
PFRM_KEEP+STADD (c          1000000, d                0):   p50: 019.19 ns        p95: 019.77 ns          p99: 020.00 ns
PFRM_STRM+STADD (c                0, d                0):   p50: 005.79 ns        p95: 005.79 ns          p99: 005.79 ns
PFRM_STRM+STADD (c               10, d                0):   p50: 120.13 ns        p95: 120.35 ns          p99: 120.48 ns
PFRM_STRM+STADD (c             1000, d                0):   p50: 066.58 ns        p95: 067.48 ns          p99: 069.22 ns
PFRM_STRM+STADD (c          1000000, d                0):   p50: 019.19 ns        p95: 019.70 ns          p99: 019.92 ns

=== Benchmark Complete ===


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2025-11-27 12:32 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-30 22:37 Overhead of arm64 LSE per-CPU atomics? Paul E. McKenney
2025-10-31 18:30 ` Catalin Marinas
2025-10-31 19:39   ` Paul E. McKenney
2025-10-31 22:21     ` Paul E. McKenney
2025-10-31 22:43     ` Catalin Marinas
2025-10-31 23:38       ` Paul E. McKenney
2025-11-01  3:25         ` Paul E. McKenney
2025-11-01  9:44           ` Willy Tarreau
2025-11-01 18:07             ` Paul E. McKenney
2025-11-01 11:23           ` Catalin Marinas
2025-11-01 11:41             ` Yicong Yang
2025-11-05 13:25               ` Catalin Marinas
2025-11-05 13:42                 ` Willy Tarreau
2025-11-05 14:49                   ` Catalin Marinas
2025-11-05 16:21                     ` Breno Leitao
2025-11-06  7:44                     ` Willy Tarreau
2025-11-06 13:53                       ` Catalin Marinas
2025-11-06 14:16                         ` Willy Tarreau
2025-11-03 20:12             ` Palmer Dabbelt
2025-11-03 21:49           ` Catalin Marinas
2025-11-03 21:56             ` Willy Tarreau
2025-11-04 17:05           ` Catalin Marinas
2025-11-04 18:43             ` Paul E. McKenney
2025-11-04 20:10               ` Paul E. McKenney
2025-11-05 15:34                 ` Catalin Marinas
2025-11-05 16:25                   ` Paul E. McKenney
2025-11-05 17:15                     ` Catalin Marinas
2025-11-05 17:40                       ` Paul E. McKenney
2025-11-05 19:16                         ` Catalin Marinas
2025-11-05 19:47                           ` Paul E. McKenney
2025-11-05 20:17                             ` Catalin Marinas
2025-11-05 20:45                               ` Paul E. McKenney
2025-11-05 21:13                           ` Palmer Dabbelt
2025-11-06 14:00                             ` Catalin Marinas
2025-11-06 16:30                               ` Palmer Dabbelt
2025-11-06 17:54                                 ` Catalin Marinas
2025-11-06 18:23                                   ` Palmer Dabbelt
2025-11-04 15:59   ` Breno Leitao
2025-11-04 17:06     ` Catalin Marinas
2025-11-04 18:08     ` Willy Tarreau
2025-11-04 18:22       ` Breno Leitao
2025-11-04 20:13       ` Paul E. McKenney
2025-11-04 20:35         ` Willy Tarreau
2025-11-04 21:25           ` Paul E. McKenney
2025-11-04 20:57     ` Puranjay Mohan
2025-11-27 12:29     ` Wentao Guan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).