All of lore.kernel.org
 help / color / mirror / Atom feed
From: Catalin Marinas <catalin.marinas@arm.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: Overhead of arm64 LSE per-CPU atomics?
Date: Fri, 31 Oct 2025 22:43:35 +0000	[thread overview]
Message-ID: <aQU7l-qMKJTx4znJ@arm.com> (raw)
In-Reply-To: <31847558-db84-4984-ab43-a5f6be00f5eb@paulmck-laptop>

On Fri, Oct 31, 2025 at 12:39:41PM -0700, Paul E. McKenney wrote:
> On Fri, Oct 31, 2025 at 06:30:31PM +0000, Catalin Marinas wrote:
> > On Thu, Oct 30, 2025 at 03:37:00PM -0700, Paul E. McKenney wrote:
> > > To make event tracing safe for PREEMPT_RT kernels, I have been creating
> > > optimized variants of SRCU readers that use per-CPU atomics.  This works
> > > quite well, but on ARM Neoverse V2, I am seeing about 100ns for a
> > > srcu_read_lock()/srcu_read_unlock() pair, or about 50ns for a single
> > > per-CPU atomic operation.  This contrasts with a handful of nanoseconds
> > > on x86 and similar on ARM for a atomic_set(&foo, atomic_read(&foo) + 1).
> > 
> > That's quite a difference. Does it get any better if
> > CONFIG_ARM64_LSE_ATOMICS is disabled? We don't have a way to disable it
> > on the kernel command line.
> 
> In other words, build with CONFIG_ARM64_USE_LSE_ATOMICS=n, correct?

Yes.

> Yes, this gets me more than an order of magnitude, and about 30% better
> than my workaround of disabling interrupts around a non-atomic increment
> of those counters, thank you!
> 
> Given that per-CPU atomics are usually not heavily contended, would it
> make sense to avoid LSE in that case?

In theory the LSE atomics should be as fast but microarchitecture
decisions likely did not cover all the use-cases. I'll raise this
internally as well, maybe we get some ideas from the hardware people.

> And I need to figure out whether I should recommend that Meta build
> its arm64 kernels with CONFIG_ARM64_USE_LSE_ATOMICS=n.  And advice you
> might have would be deeply appreciated!  (I am of course also following
> up internally.)

I wouldn't advise turning them off just yet, they are beneficial for
other use-cases. But it needs more thinking (and not that late at night ;)).

> > Interestingly, we had this patch recently to force a prefetch before the
> > atomic:
> > 
> > https://lore.kernel.org/all/20250724120651.27983-1-yangyicong@huawei.com/
> > 
> > We rejected it but I wonder whether it improves the SRCU scenario.
> 
> No statistical difference on my system.  This is a 72-CPU Neoverse V2, in
> case that matters.

I just realised that patch doesn't touch percpu.h at all. So what about
something like (untested):

-----------------8<------------------------
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index 9abcc8ef3087..e381034324e1 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -70,6 +70,7 @@ __percpu_##name##_case_##sz(void *ptr, unsigned long val)		\
 	unsigned int loop;						\
 	u##sz tmp;							\
 									\
+	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
 	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
 	/* LL/SC */							\
 	"1:	ldxr" #sfx "\t%" #w "[tmp], %[ptr]\n"			\
@@ -91,6 +92,7 @@ __percpu_##name##_return_case_##sz(void *ptr, unsigned long val)	\
 	unsigned int loop;						\
 	u##sz ret;							\
 									\
+	asm volatile("prfm pstl1strm, %a0\n" : : "p" (ptr));
 	asm volatile (ARM64_LSE_ATOMIC_INSN(				\
 	/* LL/SC */							\
 	"1:	ldxr" #sfx "\t%" #w "[ret], %[ptr]\n"			\
-----------------8<------------------------

> Here are my results for the underlying this_cpu_inc()
> and this_cpu_dec() pair of operations:
> 
> 	LSE Atomics Enabled (Stock)	LSE Atomics Disabled
> 
> Without Yicong’s Patch (Stock)
> 
> 			    110.786		       9.852
> 
> With Yicong’s Patch
> 
> 			    109.873		       9.853
> 
> As you can see, disabling LSE gets about an order of magnitude
> and Yicong's patch has no statistically significant effect.
> 
> This and more can be found in the "Per-CPU Increment/Decrement"
> section of this Google document:
> 
> https://docs.google.com/document/d/1RoYRrTsabdeTXcldzpoMnpmmCjGbJNWtDXN6ZNr_4H8/edit?usp=sharing
> 
> Full disclosure: Calls to srcu_read_lock_fast() followed by
> srcu_read_unlock_fast() really use one this_cpu_inc() followed by another
> this_cpu_inc(), but I am not seeing any difference between the two.
> And testing the underlying primitives allows my tests to give reproducible
> results regardless of what state I have the SRCU code in.  ;-)

Thanks. I'll go through your emails in more detail tomorrow/Monday.

-- 
Catalin


  parent reply	other threads:[~2025-10-31 22:43 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-30 22:37 Overhead of arm64 LSE per-CPU atomics? Paul E. McKenney
2025-10-31 18:30 ` Catalin Marinas
2025-10-31 19:39   ` Paul E. McKenney
2025-10-31 22:21     ` Paul E. McKenney
2025-10-31 22:43     ` Catalin Marinas [this message]
2025-10-31 23:38       ` Paul E. McKenney
2025-11-01  3:25         ` Paul E. McKenney
2025-11-01  9:44           ` Willy Tarreau
2025-11-01 18:07             ` Paul E. McKenney
2025-11-01 11:23           ` Catalin Marinas
2025-11-01 11:41             ` Yicong Yang
2025-11-05 13:25               ` Catalin Marinas
2025-11-05 13:42                 ` Willy Tarreau
2025-11-05 14:49                   ` Catalin Marinas
2025-11-05 16:21                     ` Breno Leitao
2025-11-06  7:44                     ` Willy Tarreau
2025-11-06 13:53                       ` Catalin Marinas
2025-11-06 14:16                         ` Willy Tarreau
2025-11-03 20:12             ` Palmer Dabbelt
2025-11-03 21:49           ` Catalin Marinas
2025-11-03 21:56             ` Willy Tarreau
2025-11-04 17:05           ` Catalin Marinas
2025-11-04 18:43             ` Paul E. McKenney
2025-11-04 20:10               ` Paul E. McKenney
2025-11-05 15:34                 ` Catalin Marinas
2025-11-05 16:25                   ` Paul E. McKenney
2025-11-05 17:15                     ` Catalin Marinas
2025-11-05 17:40                       ` Paul E. McKenney
2025-11-05 19:16                         ` Catalin Marinas
2025-11-05 19:47                           ` Paul E. McKenney
2025-11-05 20:17                             ` Catalin Marinas
2025-11-05 20:45                               ` Paul E. McKenney
2025-11-05 21:13                           ` Palmer Dabbelt
2025-11-06 14:00                             ` Catalin Marinas
2025-11-06 16:30                               ` Palmer Dabbelt
2025-11-06 17:54                                 ` Catalin Marinas
2025-11-06 18:23                                   ` Palmer Dabbelt
2025-11-04 15:59   ` Breno Leitao
2025-11-04 17:06     ` Catalin Marinas
2025-11-04 18:08     ` Willy Tarreau
2025-11-04 18:22       ` Breno Leitao
2025-11-04 20:13       ` Paul E. McKenney
2025-11-04 20:35         ` Willy Tarreau
2025-11-04 21:25           ` Paul E. McKenney
2025-11-04 20:57     ` Puranjay Mohan
2025-11-27 12:29     ` Wentao Guan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQU7l-qMKJTx4znJ@arm.com \
    --to=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=mark.rutland@arm.com \
    --cc=paulmck@kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.