Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Catalin Marinas <catalin.marinas@arm.com>
To: Yang Shi <shy828301@gmail.com>
Cc: Tejun Heo <tj@kernel.org>,
	lsf-pc@lists.linux-foundation.org, Linux MM <linux-mm@kvack.org>,
	"Christoph Lameter (Ampere)" <cl@gentwo.org>,
	dennis@kernel.org, urezki@gmail.com,
	Will Deacon <will@kernel.org>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Yang Shi <yang@os.amperecomputing.com>
Subject: Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
Date: Thu, 12 Feb 2026 17:54:19 +0000	[thread overview]
Message-ID: <aY4Ty6G6A3478_JS@arm.com> (raw)
In-Reply-To: <CAHbLzkoE0UQLZSKQJttv1_XGT-6HPKdj5o7aYnpuiXEyvbAHxA@mail.gmail.com>

On Wed, Feb 11, 2026 at 03:58:50PM -0800, Yang Shi wrote:
> On Wed, Feb 11, 2026 at 3:29 PM Tejun Heo <tj@kernel.org> wrote:
> > On Wed, Feb 11, 2026 at 03:14:57PM -0800, Yang Shi wrote:
> > ...
> > > Overhead
> > > ========
> > > 1. Some extra virtual memory space. But it shouldn’t be too much. I
> > > saw 960K with Fedora default kernel config. Given terabytes virtual
> > > memory space on 64 bit machine, 960K is negligible.
> > > 2. Some extra physical memory for percpu kernel page table. 4K *
> > > (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> > > mapping area. A couple of megabytes with Fedora default kernel config
> > > on AmpereOne with 160 cores.
> > > 3. Percpu allocation and free will be slower due to extra virtual
> > > memory allocation and page table manipulation. However, percpu is
> > > allocated by chunk. One chunk typically holds a lot percpu variables.
> > > So the slowdown should be negligible. The test result below also
> > > proved it.
[...]
> > One property that this breaks is per_cpu_ptr() of a given CPU disagreeing
> > with this_cpu_ptr(). e.g. If there are users that take this_cpu_ptr() and
> > uses that outside preempt disable block (which is a bit odd but allowed),
> > the end result would be surprising. Hmm... I wonder whether it'd be
> > worthwhile to keep this_cpu_ptr() returning the global address - ie. make it
> > access global offset from local mapping and then return the computed global
> > address. This should still be pretty cheap and gets rid of surprising and
> > potentially extremely subtle corner cases.
> 
> Yes, this is going to be a problem. So we don't change how
> this_cpu_ptr() works and keep it returning the global address. Because
> I noticed this may cause confusion for list APIs too. For example,
> when initializing a list embedded into a percpu variable, the ->next
> and ->prev will be initialized to global addresses by using
> per_cpu_ptr(), but if the list is accessed via this_cpu_ptr(), list
> head will be dereferenced by using local address, then list_empty()
> will complain, which compare the list head pointer and ->next pointer.
> This will cause some problems.
> 
> So we just use the local address for this_cpu_add/sub/inc/dec and so
> on, which just manipulate a scalar counter.

I wonder how much overhead is caused by calling into the scheduler on
preempt_enable(). It would be good to get some numbers for something
like the patch below (also removing the preempt disabling for
this_cpu_read() as I don't think it matters - a thread cannot
distinguish whether it was preempted between TPIDR read and variable
read or immediately after the variable read; we can't do this for writes
as other threads may notice unexpected updates).

Another wild hack could be to read the kernel instruction at
(current_pt_regs()->pc - 4) in arch_irqentry_exit_need_resched() and
return false if it's a read from TPIDR_EL1/2, together with removing the
preempt disabling. Or some other lighter way of detecting this_cpu_*
constructs without full preemption disabling.

-----------------8<------------------------------------
diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index b57b2bb00967..7194cc997293 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -153,11 +153,17 @@ PERCPU_RET_OP(add, add, ldadd)
  * disabled.
  */
 
+#ifdef preempt_enable_no_resched_notrace
+#define _pcp_preempt_enable_notrace	preempt_enable_no_resched_notrace
+#else
+#define _pcp_preempt_enable_notrace	preempt_enable_notrace
+#endif
+
 #define _pcp_protect(op, pcp, ...)					\
 ({									\
 	preempt_disable_notrace();					\
 	op(raw_cpu_ptr(&(pcp)), __VA_ARGS__);				\
-	preempt_enable_notrace();					\
+	_pcp_preempt_enable_notrace();					\
 })
 
 #define _pcp_protect_return(op, pcp, args...)				\
@@ -165,18 +171,21 @@ PERCPU_RET_OP(add, add, ldadd)
 	typeof(pcp) __retval;						\
 	preempt_disable_notrace();					\
 	__retval = (typeof(pcp))op(raw_cpu_ptr(&(pcp)), ##args);	\
-	preempt_enable_notrace();					\
+	_pcp_preempt_enable_notrace();					\
 	__retval;							\
 })
 
+#define _pcp_return(op, pcp, args...)					\
+	((typeof(pcp))op(raw_cpu_ptr(&(pcp)), ##args))
+
 #define this_cpu_read_1(pcp)		\
-	_pcp_protect_return(__percpu_read_8, pcp)
+	_pcp_return(__percpu_read_8, pcp)
 #define this_cpu_read_2(pcp)		\
-	_pcp_protect_return(__percpu_read_16, pcp)
+	_pcp_return(__percpu_read_16, pcp)
 #define this_cpu_read_4(pcp)		\
-	_pcp_protect_return(__percpu_read_32, pcp)
+	_pcp_return(__percpu_read_32, pcp)
 #define this_cpu_read_8(pcp)		\
-	_pcp_protect_return(__percpu_read_64, pcp)
+	_pcp_return(__percpu_read_64, pcp)
 
 #define this_cpu_write_1(pcp, val)	\
 	_pcp_protect(__percpu_write_8, pcp, (unsigned long)val)
@@ -253,7 +262,7 @@ PERCPU_RET_OP(add, add, ldadd)
 	preempt_disable_notrace();					\
 	ptr__ = raw_cpu_ptr(&(pcp));					\
 	ret__ = cmpxchg128_local((void *)ptr__, old__, new__);		\
-	preempt_enable_notrace();					\
+	_pcp_preempt_enable_notrace();					\
 	ret__;								\
 })

next prev parent reply	other threads:[~2026-02-12 17:54 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 23:14 [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures) Yang Shi
2026-02-11 23:29 ` Tejun Heo
2026-02-11 23:39   ` Christoph Lameter (Ampere)
2026-02-11 23:40     ` Tejun Heo
2026-02-12  0:05       ` Christoph Lameter (Ampere)
2026-02-11 23:58   ` Yang Shi
2026-02-12 17:54     ` Catalin Marinas [this message]
2026-02-12 18:43       ` Catalin Marinas
2026-02-13  0:23         ` Yang Shi
2026-02-12 18:45       ` Ryan Roberts
2026-02-12 19:36         ` Catalin Marinas
2026-02-12 21:12           ` Ryan Roberts
2026-02-16 10:37             ` Catalin Marinas
2026-02-18  8:59               ` Ryan Roberts
2026-02-12 18:41 ` Ryan Roberts
2026-02-12 18:55   ` Christoph Lameter (Ampere)
2026-02-12 18:58     ` Ryan Roberts
2026-02-24 16:47       ` Christoph Lameter (Ampere)
2026-02-13 18:42   ` Yang Shi
2026-02-16 11:39     ` Catalin Marinas
2026-02-17 17:28       ` Christoph Lameter (Ampere)
2026-02-18  9:18         ` Ryan Roberts
2026-02-26 18:31       ` Yang Shi
2026-02-23  9:50 ` Heiko Carstens
2026-02-26 17:48   ` Yang Shi
2026-04-29 23:03   ` Yang Shi
2026-05-08 22:52 ` Yang Shi

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b57b2bb0096 dfblob:7194cc99729 )
 OR (
bs:"Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aY4Ty6G6A3478_JS@arm.com \
    --to=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=will@kernel.org \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.