[RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
@ 2026-04-29 17:04 Yang Shi
  2026-04-29 17:04 ` [PATCH 01/11] arm64: mm: enable percpu kernel page table Yang Shi
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel


Introduction
============
This patch series implemented the LSFMM 2026 proposal for optimizing
this_cpu_*() ops on ARM64. For the details of the proposal, Please refer to:
https://lore.kernel.org/linux-mm/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/
I didn't repeat it in the cover letter because there is no change to the
proposal.

The series is based on 7.1-rc1. It is basically minimum viable patches.
There are still a few hacks in this series and it may break something,
for example, KPTI, SMT machines which shared TLB, etc. But it shoule be
good enough for now to demonstrate the core idea. The main purpose of the
RFC is to gather feedback early, figure out missing parts and risks, and
make sure we are on the right track, as well as hopefully it can help the
discussion for the upcoming LSFMM.

I broke the patches down to arch-dependent and arch-independent parts so that
hopefully the interested persons can do experiments on other architectures,
for example, S390, easier.

A new kernel config is introduced, HAVE_LOCAL_PER_CPU_MAP. The architectures
which can support this feature will select it. Allocating and freeing percpu
local mapping is protected by this config so that others won't pay the cost.

 
Known Issues
============
1. KPIT
-------
We need determine what CPU we are on, then switch to the right page table.
Currently arm64 kernel fetches tramp_pg_dir via swapper_pg_dir - fixed_offset,
and fetches swapper_pg_dir from ttbr1. But ttbr1 may not hold swapper_pg_dir
anymore except CPU #0. So we need to figure out the other way to handle it.
Switching to tramp_pg_dir should be easy, but the reverse seems harder because
tramp_pg_dir just maps the trampoline vectors.
Maybe we can do two steps switch. Switch to swapper_pg_dir at the first step,
then switch to per cpu page table (for entry) or tramp page table (for exit).
Nobody should call this_cpu_*() at either userspace -> kernel entry stage or
kernel -> userspace exit stage.

2. Shared TLB machines
----------------------
Some machines may share TLB between CPUs, for example, SMT machines may share
TLB between the two hardware threads in one single core.
The per cpu page table just can't work with it. Maybe we need a new
cpufeature to indicate whether per cpu page table is allowed or not. Then
just enable it for not-shared-TLB machines.

 
Benchmark
=========
The benchmarks are done on 160 core AmpereOne machine. The baseline is
v7.1-rc1 kernel.

1. Kernel Build
---------------
Run kernel build (make -j160) with the default Fedora kernel config in a
memcg.
13% - 18% sys time improvment
3% - 7% wall time improvement

2. stress-ng vm ops
-------------------
stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
8.5% improvement

3. stress-ng vm ops + fork
----------------------
stress-ng --mmapfork 160 --mmapfork-bytes 128M --mmapfork-ops 500
15% improvement


Regression test
===============
1. memcg creation
-----------------
Create 10K memcgs. Each memcg creation needs to allocate multiple percpu
variables, for example, percpu refcnt, rstat and objcg percpu refcnt.

Consumed 2112K more virtual memory for percpu “local mapping” and a few
more mega bytes consumed by per cpu page tables.
No noticeable regression was found for elapsed time.

2. fork test
------------
stress-ng --fork 160 --fork-ops 10000000
fork() needs to allocate multiple percpu variables, for example, rss
counters and mm_cid_cpu.

Roughly 1% regression was found. However stress-ng fork test has quites
small address space, the real life workloads typically have much larger
address space and do more complicated works. The stress-ng mmapfork
benchmark saw 15% improvement.


Yang Shi (11):
      arm64: mm: enable percpu kernel page table
      arm64: mm: define percpu virtual space area
      arm64: smp: define setup_per_cpu_areas()
      mm: percpu: prepare to use dedicated percpu area
      arm64: mm: map local percpu first chunk
      mm: percpu: set up first chunk and reserve chunk
      arm64: mm: introduce __per_cpu_local_off
      vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush()
      mm: percpu: allocate and free local percpu vm area
      arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP
      arm64: percpu: use local percpu for this_cpu_*() APIs

 arch/arm64/Kconfig                   |   2 +-
 arch/arm64/include/asm/mmu.h         |   3 +++
 arch/arm64/include/asm/mmu_context.h |   6 +++++-
 arch/arm64/include/asm/percpu.h      |  17 ++++++++++-------
 arch/arm64/include/asm/pgtable.h     |  24 +++++++++++++++++++++---
 arch/arm64/kernel/setup.c            |   3 +++
 arch/arm64/kernel/smp.c              |  40 ++++++++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mmu.c                  |  75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 arch/arm64/mm/ptdump.c               |   4 ++++
 drivers/base/arch_numa.c             |  51 +--------------------------------------------------
 include/linux/percpu.h               |   4 +++-
 include/linux/vmalloc.h              |   3 +++
 mm/Kconfig                           |   3 +++
 mm/internal.h                        |   5 ++++-
 mm/kmsan/hooks.c                     |  14 +++++++-------
 mm/percpu-internal.h                 |  15 +++++++++++++++
 mm/percpu-vm.c                       |  91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/percpu.c                          |  46 +++++++++++++++++++++++++++++++++++++---------
 mm/vmalloc.c                         | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
 19 files changed, 419 insertions(+), 99 deletions(-)


Thanks,
Yang



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 01/11] arm64: mm: enable percpu kernel page table
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 02/11] arm64: mm: define percpu virtual space area Yang Shi
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

Currently all cpus share the same kernel page table (swapper_pg_dir),
this patch creates kernel page table for each cpu and each cpu uses its
own kernel page table.  The cpu 0 keeps using swapper_pg_dir.  All the
kernel page tables share the same content.  So we don't have to
duplicate the whole page table for all cpus, we just need to have
different pgd page for each cpu.  All kernel page table modification (split,
creation, deletion, etc) actually still happens on swapper_pg_dir, the
modification needs to be synchronized to other cpu's page tables when
the pgd level is modified.

The percpu page table can't be shared across cores, so clear cnp bit
too even though CNP is supported.

Some features may not work with it for now, for example, KPTI.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/mmu.h         |  1 +
 arch/arm64/include/asm/mmu_context.h |  6 +++-
 arch/arm64/include/asm/pgtable.h     |  3 ++
 arch/arm64/kernel/setup.c            |  3 ++
 arch/arm64/kernel/smp.c              |  8 +++++
 arch/arm64/mm/mmu.c                  | 53 ++++++++++++++++++++++++++++
 6 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 5e1211c540ab..8ed3b5f3cf84 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -63,6 +63,7 @@ static inline bool arm64_kernel_unmapped_at_el0(void)
 extern void arm64_memblock_init(void);
 extern void paging_init(void);
 extern void bootmem_init(void);
+extern void setup_percpu_pgd(void);
 extern void create_mapping_noalloc(phys_addr_t phys, unsigned long virt,
 				   phys_addr_t size, pgprot_t prot);
 extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 803b68758152..0ee900eed612 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -27,6 +27,7 @@
 #include <asm/tlbflush.h>
 
 extern bool rodata_full;
+extern pgd_t *percpu_pgd[NR_CPUS];
 
 static inline void contextidr_thread_switch(struct task_struct *next)
 {
@@ -138,7 +139,10 @@ void __cpu_replace_ttbr1(pgd_t *pgdp, bool cnp);
 
 static inline void cpu_enable_swapper_cnp(void)
 {
-	__cpu_replace_ttbr1(lm_alias(swapper_pg_dir), true);
+	unsigned int cpu = smp_processor_id();
+	pgd_t *ttbr1 = percpu_pgd[cpu];
+
+	__cpu_replace_ttbr1(ttbr1, false);
 }
 
 static inline void cpu_replace_ttbr1(pgd_t *pgdp)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4dfa42b7d053..38eec71ec383 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1216,6 +1216,9 @@ p4d_t *p4d_offset_lockless_folded(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
 
 #endif  /* CONFIG_PGTABLE_LEVELS > 4 */
 
+#define ARCH_PAGE_TABLE_SYNC_MASK \
+	(pgtable_l5_enabled() ? PGTBL_PGD_MODIFIED : PGTBL_P4D_MODIFIED)
+
 #define pgd_ERROR(e)	\
 	pr_err("%s:%d: bad pgd %016llx.\n", __FILE__, __LINE__, pgd_val(e))
 
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 23c05dc7a8f2..6d420ad59af4 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -360,6 +360,9 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
 	smp_init_cpus();
 	smp_build_mpidr_hash();
 
+	/* Must be called after smp_init_cpus */
+	setup_percpu_pgd();
+
 #ifdef CONFIG_ARM64_SW_TTBR0_PAN
 	/*
 	 * Make sure init_thread_info.ttbr0 always generates translation
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 1aa324104afb..88a82eb56fb3 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -55,6 +55,8 @@
 
 #include <trace/events/ipi.h>
 
+extern void idmap_cpu_replace_ttbr1(phys_addr_t pgdir);
+
 /*
  * as from 2.5, kernels no longer have an init_tasks structure
  * so we need some other way of telling a new secondary core
@@ -198,6 +200,12 @@ asmlinkage notrace void secondary_start_kernel(void)
 	struct mm_struct *mm = &init_mm;
 	const struct cpu_operations *ops;
 	unsigned int cpu = smp_processor_id();
+	typedef void (ttbr_replace_func)(phys_addr_t);
+	ttbr_replace_func *replace_ttbr;
+
+	phys_addr_t ttbr1 = phys_to_ttbr(virt_to_phys(percpu_pgd[cpu]));
+	replace_ttbr = (void *)__pa_symbol(idmap_cpu_replace_ttbr1);
+	replace_ttbr(ttbr1);
 
 	/*
 	 * All kernel threads share the same mm context; grab a
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index dd85e093ffdb..ed1545baa045 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -67,6 +67,59 @@ long __section(".mmuoff.data.write") __early_cpu_boot_status;
 static DEFINE_SPINLOCK(swapper_pgdir_lock);
 static DEFINE_MUTEX(fixmap_lock);
 
+pgd_t *percpu_pgd[NR_CPUS] __ro_after_init;
+bool percpu_pgd_setup_done __ro_after_init = false;
+
+void __init setup_percpu_pgd(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		void *addr;
+
+		if (cpu == 0) {
+			percpu_pgd[cpu] = swapper_pg_dir;
+			continue;
+		}
+
+		addr = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+		if (!addr)
+			panic("Can't alloc percpu pgd\n");
+
+		memcpy(addr, (void *)swapper_pg_dir, PAGE_SIZE);
+		percpu_pgd[cpu] = (pgd_t *)addr;
+	}
+
+	dsb(ishst);
+
+	percpu_pgd_setup_done = true;
+}
+
+void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
+{
+	unsigned long addr, next;
+	int cpu;
+	pgd_t *pgdp = pgd_offset_k(start);
+	pgd_t pgd;
+	unsigned int index = pgd_index(start);
+
+	BUG_ON(start > end);
+
+	if (!percpu_pgd_setup_done)
+		return;
+
+	addr = start;
+	do {
+		pgd = READ_ONCE(*pgdp);
+		next = pgd_addr_end(addr, end);
+		for_each_possible_cpu(cpu) {
+			if (cpu == 0)
+				continue;
+			set_pgd(percpu_pgd[cpu] + index, pgd);
+		}
+	} while (pgdp++, index++, addr = next, addr != end);
+}
+
 void noinstr set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 	pgd_t *fixmap_pgdp;
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 02/11] arm64: mm: define percpu virtual space area
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
  2026-04-29 17:04 ` [PATCH 01/11] arm64: mm: enable percpu kernel page table Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 03/11] arm64: smp: define setup_per_cpu_areas() Yang Shi
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

The percpu allocator returns offset from percpu base address.  The
percpu base address is determined by the first chunk which is typically
in the low address of vmalloc space, however percpu varriables are
typically allocated from the high address of vmalloc space.  So the offset
could be quite big.  It may be the whole size of vmalloc space.  To support
local percpu mapping in order to optimize this_cpu_*() ops, the percpu
allocator needs to allocate memory from local percpu area too in the following
patch and the offset to local percpu base address must be same because
the offset returned by percpu allocator must be used to access both
global percpu and local percpu.

We can half vmalloc space to have either half dedicated to local percpu,
but it wastes too much address space.
So carve out dedicated global percpu and local percpu areas.  Each area size
is 2 * PGDIR_SIZE.  It is 1TB with 4K page size, should be big enough for percpu.
The percpu areas are PGDIR_SIZE aligned in order to just need to sync percpu
page table at pgd level to minimize page table sync overhead.

The kernel virtual address space layout now looks like:

+-----------------+
|  Linear mapping |
+-----------------+
|  Modules        |
+-----------------+
|  Vmalloc        |
+-----------------+
|  Global Percpu  |
+-----------------+
|  Local Percpu   |
+-----------------+
|  Vmemap         |
+-----------------+
|  PCI I/O        |
+-----------------+
|  Fixed map      |
+-----------------+

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/pgtable.h | 21 ++++++++++++++++++---
 arch/arm64/mm/mmu.c              |  4 ++++
 arch/arm64/mm/ptdump.c           |  4 ++++
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 38eec71ec383..9043b976682c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -18,14 +18,29 @@
  * VMALLOC range.
  *
  * VMALLOC_START: beginning of the kernel vmalloc space
- * VMALLOC_END: extends to the available space below vmemmap
+ * VMALLOC_END: extends to the space below global percpu area
  */
 #define VMALLOC_START		(MODULES_END)
+#define VMALLOC_END		(PERCPU_START - SZ_8M)
+
+/*
+ * PERCPU range
+ *
+ * PERCPU_START: beginning of global percpu area
+ * PERCPU_END: end of global percpu area
+ * LOCAL_PERCPU_START: beginning of local percpu area
+ * LOCAL_PERCPU_END: end of local percpu area, extend to the available
+ *                   space below vmemap
+ */
+#define PERCPU_SIZE		(2 * PGDIR_SIZE)
+#define PERCPU_START		(PERCPU_END - PERCPU_SIZE)
+#define PERCPU_END		(LOCAL_PERCPU_START)
+#define LOCAL_PERCPU_START	(LOCAL_PERCPU_END - PERCPU_SIZE)
 #if VA_BITS == VA_BITS_MIN
-#define VMALLOC_END		(VMEMMAP_START - SZ_8M)
+#define LOCAL_PERCPU_END	(ALIGN_DOWN(VMEMMAP_START, PGDIR_SIZE))
 #else
 #define VMEMMAP_UNUSED_NPAGES	((_PAGE_OFFSET(vabits_actual) - PAGE_OFFSET) >> PAGE_SHIFT)
-#define VMALLOC_END		(VMEMMAP_START + VMEMMAP_UNUSED_NPAGES * sizeof(struct page) - SZ_8M)
+#define LOCAL_PERCPU_END	(ALIGN_DOWN((VMEMMAP_START + VMEMMAP_UNUSED_NPAGES * sizeof(struct page)), PGDIR_SIZE))
 #endif
 
 #define vmemmap			((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ed1545baa045..7708dcc1b6a9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -108,6 +108,10 @@ void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
 	if (!percpu_pgd_setup_done)
 		return;
 
+	/* Don't sync local percpu area page table */
+	if (start >= LOCAL_PERCPU_START && end < LOCAL_PERCPU_END)
+		return;
+
 	addr = start;
 	do {
 		pgd = READ_ONCE(*pgdp);
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index ab9899ca1e5f..7d5696a48917 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -389,6 +389,10 @@ static int __init ptdump_init(void)
 		{ MODULES_END,		"Modules end" },
 		{ VMALLOC_START,	"vmalloc() area" },
 		{ VMALLOC_END,		"vmalloc() end" },
+		{ PERCPU_START,		"Global percpu start" },
+		{ PERCPU_END,		"Global percpu end" },
+		{ LOCAL_PERCPU_START,	"Local percpu start" },
+		{ LOCAL_PERCPU_END,	"Local percpu end" },
 		{ vmemmap_start,	"vmemmap start" },
 		{ VMEMMAP_END,		"vmemmap end" },
 		{ PCI_IO_START,		"PCI I/O start" },
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 03/11] arm64: smp: define setup_per_cpu_areas()
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
  2026-04-29 17:04 ` [PATCH 01/11] arm64: mm: enable percpu kernel page table Yang Shi
  2026-04-29 17:04 ` [PATCH 02/11] arm64: mm: define percpu virtual space area Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 04/11] mm: percpu: prepare to use dedicated percpu area Yang Shi
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

We need to modify setup_per_cpu_areas() to set up local percpu area for
arm64, the drivers/base/arch_numa.c implementation won't work anymore, so
moved it to the arm64 directory.  No functional change.

It looks like riscv is the only user of it after this change.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/kernel/smp.c  | 49 ++++++++++++++++++++++++++++++++++++++
 drivers/base/arch_numa.c | 51 +---------------------------------------
 2 files changed, 50 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 88a82eb56fb3..0cc8f4a9efa7 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -821,6 +821,55 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	}
 }
 
+extern int cpu_to_node_map[NR_CPUS];
+
+unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
+EXPORT_SYMBOL(__per_cpu_offset);
+
+int early_cpu_to_node(int cpu)
+{
+	return cpu_to_node_map[cpu];
+}
+
+static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
+{
+	return node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+}
+
+void __init setup_per_cpu_areas(void)
+{
+	unsigned long delta;
+	unsigned int cpu;
+	int rc = -EINVAL;
+
+	if (pcpu_chosen_fc != PCPU_FC_PAGE) {
+		/*
+		 * Always reserve area for module percpu variables.  That's
+		 * what the legacy allocator did.
+		 */
+		rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+					    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
+					    pcpu_cpu_distance,
+					    early_cpu_to_node);
+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
+		if (rc < 0)
+			pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n",
+				   pcpu_fc_names[pcpu_chosen_fc], rc);
+#endif
+	}
+
+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
+	if (rc < 0)
+		rc = pcpu_page_first_chunk(PERCPU_MODULE_RESERVE, early_cpu_to_node);
+#endif
+	if (rc < 0)
+		panic("Failed to initialize percpu areas (err=%d).", rc);
+
+	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
+	for_each_possible_cpu(cpu)
+		__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
+}
+
 static const char *ipi_types[MAX_IPI] __tracepoint_string = {
 	[IPI_RESCHEDULE]	= "Rescheduling interrupts",
 	[IPI_CALL_FUNC]		= "Function call interrupts",
diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index c99f2ab105e5..b3b91ceed6a9 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -16,7 +16,7 @@
 
 #include <asm/sections.h>
 
-static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
 
 bool numa_off;
 
@@ -140,55 +140,6 @@ void __init early_map_cpu_to_node(unsigned int cpu, int nid)
 		set_cpu_numa_node(cpu, nid);
 }
 
-#ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
-unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
-EXPORT_SYMBOL(__per_cpu_offset);
-
-int early_cpu_to_node(int cpu)
-{
-	return cpu_to_node_map[cpu];
-}
-
-static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
-{
-	return node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
-}
-
-void __init setup_per_cpu_areas(void)
-{
-	unsigned long delta;
-	unsigned int cpu;
-	int rc = -EINVAL;
-
-	if (pcpu_chosen_fc != PCPU_FC_PAGE) {
-		/*
-		 * Always reserve area for module percpu variables.  That's
-		 * what the legacy allocator did.
-		 */
-		rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
-					    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
-					    pcpu_cpu_distance,
-					    early_cpu_to_node);
-#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
-		if (rc < 0)
-			pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n",
-				   pcpu_fc_names[pcpu_chosen_fc], rc);
-#endif
-	}
-
-#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
-	if (rc < 0)
-		rc = pcpu_page_first_chunk(PERCPU_MODULE_RESERVE, early_cpu_to_node);
-#endif
-	if (rc < 0)
-		panic("Failed to initialize percpu areas (err=%d).", rc);
-
-	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
-	for_each_possible_cpu(cpu)
-		__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
-}
-#endif
-
 /*
  * Initialize NODE_DATA for a node on the local memory
  */
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 04/11] mm: percpu: prepare to use dedicated percpu area
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (2 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 03/11] arm64: smp: define setup_per_cpu_areas() Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 05/11] arm64: mm: map local percpu first chunk Yang Shi
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

The percpu variables are allocated from vmalloc area by default.  The
architectures which support local percpu map will allocate percpu
variables from dedicated percpu area, for example, ARM64.

Introduce a new kernel config, CONFIG_HAVE_LOCAL_PER_CPU_MAP.  The
architectures which support local percpu map will need to select it.  If
it is enabled, allocate percpu variables from the dedicated percpu area.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 mm/Kconfig   |  3 +++
 mm/percpu.c  |  6 ++++++
 mm/vmalloc.c | 20 +++++++++++++++++---
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e8bf1e9e6ad9..ccdf58b63fb8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1022,6 +1022,9 @@ config NEED_PER_CPU_PAGE_FIRST_CHUNK
 config USE_PERCPU_NUMA_NODE_ID
 	bool
 
+config HAVE_LOCAL_PER_CPU_MAP
+	bool
+
 config HAVE_SETUP_PER_CPU_AREA
 	bool
 
diff --git a/mm/percpu.c b/mm/percpu.c
index b0676b8054ed..daa2c88e6971 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -3243,9 +3243,15 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 	}
 
 	/* allocate vm area, map the pages and copy static data */
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+	vm.addr = (void *)ALIGN(PERCPU_START, PAGE_SIZE);
+	vm.size = num_possible_cpus() * ai->unit_size;
+	vm_area_add_early(&vm);
+#else
 	vm.flags = VM_ALLOC;
 	vm.size = num_possible_cpus() * ai->unit_size;
 	vm_area_register_early(&vm, PAGE_SIZE);
+#endif
 
 	for (unit = 0; unit < num_possible_cpus(); unit++) {
 		unsigned long unit_addr =
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index aa08651ec0df..068a6709062d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4841,9 +4841,15 @@ pvm_find_va_enclose_addr(unsigned long addr)
 static unsigned long
 pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
 {
-	unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
+	unsigned long vmalloc_end;
 	unsigned long addr;
 
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+	vmalloc_end = PERCPU_END & ~(align - 1);
+#else
+	vmalloc_end = VMALLOC_END & ~(align - 1);
+#endif
+
 	if (likely(*va)) {
 		list_for_each_entry_from_reverse((*va),
 				&free_vmap_area_list, list) {
@@ -4884,14 +4890,22 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 				     const size_t *sizes, int nr_vms,
 				     size_t align)
 {
-	const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align);
-	const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
+	unsigned long vmalloc_start;
+	unsigned long vmalloc_end;
 	struct vmap_area **vas, *va;
 	struct vm_struct **vms;
 	int area, area2, last_area, term_area;
 	unsigned long base, start, size, end, last_end, orig_start, orig_end;
 	bool purged = false;
 
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+	vmalloc_start = ALIGN(PERCPU_START, align);
+	vmalloc_end = PERCPU_END & ~(align - 1);
+#else
+	vmalloc_start = ALIGN(VMALLOC_START, align);
+	vmalloc_end = VMALLOC_END & ~(align - 1);
+#endif
+
 	/* verify parameters and allocate data structures */
 	BUG_ON(offset_in_page(align) || !is_power_of_2(align));
 	for (last_area = 0, area = 0; area < nr_vms; area++) {
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 05/11] arm64: mm: map local percpu first chunk
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (3 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 04/11] mm: percpu: prepare to use dedicated percpu area Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 06/11] mm: percpu: set up first chunk and reserve chunk Yang Shi
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

Allocate local percpu area and map to percpu page table for the first
chunk.

It doesn't work for PCPU_FC_EMBED because the percpu base adddress may
be in linear mapping space in this case, it will result in returning huge
offset for percpu allocator.  So percpu local map just can work with
PCPU_FC_PAGE which allocates percpu variables from vmalloc area or the
dedicated percpu area.  So unselect NEED_PER_CPU_EMBED_FIRST_CHUNK if
the architectures support percpu local map.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/Kconfig           |  1 -
 arch/arm64/include/asm/mmu.h |  2 ++
 arch/arm64/kernel/smp.c      | 25 ++-----------------------
 arch/arm64/mm/mmu.c          | 18 ++++++++++++++++++
 mm/percpu-internal.h         | 12 ++++++++++++
 mm/percpu.c                  | 13 +++++++++++++
 6 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe60738e5943..0e12e531a5b2 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1525,7 +1525,6 @@ config NUMA
 	select GENERIC_ARCH_NUMA
 	select OF_NUMA
 	select HAVE_SETUP_PER_CPU_AREA
-	select NEED_PER_CPU_EMBED_FIRST_CHUNK
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
 	select USE_PERCPU_NUMA_NODE_ID
 	help
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 8ed3b5f3cf84..d81e5c483b55 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -73,6 +73,8 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
 extern void mark_linear_text_alias_ro(void);
 extern int split_kernel_leaf_mapping(unsigned long start, unsigned long end);
 extern void linear_map_maybe_split_to_ptes(void);
+extern void map_local_percpu_first_chunk(pgd_t *pgdir, unsigned long virt,
+				struct page **pages, unsigned int nr);
 
 /*
  * This check is triggered during the early boot before the cpufeature
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 0cc8f4a9efa7..4caa6ebec12f 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -831,36 +831,15 @@ int early_cpu_to_node(int cpu)
 	return cpu_to_node_map[cpu];
 }
 
-static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
-{
-	return node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
-}
-
 void __init setup_per_cpu_areas(void)
 {
 	unsigned long delta;
 	unsigned int cpu;
 	int rc = -EINVAL;
 
-	if (pcpu_chosen_fc != PCPU_FC_PAGE) {
-		/*
-		 * Always reserve area for module percpu variables.  That's
-		 * what the legacy allocator did.
-		 */
-		rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
-					    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
-					    pcpu_cpu_distance,
-					    early_cpu_to_node);
-#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
-		if (rc < 0)
-			pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n",
-				   pcpu_fc_names[pcpu_chosen_fc], rc);
-#endif
-	}
-
 #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
-	if (rc < 0)
-		rc = pcpu_page_first_chunk(PERCPU_MODULE_RESERVE, early_cpu_to_node);
+	/* PCPU page table just can support PCPU_FC_PAGE */
+	rc = pcpu_page_first_chunk(PERCPU_MODULE_RESERVE, early_cpu_to_node);
 #endif
 	if (rc < 0)
 		panic("Failed to initialize percpu areas (err=%d).", rc);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 7708dcc1b6a9..81b662433677 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1044,6 +1044,24 @@ void __init linear_map_maybe_split_to_ptes(void)
 	}
 }
 
+void __init map_local_percpu_first_chunk(pgd_t *pgdir, unsigned long virt,
+			    struct page **pages, unsigned int nr)
+{
+	int i;
+
+	arch_enter_lazy_mmu_mode();
+
+	for (i = 0; i < nr; i++) {
+		phys_addr_t phys = page_to_phys(pages[i]);
+		__create_pgd_mapping_locked(pgdir, phys, virt, PAGE_SIZE, PAGE_KERNEL,
+				    early_pgtable_alloc, NO_EXEC_MAPPINGS);
+
+		virt += PAGE_SIZE;
+	}
+
+	arch_leave_lazy_mmu_mode();
+}
+
 /*
  * This function can only be used to modify existing table entries,
  * without allocating new levels of table. Note that this permits the
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 4b3d6ec43703..b33d1f5aba1b 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -5,6 +5,7 @@
 #include <linux/types.h>
 #include <linux/percpu.h>
 #include <linux/memcontrol.h>
+#include <linux/mmu_context.h>
 
 /*
  * pcpu_block_md is the metadata block struct.
@@ -162,6 +163,17 @@ static inline size_t pcpu_obj_full_size(size_t size)
 	return size * num_possible_cpus() + extra_size;
 }
 
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+extern void __init map_local_percpu_first_chunk(pgd_t *pgdir, unsigned long virt,
+                            struct page **pages, unsigned int nr);
+#else
+static inline void __init map_local_percpu_first_chunk(pgd_t *pgdir, unsigned long virt,
+                            struct page **pages, unsigned int nr)
+{
+	return;
+}
+#endif
+
 #ifdef CONFIG_PERCPU_STATS
 
 #include <linux/spinlock.h>
diff --git a/mm/percpu.c b/mm/percpu.c
index daa2c88e6971..59682b77089c 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -3194,6 +3194,7 @@ void __init __weak pcpu_populate_pte(unsigned long addr)
 int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t cpu_to_nd_fn)
 {
 	static struct vm_struct vm;
+	static struct vm_struct pcpu_vm;
 	struct pcpu_alloc_info *ai;
 	char psize_str[16];
 	int unit_pages;
@@ -3247,6 +3248,10 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 	vm.addr = (void *)ALIGN(PERCPU_START, PAGE_SIZE);
 	vm.size = num_possible_cpus() * ai->unit_size;
 	vm_area_add_early(&vm);
+
+	pcpu_vm.addr = (void *)ALIGN(LOCAL_PERCPU_START, PAGE_SIZE);
+	pcpu_vm.size = ai->unit_size;
+	vm_area_add_early(&pcpu_vm);
 #else
 	vm.flags = VM_ALLOC;
 	vm.size = num_possible_cpus() * ai->unit_size;
@@ -3270,6 +3275,14 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 
 		/* copy static data */
 		memcpy((void *)unit_addr, __per_cpu_start, ai->static_size);
+
+		/*
+		 * Map percpu data to PERCPU map.
+		 *
+		 * PCPU_FC_EMBED can't support it.
+		 */
+		map_local_percpu_first_chunk(percpu_pgd[unit], (unsigned long)pcpu_vm.addr,
+				&pages[unit * unit_pages], unit_pages);
 	}
 
 	/* we're ready, commit */
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 06/11] mm: percpu: set up first chunk and reserve chunk
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (4 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 05/11] arm64: mm: map local percpu first chunk Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 07/11] arm64: mm: introduce __per_cpu_local_off Yang Shi
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

Set up the first chunk and reserve chunk with local percpu map.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 include/linux/percpu.h |  2 +-
 mm/percpu-internal.h   |  2 ++
 mm/percpu.c            | 24 +++++++++++++++---------
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 85bf8dd9f087..dba050f5b548 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -113,7 +113,7 @@ extern struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
 extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);
 
 extern void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
-					 void *base_addr);
+					 void *base_addr, void *local_base);
 
 extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index b33d1f5aba1b..64b48b99ac06 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -64,6 +64,8 @@ struct pcpu_chunk {
 	 * chunk_md.
 	 */
 	void			*base_addr ____cacheline_aligned_in_smp;
+	/* percpu local base address of the chunk */
+	void                    *local_base;
 
 	unsigned long		*alloc_map;	/* allocation map */
 	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
diff --git a/mm/percpu.c b/mm/percpu.c
index 59682b77089c..5148c5ccf9e3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1349,15 +1349,16 @@ static void pcpu_init_md_blocks(struct pcpu_chunk *chunk)
  * Chunk serving the region at @tmp_addr of @map_size.
  */
 static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
-							 int map_size)
+							 unsigned long local_tmp, int map_size)
 {
 	struct pcpu_chunk *chunk;
-	unsigned long aligned_addr;
+	unsigned long aligned_addr, aligned_local;
 	int start_offset, offset_bits, region_size, region_bits;
 	size_t alloc_size;
 
 	/* region calculations */
 	aligned_addr = tmp_addr & PAGE_MASK;
+	aligned_local = local_tmp & PAGE_MASK;
 
 	start_offset = tmp_addr - aligned_addr;
 	region_size = ALIGN(start_offset + map_size, PAGE_SIZE);
@@ -1370,6 +1371,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
 	INIT_LIST_HEAD(&chunk->list);
 
 	chunk->base_addr = (void *)aligned_addr;
+	chunk->local_base = (void *)aligned_local;
 	chunk->start_offset = start_offset;
 	chunk->end_offset = region_size - chunk->start_offset - map_size;
 
@@ -2562,7 +2564,7 @@ static void pcpu_dump_alloc_info(const char *lvl,
  * and available for dynamic allocation like any other chunk.
  */
 void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
-				   void *base_addr)
+				   void *base_addr, void *local_base)
 {
 	size_t size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
 	size_t static_size, dyn_size;
@@ -2572,7 +2574,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	unsigned int cpu;
 	int *unit_map;
 	int group, unit, i;
-	unsigned long tmp_addr;
+	unsigned long tmp_addr, local_tmp;
 	size_t alloc_size;
 
 #define PCPU_SETUP_BUG_ON(cond)	do {					\
@@ -2713,11 +2715,13 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 	 *   chunk.
 	 */
 	tmp_addr = (unsigned long)base_addr + static_size;
+	local_tmp = (unsigned long)local_base + static_size;
 	if (ai->reserved_size)
-		pcpu_reserved_chunk = pcpu_alloc_first_chunk(tmp_addr,
+		pcpu_reserved_chunk = pcpu_alloc_first_chunk(tmp_addr, local_tmp,
 						ai->reserved_size);
 	tmp_addr = (unsigned long)base_addr + static_size + ai->reserved_size;
-	pcpu_first_chunk = pcpu_alloc_first_chunk(tmp_addr, dyn_size);
+	local_tmp = (unsigned long)local_base + static_size + ai->reserved_size;
+	pcpu_first_chunk = pcpu_alloc_first_chunk(tmp_addr, local_tmp, dyn_size);
 
 	pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
 	pcpu_chunk_relocate(pcpu_first_chunk, -1);
@@ -3108,7 +3112,7 @@ int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 		PFN_DOWN(size_sum), ai->static_size, ai->reserved_size,
 		ai->dyn_size, ai->unit_size);
 
-	pcpu_setup_first_chunk(ai, base);
+	pcpu_setup_first_chunk(ai, base, NULL);
 	goto out_free;
 
 out_free_areas:
@@ -3256,6 +3260,8 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 	vm.flags = VM_ALLOC;
 	vm.size = num_possible_cpus() * ai->unit_size;
 	vm_area_register_early(&vm, PAGE_SIZE);
+
+	pcpu_vm.addr = NULL;
 #endif
 
 	for (unit = 0; unit < num_possible_cpus(); unit++) {
@@ -3290,7 +3296,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 		unit_pages, psize_str, ai->static_size,
 		ai->reserved_size, ai->dyn_size);
 
-	pcpu_setup_first_chunk(ai, vm.addr);
+	pcpu_setup_first_chunk(ai, vm.addr, pcpu_vm.addr);
 	goto out_free_ar;
 
 enomem:
@@ -3372,7 +3378,7 @@ void __init setup_per_cpu_areas(void)
 	ai->groups[0].nr_units = 1;
 	ai->groups[0].cpu_map[0] = 0;
 
-	pcpu_setup_first_chunk(ai, fc);
+	pcpu_setup_first_chunk(ai, fc, NULL);
 	pcpu_free_alloc_info(ai);
 }
 
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 07/11] arm64: mm: introduce __per_cpu_local_off
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (5 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 06/11] mm: percpu: set up first chunk and reserve chunk Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 08/11] vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush() Yang Shi
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

this_cpu_*() ops will use it to get the local percpu address.  It has
the same value for all CPUs.

Also introduce pcpu_local_base, which is the base address of local
percpu map.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/kernel/smp.c | 4 ++++
 include/linux/percpu.h  | 2 ++
 mm/percpu.c             | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 4caa6ebec12f..62afabf86ba1 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -825,6 +825,9 @@ extern int cpu_to_node_map[NR_CPUS];
 
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
+/* Used to calculate pcpu local address, the offset is same for all CPUs */
+unsigned long __per_cpu_local_off __read_mostly;
+EXPORT_SYMBOL(__per_cpu_local_off);
 
 int early_cpu_to_node(int cpu)
 {
@@ -847,6 +850,7 @@ void __init setup_per_cpu_areas(void)
 	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
 	for_each_possible_cpu(cpu)
 		__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
+	__per_cpu_local_off = (unsigned long)pcpu_local_base - (unsigned long)__per_cpu_start;
 }
 
 static const char *ipi_types[MAX_IPI] __tracepoint_string = {
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index dba050f5b548..e29ebd265087 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -74,6 +74,8 @@
 
 extern void *pcpu_base_addr;
 extern const unsigned long *pcpu_unit_offsets;
+/* percpu local mapping base */
+extern void *pcpu_local_base;
 
 struct pcpu_group_info {
 	int			nr_units;	/* aligned # of units */
diff --git a/mm/percpu.c b/mm/percpu.c
index 5148c5ccf9e3..17d0c2b0de5a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -145,6 +145,8 @@ static unsigned int pcpu_high_unit_cpu __ro_after_init;
 
 /* the address of the first chunk which starts with the kernel static area */
 void *pcpu_base_addr __ro_after_init;
+/* The address of the first chunk local mapping */
+void *pcpu_local_base __ro_after_init;
 
 static const int *pcpu_unit_map __ro_after_init;		/* cpu -> unit */
 const unsigned long *pcpu_unit_offsets __ro_after_init;	/* cpu -> unit offset */
@@ -3297,6 +3299,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 		ai->reserved_size, ai->dyn_size);
 
 	pcpu_setup_first_chunk(ai, vm.addr, pcpu_vm.addr);
+	pcpu_local_base = pcpu_vm.addr;
 	goto out_free_ar;
 
 enomem:
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 08/11] vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush()
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (6 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 07/11] arm64: mm: introduce __per_cpu_local_off Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 09/11] mm: percpu: allocate and free local percpu vm area Yang Shi
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

vmap{__vunmap}_range_noflush() assume manipulate init_mm pgd.  The
following patch will map percpu local mapping into percpu page table by
calling them, so the assumption will no longer stand.  Make them take
pgd pointer as an parameter.

Also make vmap_range_noflush() non static, it will be called outside
vmalloc in the following patch.

There is no functional change.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 mm/internal.h    |  5 ++++-
 mm/kmsan/hooks.c | 14 +++++++-------
 mm/vmalloc.c     | 25 +++++++++++++------------
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..1e54945f8750 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1553,10 +1553,13 @@ void clear_vm_uninitialized_flag(struct vm_struct *vm);
 int __must_check __vmap_pages_range_noflush(unsigned long addr,
 			       unsigned long end, pgprot_t prot,
 			       struct page **pages, unsigned int page_shift);
+int __must_check vmap_range_noflush(pgd_t *pgdir, unsigned long addr,
+			unsigned long end, phys_addr_t phys_addr,
+			pgprot_t prot, unsigned int max_page_shift);
 
 void vunmap_range_noflush(unsigned long start, unsigned long end);
 
-void __vunmap_range_noflush(unsigned long start, unsigned long end);
+void __vunmap_range_noflush(pgd_t *pgdir, unsigned long start, unsigned long end);
 
 static inline bool vma_is_single_threaded_private(struct vm_area_struct *vma)
 {
diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 8f22d1f22981..e2a0faf344b9 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -135,8 +135,8 @@ static unsigned long vmalloc_origin(unsigned long addr)
 
 void kmsan_vunmap_range_noflush(unsigned long start, unsigned long end)
 {
-	__vunmap_range_noflush(vmalloc_shadow(start), vmalloc_shadow(end));
-	__vunmap_range_noflush(vmalloc_origin(start), vmalloc_origin(end));
+	__vunmap_range_noflush(init_mm.pgd, vmalloc_shadow(start), vmalloc_shadow(end));
+	__vunmap_range_noflush(init_mm.pgd, vmalloc_origin(start), vmalloc_origin(end));
 	flush_cache_vmap(vmalloc_shadow(start), vmalloc_shadow(end));
 	flush_cache_vmap(vmalloc_origin(start), vmalloc_origin(end));
 }
@@ -181,7 +181,7 @@ int kmsan_ioremap_page_range(unsigned long start, unsigned long end,
 			vmalloc_origin(start + off + PAGE_SIZE), prot, &origin,
 			PAGE_SHIFT);
 		if (mapped) {
-			__vunmap_range_noflush(
+			__vunmap_range_noflush(init_mm.pgd,
 				vmalloc_shadow(start + off),
 				vmalloc_shadow(start + off + PAGE_SIZE));
 			err = mapped;
@@ -203,10 +203,10 @@ int kmsan_ioremap_page_range(unsigned long start, unsigned long end,
 			__free_pages(shadow, 1);
 		if (origin)
 			__free_pages(origin, 1);
-		__vunmap_range_noflush(
+		__vunmap_range_noflush(init_mm.pgd,
 			vmalloc_shadow(start),
 			vmalloc_shadow(start + clean * PAGE_SIZE));
-		__vunmap_range_noflush(
+		__vunmap_range_noflush(init_mm.pgd,
 			vmalloc_origin(start),
 			vmalloc_origin(start + clean * PAGE_SIZE));
 	}
@@ -233,8 +233,8 @@ void kmsan_iounmap_page_range(unsigned long start, unsigned long end)
 	     i++, v_shadow += PAGE_SIZE, v_origin += PAGE_SIZE) {
 		shadow = kmsan_vmalloc_to_page_or_null((void *)v_shadow);
 		origin = kmsan_vmalloc_to_page_or_null((void *)v_origin);
-		__vunmap_range_noflush(v_shadow, vmalloc_shadow(end));
-		__vunmap_range_noflush(v_origin, vmalloc_origin(end));
+		__vunmap_range_noflush(init_mm.pgd, v_shadow, vmalloc_shadow(end));
+		__vunmap_range_noflush(init_mm.pgd, v_origin, vmalloc_origin(end));
 		if (shadow)
 			__free_pages(shadow, 1);
 		if (origin)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 068a6709062d..8ef7d9987e18 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -295,9 +295,9 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	return err;
 }
 
-static int vmap_range_noflush(unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot,
-			unsigned int max_page_shift)
+int vmap_range_noflush(pgd_t *pgdir, unsigned long addr, unsigned long end,
+		       phys_addr_t phys_addr, pgprot_t prot,
+		       unsigned int max_page_shift)
 {
 	pgd_t *pgd;
 	unsigned long start;
@@ -314,7 +314,7 @@ static int vmap_range_noflush(unsigned long addr, unsigned long end,
 	BUG_ON(addr >= end);
 
 	start = addr;
-	pgd = pgd_offset_k(addr);
+	pgd = pgd_offset_pgd(pgdir, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		err = vmap_p4d_range(pgd, addr, next, phys_addr, prot,
@@ -334,8 +334,8 @@ int vmap_page_range(unsigned long addr, unsigned long end,
 {
 	int err;
 
-	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
-				 ioremap_max_page_shift);
+	err = vmap_range_noflush(init_mm.pgd, addr, end, phys_addr,
+				 pgprot_nx(prot), ioremap_max_page_shift);
 	flush_cache_vmap(addr, end);
 	if (!err)
 		err = kmsan_ioremap_page_range(addr, end, phys_addr, prot,
@@ -478,7 +478,7 @@ static void vunmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
  *
  * This is an internal function only. Do not use outside mm/.
  */
-void __vunmap_range_noflush(unsigned long start, unsigned long end)
+void __vunmap_range_noflush(pgd_t *pgdir, unsigned long start, unsigned long end)
 {
 	unsigned long next;
 	pgd_t *pgd;
@@ -486,7 +486,7 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
 	pgtbl_mod_mask mask = 0;
 
 	BUG_ON(addr >= end);
-	pgd = pgd_offset_k(addr);
+	pgd = pgd_offset_pgd(pgdir, addr);
 	do {
 		next = pgd_addr_end(addr, end);
 		if (pgd_bad(*pgd))
@@ -503,7 +503,7 @@ void __vunmap_range_noflush(unsigned long start, unsigned long end)
 void vunmap_range_noflush(unsigned long start, unsigned long end)
 {
 	kmsan_vunmap_range_noflush(start, end);
-	__vunmap_range_noflush(start, end);
+	__vunmap_range_noflush(init_mm.pgd, start, end);
 }
 
 /**
@@ -670,9 +670,10 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {
 		int err;
 
-		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
-					page_to_phys(pages[i]), prot,
-					page_shift);
+		err = vmap_range_noflush(init_mm.pgd, addr,
+					 addr + (1UL << page_shift),
+					 page_to_phys(pages[i]), prot,
+					 page_shift);
 		if (err)
 			return err;
 
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 09/11] mm: percpu: allocate and free local percpu vm area
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (7 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 08/11] vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush() Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 10/11] arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP Yang Shi
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

Allocate local percpu vm area.  The delta between the allocated addr
(chunk local base) and pcpu_local_base must be same with the delta
between chunk base and pcpu_base_addr.  Each CPU's local percpu area
will be mapped to its own page table.  This section of page table is not
shared between CPUs.

And free local percpu vm area.  Also unmap from percpu page table.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 include/linux/vmalloc.h |  3 ++
 mm/percpu-internal.h    |  1 +
 mm/percpu-vm.c          | 91 +++++++++++++++++++++++++++++++++++++++++
 mm/vmalloc.c            | 69 ++++++++++++++++++++++++++++---
 4 files changed, 159 insertions(+), 5 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3b02c0c6b371..4b53992a063c 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -311,6 +311,9 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 				     size_t align);
 
 void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
+struct vm_struct *pcpu_get_local_vm_area(unsigned long hint,
+				     int unit_size, size_t align);
+
 # else
 static inline struct vm_struct **
 pcpu_get_vm_areas(const unsigned long *offsets,
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 64b48b99ac06..2c560e44ee58 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -71,6 +71,7 @@ struct pcpu_chunk {
 	struct pcpu_block_md	*md_blocks;	/* metadata blocks */
 
 	void			*data;		/* chunk data */
+	void			*local_data;	/* chunk local vm */
 	bool			immutable;	/* no [de]population allowed */
 	bool			isolated;	/* isolated from active chunk
 						   slots */
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 4f5937090590..1e6b8fdcab71 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -9,6 +9,7 @@
  * This is the default chunk allocator.
  */
 #include "internal.h"
+#include "percpu-internal.h"
 
 static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
 				    unsigned int cpu, int page_idx)
@@ -130,6 +131,11 @@ static void pcpu_pre_unmap_flush(struct pcpu_chunk *chunk,
 	flush_cache_vunmap(
 		pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),
 		pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
+
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+	flush_cache_vunmap((unsigned long)chunk->local_base + (page_start << PAGE_SHIFT),
+			(unsigned long)chunk->local_base + (page_end << PAGE_SHIFT));
+#endif
 }
 
 static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)
@@ -137,6 +143,20 @@ static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)
 	vunmap_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT));
 }
 
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+static void __pcpu_unmap_pages_local(pgd_t *pgdir, unsigned long virt,
+			    int nr_pages)
+{
+	__vunmap_range_noflush(pgdir, virt, virt + (nr_pages << PAGE_SHIFT));
+}
+#else
+static void __pcpu_unmap_pages_local(pgd_t *pgdir, unsigned long virt,
+			    int nr_pages)
+{
+	return;
+}
+#endif
+
 /**
  * pcpu_unmap_pages - unmap pages out of a pcpu_chunk
  * @chunk: chunk of interest
@@ -166,6 +186,10 @@ static void pcpu_unmap_pages(struct pcpu_chunk *chunk,
 		}
 		__pcpu_unmap_pages(pcpu_chunk_addr(chunk, cpu, page_start),
 				   page_end - page_start);
+
+		__pcpu_unmap_pages_local(percpu_pgd[cpu],
+					(unsigned long)chunk->local_base + (page_start << PAGE_SHIFT),
+					page_end - page_start);
 	}
 }
 
@@ -188,6 +212,12 @@ static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
 	flush_tlb_kernel_range(
 		pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),
 		pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
+
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+	flush_tlb_kernel_range(
+		(unsigned long)chunk->local_base + (page_start << PAGE_SHIFT),
+		(unsigned long)chunk->local_base + (page_end << PAGE_SHIFT));
+#endif
 }
 
 static int __pcpu_map_pages(unsigned long addr, struct page **pages,
@@ -197,6 +227,32 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
 			PAGE_KERNEL, pages, PAGE_SHIFT, GFP_KERNEL);
 }
 
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+static int __pcpu_map_pages_local(pgd_t *pgdir, unsigned long virt, struct page **pages,
+			    int nr_pages)
+{
+	unsigned int i;
+	int err = 0;
+
+	for (i = 0; i < nr_pages; i++) {
+		err = vmap_range_noflush(pgdir, virt, virt + PAGE_SIZE,
+				page_to_phys(pages[i]), PAGE_KERNEL, PAGE_SHIFT);
+		if (err)
+			return err;
+
+		virt += PAGE_SIZE;
+	}
+
+	return err;
+}
+#else
+static int __pcpu_map_pages_local(pgd_t *pgdir, unsigned long virt, struct page **pages,
+			    int nr_pages)
+{
+	return 0;
+}
+#endif
+
 /**
  * pcpu_map_pages - map pages into a pcpu_chunk
  * @chunk: chunk of interest
@@ -224,6 +280,13 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
 		if (err < 0)
 			goto err;
 
+		err = __pcpu_map_pages_local(percpu_pgd[cpu],
+					(unsigned long)chunk->local_base + (page_start << PAGE_SHIFT),
+					&pages[pcpu_page_idx(cpu, page_start)],
+					page_end - page_start);
+		if (err < 0)
+			goto err;
+
 		for (i = page_start; i < page_end; i++)
 			pcpu_set_page_chunk(pages[pcpu_page_idx(cpu, i)],
 					    chunk);
@@ -233,6 +296,9 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
 	for_each_possible_cpu(tcpu) {
 		__pcpu_unmap_pages(pcpu_chunk_addr(chunk, tcpu, page_start),
 				   page_end - page_start);
+		__pcpu_unmap_pages_local(percpu_pgd[cpu],
+					(unsigned long)chunk->local_base + (page_start << PAGE_SHIFT),
+					page_end - page_start);
 		if (tcpu == cpu)
 			break;
 	}
@@ -258,6 +324,11 @@ static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
 	flush_cache_vmap(
 		pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),
 		pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));
+
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+	flush_cache_vmap((unsigned long)chunk->local_base + (page_start << PAGE_SHIFT),
+			 (unsigned long)chunk->local_base + (page_end << PAGE_SHIFT));
+#endif
 }
 
 /**
@@ -349,6 +420,24 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	chunk->data = vms;
 	chunk->base_addr = vms[0]->addr - pcpu_group_offsets[0];
 
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+	unsigned long delta = (unsigned long)chunk->base_addr - (unsigned long)pcpu_base_addr;
+	unsigned long hint = delta + (unsigned long)pcpu_local_base;
+	struct vm_struct *local_vm = pcpu_get_local_vm_area(hint,
+					pcpu_unit_size, pcpu_atom_size);
+	if (!local_vm) {
+		pcpu_free_vm_areas(vms, pcpu_nr_groups);
+		pcpu_free_chunk(chunk);
+		return NULL;
+	}
+
+	chunk->local_base = local_vm->addr;
+	chunk->local_data = (void *)local_vm;
+#else
+	chunk->local_base = 0;
+	chunk->local_data = NULL;
+#endif
+
 	pcpu_stats_chunk_alloc();
 	trace_percpu_create_chunk(chunk->base_addr);
 
@@ -365,6 +454,8 @@ static void pcpu_destroy_chunk(struct pcpu_chunk *chunk)
 
 	if (chunk->data)
 		pcpu_free_vm_areas(chunk->data, pcpu_nr_groups);
+	if (chunk->local_data)
+		free_vm_area((struct vm_struct *)chunk->local_data);
 	pcpu_free_chunk(chunk);
 }
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8ef7d9987e18..f224ffec5696 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4836,17 +4836,21 @@ pvm_find_va_enclose_addr(unsigned long addr)
  *   in - the VA we start the search(reverse order);
  *   out - the VA with the highest aligned end address.
  * @align: alignment for required highest address
+ * @pcpu: whether request allocation from local percpu area
  *
  * Returns: determined end address within vmap_area
  */
 static unsigned long
-pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
+pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align, bool pcpu)
 {
 	unsigned long vmalloc_end;
 	unsigned long addr;
 
 #ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
-	vmalloc_end = PERCPU_END & ~(align - 1);
+	if (pcpu)
+		vmalloc_end = LOCAL_PERCPU_END & ~(align - 1);
+	else
+		vmalloc_end = PERCPU_END & ~(align - 1);
 #else
 	vmalloc_end = VMALLOC_END & ~(align - 1);
 #endif
@@ -4955,7 +4959,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	end = start + sizes[area];
 
 	va = pvm_find_va_enclose_addr(vmalloc_end);
-	base = pvm_determine_end_from_reverse(&va, align) - end;
+	base = pvm_determine_end_from_reverse(&va, align, false) - end;
 
 	while (true) {
 		/*
@@ -4976,7 +4980,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		 * base downwards and then recheck.
 		 */
 		if (base + end > va->va_end) {
-			base = pvm_determine_end_from_reverse(&va, align) - end;
+			base = pvm_determine_end_from_reverse(&va, align, false) - end;
 			term_area = area;
 			continue;
 		}
@@ -4986,7 +4990,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		 */
 		if (base + start < va->va_start) {
 			va = node_to_va(rb_prev(&va->rb_node));
-			base = pvm_determine_end_from_reverse(&va, align) - end;
+			base = pvm_determine_end_from_reverse(&va, align, false) - end;
 			term_area = area;
 			continue;
 		}
@@ -5149,6 +5153,61 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
 		free_vm_area(vms[i]);
 	kfree(vms);
 }
+
+#ifdef CONFIG_HAVE_LOCAL_PER_CPU_MAP
+/* Find free vm area starts from hint */
+struct vm_struct *pcpu_get_local_vm_area(unsigned long hint,
+				     int unit_size, size_t align)
+{
+	struct vmap_area *tmp_va, *va;
+	struct vm_struct *vm;
+	struct vmap_node *vn;
+	unsigned long end;
+	int ret;
+
+	va = kmem_cache_zalloc(vmap_area_cachep, GFP_KERNEL);
+	vm = kzalloc(sizeof(struct vm_struct), GFP_KERNEL);
+	if (!va || !vm)
+		goto err_free;
+
+	spin_lock(&free_vmap_area_lock);
+
+	tmp_va = pvm_find_va_enclose_addr(hint);
+	if (!tmp_va)
+		return NULL;
+
+	end = pvm_determine_end_from_reverse(&tmp_va, align, true);
+
+	if (hint + unit_size > end)
+		return NULL;
+
+	ret = va_clip(&free_vmap_area_root,
+			&free_vmap_area_list, tmp_va, hint, unit_size);
+	if (ret)
+		return NULL;
+
+	va->va_start = hint;
+	va->va_end = hint + unit_size;
+
+	spin_unlock(&free_vmap_area_lock);
+
+	vn = addr_to_node(va->va_start);
+
+	spin_lock(&vn->busy.lock);
+	insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
+	setup_vmalloc_vm(vm, va, VM_ALLOC,
+			 pcpu_get_local_vm_area);
+	spin_unlock(&vn->busy.lock);
+
+	return vm;
+
+err_free:
+	kmem_cache_free(vmap_area_cachep, va);
+	kfree(vm);
+
+	return NULL;
+}
+#endif
 #endif	/* CONFIG_SMP */
 
 #ifdef CONFIG_PRINTK
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 10/11] arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (8 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 09/11] mm: percpu: allocate and free local percpu vm area Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-29 17:04 ` [PATCH 11/11] arm64: percpu: use local percpu for this_cpu_*() APIs Yang Shi
  2026-04-30 19:02 ` [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

ARM64 supports local percpu map, so select HAVE_LOCAL_PER_CPU_MAP
by default.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 0e12e531a5b2..1094154c1c45 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1526,6 +1526,7 @@ config NUMA
 	select OF_NUMA
 	select HAVE_SETUP_PER_CPU_AREA
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
+	select HAVE_LOCAL_PER_CPU_MAP
 	select USE_PERCPU_NUMA_NODE_ID
 	help
 	  Enable NUMA (Non-Uniform Memory Access) support.
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 11/11] arm64: percpu: use local percpu for this_cpu_*() APIs
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (9 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 10/11] arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP Yang Shi
@ 2026-04-29 17:04 ` Yang Shi
  2026-04-30 19:02 ` [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-29 17:04 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: yang, linux-mm, linux-arm-kernel, linux-kernel

Use local percpu address for this_cpu_*() APIs.  Because the percpu
variable is mapped to the same virtual address, their address can be
calculated by using __per_cpu_local_off which has same value for all
CPUs.  So preempt_disable/preempt_enable is not needed anymore.  This
optimization can improve the performance for this_cpu_*() operations.

Kernel build test on AmpereOne (160 cores) with default Fedora kernel
config in a memcg roughly showed 13% - 15% sys time improvement.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/percpu.h | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index b57b2bb00967..15db56f981de 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -12,6 +12,7 @@
 #include <asm/stack_pointer.h>
 #include <asm/sysreg.h>
 
+extern unsigned long __per_cpu_local_off;
 static inline void set_my_cpu_offset(unsigned long off)
 {
 	asm volatile(ALTERNATIVE("msr tpidr_el1, %0",
@@ -153,19 +154,21 @@ PERCPU_RET_OP(add, add, ldadd)
  * disabled.
  */
 
+#define local_cpu_ptr(ptr)						\
+({									\
+	__verify_pcpu_ptr(ptr);						\
+	SHIFT_PERCPU_PTR(ptr, __per_cpu_local_off);			\
+})
+
 #define _pcp_protect(op, pcp, ...)					\
 ({									\
-	preempt_disable_notrace();					\
-	op(raw_cpu_ptr(&(pcp)), __VA_ARGS__);				\
-	preempt_enable_notrace();					\
+	op(local_cpu_ptr(&(pcp)), __VA_ARGS__);				\
 })
 
 #define _pcp_protect_return(op, pcp, args...)				\
 ({									\
 	typeof(pcp) __retval;						\
-	preempt_disable_notrace();					\
-	__retval = (typeof(pcp))op(raw_cpu_ptr(&(pcp)), ##args);	\
-	preempt_enable_notrace();					\
+	__retval = (typeof(pcp))op(local_cpu_ptr(&(pcp)), ##args);	\
 	__retval;							\
 })
 
@@ -251,7 +254,7 @@ PERCPU_RET_OP(add, add, ldadd)
 	old__ = o;							\
 	new__ = n;							\
 	preempt_disable_notrace();					\
-	ptr__ = raw_cpu_ptr(&(pcp));					\
+	ptr__ = local_cpu_ptr(&(pcp));					\
 	ret__ = cmpxchg128_local((void *)ptr__, old__, new__);		\
 	preempt_enable_notrace();					\
 	ret__;								\
-- 
2.47.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
  2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
                   ` (10 preceding siblings ...)
  2026-04-29 17:04 ` [PATCH 11/11] arm64: percpu: use local percpu for this_cpu_*() APIs Yang Shi
@ 2026-04-30 19:02 ` Yang Shi
  11 siblings, 0 replies; 13+ messages in thread
From: Yang Shi @ 2026-04-30 19:02 UTC (permalink / raw)
  To: cl, dennis, tj, urezki, catalin.marinas, will, ryan.roberts,
	david, akpm, hca, gor, agordeev
  Cc: linux-mm, linux-arm-kernel, linux-kernel



On 4/29/26 10:04 AM, Yang Shi wrote:
> Introduction
> ============
> This patch series implemented the LSFMM 2026 proposal for optimizing
> this_cpu_*() ops on ARM64. For the details of the proposal, Please refer to:
> https://lore.kernel.org/linux-mm/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/
> I didn't repeat it in the cover letter because there is no change to the
> proposal.
>
> The series is based on 7.1-rc1. It is basically minimum viable patches.
> There are still a few hacks in this series and it may break something,
> for example, KPTI, SMT machines which shared TLB, etc. But it shoule be
> good enough for now to demonstrate the core idea. The main purpose of the
> RFC is to gather feedback early, figure out missing parts and risks, and
> make sure we are on the right track, as well as hopefully it can help the
> discussion for the upcoming LSFMM.
>
> I broke the patches down to arch-dependent and arch-independent parts so that
> hopefully the interested persons can do experiments on other architectures,
> for example, S390, easier.
>
> A new kernel config is introduced, HAVE_LOCAL_PER_CPU_MAP. The architectures
> which can support this feature will select it. Allocating and freeing percpu
> local mapping is protected by this config so that others won't pay the cost.
>
>   
> Known Issues
> ============
> 1. KPIT
> -------
> We need determine what CPU we are on, then switch to the right page table.
> Currently arm64 kernel fetches tramp_pg_dir via swapper_pg_dir - fixed_offset,
> and fetches swapper_pg_dir from ttbr1. But ttbr1 may not hold swapper_pg_dir
> anymore except CPU #0. So we need to figure out the other way to handle it.
> Switching to tramp_pg_dir should be easy, but the reverse seems harder because
> tramp_pg_dir just maps the trampoline vectors.
> Maybe we can do two steps switch. Switch to swapper_pg_dir at the first step,
> then switch to per cpu page table (for entry) or tramp page table (for exit).
> Nobody should call this_cpu_*() at either userspace -> kernel entry stage or
> kernel -> userspace exit stage.
>
> 2. Shared TLB machines
> ----------------------
> Some machines may share TLB between CPUs, for example, SMT machines may share
> TLB between the two hardware threads in one single core.
> The per cpu page table just can't work with it. Maybe we need a new
> cpufeature to indicate whether per cpu page table is allowed or not. Then
> just enable it for not-shared-TLB machines.

Adding more known issues, I forgot to list them.

3. Memory hotplug/unplug
-----------------------
The linear mapping and/or vmemmap may be out of sync because 
__create_pgd_mapping() and __remove_pgd_mapping() are called
to deal with the page tables for memory hotplug/unplug, which don't have 
mechanism to sync page tables. But it should not be hard to resolve.

4. 2-level and 3-level page table
----------------------------
Need to make page table sync work for them, currently should just work 
with 4-level page table for now. It is not hard either.

5. Confusing /proc/vmallocinfo
---------------------------
The percpu were allocated from vmalloc area before, now they are not. So 
they should not show up in /proc/vmallocinfo
anymore,


Yang

>
>   
> Benchmark
> =========
> The benchmarks are done on 160 core AmpereOne machine. The baseline is
> v7.1-rc1 kernel.
>
> 1. Kernel Build
> ---------------
> Run kernel build (make -j160) with the default Fedora kernel config in a
> memcg.
> 13% - 18% sys time improvment
> 3% - 7% wall time improvement
>
> 2. stress-ng vm ops
> -------------------
> stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
> 8.5% improvement
>
> 3. stress-ng vm ops + fork
> ----------------------
> stress-ng --mmapfork 160 --mmapfork-bytes 128M --mmapfork-ops 500
> 15% improvement
>
>
> Regression test
> ===============
> 1. memcg creation
> -----------------
> Create 10K memcgs. Each memcg creation needs to allocate multiple percpu
> variables, for example, percpu refcnt, rstat and objcg percpu refcnt.
>
> Consumed 2112K more virtual memory for percpu “local mapping” and a few
> more mega bytes consumed by per cpu page tables.
> No noticeable regression was found for elapsed time.
>
> 2. fork test
> ------------
> stress-ng --fork 160 --fork-ops 10000000
> fork() needs to allocate multiple percpu variables, for example, rss
> counters and mm_cid_cpu.
>
> Roughly 1% regression was found. However stress-ng fork test has quites
> small address space, the real life workloads typically have much larger
> address space and do more complicated works. The stress-ng mmapfork
> benchmark saw 15% improvement.
>
>
> Yang Shi (11):
>        arm64: mm: enable percpu kernel page table
>        arm64: mm: define percpu virtual space area
>        arm64: smp: define setup_per_cpu_areas()
>        mm: percpu: prepare to use dedicated percpu area
>        arm64: mm: map local percpu first chunk
>        mm: percpu: set up first chunk and reserve chunk
>        arm64: mm: introduce __per_cpu_local_off
>        vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush()
>        mm: percpu: allocate and free local percpu vm area
>        arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP
>        arm64: percpu: use local percpu for this_cpu_*() APIs
>
>   arch/arm64/Kconfig                   |   2 +-
>   arch/arm64/include/asm/mmu.h         |   3 +++
>   arch/arm64/include/asm/mmu_context.h |   6 +++++-
>   arch/arm64/include/asm/percpu.h      |  17 ++++++++++-------
>   arch/arm64/include/asm/pgtable.h     |  24 +++++++++++++++++++++---
>   arch/arm64/kernel/setup.c            |   3 +++
>   arch/arm64/kernel/smp.c              |  40 ++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/mmu.c                  |  75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/ptdump.c               |   4 ++++
>   drivers/base/arch_numa.c             |  51 +--------------------------------------------------
>   include/linux/percpu.h               |   4 +++-
>   include/linux/vmalloc.h              |   3 +++
>   mm/Kconfig                           |   3 +++
>   mm/internal.h                        |   5 ++++-
>   mm/kmsan/hooks.c                     |  14 +++++++-------
>   mm/percpu-internal.h                 |  15 +++++++++++++++
>   mm/percpu-vm.c                       |  91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   mm/percpu.c                          |  46 +++++++++++++++++++++++++++++++++++++---------
>   mm/vmalloc.c                         | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
>   19 files changed, 419 insertions(+), 99 deletions(-)
>
>
> Thanks,
> Yang
>



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-04-30 19:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
2026-04-29 17:04 ` [PATCH 01/11] arm64: mm: enable percpu kernel page table Yang Shi
2026-04-29 17:04 ` [PATCH 02/11] arm64: mm: define percpu virtual space area Yang Shi
2026-04-29 17:04 ` [PATCH 03/11] arm64: smp: define setup_per_cpu_areas() Yang Shi
2026-04-29 17:04 ` [PATCH 04/11] mm: percpu: prepare to use dedicated percpu area Yang Shi
2026-04-29 17:04 ` [PATCH 05/11] arm64: mm: map local percpu first chunk Yang Shi
2026-04-29 17:04 ` [PATCH 06/11] mm: percpu: set up first chunk and reserve chunk Yang Shi
2026-04-29 17:04 ` [PATCH 07/11] arm64: mm: introduce __per_cpu_local_off Yang Shi
2026-04-29 17:04 ` [PATCH 08/11] vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush() Yang Shi
2026-04-29 17:04 ` [PATCH 09/11] mm: percpu: allocate and free local percpu vm area Yang Shi
2026-04-29 17:04 ` [PATCH 10/11] arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP Yang Shi
2026-04-29 17:04 ` [PATCH 11/11] arm64: percpu: use local percpu for this_cpu_*() APIs Yang Shi
2026-04-30 19:02 ` [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox