public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
From: Yang Shi <yang@os.amperecomputing.com>
To: cl@gentwo.org, dennis@kernel.org, tj@kernel.org,
	urezki@gmail.com, catalin.marinas@arm.com, will@kernel.org,
	ryan.roberts@arm.com, david@kernel.org,
	akpm@linux-foundation.org, hca@linux.ibm.com, gor@linux.ibm.com,
	agordeev@linux.ibm.com
Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
Date: Thu, 30 Apr 2026 12:02:12 -0700	[thread overview]
Message-ID: <59860460-5592-42cc-9593-4780fb720e18@os.amperecomputing.com> (raw)
In-Reply-To: <20260429170758.3018959-1-yang@os.amperecomputing.com>



On 4/29/26 10:04 AM, Yang Shi wrote:
> Introduction
> ============
> This patch series implemented the LSFMM 2026 proposal for optimizing
> this_cpu_*() ops on ARM64. For the details of the proposal, Please refer to:
> https://lore.kernel.org/linux-mm/CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com/
> I didn't repeat it in the cover letter because there is no change to the
> proposal.
>
> The series is based on 7.1-rc1. It is basically minimum viable patches.
> There are still a few hacks in this series and it may break something,
> for example, KPTI, SMT machines which shared TLB, etc. But it shoule be
> good enough for now to demonstrate the core idea. The main purpose of the
> RFC is to gather feedback early, figure out missing parts and risks, and
> make sure we are on the right track, as well as hopefully it can help the
> discussion for the upcoming LSFMM.
>
> I broke the patches down to arch-dependent and arch-independent parts so that
> hopefully the interested persons can do experiments on other architectures,
> for example, S390, easier.
>
> A new kernel config is introduced, HAVE_LOCAL_PER_CPU_MAP. The architectures
> which can support this feature will select it. Allocating and freeing percpu
> local mapping is protected by this config so that others won't pay the cost.
>
>   
> Known Issues
> ============
> 1. KPIT
> -------
> We need determine what CPU we are on, then switch to the right page table.
> Currently arm64 kernel fetches tramp_pg_dir via swapper_pg_dir - fixed_offset,
> and fetches swapper_pg_dir from ttbr1. But ttbr1 may not hold swapper_pg_dir
> anymore except CPU #0. So we need to figure out the other way to handle it.
> Switching to tramp_pg_dir should be easy, but the reverse seems harder because
> tramp_pg_dir just maps the trampoline vectors.
> Maybe we can do two steps switch. Switch to swapper_pg_dir at the first step,
> then switch to per cpu page table (for entry) or tramp page table (for exit).
> Nobody should call this_cpu_*() at either userspace -> kernel entry stage or
> kernel -> userspace exit stage.
>
> 2. Shared TLB machines
> ----------------------
> Some machines may share TLB between CPUs, for example, SMT machines may share
> TLB between the two hardware threads in one single core.
> The per cpu page table just can't work with it. Maybe we need a new
> cpufeature to indicate whether per cpu page table is allowed or not. Then
> just enable it for not-shared-TLB machines.

Adding more known issues, I forgot to list them.

3. Memory hotplug/unplug
-----------------------
The linear mapping and/or vmemmap may be out of sync because 
__create_pgd_mapping() and __remove_pgd_mapping() are called
to deal with the page tables for memory hotplug/unplug, which don't have 
mechanism to sync page tables. But it should not be hard to resolve.

4. 2-level and 3-level page table
----------------------------
Need to make page table sync work for them, currently should just work 
with 4-level page table for now. It is not hard either.

5. Confusing /proc/vmallocinfo
---------------------------
The percpu were allocated from vmalloc area before, now they are not. So 
they should not show up in /proc/vmallocinfo
anymore,


Yang

>
>   
> Benchmark
> =========
> The benchmarks are done on 160 core AmpereOne machine. The baseline is
> v7.1-rc1 kernel.
>
> 1. Kernel Build
> ---------------
> Run kernel build (make -j160) with the default Fedora kernel config in a
> memcg.
> 13% - 18% sys time improvment
> 3% - 7% wall time improvement
>
> 2. stress-ng vm ops
> -------------------
> stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
> 8.5% improvement
>
> 3. stress-ng vm ops + fork
> ----------------------
> stress-ng --mmapfork 160 --mmapfork-bytes 128M --mmapfork-ops 500
> 15% improvement
>
>
> Regression test
> ===============
> 1. memcg creation
> -----------------
> Create 10K memcgs. Each memcg creation needs to allocate multiple percpu
> variables, for example, percpu refcnt, rstat and objcg percpu refcnt.
>
> Consumed 2112K more virtual memory for percpu “local mapping” and a few
> more mega bytes consumed by per cpu page tables.
> No noticeable regression was found for elapsed time.
>
> 2. fork test
> ------------
> stress-ng --fork 160 --fork-ops 10000000
> fork() needs to allocate multiple percpu variables, for example, rss
> counters and mm_cid_cpu.
>
> Roughly 1% regression was found. However stress-ng fork test has quites
> small address space, the real life workloads typically have much larger
> address space and do more complicated works. The stress-ng mmapfork
> benchmark saw 15% improvement.
>
>
> Yang Shi (11):
>        arm64: mm: enable percpu kernel page table
>        arm64: mm: define percpu virtual space area
>        arm64: smp: define setup_per_cpu_areas()
>        mm: percpu: prepare to use dedicated percpu area
>        arm64: mm: map local percpu first chunk
>        mm: percpu: set up first chunk and reserve chunk
>        arm64: mm: introduce __per_cpu_local_off
>        vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush()
>        mm: percpu: allocate and free local percpu vm area
>        arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP
>        arm64: percpu: use local percpu for this_cpu_*() APIs
>
>   arch/arm64/Kconfig                   |   2 +-
>   arch/arm64/include/asm/mmu.h         |   3 +++
>   arch/arm64/include/asm/mmu_context.h |   6 +++++-
>   arch/arm64/include/asm/percpu.h      |  17 ++++++++++-------
>   arch/arm64/include/asm/pgtable.h     |  24 +++++++++++++++++++++---
>   arch/arm64/kernel/setup.c            |   3 +++
>   arch/arm64/kernel/smp.c              |  40 ++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/mmu.c                  |  75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/ptdump.c               |   4 ++++
>   drivers/base/arch_numa.c             |  51 +--------------------------------------------------
>   include/linux/percpu.h               |   4 +++-
>   include/linux/vmalloc.h              |   3 +++
>   mm/Kconfig                           |   3 +++
>   mm/internal.h                        |   5 ++++-
>   mm/kmsan/hooks.c                     |  14 +++++++-------
>   mm/percpu-internal.h                 |  15 +++++++++++++++
>   mm/percpu-vm.c                       |  91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   mm/percpu.c                          |  46 +++++++++++++++++++++++++++++++++++++---------
>   mm/vmalloc.c                         | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
>   19 files changed, 419 insertions(+), 99 deletions(-)
>
>
> Thanks,
> Yang
>



      parent reply	other threads:[~2026-04-30 19:02 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29 17:04 [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) Yang Shi
2026-04-29 17:04 ` [PATCH 01/11] arm64: mm: enable percpu kernel page table Yang Shi
2026-04-29 17:04 ` [PATCH 02/11] arm64: mm: define percpu virtual space area Yang Shi
2026-04-29 17:04 ` [PATCH 03/11] arm64: smp: define setup_per_cpu_areas() Yang Shi
2026-04-29 17:04 ` [PATCH 04/11] mm: percpu: prepare to use dedicated percpu area Yang Shi
2026-04-29 17:04 ` [PATCH 05/11] arm64: mm: map local percpu first chunk Yang Shi
2026-04-29 17:04 ` [PATCH 06/11] mm: percpu: set up first chunk and reserve chunk Yang Shi
2026-04-29 17:04 ` [PATCH 07/11] arm64: mm: introduce __per_cpu_local_off Yang Shi
2026-04-29 17:04 ` [PATCH 08/11] vmalloc: pass in pgd pointer for vmap{__vunmap}_range_noflush() Yang Shi
2026-04-29 17:04 ` [PATCH 09/11] mm: percpu: allocate and free local percpu vm area Yang Shi
2026-04-29 17:04 ` [PATCH 10/11] arm64: kconfig: select HAVE_LOCAL_PER_CPU_MAP Yang Shi
2026-04-29 17:04 ` [PATCH 11/11] arm64: percpu: use local percpu for this_cpu_*() APIs Yang Shi
2026-04-30 19:02 ` Yang Shi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59860460-5592-42cc-9593-4780fb720e18@os.amperecomputing.com \
    --to=yang@os.amperecomputing.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=david@kernel.org \
    --cc=dennis@kernel.org \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox