Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Yang Shi <yang@os.amperecomputing.com>
To: Yang Shi <shy828301@gmail.com>,
	lsf-pc@lists.linux-foundation.org, Linux MM <linux-mm@kvack.org>,
	"Christoph Lameter (Ampere)" <cl@gentwo.org>,
	dennis@kernel.org, Tejun Heo <tj@kernel.org>,
	urezki@gmail.com, Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Ryan Roberts <ryan.roberts@arm.com>
Subject: Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
Date: Fri, 8 May 2026 15:52:19 -0700	[thread overview]
Message-ID: <cdce23e4-3788-48c4-a3f5-e9b8f93ff4ef@os.amperecomputing.com> (raw)
In-Reply-To: <CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9246 bytes --]

Hi folks,

Just FYI. Attached the slides presented at LSFMM. Thanks for all the 
feedback and suggestions.

Yang


On 2/11/26 3:14 PM, Yang Shi wrote:
> Background
> =========
> The APIs using this_cpu_*() operate on a local copy of a percpu
> variable for the current processor. In order to obtain the address of
> this cpu specific variable a cpu specific offset has to be added to
> the address.
> On x86 this address calculation can be created by prefixing an
> instruction with a segment register. x86 can increment a percpu
> counter with a single instruction. Since the address calculation and
> the RMV operation occurs within one instruction it is atomic vs the
> scheduler. So no preemption is needed.
> f.e
> INC %gs:[my_counter]
> See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details.
>
> ARM64 and some other non-x86 architectures don't have a segment
> register. The address of the current percpu variable has to be
> calculated and then that address can be used for an operation on
> percpu data. This process must be atomic vs the scheduler. Therefore,
> it is necessary to disable preemption, perform the address calculation
> and then the increment operation. The cpu specific offset is in a MSR
> that also needs to be accessed on ARM64. The code flow looks like:
>      Disable preemption
>      Calculate the current CPU copy address by using the offset
>      Manipulate the counter
>      Enable preemption
>
> This process is inefficient relative to x86 and has to be repeated for
> every access to per cpu data.
> ARM64 has an increment instruction but this increment does not allow
> the use of a base register or a segment register like on x86. So an
> address calculation is always necessary even if the atomic instruction
> is used.
> A page table allows us to do remapping of addresses. So if the atomic
> instruction would be using a virtual address and the page tables for
> the local processor would map this area to the local per cpu data then
> we can also create a single instruction on ARM64 (hopefully for some
> other non-x86 architectures too) and be as efficient as x86 is.
>
> So, the code flow should just become:
> INC VIRTUAL_BASE + percpu_variable_offset
>
> In order to do that we need to have the same virtual address mapped
> differently for each processor. This means we need different page
> tables for each processor. These page tables
> can map almost all of the address space in the same way. The only area
> that will be special is the area starting at VIRTUAL_BASE.
>
> In addition, the percpu counters also can be accessed from other CPUs
> by using per_cpu_ptr() APIs. This is usually used by counters
> initialization code. For example,
> for_each_possible_cpu(cpu) {
>      p = per_cpu_ptr(ptr, cpu);
>      initialize(p);
> }
>
> Percpu allocator
> =============
> When calling alloc_percpu(), kernel allocates contiguous virtual
> memory area from vmalloc area. It is called “chunk”. The chunk looks
> like:
> | CPU 0 | CPU 1 | …… | CPU n|
>
> The size of the chunk is the percpu_unit_size * nr_cpus. Then kernel
> maps them to physical memory. It returns an offset.
>
> Design
> ======
> To improve the performance for this_cpu_ops on ARM64 and potentially
> some other non-x86 architectures, I and Christopher Lameter proposed
> the below solution.
>
> To remove the preemption disable/enable, we need to guarantee
> this_cpu_*() APIs actually convert the offset returned by
> alloc_percpu() to a pointer which should be the same on all CPUs. But
> it should not break per_cpu_ptr() APIs usecase either.
> To achieve this, we need to modify the percpu allocator to allocate
> extra virtual memory other than the virtual memory area shown in the
> above diagram. The size of the extra allocation is percpu_unit_size.
> this_cpu_*() APIs will convert the offset returned by alloc_percpu()
> to a pointer to this area. It is the same for all CPUs. I call the
> extra allocated area “local mapping” and the original area “global
> mapping” in order to simplify the discussion. So the percpu chunk will
> look like:
> | CPU 0 | CPU 1 | …… | CPU n| xxxxxxxxx | CPU |
>             Global mapping                                     local mapping
>
> this_cpu_*() APIs will just access the local mapping, per_cpu_ptr()
> APIs continue to use the global mapping.
>
> The local mapping requires mapping to different physical memory
> (shared physical memory mapped by global mapping, no need to allocate
> extra physical memory) on different CPUs in order to manipulate the
> right copy. This can be achieved by using the percpu page table in
> arch-dependent code. Each CPU just sees its own kernel page table copy
> instead of sharing one single kernel page table. However the most
> contents of the page tables can be shared except the area for percpu
> local mapping. So they basically can share PUD/PMD/PTE except PGD.
>
> The kernel also maintains a base address for global mapping in order
> to convert the offset returned by alloc_percpu() to the correct
> pointer. The local mapping also needs a base address, and the offset
> between local mapping base address and allocated local mapping area
> must be the same with the offset returned by alloc_percpu(). So the
> local mapping has to happen in a specific address range. This may need
> a dedicated percpu local mapping area which can’t be used by vmalloc()
> in order to avoid conflicts.
>
> I have done some PoC on ARM64. Hopefully I can post them to the
> mailing list to ease the discussion before the conference.
>
> Overhead
> ========
> 1. Some extra virtual memory space. But it shouldn’t be too much. I
> saw 960K with Fedora default kernel config. Given terabytes virtual
> memory space on 64 bit machine, 960K is negligible.
> 2. Some extra physical memory for percpu kernel page table. 4K *
> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> mapping area. A couple of megabytes with Fedora default kernel config
> on AmpereOne with 160 cores.
> 3. Percpu allocation and free will be slower due to extra virtual
> memory allocation and page table manipulation. However, percpu is
> allocated by chunk. One chunk typically holds a lot percpu variables.
> So the slowdown should be negligible. The test result below also
> proved it.
>
> Performance Test
> ==============
> I have done a PoC on ARM64. So all the tests were run on AmpereOne
> with 160 cores.
> 1. Kernel build
> --------------------
> Run kernel build (make -j160) with default Fedora kernel config in a memcg.
> Roughly 13% - 15% systime improvement for my kernel build workload.
>
> 2. stress-ng
> ----------------
> stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
> 6% improvement for systime
>
> 3. vm-scalability
> ----------------------
> Single digit (0 – 8%) improvement for systime for some vm-scalability test cases
>
> 4. will-it-scale
> ------------------
> 3% - 8% improvement for pagefault cases from will-it-scale
> And profiling to page_fault3_processes from will-it-scale also shows
> the reduction in percpu counters manipulation (perf diff output):
> 5.91%     -1.82%  [kernel.kallsyms]        [k] mod_memcg_lruvec_state
> 2.84%     -1.30%  [kernel.kallsyms]        [k] percpu_counter_add_batch
>
> Regression Test
> =============
> Create 10K cgroups.
> Creating cgroups need to call percpu allocators multiple times. For
> example, creating one memcg needs to allocate percpu refcnt, rstat and
> objcg percpu refcnt.
>
> It consumed 2112K more virtual memory for percpu local mapping. A few
> more megabytes consumed by percpu page table to map local mapping. The
> memory consumption depends on the number of CPUs.
>
> Execution time is basically the same. No noticeable regression is
> found. The profiling shows (perf diff):
> 0.35%     -0.33%  [kernel.kallsyms]                               [k]
> percpu_ref_get_many
> 0.61%     -0.30%  [kernel.kallsyms]                               [k]
> percpu_counter_add_batch
> 0.34%     +0.02%  [kernel.kallsyms]                               [k]
> pcpu_alloc_noprof
> 0.00%     +0.05%  [kernel.kallsyms]                               [k]
> free_percpu.part.0
> The gain from manipulating percpu counters outweigh the slowdown from
> percpu allocation and free. There is even a little bit of net gain.
>
> Future usecases
> =============
> Some potential usecases may be unlocked by percpu page table, for
> example, kernel text replication, off the top of my head. Anyway this
> is not the main point for this proposal.
>
> Key attendees
> ===========
> This work will incur changes to percpu allocator, vmalloc (just need
> to add a new interface to take pgdir pointer as an argument) and arch
> dependent code (percpu page table implementation is arch-dependent).
> So the percpu allocator maintainers, vmalloc maintainers and arch
> experts (for example, ARM64) should be key attendees. I don't know who
> can attend so I just list all of them.
>
> Christopher Lameter (co-presenter and percpu allocator maintainer)
> Dennis Zhou/Tejun Heo (percpu allocator maintainer)
> Uladzislau Rezki (vmalloc maintainer)
> Catalin Marinas/Will Deacon/Ryan Roberts (ARM64 memory management)
>
> Thanks,
> Yang

[-- Attachment #2: percpu_LSF2026.pdf --]
[-- Type: application/pdf, Size: 321099 bytes --]

     prev parent reply	other threads:[~2026-05-08 22:52 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 23:14 [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures) Yang Shi
2026-02-11 23:29 ` Tejun Heo
2026-02-11 23:39   ` Christoph Lameter (Ampere)
2026-02-11 23:40     ` Tejun Heo
2026-02-12  0:05       ` Christoph Lameter (Ampere)
2026-02-11 23:58   ` Yang Shi
2026-02-12 17:54     ` Catalin Marinas
2026-02-12 18:43       ` Catalin Marinas
2026-02-13  0:23         ` Yang Shi
2026-02-12 18:45       ` Ryan Roberts
2026-02-12 19:36         ` Catalin Marinas
2026-02-12 21:12           ` Ryan Roberts
2026-02-16 10:37             ` Catalin Marinas
2026-02-18  8:59               ` Ryan Roberts
2026-02-12 18:41 ` Ryan Roberts
2026-02-12 18:55   ` Christoph Lameter (Ampere)
2026-02-12 18:58     ` Ryan Roberts
2026-02-24 16:47       ` Christoph Lameter (Ampere)
2026-02-13 18:42   ` Yang Shi
2026-02-16 11:39     ` Catalin Marinas
2026-02-17 17:28       ` Christoph Lameter (Ampere)
2026-02-18  9:18         ` Ryan Roberts
2026-02-26 18:31       ` Yang Shi
2026-02-23  9:50 ` Heiko Carstens
2026-02-26 17:48   ` Yang Shi
2026-04-29 23:03   ` Yang Shi
2026-05-08 22:52 ` Yang Shi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cdce23e4-3788-48c4-a3f5-e9b8f93ff4ef@os.amperecomputing.com \
    --to=yang@os.amperecomputing.com \
    --cc=catalin.marinas@arm.com \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=tj@kernel.org \
    --cc=urezki@gmail.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox