From: Yang Shi <yang@os.amperecomputing.com>
To: Yang Shi <shy828301@gmail.com>,
lsf-pc@lists.linux-foundation.org, Linux MM <linux-mm@kvack.org>,
"Christoph Lameter (Ampere)" <cl@gentwo.org>,
dennis@kernel.org, Tejun Heo <tj@kernel.org>,
urezki@gmail.com, Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Ryan Roberts <ryan.roberts@arm.com>
Subject: Re: [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures)
Date: Fri, 8 May 2026 15:52:19 -0700 [thread overview]
Message-ID: <cdce23e4-3788-48c4-a3f5-e9b8f93ff4ef@os.amperecomputing.com> (raw)
In-Reply-To: <CAHbLzkpcN-T8MH6=W3jCxcFj1gVZp8fRqe231yzZT-rV_E_org@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 9246 bytes --]
Hi folks,
Just FYI. Attached the slides presented at LSFMM. Thanks for all the
feedback and suggestions.
Yang
On 2/11/26 3:14 PM, Yang Shi wrote:
> Background
> =========
> The APIs using this_cpu_*() operate on a local copy of a percpu
> variable for the current processor. In order to obtain the address of
> this cpu specific variable a cpu specific offset has to be added to
> the address.
> On x86 this address calculation can be created by prefixing an
> instruction with a segment register. x86 can increment a percpu
> counter with a single instruction. Since the address calculation and
> the RMV operation occurs within one instruction it is atomic vs the
> scheduler. So no preemption is needed.
> f.e
> INC %gs:[my_counter]
> See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details.
>
> ARM64 and some other non-x86 architectures don't have a segment
> register. The address of the current percpu variable has to be
> calculated and then that address can be used for an operation on
> percpu data. This process must be atomic vs the scheduler. Therefore,
> it is necessary to disable preemption, perform the address calculation
> and then the increment operation. The cpu specific offset is in a MSR
> that also needs to be accessed on ARM64. The code flow looks like:
> Disable preemption
> Calculate the current CPU copy address by using the offset
> Manipulate the counter
> Enable preemption
>
> This process is inefficient relative to x86 and has to be repeated for
> every access to per cpu data.
> ARM64 has an increment instruction but this increment does not allow
> the use of a base register or a segment register like on x86. So an
> address calculation is always necessary even if the atomic instruction
> is used.
> A page table allows us to do remapping of addresses. So if the atomic
> instruction would be using a virtual address and the page tables for
> the local processor would map this area to the local per cpu data then
> we can also create a single instruction on ARM64 (hopefully for some
> other non-x86 architectures too) and be as efficient as x86 is.
>
> So, the code flow should just become:
> INC VIRTUAL_BASE + percpu_variable_offset
>
> In order to do that we need to have the same virtual address mapped
> differently for each processor. This means we need different page
> tables for each processor. These page tables
> can map almost all of the address space in the same way. The only area
> that will be special is the area starting at VIRTUAL_BASE.
>
> In addition, the percpu counters also can be accessed from other CPUs
> by using per_cpu_ptr() APIs. This is usually used by counters
> initialization code. For example,
> for_each_possible_cpu(cpu) {
> p = per_cpu_ptr(ptr, cpu);
> initialize(p);
> }
>
> Percpu allocator
> =============
> When calling alloc_percpu(), kernel allocates contiguous virtual
> memory area from vmalloc area. It is called “chunk”. The chunk looks
> like:
> | CPU 0 | CPU 1 | …… | CPU n|
>
> The size of the chunk is the percpu_unit_size * nr_cpus. Then kernel
> maps them to physical memory. It returns an offset.
>
> Design
> ======
> To improve the performance for this_cpu_ops on ARM64 and potentially
> some other non-x86 architectures, I and Christopher Lameter proposed
> the below solution.
>
> To remove the preemption disable/enable, we need to guarantee
> this_cpu_*() APIs actually convert the offset returned by
> alloc_percpu() to a pointer which should be the same on all CPUs. But
> it should not break per_cpu_ptr() APIs usecase either.
> To achieve this, we need to modify the percpu allocator to allocate
> extra virtual memory other than the virtual memory area shown in the
> above diagram. The size of the extra allocation is percpu_unit_size.
> this_cpu_*() APIs will convert the offset returned by alloc_percpu()
> to a pointer to this area. It is the same for all CPUs. I call the
> extra allocated area “local mapping” and the original area “global
> mapping” in order to simplify the discussion. So the percpu chunk will
> look like:
> | CPU 0 | CPU 1 | …… | CPU n| xxxxxxxxx | CPU |
> Global mapping local mapping
>
> this_cpu_*() APIs will just access the local mapping, per_cpu_ptr()
> APIs continue to use the global mapping.
>
> The local mapping requires mapping to different physical memory
> (shared physical memory mapped by global mapping, no need to allocate
> extra physical memory) on different CPUs in order to manipulate the
> right copy. This can be achieved by using the percpu page table in
> arch-dependent code. Each CPU just sees its own kernel page table copy
> instead of sharing one single kernel page table. However the most
> contents of the page tables can be shared except the area for percpu
> local mapping. So they basically can share PUD/PMD/PTE except PGD.
>
> The kernel also maintains a base address for global mapping in order
> to convert the offset returned by alloc_percpu() to the correct
> pointer. The local mapping also needs a base address, and the offset
> between local mapping base address and allocated local mapping area
> must be the same with the offset returned by alloc_percpu(). So the
> local mapping has to happen in a specific address range. This may need
> a dedicated percpu local mapping area which can’t be used by vmalloc()
> in order to avoid conflicts.
>
> I have done some PoC on ARM64. Hopefully I can post them to the
> mailing list to ease the discussion before the conference.
>
> Overhead
> ========
> 1. Some extra virtual memory space. But it shouldn’t be too much. I
> saw 960K with Fedora default kernel config. Given terabytes virtual
> memory space on 64 bit machine, 960K is negligible.
> 2. Some extra physical memory for percpu kernel page table. 4K *
> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> mapping area. A couple of megabytes with Fedora default kernel config
> on AmpereOne with 160 cores.
> 3. Percpu allocation and free will be slower due to extra virtual
> memory allocation and page table manipulation. However, percpu is
> allocated by chunk. One chunk typically holds a lot percpu variables.
> So the slowdown should be negligible. The test result below also
> proved it.
>
> Performance Test
> ==============
> I have done a PoC on ARM64. So all the tests were run on AmpereOne
> with 160 cores.
> 1. Kernel build
> --------------------
> Run kernel build (make -j160) with default Fedora kernel config in a memcg.
> Roughly 13% - 15% systime improvement for my kernel build workload.
>
> 2. stress-ng
> ----------------
> stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
> 6% improvement for systime
>
> 3. vm-scalability
> ----------------------
> Single digit (0 – 8%) improvement for systime for some vm-scalability test cases
>
> 4. will-it-scale
> ------------------
> 3% - 8% improvement for pagefault cases from will-it-scale
> And profiling to page_fault3_processes from will-it-scale also shows
> the reduction in percpu counters manipulation (perf diff output):
> 5.91% -1.82% [kernel.kallsyms] [k] mod_memcg_lruvec_state
> 2.84% -1.30% [kernel.kallsyms] [k] percpu_counter_add_batch
>
> Regression Test
> =============
> Create 10K cgroups.
> Creating cgroups need to call percpu allocators multiple times. For
> example, creating one memcg needs to allocate percpu refcnt, rstat and
> objcg percpu refcnt.
>
> It consumed 2112K more virtual memory for percpu local mapping. A few
> more megabytes consumed by percpu page table to map local mapping. The
> memory consumption depends on the number of CPUs.
>
> Execution time is basically the same. No noticeable regression is
> found. The profiling shows (perf diff):
> 0.35% -0.33% [kernel.kallsyms] [k]
> percpu_ref_get_many
> 0.61% -0.30% [kernel.kallsyms] [k]
> percpu_counter_add_batch
> 0.34% +0.02% [kernel.kallsyms] [k]
> pcpu_alloc_noprof
> 0.00% +0.05% [kernel.kallsyms] [k]
> free_percpu.part.0
> The gain from manipulating percpu counters outweigh the slowdown from
> percpu allocation and free. There is even a little bit of net gain.
>
> Future usecases
> =============
> Some potential usecases may be unlocked by percpu page table, for
> example, kernel text replication, off the top of my head. Anyway this
> is not the main point for this proposal.
>
> Key attendees
> ===========
> This work will incur changes to percpu allocator, vmalloc (just need
> to add a new interface to take pgdir pointer as an argument) and arch
> dependent code (percpu page table implementation is arch-dependent).
> So the percpu allocator maintainers, vmalloc maintainers and arch
> experts (for example, ARM64) should be key attendees. I don't know who
> can attend so I just list all of them.
>
> Christopher Lameter (co-presenter and percpu allocator maintainer)
> Dennis Zhou/Tejun Heo (percpu allocator maintainer)
> Uladzislau Rezki (vmalloc maintainer)
> Catalin Marinas/Will Deacon/Ryan Roberts (ARM64 memory management)
>
> Thanks,
> Yang
[-- Attachment #2: percpu_LSF2026.pdf --]
[-- Type: application/pdf, Size: 321099 bytes --]
prev parent reply other threads:[~2026-05-08 22:52 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-11 23:14 [LSF/MM/BPF TOPIC] Improve this_cpu_ops performance for ARM64 (and potentially other architectures) Yang Shi
2026-02-11 23:29 ` Tejun Heo
2026-02-11 23:39 ` Christoph Lameter (Ampere)
2026-02-11 23:40 ` Tejun Heo
2026-02-12 0:05 ` Christoph Lameter (Ampere)
2026-02-11 23:58 ` Yang Shi
2026-02-12 17:54 ` Catalin Marinas
2026-02-12 18:43 ` Catalin Marinas
2026-02-13 0:23 ` Yang Shi
2026-02-12 18:45 ` Ryan Roberts
2026-02-12 19:36 ` Catalin Marinas
2026-02-12 21:12 ` Ryan Roberts
2026-02-16 10:37 ` Catalin Marinas
2026-02-18 8:59 ` Ryan Roberts
2026-02-12 18:41 ` Ryan Roberts
2026-02-12 18:55 ` Christoph Lameter (Ampere)
2026-02-12 18:58 ` Ryan Roberts
2026-02-24 16:47 ` Christoph Lameter (Ampere)
2026-02-13 18:42 ` Yang Shi
2026-02-16 11:39 ` Catalin Marinas
2026-02-17 17:28 ` Christoph Lameter (Ampere)
2026-02-18 9:18 ` Ryan Roberts
2026-02-26 18:31 ` Yang Shi
2026-02-23 9:50 ` Heiko Carstens
2026-02-26 17:48 ` Yang Shi
2026-04-29 23:03 ` Yang Shi
2026-05-08 22:52 ` Yang Shi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cdce23e4-3788-48c4-a3f5-e9b8f93ff4ef@os.amperecomputing.com \
--to=yang@os.amperecomputing.com \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=dennis@kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=tj@kernel.org \
--cc=urezki@gmail.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox