Hi folks,

Just FYI. Attached the slides presented at LSFMM. Thanks for all the 
feedback and suggestions.

Yang


On 2/11/26 3:14 PM, Yang Shi wrote:
> Background
> =========
> The APIs using this_cpu_*() operate on a local copy of a percpu
> variable for the current processor. In order to obtain the address of
> this cpu specific variable a cpu specific offset has to be added to
> the address.
> On x86 this address calculation can be created by prefixing an
> instruction with a segment register. x86 can increment a percpu
> counter with a single instruction. Since the address calculation and
> the RMV operation occurs within one instruction it is atomic vs the
> scheduler. So no preemption is needed.
> f.e
> INC %gs:[my_counter]
> See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details.
>
> ARM64 and some other non-x86 architectures don't have a segment
> register. The address of the current percpu variable has to be
> calculated and then that address can be used for an operation on
> percpu data. This process must be atomic vs the scheduler. Therefore,
> it is necessary to disable preemption, perform the address calculation
> and then the increment operation. The cpu specific offset is in a MSR
> that also needs to be accessed on ARM64. The code flow looks like:
>      Disable preemption
>      Calculate the current CPU copy address by using the offset
>      Manipulate the counter
>      Enable preemption
>
> This process is inefficient relative to x86 and has to be repeated for
> every access to per cpu data.
> ARM64 has an increment instruction but this increment does not allow
> the use of a base register or a segment register like on x86. So an
> address calculation is always necessary even if the atomic instruction
> is used.
> A page table allows us to do remapping of addresses. So if the atomic
> instruction would be using a virtual address and the page tables for
> the local processor would map this area to the local per cpu data then
> we can also create a single instruction on ARM64 (hopefully for some
> other non-x86 architectures too) and be as efficient as x86 is.
>
> So, the code flow should just become:
> INC VIRTUAL_BASE + percpu_variable_offset
>
> In order to do that we need to have the same virtual address mapped
> differently for each processor. This means we need different page
> tables for each processor. These page tables
> can map almost all of the address space in the same way. The only area
> that will be special is the area starting at VIRTUAL_BASE.
>
> In addition, the percpu counters also can be accessed from other CPUs
> by using per_cpu_ptr() APIs. This is usually used by counters
> initialization code. For example,
> for_each_possible_cpu(cpu) {
>      p = per_cpu_ptr(ptr, cpu);
>      initialize(p);
> }
>
> Percpu allocator
> =============
> When calling alloc_percpu(), kernel allocates contiguous virtual
> memory area from vmalloc area. It is called “chunk”. The chunk looks
> like:
> | CPU 0 | CPU 1 | …… | CPU n|
>
> The size of the chunk is the percpu_unit_size * nr_cpus. Then kernel
> maps them to physical memory. It returns an offset.
>
> Design
> ======
> To improve the performance for this_cpu_ops on ARM64 and potentially
> some other non-x86 architectures, I and Christopher Lameter proposed
> the below solution.
>
> To remove the preemption disable/enable, we need to guarantee
> this_cpu_*() APIs actually convert the offset returned by
> alloc_percpu() to a pointer which should be the same on all CPUs. But
> it should not break per_cpu_ptr() APIs usecase either.
> To achieve this, we need to modify the percpu allocator to allocate
> extra virtual memory other than the virtual memory area shown in the
> above diagram. The size of the extra allocation is percpu_unit_size.
> this_cpu_*() APIs will convert the offset returned by alloc_percpu()
> to a pointer to this area. It is the same for all CPUs. I call the
> extra allocated area “local mapping” and the original area “global
> mapping” in order to simplify the discussion. So the percpu chunk will
> look like:
> | CPU 0 | CPU 1 | …… | CPU n| xxxxxxxxx | CPU |
>             Global mapping                                     local mapping
>
> this_cpu_*() APIs will just access the local mapping, per_cpu_ptr()
> APIs continue to use the global mapping.
>
> The local mapping requires mapping to different physical memory
> (shared physical memory mapped by global mapping, no need to allocate
> extra physical memory) on different CPUs in order to manipulate the
> right copy. This can be achieved by using the percpu page table in
> arch-dependent code. Each CPU just sees its own kernel page table copy
> instead of sharing one single kernel page table. However the most
> contents of the page tables can be shared except the area for percpu
> local mapping. So they basically can share PUD/PMD/PTE except PGD.
>
> The kernel also maintains a base address for global mapping in order
> to convert the offset returned by alloc_percpu() to the correct
> pointer. The local mapping also needs a base address, and the offset
> between local mapping base address and allocated local mapping area
> must be the same with the offset returned by alloc_percpu(). So the
> local mapping has to happen in a specific address range. This may need
> a dedicated percpu local mapping area which can’t be used by vmalloc()
> in order to avoid conflicts.
>
> I have done some PoC on ARM64. Hopefully I can post them to the
> mailing list to ease the discussion before the conference.
>
> Overhead
> ========
> 1. Some extra virtual memory space. But it shouldn’t be too much. I
> saw 960K with Fedora default kernel config. Given terabytes virtual
> memory space on 64 bit machine, 960K is negligible.
> 2. Some extra physical memory for percpu kernel page table. 4K *
> (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local
> mapping area. A couple of megabytes with Fedora default kernel config
> on AmpereOne with 160 cores.
> 3. Percpu allocation and free will be slower due to extra virtual
> memory allocation and page table manipulation. However, percpu is
> allocated by chunk. One chunk typically holds a lot percpu variables.
> So the slowdown should be negligible. The test result below also
> proved it.
>
> Performance Test
> ==============
> I have done a PoC on ARM64. So all the tests were run on AmpereOne
> with 160 cores.
> 1. Kernel build
> --------------------
> Run kernel build (make -j160) with default Fedora kernel config in a memcg.
> Roughly 13% - 15% systime improvement for my kernel build workload.
>
> 2. stress-ng
> ----------------
> stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000
> 6% improvement for systime
>
> 3. vm-scalability
> ----------------------
> Single digit (0 – 8%) improvement for systime for some vm-scalability test cases
>
> 4. will-it-scale
> ------------------
> 3% - 8% improvement for pagefault cases from will-it-scale
> And profiling to page_fault3_processes from will-it-scale also shows
> the reduction in percpu counters manipulation (perf diff output):
> 5.91%     -1.82%  [kernel.kallsyms]        [k] mod_memcg_lruvec_state
> 2.84%     -1.30%  [kernel.kallsyms]        [k] percpu_counter_add_batch
>
> Regression Test
> =============
> Create 10K cgroups.
> Creating cgroups need to call percpu allocators multiple times. For
> example, creating one memcg needs to allocate percpu refcnt, rstat and
> objcg percpu refcnt.
>
> It consumed 2112K more virtual memory for percpu local mapping. A few
> more megabytes consumed by percpu page table to map local mapping. The
> memory consumption depends on the number of CPUs.
>
> Execution time is basically the same. No noticeable regression is
> found. The profiling shows (perf diff):
> 0.35%     -0.33%  [kernel.kallsyms]                               [k]
> percpu_ref_get_many
> 0.61%     -0.30%  [kernel.kallsyms]                               [k]
> percpu_counter_add_batch
> 0.34%     +0.02%  [kernel.kallsyms]                               [k]
> pcpu_alloc_noprof
> 0.00%     +0.05%  [kernel.kallsyms]                               [k]
> free_percpu.part.0
> The gain from manipulating percpu counters outweigh the slowdown from
> percpu allocation and free. There is even a little bit of net gain.
>
> Future usecases
> =============
> Some potential usecases may be unlocked by percpu page table, for
> example, kernel text replication, off the top of my head. Anyway this
> is not the main point for this proposal.
>
> Key attendees
> ===========
> This work will incur changes to percpu allocator, vmalloc (just need
> to add a new interface to take pgdir pointer as an argument) and arch
> dependent code (percpu page table implementation is arch-dependent).
> So the percpu allocator maintainers, vmalloc maintainers and arch
> experts (for example, ARM64) should be key attendees. I don't know who
> can attend so I just list all of them.
>
> Christopher Lameter (co-presenter and percpu allocator maintainer)
> Dennis Zhou/Tejun Heo (percpu allocator maintainer)
> Uladzislau Rezki (vmalloc maintainer)
> Catalin Marinas/Will Deacon/Ryan Roberts (ARM64 memory management)
>
> Thanks,
> Yang