Hi folks, Just FYI. Attached the slides presented at LSFMM. Thanks for all the feedback and suggestions. Yang On 2/11/26 3:14 PM, Yang Shi wrote: > Background > ========= > The APIs using this_cpu_*() operate on a local copy of a percpu > variable for the current processor. In order to obtain the address of > this cpu specific variable a cpu specific offset has to be added to > the address. > On x86 this address calculation can be created by prefixing an > instruction with a segment register. x86 can increment a percpu > counter with a single instruction. Since the address calculation and > the RMV operation occurs within one instruction it is atomic vs the > scheduler. So no preemption is needed. > f.e > INC %gs:[my_counter] > See https://www.kernel.org/doc/Documentation/this_cpu_ops.txt for more details. > > ARM64 and some other non-x86 architectures don't have a segment > register. The address of the current percpu variable has to be > calculated and then that address can be used for an operation on > percpu data. This process must be atomic vs the scheduler. Therefore, > it is necessary to disable preemption, perform the address calculation > and then the increment operation. The cpu specific offset is in a MSR > that also needs to be accessed on ARM64. The code flow looks like: > Disable preemption > Calculate the current CPU copy address by using the offset > Manipulate the counter > Enable preemption > > This process is inefficient relative to x86 and has to be repeated for > every access to per cpu data. > ARM64 has an increment instruction but this increment does not allow > the use of a base register or a segment register like on x86. So an > address calculation is always necessary even if the atomic instruction > is used. > A page table allows us to do remapping of addresses. So if the atomic > instruction would be using a virtual address and the page tables for > the local processor would map this area to the local per cpu data then > we can also create a single instruction on ARM64 (hopefully for some > other non-x86 architectures too) and be as efficient as x86 is. > > So, the code flow should just become: > INC VIRTUAL_BASE + percpu_variable_offset > > In order to do that we need to have the same virtual address mapped > differently for each processor. This means we need different page > tables for each processor. These page tables > can map almost all of the address space in the same way. The only area > that will be special is the area starting at VIRTUAL_BASE. > > In addition, the percpu counters also can be accessed from other CPUs > by using per_cpu_ptr() APIs. This is usually used by counters > initialization code. For example, > for_each_possible_cpu(cpu) { > p = per_cpu_ptr(ptr, cpu); > initialize(p); > } > > Percpu allocator > ============= > When calling alloc_percpu(), kernel allocates contiguous virtual > memory area from vmalloc area. It is called “chunk”. The chunk looks > like: > | CPU 0 | CPU 1 | …… | CPU n| > > The size of the chunk is the percpu_unit_size * nr_cpus. Then kernel > maps them to physical memory. It returns an offset. > > Design > ====== > To improve the performance for this_cpu_ops on ARM64 and potentially > some other non-x86 architectures, I and Christopher Lameter proposed > the below solution. > > To remove the preemption disable/enable, we need to guarantee > this_cpu_*() APIs actually convert the offset returned by > alloc_percpu() to a pointer which should be the same on all CPUs. But > it should not break per_cpu_ptr() APIs usecase either. > To achieve this, we need to modify the percpu allocator to allocate > extra virtual memory other than the virtual memory area shown in the > above diagram. The size of the extra allocation is percpu_unit_size. > this_cpu_*() APIs will convert the offset returned by alloc_percpu() > to a pointer to this area. It is the same for all CPUs. I call the > extra allocated area “local mapping” and the original area “global > mapping” in order to simplify the discussion. So the percpu chunk will > look like: > | CPU 0 | CPU 1 | …… | CPU n| xxxxxxxxx | CPU | > Global mapping local mapping > > this_cpu_*() APIs will just access the local mapping, per_cpu_ptr() > APIs continue to use the global mapping. > > The local mapping requires mapping to different physical memory > (shared physical memory mapped by global mapping, no need to allocate > extra physical memory) on different CPUs in order to manipulate the > right copy. This can be achieved by using the percpu page table in > arch-dependent code. Each CPU just sees its own kernel page table copy > instead of sharing one single kernel page table. However the most > contents of the page tables can be shared except the area for percpu > local mapping. So they basically can share PUD/PMD/PTE except PGD. > > The kernel also maintains a base address for global mapping in order > to convert the offset returned by alloc_percpu() to the correct > pointer. The local mapping also needs a base address, and the offset > between local mapping base address and allocated local mapping area > must be the same with the offset returned by alloc_percpu(). So the > local mapping has to happen in a specific address range. This may need > a dedicated percpu local mapping area which can’t be used by vmalloc() > in order to avoid conflicts. > > I have done some PoC on ARM64. Hopefully I can post them to the > mailing list to ease the discussion before the conference. > > Overhead > ======== > 1. Some extra virtual memory space. But it shouldn’t be too much. I > saw 960K with Fedora default kernel config. Given terabytes virtual > memory space on 64 bit machine, 960K is negligible. > 2. Some extra physical memory for percpu kernel page table. 4K * > (nr_cpus – 1) for PGD pages, plus the page tables used by percpu local > mapping area. A couple of megabytes with Fedora default kernel config > on AmpereOne with 160 cores. > 3. Percpu allocation and free will be slower due to extra virtual > memory allocation and page table manipulation. However, percpu is > allocated by chunk. One chunk typically holds a lot percpu variables. > So the slowdown should be negligible. The test result below also > proved it. > > Performance Test > ============== > I have done a PoC on ARM64. So all the tests were run on AmpereOne > with 160 cores. > 1. Kernel build > -------------------- > Run kernel build (make -j160) with default Fedora kernel config in a memcg. > Roughly 13% - 15% systime improvement for my kernel build workload. > > 2. stress-ng > ---------------- > stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000 > 6% improvement for systime > > 3. vm-scalability > ---------------------- > Single digit (0 – 8%) improvement for systime for some vm-scalability test cases > > 4. will-it-scale > ------------------ > 3% - 8% improvement for pagefault cases from will-it-scale > And profiling to page_fault3_processes from will-it-scale also shows > the reduction in percpu counters manipulation (perf diff output): > 5.91% -1.82% [kernel.kallsyms] [k] mod_memcg_lruvec_state > 2.84% -1.30% [kernel.kallsyms] [k] percpu_counter_add_batch > > Regression Test > ============= > Create 10K cgroups. > Creating cgroups need to call percpu allocators multiple times. For > example, creating one memcg needs to allocate percpu refcnt, rstat and > objcg percpu refcnt. > > It consumed 2112K more virtual memory for percpu local mapping. A few > more megabytes consumed by percpu page table to map local mapping. The > memory consumption depends on the number of CPUs. > > Execution time is basically the same. No noticeable regression is > found. The profiling shows (perf diff): > 0.35% -0.33% [kernel.kallsyms] [k] > percpu_ref_get_many > 0.61% -0.30% [kernel.kallsyms] [k] > percpu_counter_add_batch > 0.34% +0.02% [kernel.kallsyms] [k] > pcpu_alloc_noprof > 0.00% +0.05% [kernel.kallsyms] [k] > free_percpu.part.0 > The gain from manipulating percpu counters outweigh the slowdown from > percpu allocation and free. There is even a little bit of net gain. > > Future usecases > ============= > Some potential usecases may be unlocked by percpu page table, for > example, kernel text replication, off the top of my head. Anyway this > is not the main point for this proposal. > > Key attendees > =========== > This work will incur changes to percpu allocator, vmalloc (just need > to add a new interface to take pgdir pointer as an argument) and arch > dependent code (percpu page table implementation is arch-dependent). > So the percpu allocator maintainers, vmalloc maintainers and arch > experts (for example, ARM64) should be key attendees. I don't know who > can attend so I just list all of them. > > Christopher Lameter (co-presenter and percpu allocator maintainer) > Dennis Zhou/Tejun Heo (percpu allocator maintainer) > Uladzislau Rezki (vmalloc maintainer) > Catalin Marinas/Will Deacon/Ryan Roberts (ARM64 memory management) > > Thanks, > Yang