Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
       [not found] <20260429170758.3018959-1-yang@os.amperecomputing.com>
@ 2026-05-12  9:02 ` David Hildenbrand (Arm)
  2026-05-14  0:00   ` Yang Shi
  0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-12  9:02 UTC (permalink / raw)
  To: Yang Shi, cl, dennis, tj, urezki, catalin.marinas, will,
	ryan.roberts, akpm, hca, gor, agordeev
  Cc: linux-mm, linux-arm-kernel, linux-kernel

> =========
> The benchmarks are done on 160 core AmpereOne machine. The baseline is
> v7.1-rc1 kernel.
> 
> 1. Kernel Build
> ---------------
> Run kernel build (make -j160) with the default Fedora kernel config in a
> memcg.
> 13% - 18% sys time improvment
> 3% - 7% wall time improvement

This is pretty impressive!

There was quite some feedback during the LSF/MM session, what's the current plan?

Also, it was raised that Linus so far didn't enjoy per-process page tables. Is
there a way forward?


Finally, in the LSF/MM session, there was the question why the preemption
handling is even required. Can you describe what the problem is?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
  2026-05-12  9:02 ` [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) David Hildenbrand (Arm)
@ 2026-05-14  0:00   ` Yang Shi
  2026-05-15 16:28     ` Heiko Carstens
  0 siblings, 1 reply; 4+ messages in thread
From: Yang Shi @ 2026-05-14  0:00 UTC (permalink / raw)
  To: David Hildenbrand (Arm), cl, dennis, tj, urezki, catalin.marinas,
	will, ryan.roberts, akpm, hca, gor, agordeev
  Cc: linux-mm, linux-arm-kernel, linux-kernel

On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
>> =========
>> The benchmarks are done on 160 core AmpereOne machine. The baseline is
>> v7.1-rc1 kernel.
>>
>> 1. Kernel Build
>> ---------------
>> Run kernel build (make -j160) with the default Fedora kernel config in a
>> memcg.
>> 13% - 18% sys time improvment
>> 3% - 7% wall time improvement
> This is pretty impressive!

Thank you.

>
> There was quite some feedback during the LSF/MM session, what's the current plan?

We didn't talk about the plan in the LSFMM session due to time ran out. 
I had some hallway conversation with Ryan. He said he will try to 
replicate the performance benchmarks on some other ARM64 machines.

He raised the concern about CNP (Common not Private), but neither I nor 
he can find machines with shared TLB. We do need some help to run the 
patchset on those machines because disabling CNP may have some 
performance implication.

I plan to polish up the patchset. There are still a lot work to do to 
make it in a better shape. Sounds likes a plan?

I'm not sure whether S390 folks will implement this on S390 or not, 
anyway they are cc'ed.

>
> Also, it was raised that Linus so far didn't enjoy per-process page tables. Is
> there a way forward?

Yeah, it was discussed. My point is it makes some sense for x86 to not 
have per cpu page table because userspace and kernel share the same page 
table on x86, so the number of kernel page tables is actually unbounded. 
But ARM64 is different. The hardware supports separate userspace and 
kernel page tables, so the number of kernel page tables is actually 
bounded by the number of CPUs. And my regression tests didn't show 
noticeable regression for setting up percpu local mapping for 160 cores 
(means 160 kernel page tables).

So we should maximize the hardware benefit IMHO. And it should be up to 
the architecture maintainers.

>
>
> Finally, in the LSF/MM session, there was the question why the preemption
> handling is even required. Can you describe what the problem is?

Someone questioned why not just remove preempt_disable/enable because we 
just care about the sum of the counters. It may be ok for some cases, 
for example, some simple statistics, but it may cause problems for a lot 
usecases, for example:
     - __this_cpu_*() ops don't use atomic instructions. If they happen 
to access the same counter with this_cpu_*() concurrently, the counter 
may be corrupted.
     - this_cpu_write() may write a value or pointer, it may corrupt the 
remote CPU's copy.
     - The percpu counter may call into slow path to flush the per cpu 
counters to a global counter if some threshold is reached, the imprecise 
per cpu counter may result in suboptimal behavior, for example, calling 
in slow path more than necessary.
     - Cause the statistics out of sync or larger deviation than 
expected because the counter flush is not done due to comparing the 
threshold with wrong value.
     - AFAIK, scheduler may use percpu counter for some percpu lock, the 
imprecise counter may cause lockup and misbehavior.
     - And some subsystems maintain percpu state, then make decision 
based on the percpu state. The corrupted percpu state may cause various 
problems.
     - this_cpu_cmpxchg() may compare the remote CPU's value and result 
in indefinite loop.

There are a lot other cases that I may be not aware of because percpu is 
widely used by various subsystems. Anyway the spec is this_cpu_*() ops 
just can access local CPU copy. Accessing remote CPU's data is 
definitely not expected and may cause various problems.

Thanks,
Yang

>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
  2026-05-14  0:00   ` Yang Shi
@ 2026-05-15 16:28     ` Heiko Carstens
  2026-05-15 18:35       ` Yang Shi
  0 siblings, 1 reply; 4+ messages in thread
From: Heiko Carstens @ 2026-05-15 16:28 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand (Arm), cl, dennis, tj, urezki, catalin.marinas,
	will, ryan.roberts, akpm, gor, agordeev, linux-mm,
	linux-arm-kernel, linux-kernel

On Wed, May 13, 2026 at 05:00:19PM -0700, Yang Shi wrote:
> On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
> > There was quite some feedback during the LSF/MM session, what's the current plan?
...
> I'm not sure whether S390 folks will implement this on S390 or not, anyway
> they are cc'ed.

I'm not sure yet, however after I had a look at the architecture documentation
a couple of weeks ago, I think it shouldn't be too hard to get this working on
s390 as well. I was a bit concerned about TLB flushing, if changes to the
kernel mapping happen with per-cpu page tables, but as of now I believe this
shouldn't cause any harm (famous last words...).

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series)
  2026-05-15 16:28     ` Heiko Carstens
@ 2026-05-15 18:35       ` Yang Shi
  0 siblings, 0 replies; 4+ messages in thread
From: Yang Shi @ 2026-05-15 18:35 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: David Hildenbrand (Arm), cl, dennis, tj, urezki, catalin.marinas,
	will, ryan.roberts, akpm, gor, agordeev, linux-mm,
	linux-arm-kernel, linux-kernel

On 5/15/26 9:28 AM, Heiko Carstens wrote:
> On Wed, May 13, 2026 at 05:00:19PM -0700, Yang Shi wrote:
>> On 5/12/26 2:02 AM, David Hildenbrand (Arm) wrote:
>>> There was quite some feedback during the LSF/MM session, what's the current plan?
> ...
>> I'm not sure whether S390 folks will implement this on S390 or not, anyway
>> they are cc'ed.
> I'm not sure yet, however after I had a look at the architecture documentation
> a couple of weeks ago, I think it shouldn't be too hard to get this working on
> s390 as well. I was a bit concerned about TLB flushing, if changes to the
> kernel mapping happen with per-cpu page tables, but as of now I believe this
> shouldn't cause any harm (famous last words...).

Yeah, it shouldn't. Kernel needs to flush TLB for all CPUs regardless of 
percpu page table when kernel mapping is changed. There should not be 
any extra overhead for the most cases.

Some extra TLB flush is needed for "percpu local mapping area", but all 
CPUs use the same virtual address, so we should just need one more TLB 
flush call with the same virtual address for all CPUs. In addition, the 
percpu chunk destruction happens asynchronously in work queue. Umapping 
page tables, flushing TLB and freeing pages all happen in work queue 
when the whole chunk is freed. The fast path basically just updates an 
allocation bitmap.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-15 18:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260429170758.3018959-1-yang@os.amperecomputing.com>
2026-05-12  9:02 ` [RFC v1 PATCH 0/11] Optimize this_cpu_*() ops for non-x86 (ARM64 for this series) David Hildenbrand (Arm)
2026-05-14  0:00   ` Yang Shi
2026-05-15 16:28     ` Heiko Carstens
2026-05-15 18:35       ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox