Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gabriel Krisman Bertazi <krisman@suse.de>
To: Pedro Falcato <pfalcato@suse.de>
Cc: Jan Kara <jack@suse.cz>,  Harry Yoo <harry.yoo@oracle.com>,
	linux-mm@kvack.org,  lsf-pc@lists.linux-foundation.org,
	 Mateusz Guzik <mjguzik@gmail.com>,
	 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Tejun Heo <tj@kernel.org>,  Christoph Lameter <cl@gentwo.org>,
	 Dennis Zhou <dennis@kernel.org>,
	 Vlastimil Babka <vbabka@suse.cz>,  Hao Li <hao.li@linux.dev>
Subject: Re: [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator
Date: Fri, 06 Mar 2026 11:26:22 -0500	[thread overview]
Message-ID: <87seaczzyp.fsf@mailhost.krisman.be> (raw)
In-Reply-To: <qz3f4p2ra6nq5cx3vlacmwif2ih5ojbf7s3ydzw6d7tgqn24lj@pnynq4l6oovc> (Pedro Falcato's message of "Fri, 6 Mar 2026 15:35:36 +0000")

Pedro Falcato <pfalcato@suse.de> writes:

> On Thu, Mar 05, 2026 at 12:48:21PM +0100, Jan Kara wrote:
>
>> On Thu 05-03-26 11:33:21, Pedro Falcato wrote:
>> > On Fri, Feb 27, 2026 at 03:41:50PM +0900, Harry Yoo wrote:
>> > > Hi folks, I'd like to discuss ways to mitigate limitations of
>> > > percpu memory allocator. 
>> > > 
>> > > While the percpu memory allocator has served its role well,
>> > > it has a few problems: 1) its global lock contention, and
>> > > 2) lack of features to avoid high initialization cost of percpu memory.
>> > > 
>> > > Global lock contention
>> > > =======================
>> > > 
>> > > Percpu allocator has a global lock when allocating or freeing memory.
>> > > Of course, caching percpu memory is not always worth it, because
>> > > it would meaningfully increase memory usage.
>> > > 
>> > > However, some users (e.g., fork+exec, tc filter) suffer from
>> > > the lock contention when many CPUs allocate / free percpu memory
>> > > concurrently.
>> > > 
>> > > That said, we need a way to cache percpu memory per cpu, in a selective
>> > > way. As an opt-in approach, Mateusz Guzik proposed [1] keeping percpu
>> > > memory in slab objects and letting slab cache them per cpu,
>> > > with slab ctor+dtor pair: allocate percpu memory and
>> > > associate it with slab object in constructor, and free it when
>> > > deallocating slabs (with resurrecting slab destructor feature).
>> > > 
>> > > This only works when percpu memory is associated with slab objects.
>> > > I would like to hear if anybody thinks it's still worth redesigning
>> > > percpu memory allocator for better scalability.
>> > 
>> > I think this (make alloc_percpu actually scale) is the obvious suggestion.
>> > Everything else is just papering over the cracks.
>> 
>> I disagree. There are two separate (although related) issues that need
>> solving. One issue is certainly scalability of the percpu allocator.
>> Another issue (which is also visible in singlethreaded workloads) is that
>> a percpu counter creation has a rather large cost even if the allocator is
>> totally uncontended - this is because of the initialization (and final
>> summarization) cost. And this is very visible e.g. in the fork() intensive
>> loads such as shell scripts where we currently allocate several percpu
>> arrays for each fork() and significant part of the fork() cost is currently
>> the initialization of percpu arrays on larger machines. Reducing this
>> overhead is a separate goal.
>
> I agree that it's a separate issue. But it's as much of an issue for
> single-threaded processes as much as multi-threaded. Say you have a 64 core
> CPU. Why should you pay for 64 separate cores when you only spawned 2 threads?
> (and, yes, this is a not-so-rare situation, like lld which spawns up to 16
> threads (https://reviews.llvm.org/D147493), even if you have hundreds
> of CPUs)

True.  Still, being an up-front initialization cost, it is the most
relevant the shortest the task lives.  I'd imagine that even for
something as lld doing 16 clone syscalls, the overhead of a single
percpu counter initialization is a very small blip in the profile, not
worth special-casing for.  The single-threaded case is the obvious
optimizable-case in this sense.

> So perhaps the best way to go about this problem would be to go back to
> per-task RSS accounting. This one had problems with many-task RSS accuracy,
> but the current one has problems for many-cpu RSS accuracy.

The current pcpu one has a much smaller accuracy error than per-task,
which justified its inclusion in the first place, no?  IIRC, there was a
real use case where the worse accuracy mattered for process selection
during OOM.

> A single-threaded
> optimization could patch over the problem for the vast majority of programs,
> but exceptions exist.

>
> Or another possible idea: lazily initialize these cpu counters somehow,
> on task switch.

>
> I'm afraid that while the solution presented by Mathieu fixes a problem with
> the current scheme (insane inaccuracy on large-cpu-count), it might also add
> to the percpu allocation + init problem (this might not be true, I have not
> paid too much attention).

-- 
Gabriel Krisman Bertazi

     prev parent reply	other threads:[~2026-03-06 16:26 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-27  6:41 [LSF/MM/BPF TOPIC] Ways to mitigate limitations of percpu memory allocator Harry Yoo
2026-03-04 17:50 ` Gabriel Krisman Bertazi
2026-03-05  4:24   ` Mathieu Desnoyers
2026-03-05 10:05     ` Jan Kara
2026-03-05 11:33 ` Pedro Falcato
2026-03-05 11:48   ` Jan Kara
2026-03-06 15:35     ` Pedro Falcato
2026-03-06 16:26       ` Gabriel Krisman Bertazi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87seaczzyp.fsf@mailhost.krisman.be \
    --to=krisman@suse.de \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=hao.li@linux.dev \
    --cc=harry.yoo@oracle.com \
    --cc=jack@suse.cz \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mjguzik@gmail.com \
    --cc=pfalcato@suse.de \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.