Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Harry Yoo <harry.yoo@oracle.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jan Kara <jack@suse.cz>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Gabriel Krisman Bertazi <krisman@suse.de>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Michal Hocko <mhocko@kernel.org>, Dennis Zhou <dennis@kernel.org>,
	Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@gentwo.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
Date: Mon, 1 Dec 2025 19:38:55 +0900	[thread overview]
Message-ID: <aS1wPxSdd-WcU__K@hyeyoo> (raw)
In-Reply-To: <CAGudoHEyX1gdwG_MVf-M2KMHBE1Juo6VbfSyp3rbXoS+5JaNtw@mail.gmail.com>

On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote:
> On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> > > What would really reduce memory allocation overhead on fork
> > > is to move all those fields into a top level
> > > "struct mm_percpu_struct" as a first step. This would
> > > merge 3 per-cpu allocations into one when forking a new
> > > task.
> > >
> > > Then the second step is to create a mm_percpu_struct
> > > cache to bypass the per-cpu allocator.
> > >
> > > I suspect that by doing just that we'd get most of the
> > > performance benefits provided by the single-threaded special-case
> > > proposed here.
> >
> > I don't think so. Because in the profiles I have been doing for these
> > loads the biggest cost wasn't actually the per-cpu allocation itself but
> > the cost of zeroing the allocated counter for many CPUs (and then the
> > counter summarization on exit) and you're not going to get rid of that with
> > just reshuffling per-cpu fields and adding slab allocator in front.
> >
> 
> The entire ordeal has been discussed several times already. I'm rather
> disappointed there is a new patchset posted which does not address any
> of it and goes straight to special-casing single-threaded operation.
> 
> The major claims (by me anyway) are:
> 1. single-threaded operation for fork + exec suffers avoidable
> overhead even without the rss counter problem, which are tractable
> with the same kind of thing which would sort out the multi-threaded
> problem
> 2. unfortunately there is an increasing number of multi-threaded (and
> often short lived) processes (example: lld, the linker form the llvm
> project; more broadly plenty of things Rust where people think
> threading == performance)
> 
> Bottom line is, solutions like the one proposed in the patchset are at
> best a stopgap and even they leave performance on the table for the
> case they are optimizing for.
> 
> The pragmatic way forward (as I see it anyway) is to fix up the
> multi-threaded thing and see if trying to special case for
> single-threaded case is justifiable afterwards.
> 
> Given that the current patchset has to resort to atomics in certain
> cases, there is some error-pronnes and runtime overhead associated
> with it going beyond merely checking if the process is
> single-threaded, which puts an additional question mark on it.
> 
> Now to business:
> You mentioned the rss loops are a problem. I agree, but they can be
> largely damage-controlled. More importantly there are 2 loops of the
> sort already happening even with the patchset at hand.
> 
> mm_alloc_cid() results in one loop in the percpu allocator to zero out
> the area, then mm_init_cid() performs the following:
>         for_each_possible_cpu(i) {
>                 struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
> 
>                 pcpu_cid->cid = MM_CID_UNSET;
>                 pcpu_cid->recent_cid = MM_CID_UNSET;
>                 pcpu_cid->time = 0;
>         }
> 
> There is no way this is not visible already on 256 threads.
> 
> Preferably some magic would be done to init this on first use on given
> CPU.There is some bitmap tracking CPU presence, maybe this can be
> tackled on top of it. But for the sake of argument let's say that's
> too expensive or perhaps not feasible. Even then, the walk can be done
> *once* by telling the percpu allocator to refrain from zeroing memory.
> 
> Which brings me to rss counters. In the current kernel that's
> *another* loop over everything to zero it out. But it does not have to
> be that way. Suppose bitmap shenanigans mentioned above are no-go for
> these as well.
> 
> So instead the code could reach out to the percpu allocator to
> allocate memory for both cid and rss (as mentined by Mathieu), but
> have it returned uninitialized and loop over it once sorting out both
> cid and rss in the same body. This should be drastically faster than
> the current code.
> 
> But one may observe it is an invariant the values sum up to 0 on process exit.
> 
> So if one was to make sure the first time this is handed out by the
> percpu allocator the values are all 0s and then cache the area
> somewhere for future allocs/frees of mm, there would be no need to do
> the zeroing on alloc.

That's what slab constructor is for!

> On the free side summing up rss counters in check_mm() is only there
> for debugging purposes. Suppose it is useful enough that it needs to
> stay. Even then, as implemented right now, this is just slow for no
> reason:
> 
>         for (i = 0; i < NR_MM_COUNTERS; i++) {
>                 long x = percpu_counter_sum(&mm->rss_stat[i]);
> [snip]
>         }
> 
> That's *four* loops with extra overhead of irq-trips for every single
> one. This can be patched up to only do one loop, possibly even with
> irqs enabled the entire time.
> 
> Doing the loop is still slower than not doing it, but his may be just
> fast enough to obsolete the ideas like in the proposed patchset.
> 
> While per-cpu level caching for all possible allocations seems like
> the easiest way out, it in fact does *NOT* fully solve problem -- you
> are still going to globally serialize in lru_gen_add_mm() (and the del
> part), pgd_alloc() and other places.
> 
> Or to put it differently, per-cpu caching of mm_struct itself makes no
> sense in the current kernel (with the patchset or not) because on the
> way to finish the alloc or free you are going to globally serialize
> several times and *that* is the issue to fix in the long run. You can
> make the problematic locks fine-grained (and consequently alleviate
> the scalability aspect), but you are still going to suffer the
> overhead of taking them.
> 
> As far as I'm concerned the real long term solution(tm) would make the
> cached mm's retain the expensive to sort out state -- list presence,
> percpu memory and whatever else.
> 
> To that end I see 2 feasible approaches:
> 1. a dedicated allocator with coarse granularity
> 
> Instead of per-cpu, you could have an instance for every n threads
> (let's say 8 or whatever). this would pose a tradeoff between total
> memory usage and scalability outside of a microbenchmark setting. you
> are still going to serialize in some cases, but only once on alloc and
> once on free, not several times and you are still cheaper
> single-threaded. This is faster all around.
> 
> 2. dtor support in the slub allocator
> 
> ctor does the hard work and dtor undoes it. There is an unfinished
> patchset by Harry which implements the idea[1].

Apologies for not reposting it for a while. I have limited capacity to push
this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
branch after rebasing it onto the latest slab/for-next.

https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads

My review on the version is limited, but did a little bit of testing.

> There is a serious concern about deadlock potential stemming from
> running arbitrary dtor code during memory reclaim. I already described
> elsewhere how with a little bit of discipline supported by lockdep
> this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't
> take any locks if you hold them and you have to disable interrupts) +
> mark dtors as only allowed to hold a leaf spinlock et voila, code
> guaranteed to not deadlock). But then all code trying to cache its
> state in to be undone with dtor has to be patched to facilitate it.
> Again bugs in the area sorted out by lockdep.
> 
> The good news is that folks were apparently open to punting reclaim of
> such memory into a workqueue, which completely alleviates that concern
> anyway.

I took the good news and switched to using workqueue to reclaim slabs
(for caches with dtor) in v2.

> So happens if fork + exit is involved there are numerous other
> bottlenecks which overshadow the above, but that's a rant for another
> day. Here we can pretend for a minute they are solved.
> 
> [1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads

-- 
Cheers,
Harry / Hyeonggon

next prev parent reply	other threads:[~2025-12-01 10:39 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-27 23:36 [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 1/4] lib/percpu_counter: Split out a helper to insert into hotplug list Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 2/4] lib: Support lazy initialization of per-cpu counters Gabriel Krisman Bertazi
2025-11-27 23:36 ` [RFC PATCH 3/4] mm: Avoid percpu MM counters on single-threaded tasks Gabriel Krisman Bertazi
2025-11-28  1:48   ` kernel test robot
2025-11-28  3:55   ` kernel test robot
2025-11-27 23:36 ` [RFC PATCH 4/4] mm: Split a slow path for updating mm counters Gabriel Krisman Bertazi
2025-12-01 10:19   ` David Hildenbrand (Red Hat)
2025-11-28 13:30 ` [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks Mathieu Desnoyers
2025-11-28 20:10   ` Jan Kara
2025-11-28 20:12     ` Mathieu Desnoyers
2025-11-29  5:57     ` Mateusz Guzik
2025-11-29  7:50       ` Mateusz Guzik
2025-12-01 10:38       ` Harry Yoo [this message]
2025-12-01 11:31         ` Mateusz Guzik
2025-12-01 14:47           ` Mathieu Desnoyers
2025-12-01 15:23       ` Gabriel Krisman Bertazi
2025-12-01 19:16         ` Harry Yoo
2025-12-03 11:02         ` Mateusz Guzik
2025-12-03 11:54           ` Mateusz Guzik
2025-12-03 14:36             ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aS1wPxSdd-WcU__K@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=jack@suse.cz \
    --cc=krisman@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhocko@kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.