From: Harry Yoo <harry.yoo@oracle.com>
To: Pedro Falcato <pfalcato@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>,
Christoph Lameter <cl@gentwo.org>,
David Rientjes <rientjes@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
Mateusz Guzik <mjguzik@gmail.com>,
Jamal Hadi Salim <jhs@mojatatu.com>,
Cong Wang <xiyou.wangcong@gmail.com>,
Jiri Pirko <jiri@resnulli.us>, Vlad Buslov <vladbu@nvidia.com>,
Yevgeny Kliteynik <kliteyn@nvidia.com>, Jan Kara <jack@suse.cz>,
Byungchul Park <byungchul@sk.com>,
linux-mm@kvack.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
Date: Mon, 28 Apr 2025 10:18:34 +0900 [thread overview]
Message-ID: <aA7XagXweCLdATTH@harry> (raw)
In-Reply-To: <vd3k2bljkzow6ozzan2hkeiyytcqe2g6gavroej23457erucza@fknlr6cmzvo7>
On Fri, Apr 25, 2025 at 11:42:27AM +0100, Pedro Falcato wrote:
> On Fri, Apr 25, 2025 at 07:12:02PM +0900, Harry Yoo wrote:
> > On Thu, Apr 24, 2025 at 12:28:37PM +0100, Pedro Falcato wrote:
> > > On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> > > > Overview
> > > > ========
> > > >
> > > > The slab destructor feature existed in early days of slab allocator(s).
> > > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > > for destructors") in 2007 due to lack of serious use cases at that time.
> > > >
> > > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > > constructor/destructor pair to mitigate the global serialization point
> > > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > > percpu memory during its lifetime.
> > > >
> > > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > > so each allocate–free cycle requires two expensive acquire/release on
> > > > that mutex.
> > > >
> > > > We can mitigate this contention by retaining the percpu regions after
> > > > the object is freed and releasing them only when the backing slab pages
> > > > are freed.
> > > >
> > > > How to do this with slab constructors and destructors: the constructor
> > > > allocates percpu memory, and the destructor frees it when the slab pages
> > > > are reclaimed; this slightly alters the constructor’s semantics,
> > > > as it can now fail.
> > > >
> > >
> > > I really really really really don't like this. We're opening a pandora's box
> > > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > > solution there would be, what, failing dtors? which says a lot about the whole
> > > situation...
> > >
> > > Case in point:
> >
> > <...snip...>
> >
> > > Then there are obviously other problems like: whatever you're calling must
> > > not ever require the slab allocator (directly or indirectly) and must not
> > > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > > is a no-go (AIUI!) already because of such issues.
> >
> > Could you please elaborate more on this?
>
> Well, as discussed multiple-times both on-and-off-list, the pcpu allocator is
> not a problem here because the freeing path takes a spinlock, not a mutex.
Yes, and it seems to be a leaf spinlock (no code path in the kernel takes
any other locks while holding the lock).
> But obviously you can see the fun locking horror dependency chains
> we're creating with this patchset.
>
> ->ctor() needs to be super careful calling things, avoiding
> any sort of loop.
You mean recursion to avoid exhausting the kernel stack?
It'd be fine as long as ->ctor does not allocate objects from
the same cache (and of course it should not).
> ->dtor() needs to be super careful calling things, avoiding
> _any_ sort of direct reclaim possibilities.
Why would a dtor _ever_ need to perform direct reclamation?
>You also now need to pass a gfp_t to both ->ctor and ->dtor.
Passing gfp_t to ->ctor agreed, but why to ->dtor?
Destructors should not allocate any memory and thus no need for
'Get Free Page' flags.
> So basically:
> - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
> lock holding, etc, can be hidden with a normal kmem_cache_alloc;
Speaking of deadlocks involving ->ctor, of course, you shouldn't allocate
mm_struct while holding pcpu_lock, or allocate a pgd while holding pgd_lock.
I don't think avoiding deadlocks caused by ->ctor is that difficult, and
lockdep can detect them even if someone makes a mistake.
> - ->dtor *will* do fancy things like recursing back onto the slab allocator and
> grabbing locks
AIUI it can't recurse back onto the slab allocator.
'Taking only leaf spinlocks' is a very restrictive rule.
For example, slab takes list_lock, and it is not a leaf spinlock because
in some path slab can take other locks while holding it. And thus you can't
call kmem_cache_free() directly in ->dtor.
'Doing fancy things and grabbing locks' in dtor is safe as long as
you 1) disable interrupts and 2) take only leaf spinlocks.
> - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
> to dispose of slabs. It can also uncontrollably do $whatever.
'Can uncontrollably do $whatever' is not true.
It's safe as long as ->dtor only takes leaf spinlocks.
> - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
> ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
> cache_free/slab disposal issues
>
> - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
> issues by purely starting up shrinkers on direct reclaim as well.
I don't see that is a problem (as long as ->dtor only takes leaf spinlocks).
--
Cheers,
Harry / Hyeonggon
next prev parent reply other threads:[~2025-04-28 1:25 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-24 8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
2025-04-24 8:07 ` [RFC PATCH 1/7] mm/slab: refactor freelist shuffle Harry Yoo
2025-04-24 8:07 ` [RFC PATCH 2/7] treewide, slab: allow slab constructor to return an error Harry Yoo
2025-04-24 8:07 ` [RFC PATCH 3/7] mm/slab: revive the destructor feature in slab allocator Harry Yoo
2025-04-24 8:07 ` [RFC PATCH 4/7] net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu alloc Harry Yoo
2025-04-24 8:07 ` [RFC PATCH 5/7] mm/percpu: allow (un)charging objects without alloc/free Harry Yoo
2025-04-24 8:07 ` [RFC PATCH 6/7] lib/percpu_counter: allow (un)charging percpu counters " Harry Yoo
2025-04-24 8:07 ` [RFC PATCH 7/7] kernel/fork: improve exec() throughput with slab ctor/dtor pair Harry Yoo
2025-04-24 9:29 ` [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Mateusz Guzik
2025-04-24 9:58 ` Harry Yoo
2025-04-24 15:00 ` Mateusz Guzik
2025-04-24 11:28 ` Pedro Falcato
2025-04-24 15:20 ` Mateusz Guzik
2025-04-24 16:11 ` Mateusz Guzik
2025-04-25 7:40 ` Harry Yoo
2025-04-25 10:12 ` Harry Yoo
2025-04-25 10:42 ` Pedro Falcato
2025-04-28 1:18 ` Harry Yoo [this message]
2025-04-30 19:49 ` Mateusz Guzik
2025-05-12 11:00 ` Harry Yoo
2025-04-24 15:50 ` Christoph Lameter (Ampere)
2025-04-24 16:03 ` Mateusz Guzik
2025-04-24 16:39 ` Christoph Lameter (Ampere)
2025-04-24 17:26 ` Mateusz Guzik
2025-04-24 18:47 ` Tejun Heo
2025-04-25 10:10 ` Harry Yoo
2025-04-25 19:03 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aA7XagXweCLdATTH@harry \
--to=harry.yoo@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=byungchul@sk.com \
--cc=cl@gentwo.org \
--cc=dennis@kernel.org \
--cc=jack@suse.cz \
--cc=jhs@mojatatu.com \
--cc=jiri@resnulli.us \
--cc=kliteyn@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mjguzik@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pfalcato@suse.de \
--cc=rientjes@google.com \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vladbu@nvidia.com \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.