netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jiri Pirko <jiri@resnulli.us>, Vlad Buslov <vladbu@nvidia.com>,
	Yevgeny Kliteynik <kliteyn@nvidia.com>, Jan Kara <jack@suse.cz>,
	Byungchul Park <byungchul@sk.com>,
	linux-mm@kvack.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem
Date: Thu, 24 Apr 2025 18:58:01 +0900	[thread overview]
Message-ID: <aAoLKVwC5JCe7fbv@harry> (raw)
In-Reply-To: <CAGudoHGkNn032RVuJdwYqpzfQgAB8pv=hEzdR1APsFOOSQnq1Q@mail.gmail.com>

On Thu, Apr 24, 2025 at 11:29:13AM +0200, Mateusz Guzik wrote:
> On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > Overview
> > ========
> >
> > The slab destructor feature existed in early days of slab allocator(s).
> > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > for destructors") in 2007 due to lack of serious use cases at that time.
> >
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> >
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> >
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
> >
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> >
> > This series is functional (although not compatible with MM debug
> > features yet), but still far from perfect. I’m actively refining it and
> > would appreciate early feedback before I improve it further. :)
> >
> 
> Thanks for looking into this.

You're welcome. Thanks for the proposal.

> The dtor thing poses a potential problem where a dtor acquiring
> arbitrary locks can result in a deadlock during memory reclaim.

AFAICT, MM does not reclaim slab memory unless we register shrinker
interface to reclaim it. Or am I missing something?

Hmm let's say it does anyway, then is this what you worry about?

someone requests percpu memory
-> percpu allocator takes a lock (e.g., pcpu_alloc_mutex)
-> allocates pages from buddy
-> buddy reclaims slab memory
-> slab destructor calls pcpu_alloc_mutex (deadlock!)

> So for this to be viable one needs to ensure that in the worst case
> this only ever takes leaf-locks (i.e., locks which are last in any
> dependency chain -- no locks are being taken if you hold one).

Oh, then you can't allocate memory while holding pcpu_lock or
pcpu_alloc_mutex?

> This
> needs to demonstrate the percpu thing qualifies or needs to refactor
> it to that extent.
>
> > This series is based on slab/for-next [2].
> >
> > Performance Improvement
> > =======================
> >
> > I measured the benefit of this series for two different users:
> > exec() and tc filter insertion/removal.
> >
> > exec() throughput
> > -----------------
> >
> > The performance of exec() is important when short-lived processes are
> > frequently created. For example: shell-heavy workloads and running many
> > test cases [3].
> >
> > I measured exec() throughput with a microbenchmark:
> >   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
> >   - 4.56% gain on a desktop with 24 hardware threads, and
> >   - Even 4% gain on a single-threaded exec() throughput.
> >
> > Further investigation showed that this was due to the overhead of
> > acquiring/releasing pcpu_alloc_mutex and its contention.
> >
> > See patch 7 for more detail on the experiment.
> >
> > Traffic Filter Insertion and Removal
> > ------------------------------------
> >
> > Each tc filter allocates three percpu memory regions per tc_action object,
> > so frequently inserting and removing filters contend heavily on the same
> > mutex.
> >
> > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> > more detail), I observed a 26% reduction in system time and observed
> > much less contention on pcpu_alloc_mutex with this series.
> >
> > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> > about tc filter insertion rate; these changes may still benefit
> > workloads they run today.
> >
> > Feedback Needed from Percpu Allocator Folks
> > ===========================================
> >
> > As percpu allocator is directly affected by this series, this work
> > will need support from the percpu allocator maintainers, and we need to
> > address their concerns.
> >
> > They will probably say "This is a percpu memory allocator scalability
> > issue and we need to make it scalable"? I don't know.
> >
> > What do you say?
> >
> > Some hanging thoughts:
> > - Tackling the problem on the slab side is much simpler, because the slab
> >   allocator already caches objects per CPU. Re-creating similar logic
> >   inside the percpu allocator would be redundant.
> >
> >   Also, since this is opt-in per slab cache, other percpu allocator
> >   users remain unaffected.
> >
> > - If fragmentation is a concern, we could probably allocate larger percpu
> >   chunks and partition them for slab objects.
> >
> > - If memory overhead becomes an issue, we could introduce a shrinker
> >   to free empty slabs (and thus releasing underlying percpu memory chunks).
> >
> > Patch Sequence
> > ==============
> >
> > Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> > fail in the next patch.
> >
> > Patch #2 allows the slab constructor fail.
> >
> > Patch #3 implements the slab destructor feature.
> >
> > Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
> >
> > Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> > percpu counter.
> >
> > Patch #7 converts mm_struct to use the slab ctor/dtor pair.
> >
> > Known issues with MM debug features
> > ===================================
> >
> > The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> > and DEBUG_OBJECTS.
> >
> > KASAN reports an error when a percpu counter is inserted into the
> > percpu_counters linked list because the counter has not been allocated
> > yet.
> >
> > DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> > the associated percpu memory is still resident in memory.
> >
> > I don't expect fixing these issues to be too difficult, but I need to
> > think a little bit to fix it.
> >
> > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/CAGudoHFc*Km-3usiy4Wdm1JkM*YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com__;Kys!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnI6wgKqLM$ 
> >
> > [2] https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab*for-next__;Lw!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIGu8ThaA$ 
> >
> > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIN_ctSTM$ 
> >
> > [4] https://urldefense.com/v3/__https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIDPKy5XU$ 
> >
> > Harry Yoo (7):
> >   mm/slab: refactor freelist shuffle
> >   treewide, slab: allow slab constructor to return an error
> >   mm/slab: revive the destructor feature in slab allocator
> >   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
> >     alloc
> >   mm/percpu: allow (un)charging objects without alloc/free
> >   lib/percpu_counter: allow (un)charging percpu counters without
> >     alloc/free
> >   kernel/fork: improve exec() throughput with slab ctor/dtor pair
> >
> >  arch/powerpc/include/asm/svm.h            |   2 +-
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
> >  arch/powerpc/mm/init-common.c             |   3 +-
> >  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
> >  arch/powerpc/platforms/pseries/setup.c    |   2 +-
> >  arch/powerpc/platforms/pseries/svm.c      |   4 +-
> >  arch/sh/mm/pgtable.c                      |   3 +-
> >  arch/sparc/mm/tsb.c                       |   8 +-
> >  block/bdev.c                              |   3 +-
> >  drivers/dax/super.c                       |   3 +-
> >  drivers/gpu/drm/i915/i915_request.c       |   3 +-
> >  drivers/misc/lkdtm/heap.c                 |  12 +--
> >  drivers/usb/mon/mon_text.c                |   5 +-
> >  fs/9p/v9fs.c                              |   3 +-
> >  fs/adfs/super.c                           |   3 +-
> >  fs/affs/super.c                           |   3 +-
> >  fs/afs/super.c                            |   5 +-
> >  fs/befs/linuxvfs.c                        |   3 +-
> >  fs/bfs/inode.c                            |   3 +-
> >  fs/btrfs/inode.c                          |   3 +-
> >  fs/ceph/super.c                           |   3 +-
> >  fs/coda/inode.c                           |   3 +-
> >  fs/debugfs/inode.c                        |   3 +-
> >  fs/dlm/lowcomms.c                         |   3 +-
> >  fs/ecryptfs/main.c                        |   5 +-
> >  fs/efs/super.c                            |   3 +-
> >  fs/erofs/super.c                          |   3 +-
> >  fs/exfat/cache.c                          |   3 +-
> >  fs/exfat/super.c                          |   3 +-
> >  fs/ext2/super.c                           |   3 +-
> >  fs/ext4/super.c                           |   3 +-
> >  fs/fat/cache.c                            |   3 +-
> >  fs/fat/inode.c                            |   3 +-
> >  fs/fuse/inode.c                           |   3 +-
> >  fs/gfs2/main.c                            |   9 +-
> >  fs/hfs/super.c                            |   3 +-
> >  fs/hfsplus/super.c                        |   3 +-
> >  fs/hpfs/super.c                           |   3 +-
> >  fs/hugetlbfs/inode.c                      |   3 +-
> >  fs/inode.c                                |   3 +-
> >  fs/isofs/inode.c                          |   3 +-
> >  fs/jffs2/super.c                          |   3 +-
> >  fs/jfs/super.c                            |   3 +-
> >  fs/minix/inode.c                          |   3 +-
> >  fs/nfs/inode.c                            |   3 +-
> >  fs/nfs/nfs42xattr.c                       |   3 +-
> >  fs/nilfs2/super.c                         |   6 +-
> >  fs/ntfs3/super.c                          |   3 +-
> >  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
> >  fs/ocfs2/super.c                          |   3 +-
> >  fs/openpromfs/inode.c                     |   3 +-
> >  fs/orangefs/super.c                       |   3 +-
> >  fs/overlayfs/super.c                      |   3 +-
> >  fs/pidfs.c                                |   3 +-
> >  fs/proc/inode.c                           |   3 +-
> >  fs/qnx4/inode.c                           |   3 +-
> >  fs/qnx6/inode.c                           |   3 +-
> >  fs/romfs/super.c                          |   3 +-
> >  fs/smb/client/cifsfs.c                    |   3 +-
> >  fs/squashfs/super.c                       |   3 +-
> >  fs/tracefs/inode.c                        |   3 +-
> >  fs/ubifs/super.c                          |   3 +-
> >  fs/udf/super.c                            |   3 +-
> >  fs/ufs/super.c                            |   3 +-
> >  fs/userfaultfd.c                          |   3 +-
> >  fs/vboxsf/super.c                         |   3 +-
> >  fs/xfs/xfs_super.c                        |   3 +-
> >  include/linux/mm_types.h                  |  40 ++++++---
> >  include/linux/percpu.h                    |  10 +++
> >  include/linux/percpu_counter.h            |   2 +
> >  include/linux/slab.h                      |  21 +++--
> >  ipc/mqueue.c                              |   3 +-
> >  kernel/fork.c                             |  65 +++++++++-----
> >  kernel/rcu/refscale.c                     |   3 +-
> >  lib/percpu_counter.c                      |  25 ++++++
> >  lib/radix-tree.c                          |   3 +-
> >  lib/test_meminit.c                        |   3 +-
> >  mm/kfence/kfence_test.c                   |   5 +-
> >  mm/percpu.c                               |  79 ++++++++++------
> >  mm/rmap.c                                 |   3 +-
> >  mm/shmem.c                                |   3 +-
> >  mm/slab.h                                 |  11 +--
> >  mm/slab_common.c                          |  43 +++++----
> >  mm/slub.c                                 | 105 ++++++++++++++++------
> >  net/sched/act_api.c                       |  82 +++++++++++------
> >  net/socket.c                              |   3 +-
> >  net/sunrpc/rpc_pipe.c                     |   3 +-
> >  security/integrity/ima/ima_iint.c         |   3 +-
> >  88 files changed, 518 insertions(+), 226 deletions(-)
> >
> > --
> > 2.43.0
> >
> 
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>

-- 
Cheers,
Harry / Hyeonggon

  reply	other threads:[~2025-04-24  9:59 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24  8:07 [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 1/7] mm/slab: refactor freelist shuffle Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 2/7] treewide, slab: allow slab constructor to return an error Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 3/7] mm/slab: revive the destructor feature in slab allocator Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 4/7] net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu alloc Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 5/7] mm/percpu: allow (un)charging objects without alloc/free Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 6/7] lib/percpu_counter: allow (un)charging percpu counters " Harry Yoo
2025-04-24  8:07 ` [RFC PATCH 7/7] kernel/fork: improve exec() throughput with slab ctor/dtor pair Harry Yoo
2025-04-24  9:29 ` [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem Mateusz Guzik
2025-04-24  9:58   ` Harry Yoo [this message]
2025-04-24 15:00     ` Mateusz Guzik
2025-04-24 11:28 ` Pedro Falcato
2025-04-24 15:20   ` Mateusz Guzik
2025-04-24 16:11     ` Mateusz Guzik
2025-04-25  7:40     ` Harry Yoo
2025-04-25 10:12   ` Harry Yoo
2025-04-25 10:42     ` Pedro Falcato
2025-04-28  1:18       ` Harry Yoo
2025-04-30 19:49       ` Mateusz Guzik
2025-05-12 11:00         ` Harry Yoo
2025-04-24 15:50 ` Christoph Lameter (Ampere)
2025-04-24 16:03   ` Mateusz Guzik
2025-04-24 16:39     ` Christoph Lameter (Ampere)
2025-04-24 17:26       ` Mateusz Guzik
2025-04-24 18:47 ` Tejun Heo
2025-04-25 10:10   ` Harry Yoo
2025-04-25 19:03     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAoLKVwC5JCe7fbv@harry \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=byungchul@sk.com \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=jack@suse.cz \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=kliteyn@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mjguzik@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=vladbu@nvidia.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).