Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, Hao Li <hao.li@linux.dev>,
	surenb@google.com
Subject: Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
Date: Tue, 24 Feb 2026 17:07:18 +0800	[thread overview]
Message-ID: <aZ1qRhIGDAR7d56r@fedora> (raw)
In-Reply-To: <aZ0wX_QuxNTxXHMj@hyeyoo>

Hi Harry,

On Tue, Feb 24, 2026 at 02:00:15PM +0900, Harry Yoo wrote:
> On Tue, Feb 24, 2026 at 10:52:28AM +0800, Ming Lei wrote:
> > Hello Vlastimil and MM guys,
> 
> Hi Ming, thanks for the report!
> 
> > The SLUB "sheaves" series merged via 815c8e35511d ("Merge branch
> > 'slab/for-7.0/sheaves' into slab/for-next") introduces a severe
> > performance regression for workloads with persistent cross-CPU
> > alloc/free patterns. ublk null target benchmark IOPS drops
> > significantly compared to v6.19: from ~36M IOPS to ~13M IOPS (~64%
> > drop).
> > 
> > Bisecting within the sheaves series is blocked by a kernel panic at
> > 17c38c88294d ("slab: remove cpu (partial) slabs usage from allocation
> > paths"), so the exact first bad commit could not be identified.
> 
> Ouch. Why did it crash?

[   16.162422] Oops: general protection fault, probably for non-canonical address 0xdead000000000110: 0000 [#1] SMP NOPTI
[   16.162426] CPU: 44 UID: 0 PID: 908 Comm: (udev-worker) Not tainted 6.19.0-rc5_master+ #116 PREEMPT(lazy) 
[   16.162429] Hardware name: Giga Computing MZ73-LM2-000/MZ73-LM2-000, BIOS R19_F40 05/12/2025
[   16.162430] RIP: 0010:__put_partials+0x2f/0x140
[   16.162437] Code: 41 57 41 56 49 89 f6 41 55 49 89 fd 31 ff 41 54 45 31 e4 55 53 48 83 ec 18 48 c7 44 24 10 00 00 00 00 eb 03 48 89 df 4c9
[   16.162438] RSP: 0018:ff5117c0ca2dfa60 EFLAGS: 00010086
[   16.162441] RAX: 0000000000000001 RBX: ff1b266981200d80 RCX: 0000000000000246
[   16.162442] RDX: ff1b266981200d90 RSI: ff1b266981200d90 RDI: ff1b266981200d80
[   16.162442] RBP: dead000000000100 R08: 0000000000000000 R09: ffffffffa761bf5e
[   16.162443] R10: ffb6d4b7841d5400 R11: ff1b2669800575c0 R12: 0000000000000000
[   16.162444] R13: ff1b2669800575c0 R14: dead000000000100 R15: ffb6d4b7846be410
[   16.162445] FS:  00007f5fdccc23c0(0000) GS:ff1b267902427000(0000) knlGS:0000000000000000
[   16.162446] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.162446] CR2: 0000559824c6c058 CR3: 000000011fb49001 CR4: 0000000000f71ef0
[   16.162447] PKRU: 55555554
[   16.162448] Call Trace:
[   16.162450]  <TASK>
[   16.162452]  kmem_cache_free+0x410/0x490
[   16.162454]  do_readlinkat+0x14e/0x180
[   16.162459]  __x64_sys_readlinkat+0x1c/0x30
[   16.162461]  do_syscall_64+0x7e/0x6b0
[   16.162465]  ? post_alloc_hook+0xb9/0x140
[   16.162468]  ? get_page_from_freelist+0x478/0x720
[   16.162470]  ? path_openat+0xb3/0x2a0
[   16.162472]  ? __alloc_frozen_pages_noprof+0x192/0x350
[   16.162474]  ? count_memcg_events+0xd6/0x210
[   16.162476]  ? memcg1_commit_charge+0x7a/0xa0
[   16.162479]  ? mod_memcg_lruvec_state+0xe7/0x2d0
[   16.162481]  ? charge_memcg+0x48/0x80
[   16.162482]  ? lruvec_stat_mod_folio+0x85/0xd0
[   16.162484]  ? __folio_mod_stat+0x2d/0x90
[   16.162487]  ? set_ptes.isra.0+0x36/0x80
[   16.162490]  ? do_anonymous_page+0x100/0x4a0
[   16.162492]  ? __handle_mm_fault+0x45d/0x6f0
[   16.162493]  ? count_memcg_events+0xd6/0x210
[   16.162494]  ? handle_mm_fault+0x212/0x340
[   16.162495]  ? do_user_addr_fault+0x2b4/0x7b0
[   16.162500]  ? irqentry_exit+0x6d/0x540
[   16.162502]  ? exc_page_fault+0x7e/0x1a0
[   16.162503]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

> 
> > Reproducer
> > ==========
> > 
> > Hardware: NUMA machine with >= 32 CPUs
> > Kernel:   v7.0-rc (with slab/for-7.0/sheaves merged)
> > 
> >     # build kublk selftest
> >     make -C tools/testing/selftests/ublk/
> > 
> >     # create ublk null target device with 16 queues
> >     tools/testing/selftests/ublk/kublk add -t null -q 16
> > 
> >     # run fio/t/io_uring benchmark: 16 jobs, 20 seconds, non-polled
> >     taskset -c 0-31 fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0
> > 
> >     # cleanup
> >     tools/testing/selftests/ublk/kublk del -n 0
> > 
> > Good: v6.19 (and 41f1a08645ab, the mainline parent of the slab merge)
> > Bad:  815c8e35511d (Merge branch 'slab/for-7.0/sheaves' into slab/for-next)
> 
> Thanks for such a detailed steps to reproduce :)
> 
> > perf profile (bad kernel)
> > =========================
> > 
> > ~47% of CPU time is spent in bio allocation hitting the SLUB slow path,
> > with massive spinlock contention on the node partial list lock:
> > 
> > +   47.65%     1.21%  io_uring  [k] bio_alloc_bioset
> > -   44.87%     0.45%  io_uring  [k] kmem_cache_alloc_noprof
> >    - 44.41% kmem_cache_alloc_noprof
> >       - 43.89% ___slab_alloc
> >          + 41.16% get_from_any_partial
> >            0.91% get_from_partial_node
> >          + 0.87% alloc_from_new_slab
> >          + 0.65% allocate_slab
> > -   44.70%     0.21%  io_uring  [k] mempool_alloc_noprof
> >    - 44.49% mempool_alloc_noprof
> >       - 44.43% kmem_cache_alloc_noprof
> >          - 43.90% ___slab_alloc
> >             + 41.18% get_from_any_partial
> >               0.90% get_from_partial_node
> >             + 0.87% alloc_from_new_slab
> >             + 0.65% allocate_slab
> > +   41.23%     0.10%  io_uring  [k] get_from_any_partial
> > +   40.82%     0.48%  io_uring  [k] __raw_spin_lock_irqsave
> > -   40.75%     0.20%  io_uring  [k] get_from_partial_node
> >    - 40.56% get_from_partial_node
> >       - 38.83% __raw_spin_lock_irqsave
> >            38.65% native_queued_spin_lock_slowpath
> 
> That's pretty severe contention. Interestingly, the profile shows
> a severe contention on the alloc path, but I don't see free path here.
> wondering why only the alloc path is suffering, hmm...

free path looks fine.

+    2.84%     0.16%  kublk            [kernel.kallsyms]       [k] mempool_free
+    2.66%     0.17%  kublk            [kernel.kallsyms]       [k] security_uring_cmd
+    2.57%     0.36%  kublk            [kernel.kallsyms]       [k] __slab_free

> 
> Anyway, I think there may be two pieces contributing to this contention:
> 
> Part 1) We probably made the portion of slowpath bigger,
>         by caching a smaller number of objects per CPU
> 	after transitioning to sheaves.
> 
> Part 2) We probably made the slowpath much slower.
> 
> We need to investigate those parts separately.
> 
> Regarding Part 1:
> 
> # Point 1. The CPU slab was not considered in the sheaf capacity calculation
> 
> calculate_sheaf_capacity() does not take into account that the CPU slab
> was also cached per CPU. Shouldn't we add oo_objects(s->oo) to the existing
> calculation to cache a number of objects similar to the CPU slab + percpu
> partial slab list layers that SLUB previously had?
> 
> # Point 2. SLUB no longer relies on "Slabs are half-full" assumption,
> # and that probably means we're caching less objects per CPU.
> 
> Because SLUB previously assumed "slabs are half-full" when calculating
> the number of slabs to cache per CPU, that could actually cache as twice
> as many objects than intended when slabs are mostly empty.
> 
> Because sheaves track the number of objects precisely, that inaccuracy
> is gone. If the workload was previously benefiting from the inaccuracy,
> sheaves can make CPUs cache a smaller number of objects per CPU compared
> to the percpu slab caching layer.
> 
> Anyway, I guess we need to check how many objects are actually
> cached per CPU w/ and w/o sheaves, during the benchmark.

In the workload `fio/t/io_uring -p0 -n 16 -r 20 /dev/ublkb0`, queue depth
is 128, so there should be 128 inflight bios on these 16 tasks/cpus.

> 
> After making sure the number of objects cached per CPU is the same as
> before, we could further investigate how much Part 2 plays into it.
> 
> Slightly off-topic, by the way, slab currently doesn't let system admins
> set custom sheaf_capacity. Instead, calculate_sheaf_capacity() sets
> the default capacity. I think we need to allow sys admins to set a custom
> sheaf_capacity in the very near future.
> 
> > Analysis
> > ========
> > 
> > The ublk null target workload exposes a cross-CPU slab allocation
> > pattern: bios are allocated on the io_uring submitter CPU during block
> > layer submission, but freed on a different CPU — the ublk daemon thread
> > that runs the completion via io_uring_cmd_complete_in_task() task work.
> > And the completion CPU stays in same LLC or numa node with submission CPU.
> 
> Ok, so a submitter CPU keeps allocating objects, while a completion CPU
> keeps freeing objects.

Yes.


Thanks, 
Ming

next prev parent reply	other threads:[~2026-02-24  9:07 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24  5:00 ` Harry Yoo
2026-02-24  9:07   ` Ming Lei [this message]
2026-02-25  5:32     ` Hao Li
2026-02-25  6:54       ` Harry Yoo
2026-02-25  7:06         ` Hao Li
2026-02-25  7:19           ` Harry Yoo
2026-02-25  8:19             ` Hao Li
2026-02-25  8:41               ` Harry Yoo
2026-02-25  8:54                 ` Hao Li
2026-02-25  8:21             ` Harry Yoo
2026-02-24  6:51 ` Hao Li
2026-02-24  7:10   ` Harry Yoo
2026-02-24  7:41     ` Hao Li
2026-02-24 20:27 ` Vlastimil Babka
2026-02-25  5:24   ` Harry Yoo
2026-02-25  8:45   ` Vlastimil Babka (SUSE)
2026-02-25  9:31     ` Ming Lei
2026-02-25 11:29       ` Vlastimil Babka (SUSE)
2026-02-25 12:24         ` Ming Lei
2026-02-25 13:22           ` Vlastimil Babka (SUSE)
2026-02-26 18:02       ` Vlastimil Babka (SUSE)
2026-02-27  9:23         ` Ming Lei
2026-03-05 13:05           ` Vlastimil Babka (SUSE)
2026-03-05 15:48             ` Ming Lei
2026-03-06  1:01               ` Ming Lei
2026-03-06  4:17               ` Hao Li
2026-03-06  4:55         ` Harry Yoo
2026-03-06  8:32           ` Hao Li
2026-03-06  8:47           ` Vlastimil Babka (SUSE)
2026-03-06 10:22             ` Ming Lei
2026-03-11  1:10               ` Harry Yoo
2026-03-11 10:15                 ` Ming Lei
2026-03-11 10:43                   ` Ming Lei
2026-03-12  4:11                   ` Harry Yoo
2026-03-12 11:26 ` Hao Li
2026-03-12 11:56   ` Ming Lei
2026-03-12 12:13     ` Hao Li
2026-03-12 14:50       ` Ming Lei
2026-03-13  3:26         ` Hao Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZ1qRhIGDAR7d56r@fedora \
    --to=ming.lei@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hao.li@linux.dev \
    --cc=harry.yoo@oracle.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.