From: Ming Lei <ming.lei@redhat.com>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
Vlastimil Babka <vbabka@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org, Hao Li <hao.li@linux.dev>,
Christoph Hellwig <hch@infradead.org>
Subject: Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
Date: Wed, 11 Mar 2026 18:15:51 +0800 [thread overview]
Message-ID: <abFA17ZmsW6RgcYI@fedora> (raw)
In-Reply-To: <abDA9UrJBT1wXh22@hyeyoo>
On Wed, Mar 11, 2026 at 10:10:13AM +0900, Harry Yoo wrote:
> On Fri, Mar 06, 2026 at 06:22:37PM +0800, Ming Lei wrote:
> > On Fri, Mar 06, 2026 at 09:47:27AM +0100, Vlastimil Babka (SUSE) wrote:
> > > On 3/6/26 05:55, Harry Yoo wrote:
> > > > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> > > >> On 2/25/26 10:31, Ming Lei wrote:
> > > >> > Hi Vlastimil,
> > > >> >
> > > >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> > > >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> > > >> >> >
> > > >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > > >> >> > didn't anticipate this interaction with mempools. We could change them
> > > >> >> > but there might be others using a similar pattern. Maybe it would be for
> > > >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > > >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> > > >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > > >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> > > >> >> Could you try this then, please? Thanks!
> > > >> >
> > > >> > Thanks for working on this issue!
> > > >> >
> > > >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > > >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> > > >>
> > > >> what about this patch in addition to the previous one? Thanks.
> > > >>
> > > >> ----8<----
> > > >> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> > > >> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> > > >> Date: Thu, 26 Feb 2026 18:59:56 +0100
> > > >> Subject: [PATCH] mm/slab: put barn on every online node
> > > >>
> > > >> Including memoryless nodes.
> > > >>
> > > >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > > >> ---
> > > >
> > > > Just taking a quick grasp...
> > > >
> > > >> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > > >> if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > > >> return;
> > > >>
> > > >> - if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> > > >> + if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > > >> + || !node_isset(slab_nid(slab), slab_nodes))
> > > >
> > > > I think you intended !node_isset(numa_mem_id(), slab_nodes)?
> > > >
> > > > "Skip freeing to pcs if it's remote free, but memoryless nodes is
> > > > an exception".
> > >
> > > Indeed, thanks! Ming, could you retry with that fixed up please?
> >
> > After applying the following change, IOPS is ~25M:
> >
> > - delta change on the two patches
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 085fe49eec68..56fe8bd956c0 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -6142,7 +6142,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > return;
> >
> > if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > - || !node_isset(slab_nid(slab), slab_nodes))
> > + || !node_isset(numa_mem_id(), slab_nodes))
> > && likely(!slab_test_pfmemalloc(slab))) {
> > if (likely(free_to_pcs(s, object, true)))
> > return;
> >
>
> Hi Ming, thanks a lot for helping testing!
>
> The stats look quite fine to me, but we're still seeing suboptimal IOPS.
>
> > - slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
>
> Does that doesn't include Vlastimil's (fb1091febd66 mm/slab: allow sheaf
> refill if blocking is not allowed)?
No, because fb1091febd66 isn't included into `815c8e35511d Merge branch
'slab/for-7.0/sheaves'.
>
> Next time when testing it, could you please test on top of 7.0-rc3 w/
> the memoryless node patch (w/ the delta above) applied?
IOPS is same between `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`
and 7.0-rc3 with the two patches.
IMO, it should be more easier to compare & investigate by focusing on
815c8e35511d, given there is only 41 patches between v6.19-rc5 and
commit 815c8e35511d.
>
> Also, let us check a few things...
>
> 1) Does bumping up sheaf capacity change the slab stats & IOPS?
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 0c906fefc31b..5207279417e2 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -7611,13 +7611,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
> * should result in similar lock contention (barn or list_lock)
> */
> if (s->size >= PAGE_SIZE)
> - capacity = 4;
> + capacity = 6;
> else if (s->size >= 1024)
> - capacity = 12;
> + capacity = 24;
> else if (s->size >= 256)
> - capacity = 26;
> + capacity = 52;
> else
> - capacity = 60;
> + capacity = 120;
>
> /* Increment capacity to make sheaf exactly a kmalloc size bucket */
> size = struct_size_t(struct slab_sheaf, objects, capacity);
IOPS can be increased from 24M to 29M with this patch, against 7.0-rc3 with
Vlastimil's today patchset.
>
> 2) Is there any change in NUMA locality between v6.19 vs. v7.0-rc3 (patched)?
> (e.g., measured via
> perf stat -e node-loads,node-load-misses,node-stores,node-store-misses)
root@tomsrv:~/temp/mm/7.0-rc3/patched# perf stat -a -e node-loads,node-load-misses,node-stores,node-store-misses
Error:
No supported events found.
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (node-loads).
"dmesg | grep -i perf" may provide additional information.
Looks the events are not supported on AMD Zen4 machine.
>
> 3) It's quite strange that blk_mq_sched_bio_merge() completely
> disappeared in v7.0-rc2 profile [1] . Is there any change
> in read/write io merge rate? (/proc/diskstats) between v6.19 and
> v7.0-rc3?
It isn't strange.
Because IOPS drops to 13M on v7.0-rc2 from 34M on v6.19-rc5, so blk_mq_sched_bio_merge
can't be shown obviously, which code path is run for each bio(IO).
It is one totally random READ IO, and IO merge shouldn't happen.
Thanks,
Ming
next prev parent reply other threads:[~2026-03-11 10:16 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-24 2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24 5:00 ` Harry Yoo
2026-02-24 9:07 ` Ming Lei
2026-02-25 5:32 ` Hao Li
2026-02-25 6:54 ` Harry Yoo
2026-02-25 7:06 ` Hao Li
2026-02-25 7:19 ` Harry Yoo
2026-02-25 8:19 ` Hao Li
2026-02-25 8:41 ` Harry Yoo
2026-02-25 8:54 ` Hao Li
2026-02-25 8:21 ` Harry Yoo
2026-02-24 6:51 ` Hao Li
2026-02-24 7:10 ` Harry Yoo
2026-02-24 7:41 ` Hao Li
2026-02-24 20:27 ` Vlastimil Babka
2026-02-25 5:24 ` Harry Yoo
2026-02-25 8:45 ` Vlastimil Babka (SUSE)
2026-02-25 9:31 ` Ming Lei
2026-02-25 11:29 ` Vlastimil Babka (SUSE)
2026-02-25 12:24 ` Ming Lei
2026-02-25 13:22 ` Vlastimil Babka (SUSE)
2026-02-26 18:02 ` Vlastimil Babka (SUSE)
2026-02-27 9:23 ` Ming Lei
2026-03-05 13:05 ` Vlastimil Babka (SUSE)
2026-03-05 15:48 ` Ming Lei
2026-03-06 1:01 ` Ming Lei
2026-03-06 4:17 ` Hao Li
2026-03-06 4:55 ` Harry Yoo
2026-03-06 8:32 ` Hao Li
2026-03-06 8:47 ` Vlastimil Babka (SUSE)
2026-03-06 10:22 ` Ming Lei
2026-03-11 1:10 ` Harry Yoo
2026-03-11 10:15 ` Ming Lei [this message]
2026-03-11 10:43 ` Ming Lei
2026-03-12 4:11 ` Harry Yoo
2026-03-12 11:26 ` Hao Li
2026-03-12 11:56 ` Ming Lei
2026-03-12 12:13 ` Hao Li
2026-03-12 14:50 ` Ming Lei
2026-03-13 3:26 ` Hao Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abFA17ZmsW6RgcYI@fedora \
--to=ming.lei@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=hao.li@linux.dev \
--cc=harry.yoo@oracle.com \
--cc=hch@infradead.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=vbabka@kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox