public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-block@vger.kernel.org, Hao Li <hao.li@linux.dev>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation
Date: Wed, 11 Mar 2026 10:10:13 +0900	[thread overview]
Message-ID: <abDA9UrJBT1wXh22@hyeyoo> (raw)
In-Reply-To: <aaqq7YmUcOht3GWH@fedora>

On Fri, Mar 06, 2026 at 06:22:37PM +0800, Ming Lei wrote:
> On Fri, Mar 06, 2026 at 09:47:27AM +0100, Vlastimil Babka (SUSE) wrote:
> > On 3/6/26 05:55, Harry Yoo wrote:
> > > On Thu, Feb 26, 2026 at 07:02:11PM +0100, Vlastimil Babka (SUSE) wrote:
> > >> On 2/25/26 10:31, Ming Lei wrote:
> > >> > Hi Vlastimil,
> > >> > 
> > >> > On Wed, Feb 25, 2026 at 09:45:03AM +0100, Vlastimil Babka (SUSE) wrote:
> > >> >> On 2/24/26 21:27, Vlastimil Babka wrote:
> > >> >> > 
> > >> >> > It made sense to me not to refill sheaves when we can't reclaim, but I
> > >> >> > didn't anticipate this interaction with mempools. We could change them
> > >> >> > but there might be others using a similar pattern. Maybe it would be for
> > >> >> > the best to just drop that heuristic from __pcs_replace_empty_main()
> > >> >> > (but carefully as some deadlock avoidance depends on it, we might need
> > >> >> > to e.g. replace it with gfpflags_allow_spinning()). I'll send a patch
> > >> >> > tomorrow to test this theory, unless someone beats me to it (feel free to).
> > >> >> Could you try this then, please? Thanks!
> > >> > 
> > >> > Thanks for working on this issue!
> > >> > 
> > >> > Unfortunately the patch doesn't make a difference on IOPS in the perf test,
> > >> > follows the collected perf profile on linus tree(basically 7.0-rc1 with your patch):
> > >> 
> > >> what about this patch in addition to the previous one? Thanks.
> > >> 
> > >> ----8<----
> > >> From d3e8118c078996d1372a9f89285179d93971fdb2 Mon Sep 17 00:00:00 2001
> > >> From: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
> > >> Date: Thu, 26 Feb 2026 18:59:56 +0100
> > >> Subject: [PATCH] mm/slab: put barn on every online node
> > >> 
> > >> Including memoryless nodes.
> > >> 
> > >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> > >> ---
> > > 
> > > Just taking a quick grasp...
> > > 
> > >> @@ -6121,7 +6122,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > >>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > >>  		return;
> > >>  
> > >> -	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
> > >> +	if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> > >> +			|| !node_isset(slab_nid(slab), slab_nodes))
> > > 
> > > I think you intended !node_isset(numa_mem_id(), slab_nodes)?
> > > 
> > > "Skip freeing to pcs if it's remote free, but memoryless nodes is
> > >  an exception".
> > 
> > Indeed, thanks! Ming, could you retry with that fixed up please?
> 
> After applying the following change, IOPS is ~25M:
> 
> - delta change on the two patches
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 085fe49eec68..56fe8bd956c0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -6142,7 +6142,7 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>                 return;
>  
>         if (likely(!IS_ENABLED(CONFIG_NUMA) || (slab_nid(slab) == numa_mem_id())
> -                       || !node_isset(slab_nid(slab), slab_nodes))
> +                       || !node_isset(numa_mem_id(), slab_nodes))
>             && likely(!slab_test_pfmemalloc(slab))) {
>                 if (likely(free_to_pcs(s, object, true)))
>                         return;
>

Hi Ming, thanks a lot for helping testing!

The stats look quite fine to me, but we're still seeing suboptimal IOPS.

> - slab stat on patched `815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next`

Does that doesn't include Vlastimil's (fb1091febd66 mm/slab: allow sheaf
refill if blocking is not allowed)?

Next time when testing it, could you please test on top of 7.0-rc3 w/
the memoryless node patch (w/ the delta above) applied?

Also, let us check a few things...

1) Does bumping up sheaf capacity change the slab stats & IOPS?

diff --git a/mm/slub.c b/mm/slub.c
index 0c906fefc31b..5207279417e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -7611,13 +7611,13 @@ static unsigned int calculate_sheaf_capacity(struct kmem_cache *s,
 	 * should result in similar lock contention (barn or list_lock)
 	 */
 	if (s->size >= PAGE_SIZE)
-		capacity = 4;
+		capacity = 6;
 	else if (s->size >= 1024)
-		capacity = 12;
+		capacity = 24;
 	else if (s->size >= 256)
-		capacity = 26;
+		capacity = 52;
 	else
-		capacity = 60;
+		capacity = 120;
 
 	/* Increment capacity to make sheaf exactly a kmalloc size bucket */
 	size = struct_size_t(struct slab_sheaf, objects, capacity);

2) Is there any change in NUMA locality between v6.19 vs. v7.0-rc3 (patched)?
   (e.g., measured via
    perf stat -e node-loads,node-load-misses,node-stores,node-store-misses)

3) It's quite strange that blk_mq_sched_bio_merge() completely
   disappeared in v7.0-rc2 profile [1] . Is there any change
   in read/write io merge rate? (/proc/diskstats) between v6.19 and
   v7.0-rc3?

[1] https://lore.kernel.org/linux-mm/aamluV66pLIdo66g@fedora

> # (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
> ./remote_node_defrag_ratio:100
> ./total_objects:7395 N1=3876 N5=3519
> ./alloc_fastpath:507619662 C0=70 C1=27608632 C3=28990301 C5=35098386 C6=9 C7=35782152 C8=115 C9=31757274 C10=32 C11=30087065 C12=34 C13=31615065 C14=7 C15=31798233 C17=30695955 C18=128 C19=32204853 C20=64 C21=36842392 C23=36212376 C25=30013640 C27=29055001 C29=29990232 C30=48 C31=29867595 C36=2 C50=1
> ./cpu_slabs:0
> ./objects:7232 N1=3816 N5=3416
> ./sheaf_return_slow:0
> ./objects_partial:500 N1=195 N5=305
> ./sheaf_return_fast:0
> ./cpu_partial:0
> ./free_slowpath:20 C4=20
> ./barn_get_fail:260 C1=6 C3=26 C5=26 C7=7 C9=5 C10=2 C11=26 C12=2 C13=10 C14=1 C15=19 C17=8 C18=5 C19=19 C20=1 C21=9 C23=22 C25=11 C27=21 C29=26 C31=6 C36=1 C50=1
> ./sheaf_prefill_oversize:0
> ./skip_kfence:0
> ./min_partial:5
> ./order_fallback:0
> ./sheaf_capacity:28
> ./sheaf_flush:28 C24=28
> ./free_rcu_sheaf:0
> ./sheaf_alloc:178 C0=4 C2=9 C3=1 C4=9 C5=65 C6=4 C8=5 C10=8 C11=1 C12=4 C13=1 C14=8 C15=1 C16=5 C18=8 C19=1 C20=3 C22=10 C23=1 C24=5 C25=1 C26=7 C27=1 C28=10 C29=1 C30=2 C31=1 C36=1 C50=1
> ./sheaf_free:0
> ./sheaf_prefill_slow:0
> ./sheaf_prefill_fast:0
> ./poison:0
> ./red_zone:0
> ./free_slab:0
> ./slabs:145 N1=76 N5=69
> ./barn_get:18129029 C0=3 C1=986017 C3=1035342 C5=1253488 C6=1 C7=1277927 C8=5 C9=1134184 C11=1074513 C13=1129100 C15=1135633 C17=1096277 C19=1150155 C20=2 C21=1315791 C23=1293278 C25=1071905 C27=1037658 C29=1071054 C30=2 C31=1066694
> ./alloc_slowpath:0
> ./destroy_by_rcu:1
> ./free_rcu_sheaf_fail:0
> ./barn_put:18129105 C0=986015 C2=1035357 C4=1253502 C6=1277924 C8=1134182 C10=1074529 C12=1129101 C14=1135641 C16=1096273 C18=1150168 C20=1315792 C22=1293288 C24=1071905 C26=1037668 C28=1071069 C30=1066691
> ./usersize:0
> ./sanity_checks:0
> ./barn_put_fail:1 C24=1
> ./align:64
> ./alloc_node_mismatch:0
> ./alloc_slab:145 C1=3 C3=19 C5=6 C7=3 C9=3 C10=2 C11=18 C12=2 C13=6 C14=1 C15=12 C17=8 C18=3 C19=12 C21=2 C23=5 C25=7 C27=12 C29=15 C31=4 C36=1 C50=1
> ./free_remove_partial:0
> ./aliases:0
> ./store_user:0
> ./trace:0
> ./reclaim_account:0
> ./order:2
> ./sheaf_refill:7280 C1=168 C3=728 C5=728 C7=196 C9=140 C10=56 C11=728 C12=56 C13=280 C14=28 C15=532 C17=224 C18=140 C19=532 C20=28 C21=252 C23=616 C25=308 C27=588 C29=728 C31=168 C36=28 C50=28
> ./object_size:256
> ./free_fastpath:507615526 C0=27608438 C2=28990052 C4=35098103 C6=35781903 C8=31757101 C10=30086841 C12=31614841 C14=31797983 C16=30695700 C18=32204722 C19=1 C20=36842201 C22=36212117 C24=30013416 C26=29054742 C28=29989974 C30=29867383 C31=4 C39=2 C47=2
> ./hwcache_align:1
> ./cmpxchg_double_fail:0
> ./objs_per_slab:51
> ./partial:13 N1=5 N5=8
> ./slabs_cpu_partial:0(0)
> ./free_add_partial:117 C1=3 C3=7 C5=19 C7=4 C9=2 C11=8 C13=4 C15=7 C18=2 C19=7 C20=1 C21=7 C23=17 C24=3 C25=4 C27=9 C29=11 C31=2
> ./slab_size:320
> ./cache_dma:0
> 
> 
> Thanks,
> Ming
> 

-- 
Cheers,
Harry / Hyeonggon

  reply	other threads:[~2026-03-11  1:10 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-24  2:52 [Regression] mm:slab/sheaves: severe performance regression in cross-CPU slab allocation Ming Lei
2026-02-24  5:00 ` Harry Yoo
2026-02-24  9:07   ` Ming Lei
2026-02-25  5:32     ` Hao Li
2026-02-25  6:54       ` Harry Yoo
2026-02-25  7:06         ` Hao Li
2026-02-25  7:19           ` Harry Yoo
2026-02-25  8:19             ` Hao Li
2026-02-25  8:41               ` Harry Yoo
2026-02-25  8:54                 ` Hao Li
2026-02-25  8:21             ` Harry Yoo
2026-02-24  6:51 ` Hao Li
2026-02-24  7:10   ` Harry Yoo
2026-02-24  7:41     ` Hao Li
2026-02-24 20:27 ` Vlastimil Babka
2026-02-25  5:24   ` Harry Yoo
2026-02-25  8:45   ` Vlastimil Babka (SUSE)
2026-02-25  9:31     ` Ming Lei
2026-02-25 11:29       ` Vlastimil Babka (SUSE)
2026-02-25 12:24         ` Ming Lei
2026-02-25 13:22           ` Vlastimil Babka (SUSE)
2026-02-26 18:02       ` Vlastimil Babka (SUSE)
2026-02-27  9:23         ` Ming Lei
2026-03-05 13:05           ` Vlastimil Babka (SUSE)
2026-03-05 15:48             ` Ming Lei
2026-03-06  1:01               ` Ming Lei
2026-03-06  4:17               ` Hao Li
2026-03-06  4:55         ` Harry Yoo
2026-03-06  8:32           ` Hao Li
2026-03-06  8:47           ` Vlastimil Babka (SUSE)
2026-03-06 10:22             ` Ming Lei
2026-03-11  1:10               ` Harry Yoo [this message]
2026-03-11 10:15                 ` Ming Lei
2026-03-11 10:43                   ` Ming Lei
2026-03-12  4:11                   ` Harry Yoo
2026-03-12 11:26 ` Hao Li
2026-03-12 11:56   ` Ming Lei
2026-03-12 12:13     ` Hao Li
2026-03-12 14:50       ` Ming Lei
2026-03-13  3:26         ` Hao Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abDA9UrJBT1wXh22@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=hao.li@linux.dev \
    --cc=hch@infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ming.lei@redhat.com \
    --cc=vbabka@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox