Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zhao Liu <zhao1.liu@intel.com>
To: Vlastimil Babka <vbabka@suse.cz>, Hao Li <haolee.swjtu@gmail.com>
Cc: akpm@linux-foundation.org, harry.yoo@oracle.com, cl@gentwo.org,
	rientjes@google.com, roman.gushchin@linux.dev,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	tim.c.chen@intel.com, yu.c.chen@intel.com, zhao1.liu@intel.com
Subject: Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()
Date: Thu, 15 Jan 2026 18:12:44 +0800	[thread overview]
Message-ID: <aWi9nAbIkTfYFoMM@intel.com> (raw)
In-Reply-To: <a231264a-2da5-4468-a276-777fc0241246@suse.cz>

Hi Babka & Hao,

> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>          */
>  
>         if (pcs->main->size == 0) {
> -               barn_put_empty_sheaf(barn, pcs->main);
> +               if (!pcs->spare) {
> +                       pcs->spare = pcs->main;
> +               } else {
> +                       barn_put_empty_sheaf(barn, pcs->main);
> +               }
>                 pcs->main = full;
>                 return pcs;
>         }

I noticed the previous lkp regression report and tested this fix:

* will-it-scale.per_process_ops

Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
results:

nr_tasks   Delta
1          + 3.593%
8          + 3.094%
64         +60.247%
128        +49.344%
192        +27.500%
256        -12.077%

For the cases (nr_tasks: 1-192), there're the improvements. I think
this is expected since pre-cached spare sheaf reduces spinlock race:
reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().

So (maybe too late),

Tested-by: Zhao Liu <zhao1.liu@intel.com>



But I find there are two more questions that might need consideration?

# Question 1: Regression for 256 tasks

For the above test - the case with nr_tasks: 256, there's a "slight"
regression. I did more testing:

(This is a single-round test; the 256-tasks data has jitter.)

nr_tasks   Delta
244	     0.308%
248	   - 0.805%
252	    12.070%
256	   -11.441%
258	     2.070%
260	     1.252%
264	     2.369%
268	   -11.479%
272	     2.130%
292	     8.714%
296	    10.905%
298	    17.196%
300	    11.783%
302	     6.620%
304	     3.112%
308	   - 5.924%

It can be seen that most cases show improvement, though a few may
experience slight regression.

Based on the configuration of my machine:

    GNR - 2 sockets with the following NUMA topology:

    NUMA:
      NUMA node(s):              4
      NUMA node0 CPU(s):         0-42,172-214
      NUMA node1 CPU(s):         43-85,215-257
      NUMA node2 CPU(s):         86-128,258-300
      NUMA node3 CPU(s):         129-171,301-343

Since I set the CPU affinity on the core, 256 cases is roughly
equivalent to the moment when Node 0 and Node 1 are filled.

The following is the perf data comparing 2 tests w/o fix & with this fix:

# Baseline  Delta Abs  Shared Object            Symbol
# ........  .........  .......................  ....................................
#
    61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
     0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
     0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
     1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
     3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
     1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
     0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
     0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
     1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
     1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
     1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
     0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
     0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
     0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
     0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
     0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
     0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
     0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
     0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
     0.53%     -0.06%  libc.so.6                [.] __mmap
     0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
     0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
     0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
     0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
     0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
     0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
     0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
     0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
     0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
     0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
     0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
     0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
     0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
     0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
     0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
     0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
     0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
     0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
     0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
     0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.49%     -0.04%  libc.so.6                [.] __munmap
     0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
     0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
     0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
     0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
     0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
     0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
     0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
     0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
     0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
     0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
     0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
     0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
     0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
     0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
     0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
     0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc

I think the insteresting item is "get_partial_node". It seems this fix
makes "get_partial_node" slightly more frequent. HMM, however, I still
can't figure out why this is happening. Do you have any thoughts on it?

# Question 2: sheaf capacity

Back the original commit which triggerred lkp regression. I did more
testing to check if this fix could totally fill the regression gap.

The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
use percpu sheaves for maple_node_cache") has the regression.

I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
line:

nr_tasks	w/o fix         with fix
1		- 3.643%	- 0.181%
8		-12.523%	- 9.816%
64		-50.378%	-20.482%
128		-36.736%	- 5.518%
192		-22.963%	- 1.777%
256		-32.926%	- 41.026%

It appears that under extreme conditions, regression remains significate.
I remembered your suggestion about larger capacity and did the following
testing:

	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4
			(with this fix)	(cap: 32->64)	(cap: 32->128)	(cap: 32->256)
1	-8.789%		-8.805%		-8.185%		-9.912%		-8.673%
8	-12.256%	-9.219%		-10.460%	-10.070%	-8.819%
64	-38.915%	-8.172%		-4.700%		4.571%		8.793%
128	-8.032%		11.377%		23.232%		26.940%		30.573%
192	-1.220%		9.758%		20.573%		22.645%		25.768%
256	-6.570%		9.967%		21.663%		30.103%		33.876%

Comparing with base line (3accabda4), larger capacity could
significatly improve the Sheaf's scalability.

So, I'd like to know if you think dynamically or adaptively adjusting
capacity is a worthwhile idea.

Thanks for your patience.

Regards,
Zhao

next prev parent reply	other threads:[~2026-01-15  9:47 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-10  0:26 [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main() Hao Li
2025-12-15 14:30 ` Vlastimil Babka
2025-12-16  2:34   ` Hao Lee
2025-12-22 10:20   ` Harry Yoo
2026-01-05 15:58     ` Vlastimil Babka
2026-01-15 10:12   ` Zhao Liu [this message]
2026-01-15 16:19     ` Vlastimil Babka
2026-01-16  9:07       ` Zhao Liu
2026-01-16  9:11         ` Hao Li
2026-01-16  4:06     ` Hao Li
2026-01-16  9:16       ` Zhao Liu
2026-01-16  9:09         ` Hao Li
2026-01-19  6:07     ` Hao Li
2026-01-20  8:21       ` Zhao Liu
2026-01-21  3:15         ` Hao Li
2026-01-21 13:17           ` Zhao Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aWi9nAbIkTfYFoMM@intel.com \
    --to=zhao1.liu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=haolee.swjtu@gmail.com \
    --cc=harry.yoo@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=tim.c.chen@intel.com \
    --cc=vbabka@suse.cz \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.