From: Zhao Liu <zhao1.liu@intel.com>
To: Vlastimil Babka <vbabka@suse.cz>, Hao Li <haolee.swjtu@gmail.com>
Cc: akpm@linux-foundation.org, harry.yoo@oracle.com, cl@gentwo.org,
rientjes@google.com, roman.gushchin@linux.dev,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
tim.c.chen@intel.com, yu.c.chen@intel.com, zhao1.liu@intel.com
Subject: Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()
Date: Thu, 15 Jan 2026 18:12:44 +0800 [thread overview]
Message-ID: <aWi9nAbIkTfYFoMM@intel.com> (raw)
In-Reply-To: <a231264a-2da5-4468-a276-777fc0241246@suse.cz>
Hi Babka & Hao,
> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> */
>
> if (pcs->main->size == 0) {
> - barn_put_empty_sheaf(barn, pcs->main);
> + if (!pcs->spare) {
> + pcs->spare = pcs->main;
> + } else {
> + barn_put_empty_sheaf(barn, pcs->main);
> + }
> pcs->main = full;
> return pcs;
> }
I noticed the previous lkp regression report and tested this fix:
* will-it-scale.per_process_ops
Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
results:
nr_tasks Delta
1 + 3.593%
8 + 3.094%
64 +60.247%
128 +49.344%
192 +27.500%
256 -12.077%
For the cases (nr_tasks: 1-192), there're the improvements. I think
this is expected since pre-cached spare sheaf reduces spinlock race:
reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
So (maybe too late),
Tested-by: Zhao Liu <zhao1.liu@intel.com>
But I find there are two more questions that might need consideration?
# Question 1: Regression for 256 tasks
For the above test - the case with nr_tasks: 256, there's a "slight"
regression. I did more testing:
(This is a single-round test; the 256-tasks data has jitter.)
nr_tasks Delta
244 0.308%
248 - 0.805%
252 12.070%
256 -11.441%
258 2.070%
260 1.252%
264 2.369%
268 -11.479%
272 2.130%
292 8.714%
296 10.905%
298 17.196%
300 11.783%
302 6.620%
304 3.112%
308 - 5.924%
It can be seen that most cases show improvement, though a few may
experience slight regression.
Based on the configuration of my machine:
GNR - 2 sockets with the following NUMA topology:
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0-42,172-214
NUMA node1 CPU(s): 43-85,215-257
NUMA node2 CPU(s): 86-128,258-300
NUMA node3 CPU(s): 129-171,301-343
Since I set the CPU affinity on the core, 256 cases is roughly
equivalent to the moment when Node 0 and Node 1 are filled.
The following is the perf data comparing 2 tests w/o fix & with this fix:
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ....................... ....................................
#
61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
0.93% -0.32% [kernel.vmlinux] [k] __slab_free
0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
0.26% -0.07% [kernel.vmlinux] [k] down_write
0.53% -0.06% libc.so.6 [.] __mmap
0.66% -0.06% [kernel.vmlinux] [k] mas_walk
0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
0.45% -0.06% [kernel.vmlinux] [k] mas_find
0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
0.41% -0.05% [kernel.vmlinux] [k] memcpy
0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
0.14% +0.04% [kernel.vmlinux] [k] __put_partials
0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
0.49% -0.04% libc.so.6 [.] __munmap
0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
0.27% -0.03% [kernel.vmlinux] [k] up_write
0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
I think the insteresting item is "get_partial_node". It seems this fix
makes "get_partial_node" slightly more frequent. HMM, however, I still
can't figure out why this is happening. Do you have any thoughts on it?
# Question 2: sheaf capacity
Back the original commit which triggerred lkp regression. I did more
testing to check if this fix could totally fill the regression gap.
The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
use percpu sheaves for maple_node_cache") has the regression.
I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
line:
nr_tasks w/o fix with fix
1 - 3.643% - 0.181%
8 -12.523% - 9.816%
64 -50.378% -20.482%
128 -36.736% - 5.518%
192 -22.963% - 1.777%
256 -32.926% - 41.026%
It appears that under extreme conditions, regression remains significate.
I remembered your suggestion about larger capacity and did the following
testing:
59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4
(with this fix) (cap: 32->64) (cap: 32->128) (cap: 32->256)
1 -8.789% -8.805% -8.185% -9.912% -8.673%
8 -12.256% -9.219% -10.460% -10.070% -8.819%
64 -38.915% -8.172% -4.700% 4.571% 8.793%
128 -8.032% 11.377% 23.232% 26.940% 30.573%
192 -1.220% 9.758% 20.573% 22.645% 25.768%
256 -6.570% 9.967% 21.663% 30.103% 33.876%
Comparing with base line (3accabda4), larger capacity could
significatly improve the Sheaf's scalability.
So, I'd like to know if you think dynamically or adaptively adjusting
capacity is a worthwhile idea.
Thanks for your patience.
Regards,
Zhao
next prev parent reply other threads:[~2026-01-15 9:47 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-10 0:26 [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main() Hao Li
2025-12-15 14:30 ` Vlastimil Babka
2025-12-16 2:34 ` Hao Lee
2025-12-22 10:20 ` Harry Yoo
2026-01-05 15:58 ` Vlastimil Babka
2026-01-15 10:12 ` Zhao Liu [this message]
2026-01-15 16:19 ` Vlastimil Babka
2026-01-16 9:07 ` Zhao Liu
2026-01-16 9:11 ` Hao Li
2026-01-16 4:06 ` Hao Li
2026-01-16 9:16 ` Zhao Liu
2026-01-16 9:09 ` Hao Li
2026-01-19 6:07 ` Hao Li
2026-01-20 8:21 ` Zhao Liu
2026-01-21 3:15 ` Hao Li
2026-01-21 13:17 ` Zhao Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aWi9nAbIkTfYFoMM@intel.com \
--to=zhao1.liu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=cl@gentwo.org \
--cc=haolee.swjtu@gmail.com \
--cc=harry.yoo@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=tim.c.chen@intel.com \
--cc=vbabka@suse.cz \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.