[PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
@ 2026-01-23  6:52 Vlastimil Babka
  2026-01-23  6:52 ` [PATCH v4 01/22] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Vlastimil Babka @ 2026-01-23  6:52 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	Vlastimil Babka, kernel test robot, stable, Paul E. McKenney

Percpu sheaves caching was introduced as opt-in but the goal was to
eventually move all caches to them. This is the next step, enabling
sheaves for all caches (except the two bootstrap ones) and then removing
the per cpu (partial) slabs and lots of associated code.

Besides (hopefully) improved performance, this removes the rather
complicated code related to the lockless fastpaths (using
this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
kmalloc_nolock().

The lockless slab freelist+counters update operation using
try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
without repeating the "alien" array flushing of SLUB, and to allow
flushing objects from sheaves to slabs mostly without the node
list_lock.

Sending this v4 because various changes accumulated in the branch due to
review and -next exposure (see the list below). Thanks for all the
reviews!

Git branch for the v4
  https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=sheaves-for-all-v4

Which is a snapshot of:
  https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/sheaves-for-all

Based on:
  https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-7.0/sheaves-base
  - includes a sheaves optimization that seemed minor but there was lkp
    test robot result with significant improvements:
    https://lore.kernel.org/all/202512291555.56ce2e53-lkp@intel.com/
    (could be an uncommon corner case workload though)
  - includes the kmalloc_nolock() fix commit a4ae75d1b6a2 that is undone
    as part of this series

Significant (but not critical) remaining TODOs:
- Integration of rcu sheaves handling with kfree_rcu batching.
  - Currently the kfree_rcu batching is almost completely bypassed. I'm
    thinking it could be adjusted to handle rcu sheaves in addition to
    individual objects, to get the best of both.
- Performance evaluation. Petr Tesarik has been doing that on the RFC
  with some promising results (thanks!) and also found a memory leak.

Note that as many things, this caching scheme change is a tradeoff, as
summarized by Christoph:

  https://lore.kernel.org/all/f7c33974-e520-387e-9e2f-1e523bfe1545@gentwo.org/

- Objects allocated from sheaves should have better temporal locality
  (likely recently freed, thus cache hot) but worse spatial locality
  (likely from many different slabs, increasing memory usage and
  possibly TLB pressure on kernel's direct map).

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Changes in v4:
- Fix up both missing and spurious r-b tags from v3, and add new ones
  (big thanks to Hao Li, Harry, and Suren!)
- Fix infinite recursion with kmemleak (Breno Leitao)
- Use cache_has_sheaves() in pcs_destroy() (Suren)
- Use cache_has_sheaves() in kvfree_rcu_barrier_on_cache() (Hao Li)
- Bypass sheaf for remote object free also in kfree_nolock() (Harry)
- WRITE_ONCE slab->counters in __update_freelist_slow() so
  get_partial_node_bulk() can stop being paranoid (Harry)
- Tweak conditions in alloc_from_new_slab() (Hao Li, Suren)
- Rename get_partial*() functions to get_from_partial*() (Suren)
- Rename variable freelist to object in ___slab_alloc() (Suren)
- Separate struct partial_bulk_context instead of extending.
- Rename flush_cpu_slab() to flush_cpu_sheaves() (Hao Li)
- Add "mm/slab: fix false lockdep warning in __kfree_rcu_sheaf()" from
  Harry.
- Add counting of FREE_SLOWPATH stat to some missing places (Suren, Hao
  Li)
- Link to v3: https://patch.msgid.link/20260116-sheaves-for-all-v3-0-5595cb000772@suse.cz

Changes in v3:
- Rebase to current slab/for-7.0/sheaves which itself is rebased to
  slab/for-next-fixes to include commit a4ae75d1b6a2 ("slab: fix
  kmalloc_nolock() context check for PREEMPT_RT")
- Revert a4ae75d1b6a2 as part of "slab: simplify kmalloc_nolock()" as
  it's no longer necessary.
- Add cache_has_sheaves() helper to test for s->sheaf_capacity, use it
  in more places instead of s->cpu_sheaves tests that were missed
  (Hao Li)
- Fix a bug where kmalloc_nolock() could end up trying to allocate empty
  sheaf (not compatible with !allow_spin) in __pcs_replace_full_main()
  (Hao Li)
- Fix missing inc_slabs_node() in ___slab_alloc() ->
  alloc_from_new_slab() path. (Hao Li)
  - Also a bug where refill_objects() -> alloc_from_new_slab ->
    free_new_slab_nolock() (previously defer_deactivate_slab()) would
    do inc_slabs_node() without matching dec_slabs_node()
- Make __free_slab call free_frozen_pages_nolock() when !allow_spin.
  This was correct in the first RFC. (Hao Li)
- Add patch to make SLAB_CONSISTENCY_CHECKS prevent merging.
- Add tags from sveral people (thanks!)
- Fix checkpatch warnings.
- Link to v2: https://patch.msgid.link/20260112-sheaves-for-all-v2-0-98225cfb50cf@suse.cz

Changes in v2:
- Rebased to v6.19-rc1+slab.git slab/for-7.0/sheaves
  - Some of the preliminary patches from the RFC went in there.
- Incorporate feedback/reports from many people (thanks!), including:
  - Make caches with sheaves mergeable.
  - Fix a major memory leak.
- Cleanup of stat items.
- Link to v1: https://patch.msgid.link/20251023-sheaves-for-all-v1-0-6ffa2c9941c0@suse.cz

---
Harry Yoo (1):
      mm/slab: fix false lockdep warning in __kfree_rcu_sheaf()

Vlastimil Babka (21):
      mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache()
      slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE
      mm/slab: move and refactor __kmem_cache_alias()
      mm/slab: make caches with sheaves mergeable
      slab: add sheaves to most caches
      slab: introduce percpu sheaves bootstrap
      slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
      slab: handle kmalloc sheaves bootstrap
      slab: add optimized sheaf refill from partial list
      slab: remove cpu (partial) slabs usage from allocation paths
      slab: remove SLUB_CPU_PARTIAL
      slab: remove the do_slab_free() fastpath
      slab: remove defer_deactivate_slab()
      slab: simplify kmalloc_nolock()
      slab: remove struct kmem_cache_cpu
      slab: remove unused PREEMPT_RT specific macros
      slab: refill sheaves from all nodes
      slab: update overview comments
      slab: remove frozen slab checks from __slab_free()
      mm/slub: remove DEACTIVATE_TO_* stat items
      mm/slub: cleanup and repurpose some stat items

 include/linux/slab.h |    6 -
 mm/Kconfig           |   11 -
 mm/internal.h        |    1 +
 mm/page_alloc.c      |    5 +
 mm/slab.h            |   65 +-
 mm/slab_common.c     |   61 +-
 mm/slub.c            | 2689 ++++++++++++++++++--------------------------------
 7 files changed, 1031 insertions(+), 1807 deletions(-)
---
base-commit: a66f9c0f1ba2dd05fa994c800ebc63f265155f91
change-id: 20251002-sheaves-for-all-86ac13dc47a5

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 01/22] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache()
  2026-01-23  6:52 [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
@ 2026-01-23  6:52 ` Vlastimil Babka
  2026-01-27 16:08   ` Liam R. Howlett
  2026-01-29 15:18 ` [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Hao Li
  2026-03-26 12:43 ` [REGRESSION] " Aishwarya Rambhadran
  2 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2026-01-23  6:52 UTC (permalink / raw)
  To: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	Vlastimil Babka, kernel test robot, stable

After we submit the rcu_free sheaves to call_rcu() we need to make sure
the rcu callbacks complete. kvfree_rcu_barrier() does that via
flush_all_rcu_sheaves() but kvfree_rcu_barrier_on_cache() doesn't. Fix
that.

This currently causes no issues because the caches with sheaves we have
are never destroyed. The problem flagged by kernel test robot was
reported for a patch that enables sheaves for (almost) all caches, and
occurred only with CONFIG_KASAN. Harry Yoo found the root cause [1]:

  It turns out the object freed by sheaf_flush_unused() was in KASAN
  percpu quarantine list (confirmed by dumping the list) by the time
  __kmem_cache_shutdown() returns an error.

  Quarantined objects are supposed to be flushed by kasan_cache_shutdown(),
  but things go wrong if the rcu callback (rcu_free_sheaf_nobarn()) is
  processed after kasan_cache_shutdown() finishes.

  That's why rcu_barrier() in __kmem_cache_shutdown() didn't help,
  because it's called after kasan_cache_shutdown().

  Calling rcu_barrier() in kvfree_rcu_barrier_on_cache() guarantees
  that it'll be added to the quarantine list before kasan_cache_shutdown()
  is called. So it's a valid fix!

[1] https://lore.kernel.org/all/aWd6f3jERlrB5yeF@hyeyoo/

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202601121442.c530bed3-lkp@intel.com
Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction")
Cc: stable@vger.kernel.org
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Tested-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab_common.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index eed7ea556cb1..ee994ec7f251 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -2133,8 +2133,11 @@ EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
  */
 void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
 {
-	if (s->cpu_sheaves)
+	if (s->cpu_sheaves) {
 		flush_rcu_sheaves_on_cache(s);
+		rcu_barrier();
+	}
+
 	/*
 	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
 	 * on a specific slab cache.

-- 
2.52.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 01/22] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache()
  2026-01-23  6:52 ` [PATCH v4 01/22] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
@ 2026-01-27 16:08   ` Liam R. Howlett
  0 siblings, 0 replies; 15+ messages in thread
From: Liam R. Howlett @ 2026-01-27 16:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Hao Li, Andrew Morton, Uladzislau Rezki,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable

* Vlastimil Babka <vbabka@suse.cz> [260123 01:53]:
> After we submit the rcu_free sheaves to call_rcu() we need to make sure
> the rcu callbacks complete. kvfree_rcu_barrier() does that via
> flush_all_rcu_sheaves() but kvfree_rcu_barrier_on_cache() doesn't. Fix
> that.
> 
> This currently causes no issues because the caches with sheaves we have
> are never destroyed. The problem flagged by kernel test robot was
> reported for a patch that enables sheaves for (almost) all caches, and
> occurred only with CONFIG_KASAN. Harry Yoo found the root cause [1]:
> 
>   It turns out the object freed by sheaf_flush_unused() was in KASAN
>   percpu quarantine list (confirmed by dumping the list) by the time
>   __kmem_cache_shutdown() returns an error.
> 
>   Quarantined objects are supposed to be flushed by kasan_cache_shutdown(),
>   but things go wrong if the rcu callback (rcu_free_sheaf_nobarn()) is
>   processed after kasan_cache_shutdown() finishes.
> 
>   That's why rcu_barrier() in __kmem_cache_shutdown() didn't help,
>   because it's called after kasan_cache_shutdown().
> 
>   Calling rcu_barrier() in kvfree_rcu_barrier_on_cache() guarantees
>   that it'll be added to the quarantine list before kasan_cache_shutdown()
>   is called. So it's a valid fix!
> 
> [1] https://lore.kernel.org/all/aWd6f3jERlrB5yeF@hyeyoo/
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202601121442.c530bed3-lkp@intel.com
> Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction")
> Cc: stable@vger.kernel.org
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Tested-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  mm/slab_common.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index eed7ea556cb1..ee994ec7f251 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -2133,8 +2133,11 @@ EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
>   */
>  void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
>  {
> -	if (s->cpu_sheaves)
> +	if (s->cpu_sheaves) {
>  		flush_rcu_sheaves_on_cache(s);
> +		rcu_barrier();
> +	}
> +
>  	/*
>  	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
>  	 * on a specific slab cache.
> 
> -- 
> 2.52.0
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-01-23  6:52 [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
  2026-01-23  6:52 ` [PATCH v4 01/22] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
@ 2026-01-29 15:18 ` Hao Li
  2026-01-29 15:28   ` Vlastimil Babka
  2026-03-26 12:43 ` [REGRESSION] " Aishwarya Rambhadran
  2 siblings, 1 reply; 15+ messages in thread
From: Hao Li @ 2026-01-29 15:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

Hi Vlastimil,

I conducted a detailed performance evaluation of the each patch on my setup.

During my tests, I observed two points in the series where performance
regressions occurred:

    Patch 10: I noticed a ~16% regression in my environment. My hypothesis is
    that with this patch, the allocation fast path bypasses the percpu partial
    list, leading to increased contention on the node list.

    Patch 12: This patch seems to introduce an additional ~9.7% regression. I
    suspect this might be because the free path also loses buffering from the
    percpu partial list, further exacerbating node list contention.

These are the only two patches in the series where I observed noticeable
regressions. The rest of the patches did not show significant performance
changes in my tests.

I hope these test results are helpful.

-- 
Thanks,
Hao

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-01-29 15:18 ` [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Hao Li
@ 2026-01-29 15:28   ` Vlastimil Babka
  2026-01-29 16:06     ` Hao Li
  2026-01-30  4:50     ` Hao Li
  0 siblings, 2 replies; 15+ messages in thread
From: Vlastimil Babka @ 2026-01-29 15:28 UTC (permalink / raw)
  To: Hao Li
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

On 1/29/26 16:18, Hao Li wrote:
> Hi Vlastimil,
> 
> I conducted a detailed performance evaluation of the each patch on my setup.

Thanks! What was the benchmark(s) used? Importantly, does it rely on
vma/maple_node objects? So previously those would become kind of double
cached by both sheaves and cpu (partial) slabs (and thus hopefully benefited
more than they should) since sheaves introduction in 6.18, and now they are
not double cached anymore?

> During my tests, I observed two points in the series where performance
> regressions occurred:
> 
>     Patch 10: I noticed a ~16% regression in my environment. My hypothesis is
>     that with this patch, the allocation fast path bypasses the percpu partial
>     list, leading to increased contention on the node list.

That makes sense.

>     Patch 12: This patch seems to introduce an additional ~9.7% regression. I
>     suspect this might be because the free path also loses buffering from the
>     percpu partial list, further exacerbating node list contention.

Hmm yeah... we did put the previously full slabs there, avoiding the lock.

> These are the only two patches in the series where I observed noticeable
> regressions. The rest of the patches did not show significant performance
> changes in my tests.
> 
> I hope these test results are helpful.

They are, thanks. I'd however hope it's just some particular test that has
these regressions, which can be explained by the loss of double caching.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-01-29 15:28   ` Vlastimil Babka
@ 2026-01-29 16:06     ` Hao Li
  2026-01-29 16:44       ` Liam R. Howlett
  2026-01-30  4:50     ` Hao Li
  1 sibling, 1 reply; 15+ messages in thread
From: Hao Li @ 2026-01-29 16:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

On Thu, Jan 29, 2026 at 04:28:01PM +0100, Vlastimil Babka wrote:
> On 1/29/26 16:18, Hao Li wrote:
> > Hi Vlastimil,
> > 
> > I conducted a detailed performance evaluation of the each patch on my setup.
> 
> Thanks! What was the benchmark(s) used?

I'm currently using the mmap2 test case from will-it-scale. The machine is still
an AMD 2-socket system, with 2 nodes per socket, totaling 192 CPUs, with SMT
disabled. For each test run, I used 64, 128, and 192 processes respectively.

> Importantly, does it rely on vma/maple_node objects?

Yes, this test primarily puts a lot of pressure on maple_node.

> So previously those would become kind of double
> cached by both sheaves and cpu (partial) slabs (and thus hopefully benefited
> more than they should) since sheaves introduction in 6.18, and now they are
> not double cached anymore?

Exactly, since version 6.18, maple_node has indeed benefited from a dual-layer
cache.

I did wonder if this isn't a performance regression but rather the
performance returning to its baseline after removing one layer of caching.

However, verifying this idea would require completely disabling the sheaf
mechanism on version 6.19-rc5 while leaving the rest of the SLUB code untouched.
It would be great to hear any suggestions on how this might be approached.

> 
> > During my tests, I observed two points in the series where performance
> > regressions occurred:
> > 
> >     Patch 10: I noticed a ~16% regression in my environment. My hypothesis is
> >     that with this patch, the allocation fast path bypasses the percpu partial
> >     list, leading to increased contention on the node list.
> 
> That makes sense.
> 
> >     Patch 12: This patch seems to introduce an additional ~9.7% regression. I
> >     suspect this might be because the free path also loses buffering from the
> >     percpu partial list, further exacerbating node list contention.
> 
> Hmm yeah... we did put the previously full slabs there, avoiding the lock.
> 
> > These are the only two patches in the series where I observed noticeable
> > regressions. The rest of the patches did not show significant performance
> > changes in my tests.
> > 
> > I hope these test results are helpful.
> 
> They are, thanks. I'd however hope it's just some particular test that has
> these regressions,

Yes, I hope so too. And the mmap2 test case is indeed quite extreme.

> which can be explained by the loss of double caching.

If we could compare it with a version that only uses the
CPU partial list, the answer might become clearer.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-01-29 16:06     ` Hao Li
@ 2026-01-29 16:44       ` Liam R. Howlett
  2026-01-30  4:38         ` Hao Li
  0 siblings, 1 reply; 15+ messages in thread
From: Liam R. Howlett @ 2026-01-29 16:44 UTC (permalink / raw)
  To: Hao Li
  Cc: Vlastimil Babka, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin, Andrew Morton, Uladzislau Rezki,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

* Hao Li <hao.li@linux.dev> [260129 11:07]:
> On Thu, Jan 29, 2026 at 04:28:01PM +0100, Vlastimil Babka wrote:
> > On 1/29/26 16:18, Hao Li wrote:
> > > Hi Vlastimil,
> > > 
> > > I conducted a detailed performance evaluation of the each patch on my setup.
> > 
> > Thanks! What was the benchmark(s) used?

Yes, Thank you for running the benchmarks!

> 
> I'm currently using the mmap2 test case from will-it-scale. The machine is still
> an AMD 2-socket system, with 2 nodes per socket, totaling 192 CPUs, with SMT
> disabled. For each test run, I used 64, 128, and 192 processes respectively.

What about the other tests you ran in the detailed evaluation, were
there other regressions?  It might be worth including the list of tests
that showed issues and some of the raw results (maybe at the end of your
email) to show what you saw more clearly.  I did notice you had done
this previously.

Was the regression in the threaded or processes version of mmap2?

> 
> > Importantly, does it rely on vma/maple_node objects?
> 
> Yes, this test primarily puts a lot of pressure on maple_node.
> 
> > So previously those would become kind of double
> > cached by both sheaves and cpu (partial) slabs (and thus hopefully benefited
> > more than they should) since sheaves introduction in 6.18, and now they are
> > not double cached anymore?
> 
> Exactly, since version 6.18, maple_node has indeed benefited from a dual-layer
> cache.
> 
> I did wonder if this isn't a performance regression but rather the
> performance returning to its baseline after removing one layer of caching.
> 
> However, verifying this idea would require completely disabling the sheaf
> mechanism on version 6.19-rc5 while leaving the rest of the SLUB code untouched.
> It would be great to hear any suggestions on how this might be approached.

You could use perf record to capture the differences on the two kernels.
You could also user perf to look at the differences between three kernel
versions:
1. pre-sheaves entirely
2. the 'dual layer' cache
3. The final version

In these scenarios, it's not worth looking at the numbers, but just the
differences since the debug required to get meaningful information makes
the results hugely slow and, potentially, not as consistent.  Sometimes
I run them multiple time to ensure what I'm seeing makes sense for a
particular comparison (and the server didn't just rotate the logs or
whatever..)

> 
> > 
> > > During my tests, I observed two points in the series where performance
> > > regressions occurred:
> > > 
> > >     Patch 10: I noticed a ~16% regression in my environment. My hypothesis is
> > >     that with this patch, the allocation fast path bypasses the percpu partial
> > >     list, leading to increased contention on the node list.
> > 
> > That makes sense.
> > 
> > >     Patch 12: This patch seems to introduce an additional ~9.7% regression. I
> > >     suspect this might be because the free path also loses buffering from the
> > >     percpu partial list, further exacerbating node list contention.
> > 
> > Hmm yeah... we did put the previously full slabs there, avoiding the lock.
> > 
> > > These are the only two patches in the series where I observed noticeable
> > > regressions. The rest of the patches did not show significant performance
> > > changes in my tests.
> > > 
> > > I hope these test results are helpful.
> > 
> > They are, thanks. I'd however hope it's just some particular test that has
> > these regressions,
> 
> Yes, I hope so too. And the mmap2 test case is indeed quite extreme.
> 
> > which can be explained by the loss of double caching.
> 
> If we could compare it with a version that only uses the
> CPU partial list, the answer might become clearer.

In my experience, micro-benchmarks are good at identifying specific
failure points of a patch set, but unless an entire area of benchmarks
regress (ie all mmap threaded), then they rarely tell the whole story.

Are the benchmarks consistently slower?  This specific test is sensitive
to alignment because of the 128MB mmap/munmap operation.  Sometimes, you
will see a huge spike at a particular process/thread count that moves
around in tests like this.  Was your run consistently lower?

Thanks,
Liam

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-01-29 16:44       ` Liam R. Howlett
@ 2026-01-30  4:38         ` Hao Li
  0 siblings, 0 replies; 15+ messages in thread
From: Hao Li @ 2026-01-30  4:38 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Vlastimil Babka, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin, Andrew Morton, Uladzislau Rezki,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

On Thu, Jan 29, 2026 at 11:44:21AM -0500, Liam R. Howlett wrote:
> * Hao Li <hao.li@linux.dev> [260129 11:07]:
> > On Thu, Jan 29, 2026 at 04:28:01PM +0100, Vlastimil Babka wrote:
> > > On 1/29/26 16:18, Hao Li wrote:
> > > > Hi Vlastimil,
> > > > 
> > > > I conducted a detailed performance evaluation of the each patch on my setup.
> > > 
> > > Thanks! What was the benchmark(s) used?
> 
> Yes, Thank you for running the benchmarks!
> 
> > 
> > I'm currently using the mmap2 test case from will-it-scale. The machine is still
> > an AMD 2-socket system, with 2 nodes per socket, totaling 192 CPUs, with SMT
> > disabled. For each test run, I used 64, 128, and 192 processes respectively.
> 
> What about the other tests you ran in the detailed evaluation, were
> there other regressions?  It might be worth including the list of tests
> that showed issues and some of the raw results (maybe at the end of your
> email) to show what you saw more clearly.  I did notice you had done
> this previously.

Hi, Liam

I only ran the mmap2 use case of will-it-scale. And now I have some new test results, and
I will share the raw data later.

> 
> Was the regression in the threaded or processes version of mmap2?

It's processes version.

> 
> > 
> > > Importantly, does it rely on vma/maple_node objects?
> > 
> > Yes, this test primarily puts a lot of pressure on maple_node.
> > 
> > > So previously those would become kind of double
> > > cached by both sheaves and cpu (partial) slabs (and thus hopefully benefited
> > > more than they should) since sheaves introduction in 6.18, and now they are
> > > not double cached anymore?
> > 
> > Exactly, since version 6.18, maple_node has indeed benefited from a dual-layer
> > cache.
> > 
> > I did wonder if this isn't a performance regression but rather the
> > performance returning to its baseline after removing one layer of caching.
> > 
> > However, verifying this idea would require completely disabling the sheaf
> > mechanism on version 6.19-rc5 while leaving the rest of the SLUB code untouched.
> > It would be great to hear any suggestions on how this might be approached.
> 
> You could use perf record to capture the differences on the two kernels.
> You could also user perf to look at the differences between three kernel
> versions:
> 1. pre-sheaves entirely
> 2. the 'dual layer' cache
> 3. The final version

That's right, this is exactly the test I just completed. I will send a separate
email later.

> 
> In these scenarios, it's not worth looking at the numbers, but just the
> differences since the debug required to get meaningful information makes
> the results hugely slow and, potentially, not as consistent.  Sometimes
> I run them multiple time to ensure what I'm seeing makes sense for a
> particular comparison (and the server didn't just rotate the logs or
> whatever..)

Yes, that's right. This is important. I also ran it multiple times to observe
data stability and took the average value.

> 
> > 
> > > 
> > > > During my tests, I observed two points in the series where performance
> > > > regressions occurred:
> > > > 
> > > >     Patch 10: I noticed a ~16% regression in my environment. My hypothesis is
> > > >     that with this patch, the allocation fast path bypasses the percpu partial
> > > >     list, leading to increased contention on the node list.
> > > 
> > > That makes sense.
> > > 
> > > >     Patch 12: This patch seems to introduce an additional ~9.7% regression. I
> > > >     suspect this might be because the free path also loses buffering from the
> > > >     percpu partial list, further exacerbating node list contention.
> > > 
> > > Hmm yeah... we did put the previously full slabs there, avoiding the lock.
> > > 
> > > > These are the only two patches in the series where I observed noticeable
> > > > regressions. The rest of the patches did not show significant performance
> > > > changes in my tests.
> > > > 
> > > > I hope these test results are helpful.
> > > 
> > > They are, thanks. I'd however hope it's just some particular test that has
> > > these regressions,
> > 
> > Yes, I hope so too. And the mmap2 test case is indeed quite extreme.
> > 
> > > which can be explained by the loss of double caching.
> > 
> > If we could compare it with a version that only uses the
> > CPU partial list, the answer might become clearer.
> 
> In my experience, micro-benchmarks are good at identifying specific
> failure points of a patch set, but unless an entire area of benchmarks
> regress (ie all mmap threaded), then they rarely tell the whole story.

Yes. This make sense to me.

> 
> Are the benchmarks consistently slower?  This specific test is sensitive
> to alignment because of the 128MB mmap/munmap operation.  Sometimes, you
> will see a huge spike at a particular process/thread count that moves
> around in tests like this.  Was your run consistently lower?

Yes, my test results have been quite stable, probably because the machine was
relatively idle.

Thanks for your reply and discuss!

-- 
Thanks,
Hao

> 
> Thanks,
> Liam
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-01-29 15:28   ` Vlastimil Babka
  2026-01-29 16:06     ` Hao Li
@ 2026-01-30  4:50     ` Hao Li
  2026-01-30  6:17       ` Hao Li
  2026-02-04 18:02       ` Vlastimil Babka
  1 sibling, 2 replies; 15+ messages in thread
From: Hao Li @ 2026-01-30  4:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

On Thu, Jan 29, 2026 at 04:28:01PM +0100, Vlastimil Babka wrote:
> 
> So previously those would become kind of double
> cached by both sheaves and cpu (partial) slabs (and thus hopefully benefited
> more than they should) since sheaves introduction in 6.18, and now they are
> not double cached anymore?
> 

I've conducted new tests, and here are the details of three scenarios:

  1. Checked out commit 9d4e6ab865c4, which represents the state before the
     introduction of the sheaves mechanism.
  2. Tested with 6.19-rc5, which includes sheaves but does not yet apply the
     "sheaves for all" patchset.
  3. Applied the "sheaves for all" patchset and also included the "avoid
     list_lock contention" patch.

Results:

For scenario 2 (with sheaves but without "sheaves for all"), there is a
noticeable performance improvement compared to scenario 1:

will-it-scale.128.processes +34.3%
will-it-scale.192.processes +35.4%
will-it-scale.64.processes +31.5%
will-it-scale.per_process_ops +33.7%

For scenario 3 (after applying "sheaves for all"), performance slightly
regressed compared to scenario 1:

will-it-scale.128.processes -1.3%
will-it-scale.192.processes -4.2%
will-it-scale.64.processes -1.2%
will-it-scale.per_process_ops -2.1%

Analysis:

So when the sheaf size for maple nodes is set to 32 by default, the performance
of fully adopting the sheaves mechanism roughly matches the performance of the
previous approach that relied solely on the percpu slab partial list.

The performance regression observed with the "sheaves for all" patchset can
actually be explained as follows: moving from scenario 1 to scenario 2
introduces an additional cache layer, which boosts performance temporarily.
When moving from scenario 2 to scenario 3, this additional cache layer is
removed, then performance reverted to its original level.

So I think the performance of the percpu partial list and the sheaves mechanism
is roughly the same, which is consistent with our expectations.

-- 
Thanks,
Hao

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-01-30  4:50     ` Hao Li
@ 2026-01-30  6:17       ` Hao Li
  2026-02-04 18:02       ` Vlastimil Babka
  1 sibling, 0 replies; 15+ messages in thread
From: Hao Li @ 2026-01-30  6:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

On Fri, Jan 30, 2026 at 12:50:25PM +0800, Hao Li wrote:
> On Thu, Jan 29, 2026 at 04:28:01PM +0100, Vlastimil Babka wrote:
> > 
> > So previously those would become kind of double
> > cached by both sheaves and cpu (partial) slabs (and thus hopefully benefited
> > more than they should) since sheaves introduction in 6.18, and now they are
> > not double cached anymore?
> > 
> 
> I've conducted new tests, and here are the details of three scenarios:
> 
>   1. Checked out commit 9d4e6ab865c4, which represents the state before the
>      introduction of the sheaves mechanism.
>   2. Tested with 6.19-rc5, which includes sheaves but does not yet apply the
>      "sheaves for all" patchset.
>   3. Applied the "sheaves for all" patchset and also included the "avoid
>      list_lock contention" patch.

Here is my testing environment information and the raw test data.

Command:

cd will-it-scale/
python3 ./runtest.py mmap2 25 process 0 0 64 128 192

Env:

CPU(s):                                  192
Thread(s) per core:                      1
Core(s) per socket:                      96
Socket(s):                               2
NUMA node(s):                            4
NUMA node0 CPU(s):                       0-47
NUMA node1 CPU(s):                       48-95
NUMA node2 CPU(s):                       96-143
NUMA node3 CPU(s):                       144-191
Memory:                                  1.5T

Raw data:

1. Checked out commit 9d4e6ab865c4, which represents the state before the
   introduction of the sheaves mechanism.

{
  "time.elapsed_time": 93.88,
  "time.elapsed_time.max": 93.88,
  "time.file_system_inputs": 2640,
  "time.file_system_outputs": 128,
  "time.involuntary_context_switches": 417738,
  "time.major_page_faults": 54,
  "time.maximum_resident_set_size": 90012,
  "time.minor_page_faults": 80569,
  "time.page_size": 4096,
  "time.percent_of_cpu_this_job_got": 5707,
  "time.system_time": 5272.97,
  "time.user_time": 85.59,
  "time.voluntary_context_switches": 2436,
  "will-it-scale.128.processes": 28445014,
  "will-it-scale.128.processes_idle": 33.89,
  "will-it-scale.192.processes": 39899678,
  "will-it-scale.192.processes_idle": 1.29,
  "will-it-scale.64.processes": 15645502,
  "will-it-scale.64.processes_idle": 66.75,
  "will-it-scale.per_process_ops": 224832,
  "will-it-scale.time.elapsed_time": 93.88,
  "will-it-scale.time.elapsed_time.max": 93.88,
  "will-it-scale.time.file_system_inputs": 2640,
  "will-it-scale.time.file_system_outputs": 128,
  "will-it-scale.time.involuntary_context_switches": 417738,
  "will-it-scale.time.major_page_faults": 54,
  "will-it-scale.time.maximum_resident_set_size": 90012,
  "will-it-scale.time.minor_page_faults": 80569,
  "will-it-scale.time.page_size": 4096,
  "will-it-scale.time.percent_of_cpu_this_job_got": 5707,
  "will-it-scale.time.system_time": 5272.97,
  "will-it-scale.time.user_time": 85.59,
  "will-it-scale.time.voluntary_context_switches": 2436,
  "will-it-scale.workload": 83990194
}

2. Tested with 6.19-rc5, which includes sheaves but does not yet apply the
   "sheaves for all" patchset.

{
  "time.elapsed_time": 93.86000000000001,
  "time.elapsed_time.max": 93.86000000000001,
  "time.file_system_inputs": 1952,
  "time.file_system_outputs": 160,
  "time.involuntary_context_switches": 766225,
  "time.major_page_faults": 50.666666666666664,
  "time.maximum_resident_set_size": 90012,
  "time.minor_page_faults": 80635,
  "time.page_size": 4096,
  "time.percent_of_cpu_this_job_got": 5738,
  "time.system_time": 5251.88,
  "time.user_time": 134.57666666666665,
  "time.voluntary_context_switches": 2539,
  "will-it-scale.128.processes": 38223543.333333336,
  "will-it-scale.128.processes_idle": 33.833333333333336,
  "will-it-scale.192.processes": 54039039,
  "will-it-scale.192.processes_idle": 1.26,
  "will-it-scale.64.processes": 20579207.666666668,
  "will-it-scale.64.processes_idle": 66.74333333333334,
  "will-it-scale.per_process_ops": 300541,
  "will-it-scale.time.elapsed_time": 93.86000000000001,
  "will-it-scale.time.elapsed_time.max": 93.86000000000001,
  "will-it-scale.time.file_system_inputs": 1952,
  "will-it-scale.time.file_system_outputs": 160,
  "will-it-scale.time.involuntary_context_switches": 766225,
  "will-it-scale.time.major_page_faults": 50.666666666666664,
  "will-it-scale.time.maximum_resident_set_size": 90012,
  "will-it-scale.time.minor_page_faults": 80635,
  "will-it-scale.time.page_size": 4096,
  "will-it-scale.time.percent_of_cpu_this_job_got": 5738,
  "will-it-scale.time.system_time": 5251.88,
  "will-it-scale.time.user_time": 134.57666666666665,
  "will-it-scale.time.voluntary_context_switches": 2539,
  "will-it-scale.workload": 112841790
}

3. Applied the "sheaves for all" patchset and also included the "avoid
   list_lock contention" patch.

{
  "time.elapsed_time": 93.86666666666667,
  "time.elapsed_time.max": 93.86666666666667,
  "time.file_system_inputs": 1800,
  "time.file_system_outputs": 149.33333333333334,
  "time.involuntary_context_switches": 421120,
  "time.major_page_faults": 37,
  "time.maximum_resident_set_size": 90016,
  "time.minor_page_faults": 80645,
  "time.page_size": 4096,
  "time.percent_of_cpu_this_job_got": 5714.666666666667,
  "time.system_time": 5256.176666666667,
  "time.user_time": 108.88333333333333,
  "time.voluntary_context_switches": 2513,
  "will-it-scale.128.processes": 28067051.333333332,
  "will-it-scale.128.processes_idle": 33.82,
  "will-it-scale.192.processes": 38232965.666666664,
  "will-it-scale.192.processes_idle": 1.2733333333333334,
  "will-it-scale.64.processes": 15464041.333333334,
  "will-it-scale.64.processes_idle": 66.76333333333334,
  "will-it-scale.per_process_ops": 220009.33333333334,
  "will-it-scale.time.elapsed_time": 93.86666666666667,
  "will-it-scale.time.elapsed_time.max": 93.86666666666667,
  "will-it-scale.time.file_system_inputs": 1800,
  "will-it-scale.time.file_system_outputs": 149.33333333333334,
  "will-it-scale.time.involuntary_context_switches": 421120,
  "will-it-scale.time.major_page_faults": 37,
  "will-it-scale.time.maximum_resident_set_size": 90016,
  "will-it-scale.time.minor_page_faults": 80645,
  "will-it-scale.time.page_size": 4096,
  "will-it-scale.time.percent_of_cpu_this_job_got": 5714.666666666667,
  "will-it-scale.time.system_time": 5256.176666666667,
  "will-it-scale.time.user_time": 108.88333333333333,
  "will-it-scale.time.voluntary_context_switches": 2513,
  "will-it-scale.workload": 81764058.33333333
}

> 
> 
> Results:
> 
> For scenario 2 (with sheaves but without "sheaves for all"), there is a
> noticeable performance improvement compared to scenario 1:
> 
> will-it-scale.128.processes +34.3%
> will-it-scale.192.processes +35.4%
> will-it-scale.64.processes +31.5%
> will-it-scale.per_process_ops +33.7%
> 
> For scenario 3 (after applying "sheaves for all"), performance slightly
> regressed compared to scenario 1:
> 
> will-it-scale.128.processes -1.3%
> will-it-scale.192.processes -4.2%
> will-it-scale.64.processes -1.2%
> will-it-scale.per_process_ops -2.1%
> 
> Analysis:
> 
> So when the sheaf size for maple nodes is set to 32 by default, the performance
> of fully adopting the sheaves mechanism roughly matches the performance of the
> previous approach that relied solely on the percpu slab partial list.
> 
> The performance regression observed with the "sheaves for all" patchset can
> actually be explained as follows: moving from scenario 1 to scenario 2
> introduces an additional cache layer, which boosts performance temporarily.
> When moving from scenario 2 to scenario 3, this additional cache layer is
> removed, then performance reverted to its original level.
> 
> So I think the performance of the percpu partial list and the sheaves mechanism
> is roughly the same, which is consistent with our expectations.
> 
> -- 
> Thanks,
> Hao

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-01-30  4:50     ` Hao Li
  2026-01-30  6:17       ` Hao Li
@ 2026-02-04 18:02       ` Vlastimil Babka
  2026-02-04 18:24         ` Christoph Lameter (Ampere)
  1 sibling, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2026-02-04 18:02 UTC (permalink / raw)
  To: Hao Li
  Cc: Harry Yoo, Petr Tesarik, Christoph Lameter, David Rientjes,
	Roman Gushchin, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

On 1/30/26 05:50, Hao Li wrote:
> On Thu, Jan 29, 2026 at 04:28:01PM +0100, Vlastimil Babka wrote:
>> 
>> So previously those would become kind of double
>> cached by both sheaves and cpu (partial) slabs (and thus hopefully benefited
>> more than they should) since sheaves introduction in 6.18, and now they are
>> not double cached anymore?
>> 
> 
> I've conducted new tests, and here are the details of three scenarios:
> 
>   1. Checked out commit 9d4e6ab865c4, which represents the state before the
>      introduction of the sheaves mechanism.
>   2. Tested with 6.19-rc5, which includes sheaves but does not yet apply the
>      "sheaves for all" patchset.
>   3. Applied the "sheaves for all" patchset and also included the "avoid
>      list_lock contention" patch.
> 
> 
> Results:
> 
> For scenario 2 (with sheaves but without "sheaves for all"), there is a
> noticeable performance improvement compared to scenario 1:
> 
> will-it-scale.128.processes +34.3%
> will-it-scale.192.processes +35.4%
> will-it-scale.64.processes +31.5%
> will-it-scale.per_process_ops +33.7%
> 
> For scenario 3 (after applying "sheaves for all"), performance slightly
> regressed compared to scenario 1:
> 
> will-it-scale.128.processes -1.3%
> will-it-scale.192.processes -4.2%
> will-it-scale.64.processes -1.2%
> will-it-scale.per_process_ops -2.1%
> 
> Analysis:
> 
> So when the sheaf size for maple nodes is set to 32 by default, the performance
> of fully adopting the sheaves mechanism roughly matches the performance of the
> previous approach that relied solely on the percpu slab partial list.
> 
> The performance regression observed with the "sheaves for all" patchset can
> actually be explained as follows: moving from scenario 1 to scenario 2
> introduces an additional cache layer, which boosts performance temporarily.
> When moving from scenario 2 to scenario 3, this additional cache layer is
> removed, then performance reverted to its original level.
> 
> So I think the performance of the percpu partial list and the sheaves mechanism
> is roughly the same, which is consistent with our expectations.

Thanks!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-02-04 18:02       ` Vlastimil Babka
@ 2026-02-04 18:24         ` Christoph Lameter (Ampere)
  2026-02-06 16:44           ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-02-04 18:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hao Li, Harry Yoo, Petr Tesarik, David Rientjes, Roman Gushchin,
	Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

On Wed, 4 Feb 2026, Vlastimil Babka wrote:

> > So I think the performance of the percpu partial list and the sheaves mechanism
> > is roughly the same, which is consistent with our expectations.
>
> Thanks!

There are other considerations that usually do not show up well in
benchmark tests.

The sheaves cannot do the spatial optimizations that cpu partial lists
provide. Fragmentation in slab caches (and therefore the nubmer of
partial slab pages) will increase since

1. The objects are not immediately returned to their slab pages but end up
in some queuing structure.

2. Available objects from a single slab page are not allocated in sequence
to empty partial pages and remove the page from the partial lists.

Objects are put into some queue on free and are processed on a FIFO basis.
Objects allocated may come from lots of different slab pages potentially
increasing TLB pressure.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves
  2026-02-04 18:24         ` Christoph Lameter (Ampere)
@ 2026-02-06 16:44           ` Vlastimil Babka
  0 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka @ 2026-02-06 16:44 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Hao Li, Harry Yoo, Petr Tesarik, David Rientjes, Roman Gushchin,
	Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney

On 2/4/26 19:24, Christoph Lameter (Ampere) wrote:
> On Wed, 4 Feb 2026, Vlastimil Babka wrote:
> 
>> > So I think the performance of the percpu partial list and the sheaves mechanism
>> > is roughly the same, which is consistent with our expectations.
>>
>> Thanks!
> 
> There are other considerations that usually do not show up well in
> benchmark tests.
> 
> The sheaves cannot do the spatial optimizations that cpu partial lists
> provide. Fragmentation in slab caches (and therefore the nubmer of
> partial slab pages) will increase since
> 
> 1. The objects are not immediately returned to their slab pages but end up
> in some queuing structure.
> 
> 2. Available objects from a single slab page are not allocated in sequence
> to empty partial pages and remove the page from the partial lists.
> 
> Objects are put into some queue on free and are processed on a FIFO basis.
> Objects allocated may come from lots of different slab pages potentially
> increasing TLB pressure.

IIUC this is what you said before [1] and the cover letter has a link and a
summary of it.

[1] https://lore.kernel.org/all/f7c33974-e520-387e-9e2f-1e523bfe1545@gentwo.org/


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [REGRESSION] slab: replace cpu (partial) slabs with sheaves
  2026-01-23  6:52 [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
  2026-01-23  6:52 ` [PATCH v4 01/22] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
  2026-01-29 15:18 ` [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Hao Li
@ 2026-03-26 12:43 ` Aishwarya Rambhadran
  2026-03-26 14:42   ` Vlastimil Babka (SUSE)
  2 siblings, 1 reply; 15+ messages in thread
From: Aishwarya Rambhadran @ 2026-03-26 12:43 UTC (permalink / raw)
  To: Vlastimil Babka, Harry Yoo, Petr Tesarik, Christoph Lameter,
	David Rientjes, Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney, ryan.roberts

Hi Vlastimil, Harry,

We have observed few kernel performance benchmark regressions,
mainly in perf & vmalloc workloads, when comparing v6.19 mainline
kernel results against later releases in the v7.0 cycle.
Independent bisections on different machines consistently point
to commits within the slab percpu sheaves series. However, towards
the end of the bisection, the signal becomes less clear, so it's
not yet certain which specific commit within the series is the
root cause.

The workloads were triggered on AWS Graviton3 (arm64) & AWS Intel
Sapphire Rapids (x86_64) systems in which the regressions are
reproducible across different kernel release candidates.
(R)/(I) mean statistically significant regression/improvement,
where "statistically significant" means the 95% confidence
intervals do not overlap”.

Below given are the performance benchmark results generated by
Fastpath Tool, for different kernel -rc versions relative to the
base version v6.19, executed on the mentioned SUTs. The perf/
syscall benchmarks (execve/fork) regress consistently by ~6–11% on
both arm64 and x86_64 across v7.0-rc1 to rc5, while vmalloc
workloads show smaller but stable regressions (~2–10%), particularly
in kvfree_rcu paths.

Regressions on AWS Intel Sapphire Rapids (x86_64) :
+-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
| Benchmark       | Result Class            |   6-19-0 (base) |  
  7-0-0-rc1 |   7-0-0-rc2 |  7-0-0-rc2-gaf4e9ef3d784 |   7-0-0-rc3 |  
  7-0-0-rc4 |   7-0-0-rc5 |
+=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
| micromm/vmalloc | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 
(usec) |       262605.17 |      -4.94% |      -7.48% |             (R) 
-8.11% |      -4.51% |      -6.23% |      -3.47% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 
(usec) |       253198.67 |      -7.56% | (R) -10.57% |            (R) 
-10.13% |  (R) -7.07% |      -6.37% |      -6.55% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)           
  |       197904.67 |      -2.07% |      -3.38% |             -2.07% |  
     -2.97% |  (R) -4.30% |      -3.39% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 
(usec)  |      1707089.83 |      -2.63% |  (R) -3.69% |               
(R) -3.25% |  (R) -2.87% |      -2.22% |  (R) -3.63% |
+-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
| perf/syscall    | execve (ops/sec)            |         1202.92 |  (R) 
-7.15% |  (R) -7.05% |         (R) -7.03% |  (R) -7.93% |  (R) -6.51% |  
(R) -7.36% |
|                 | fork (ops/sec)            |          996.00 |  (R) 
-9.00% | (R) -10.27% |         (R) -9.92% | (R) -11.19% | (R) -10.69% | 
(R) -10.28% |
+-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+

Regressions on AWS Graviton3 (arm64) :
+-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
| Benchmark       | Result Class            |   6-19-0 (base) |  
  7-0-0-rc1 |   7-0-0-rc2 |  7-0-0-rc2-gaf4e9ef3d784 |   7-0-0-rc3 |  
  7-0-0-rc4 |   7-0-0-rc5 |
+=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
| micromm/vmalloc | fix_size_alloc_test: p:1, h:0, l:500000 (usec)      
      |       320101.50 |  (R) -4.72% |  (R) -3.81% |               (R) 
-5.05% |      -3.06% |      -3.16% |  (R) -3.91% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)      
      |       522072.83 |  (R) -2.15% |      -1.25% |               (R) 
-2.16% |  (R) -2.13% |      -2.10% |      -1.82% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)      
     |      1041640.33 |      -0.50% |  (R) -2.04% |                 
-1.43% |      -0.69% |      -1.78% |  (R) -2.03% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)    
      |      2255794.00 |      -1.51% |  (R) -2.24% |             (R) 
-2.33% |      -1.14% |      -0.94% |      -1.60% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 
(usec) |       343543.83 |  (R) -4.50% |  (R) -3.54% |             (R) 
-5.00% |  (R) -4.88% |  (R) -4.01% |  (R) -5.54% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 
(usec) |       342290.33 |  (R) -5.15% |  (R) -3.24% |             (R) 
-3.76% |  (R) -5.37% |  (R) -3.74% |  (R) -5.51% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 
(usec)  |      1209666.83 |      -2.43% |      -2.09% |                 
   -1.19% |  (R) -4.39% |      -1.81% |      -3.15% |
+-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
| perf/syscall    | execve (ops/sec)            |         1219.58 |      
        |  (R) -8.12% |         (R) -7.37% |  (R) -7.60% |  (R) -7.86% 
|  (R) -7.71% |
|                 | fork (ops/sec)            |          863.67 |        
      |  (R) -7.24% |         (R) -7.07% |  (R) -6.42% |  (R) -6.93% |  
(R) -6.55% |
+-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+


The details of latest bisections that were carried out for the above
listed regressions, are given below :
-Graviton3 (arm64)
  good: v6.19 (05f7e89ab973)
  bad:  v7.0-rc2 (11439c4635ed)
  workload: perf/syscall (execve)
  bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
  kmalloc_nolock()/kfree_nolock()”)

-Sapphire Rapids (x86_64)
  good: v6.19 (05f7e89ab973)
  bad:  v7.0-rc3 (1f318b96cc84)
  workload: perf/syscall (fork)
  bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
  kmalloc_nolock()/kfree_nolock()”)

-Graviton3 (arm64)
  good: v6.19 (05f7e89ab973)
  bad:  v7.0-rc3 (1f318b96cc84)
  workload: perf/syscall (execve)
  bisected to: f3421f8d154c (“slab: introduce percpu sheaves bootstrap”)

I'm aware that some fixes for the sheaves series have already been
merged around v7.0-rc3; however, these do not appear to resolve the
regressions described above completely. Are there additional fixes or
follow-ups in progress that I should evaluate? I can investigate
further and provide additional data, if that would be useful.

Thank you.
Aishwarya Rambhadran


On 23/01/26 12:22 PM, Vlastimil Babka wrote:
> Percpu sheaves caching was introduced as opt-in but the goal was to
> eventually move all caches to them. This is the next step, enabling
> sheaves for all caches (except the two bootstrap ones) and then removing
> the per cpu (partial) slabs and lots of associated code.
>
> Besides (hopefully) improved performance, this removes the rather
> complicated code related to the lockless fastpaths (using
> this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
> kmalloc_nolock().
>
> The lockless slab freelist+counters update operation using
> try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
> without repeating the "alien" array flushing of SLUB, and to allow
> flushing objects from sheaves to slabs mostly without the node
> list_lock.
>
> Sending this v4 because various changes accumulated in the branch due to
> review and -next exposure (see the list below). Thanks for all the
> reviews!
>
> Git branch for the v4
>    https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=sheaves-for-all-v4
>
> Which is a snapshot of:
>    https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/sheaves-for-all
>
> Based on:
>    https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-7.0/sheaves-base
>    - includes a sheaves optimization that seemed minor but there was lkp
>      test robot result with significant improvements:
>      https://lore.kernel.org/all/202512291555.56ce2e53-lkp@intel.com/
>      (could be an uncommon corner case workload though)
>    - includes the kmalloc_nolock() fix commit a4ae75d1b6a2 that is undone
>      as part of this series
>
> Significant (but not critical) remaining TODOs:
> - Integration of rcu sheaves handling with kfree_rcu batching.
>    - Currently the kfree_rcu batching is almost completely bypassed. I'm
>      thinking it could be adjusted to handle rcu sheaves in addition to
>      individual objects, to get the best of both.
> - Performance evaluation. Petr Tesarik has been doing that on the RFC
>    with some promising results (thanks!) and also found a memory leak.
>
> Note that as many things, this caching scheme change is a tradeoff, as
> summarized by Christoph:
>
>    https://lore.kernel.org/all/f7c33974-e520-387e-9e2f-1e523bfe1545@gentwo.org/
>
> - Objects allocated from sheaves should have better temporal locality
>    (likely recently freed, thus cache hot) but worse spatial locality
>    (likely from many different slabs, increasing memory usage and
>    possibly TLB pressure on kernel's direct map).
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
> Changes in v4:
> - Fix up both missing and spurious r-b tags from v3, and add new ones
>    (big thanks to Hao Li, Harry, and Suren!)
> - Fix infinite recursion with kmemleak (Breno Leitao)
> - Use cache_has_sheaves() in pcs_destroy() (Suren)
> - Use cache_has_sheaves() in kvfree_rcu_barrier_on_cache() (Hao Li)
> - Bypass sheaf for remote object free also in kfree_nolock() (Harry)
> - WRITE_ONCE slab->counters in __update_freelist_slow() so
>    get_partial_node_bulk() can stop being paranoid (Harry)
> - Tweak conditions in alloc_from_new_slab() (Hao Li, Suren)
> - Rename get_partial*() functions to get_from_partial*() (Suren)
> - Rename variable freelist to object in ___slab_alloc() (Suren)
> - Separate struct partial_bulk_context instead of extending.
> - Rename flush_cpu_slab() to flush_cpu_sheaves() (Hao Li)
> - Add "mm/slab: fix false lockdep warning in __kfree_rcu_sheaf()" from
>    Harry.
> - Add counting of FREE_SLOWPATH stat to some missing places (Suren, Hao
>    Li)
> - Link to v3: https://patch.msgid.link/20260116-sheaves-for-all-v3-0-5595cb000772@suse.cz
>
> Changes in v3:
> - Rebase to current slab/for-7.0/sheaves which itself is rebased to
>    slab/for-next-fixes to include commit a4ae75d1b6a2 ("slab: fix
>    kmalloc_nolock() context check for PREEMPT_RT")
> - Revert a4ae75d1b6a2 as part of "slab: simplify kmalloc_nolock()" as
>    it's no longer necessary.
> - Add cache_has_sheaves() helper to test for s->sheaf_capacity, use it
>    in more places instead of s->cpu_sheaves tests that were missed
>    (Hao Li)
> - Fix a bug where kmalloc_nolock() could end up trying to allocate empty
>    sheaf (not compatible with !allow_spin) in __pcs_replace_full_main()
>    (Hao Li)
> - Fix missing inc_slabs_node() in ___slab_alloc() ->
>    alloc_from_new_slab() path. (Hao Li)
>    - Also a bug where refill_objects() -> alloc_from_new_slab ->
>      free_new_slab_nolock() (previously defer_deactivate_slab()) would
>      do inc_slabs_node() without matching dec_slabs_node()
> - Make __free_slab call free_frozen_pages_nolock() when !allow_spin.
>    This was correct in the first RFC. (Hao Li)
> - Add patch to make SLAB_CONSISTENCY_CHECKS prevent merging.
> - Add tags from sveral people (thanks!)
> - Fix checkpatch warnings.
> - Link to v2: https://patch.msgid.link/20260112-sheaves-for-all-v2-0-98225cfb50cf@suse.cz
>
> Changes in v2:
> - Rebased to v6.19-rc1+slab.git slab/for-7.0/sheaves
>    - Some of the preliminary patches from the RFC went in there.
> - Incorporate feedback/reports from many people (thanks!), including:
>    - Make caches with sheaves mergeable.
>    - Fix a major memory leak.
> - Cleanup of stat items.
> - Link to v1: https://patch.msgid.link/20251023-sheaves-for-all-v1-0-6ffa2c9941c0@suse.cz
>
> ---
> Harry Yoo (1):
>        mm/slab: fix false lockdep warning in __kfree_rcu_sheaf()
>
> Vlastimil Babka (21):
>        mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache()
>        slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGE
>        mm/slab: move and refactor __kmem_cache_alias()
>        mm/slab: make caches with sheaves mergeable
>        slab: add sheaves to most caches
>        slab: introduce percpu sheaves bootstrap
>        slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
>        slab: handle kmalloc sheaves bootstrap
>        slab: add optimized sheaf refill from partial list
>        slab: remove cpu (partial) slabs usage from allocation paths
>        slab: remove SLUB_CPU_PARTIAL
>        slab: remove the do_slab_free() fastpath
>        slab: remove defer_deactivate_slab()
>        slab: simplify kmalloc_nolock()
>        slab: remove struct kmem_cache_cpu
>        slab: remove unused PREEMPT_RT specific macros
>        slab: refill sheaves from all nodes
>        slab: update overview comments
>        slab: remove frozen slab checks from __slab_free()
>        mm/slub: remove DEACTIVATE_TO_* stat items
>        mm/slub: cleanup and repurpose some stat items
>
>   include/linux/slab.h |    6 -
>   mm/Kconfig           |   11 -
>   mm/internal.h        |    1 +
>   mm/page_alloc.c      |    5 +
>   mm/slab.h            |   65 +-
>   mm/slab_common.c     |   61 +-
>   mm/slub.c            | 2689 ++++++++++++++++++--------------------------------
>   7 files changed, 1031 insertions(+), 1807 deletions(-)
> ---
> base-commit: a66f9c0f1ba2dd05fa994c800ebc63f265155f91
> change-id: 20251002-sheaves-for-all-86ac13dc47a5
>
> Best regards,

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [REGRESSION] slab: replace cpu (partial) slabs with sheaves
  2026-03-26 12:43 ` [REGRESSION] " Aishwarya Rambhadran
@ 2026-03-26 14:42   ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 15+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-26 14:42 UTC (permalink / raw)
  To: Aishwarya Rambhadran, Vlastimil Babka, Harry Yoo, Petr Tesarik,
	Christoph Lameter, David Rientjes, Roman Gushchin
  Cc: Hao Li, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Suren Baghdasaryan, Sebastian Andrzej Siewior, Alexei Starovoitov,
	linux-mm, linux-kernel, linux-rt-devel, bpf, kasan-dev,
	kernel test robot, stable, Paul E. McKenney, ryan.roberts

On 3/26/26 13:43, Aishwarya Rambhadran wrote:
> Hi Vlastimil, Harry,

Hi!

> We have observed few kernel performance benchmark regressions,
> mainly in perf & vmalloc workloads, when comparing v6.19 mainline
> kernel results against later releases in the v7.0 cycle.
> Independent bisections on different machines consistently point
> to commits within the slab percpu sheaves series. However, towards
> the end of the bisection, the signal becomes less clear, so it's
> not yet certain which specific commit within the series is the
> root cause.
> 
> The workloads were triggered on AWS Graviton3 (arm64) & AWS Intel
> Sapphire Rapids (x86_64) systems in which the regressions are
> reproducible across different kernel release candidates.
> (R)/(I) mean statistically significant regression/improvement,
> where "statistically significant" means the 95% confidence
> intervals do not overlap”.
> 
> Below given are the performance benchmark results generated by
> Fastpath Tool, for different kernel -rc versions relative to the
> base version v6.19, executed on the mentioned SUTs. The perf/
> syscall benchmarks (execve/fork) regress consistently by ~6–11% on
> both arm64 and x86_64 across v7.0-rc1 to rc5, while vmalloc
> workloads show smaller but stable regressions (~2–10%), particularly
> in kvfree_rcu paths.
> 
> Regressions on AWS Intel Sapphire Rapids (x86_64) :

The table formatting is broken for me, can you resend it please? Maybe a
.txt attachment would work better.

> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> | Benchmark       | Result Class            |   6-19-0 (base) |  
>   7-0-0-rc1 |   7-0-0-rc2 |  7-0-0-rc2-gaf4e9ef3d784 |   7-0-0-rc3 |  
>   7-0-0-rc4 |   7-0-0-rc5 |
> +=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
> | micromm/vmalloc | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 
> (usec) |       262605.17 |      -4.94% |      -7.48% |             (R) 
> -8.11% |      -4.51% |      -6.23% |      -3.47% |
> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 
> (usec) |       253198.67 |      -7.56% | (R) -10.57% |            (R) 
> -10.13% |  (R) -7.07% |      -6.37% |      -6.55% |
> |                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)           
>   |       197904.67 |      -2.07% |      -3.38% |             -2.07% |  
>      -2.97% |  (R) -4.30% |      -3.39% |
> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 
> (usec)  |      1707089.83 |      -2.63% |  (R) -3.69% |               
> (R) -3.25% |  (R) -2.87% |      -2.22% |  (R) -3.63% |
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> | perf/syscall    | execve (ops/sec)            |         1202.92 |  (R) 
> -7.15% |  (R) -7.05% |         (R) -7.03% |  (R) -7.93% |  (R) -6.51% |  
> (R) -7.36% |
> |                 | fork (ops/sec)            |          996.00 |  (R) 
> -9.00% | (R) -10.27% |         (R) -9.92% | (R) -11.19% | (R) -10.69% | 
> (R) -10.28% |
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> 
> Regressions on AWS Graviton3 (arm64) :
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> | Benchmark       | Result Class            |   6-19-0 (base) |  
>   7-0-0-rc1 |   7-0-0-rc2 |  7-0-0-rc2-gaf4e9ef3d784 |   7-0-0-rc3 |  
>   7-0-0-rc4 |   7-0-0-rc5 |
> +=================+==========================================================+=================+=============+=============+===========================+=============+=============+=============+
> | micromm/vmalloc | fix_size_alloc_test: p:1, h:0, l:500000 (usec)      
>       |       320101.50 |  (R) -4.72% |  (R) -3.81% |               (R) 
> -5.05% |      -3.06% |      -3.16% |  (R) -3.91% |
> |                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)      
>       |       522072.83 |  (R) -2.15% |      -1.25% |               (R) 
> -2.16% |  (R) -2.13% |      -2.10% |      -1.82% |
> |                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)      
>      |      1041640.33 |      -0.50% |  (R) -2.04% |                 
> -1.43% |      -0.69% |      -1.78% |  (R) -2.03% |
> |                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)    
>       |      2255794.00 |      -1.51% |  (R) -2.24% |             (R) 
> -2.33% |      -1.14% |      -0.94% |      -1.60% |
> |                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 
> (usec) |       343543.83 |  (R) -4.50% |  (R) -3.54% |             (R) 
> -5.00% |  (R) -4.88% |  (R) -4.01% |  (R) -5.54% |
> |                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 
> (usec) |       342290.33 |  (R) -5.15% |  (R) -3.24% |             (R) 
> -3.76% |  (R) -5.37% |  (R) -3.74% |  (R) -5.51% |
> |                 | random_size_align_alloc_test: p:1, h:0, l:500000 
> (usec)  |      1209666.83 |      -2.43% |      -2.09% |                 
>    -1.19% |  (R) -4.39% |      -1.81% |      -3.15% |
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> | perf/syscall    | execve (ops/sec)            |         1219.58 |      
>         |  (R) -8.12% |         (R) -7.37% |  (R) -7.60% |  (R) -7.86% 
> |  (R) -7.71% |
> |                 | fork (ops/sec)            |          863.67 |        
>       |  (R) -7.24% |         (R) -7.07% |  (R) -6.42% |  (R) -6.93% |  
> (R) -6.55% |
> +-----------------+----------------------------------------------------------+-----------------+-------------+-------------+---------------------------+-------------+-------------+-------------+
> 
> 
> The details of latest bisections that were carried out for the above
> listed regressions, are given below :
> -Graviton3 (arm64)
>   good: v6.19 (05f7e89ab973)
>   bad:  v7.0-rc2 (11439c4635ed)
>   workload: perf/syscall (execve)
>   bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
>   kmalloc_nolock()/kfree_nolock()”)
> 
> -Sapphire Rapids (x86_64)
>   good: v6.19 (05f7e89ab973)
>   bad:  v7.0-rc3 (1f318b96cc84)
>   workload: perf/syscall (fork)
>   bisected to: f1427a1d6415 (“slab: make percpu sheaves compatible with
>   kmalloc_nolock()/kfree_nolock()”)
> 
> -Graviton3 (arm64)
>   good: v6.19 (05f7e89ab973)
>   bad:  v7.0-rc3 (1f318b96cc84)
>   workload: perf/syscall (execve)
>   bisected to: f3421f8d154c (“slab: introduce percpu sheaves bootstrap”)

Yeah none of these are likely to introduce the regression.
We've seen other reports from e.g. lkp pointing to later commits that remove
the cpu (partial) slabs. The theory is that on benchmarks that stress vma
and maple node caches (fork and execve are likely those), the introduction
of sheaves in 6.18 (for those caches only) resulted in ~doubled percpu
caching capacity (and likely associated performance increase) - by sheaves
backed by cpu (partial) slabs,. Removing the latter then looks like a
regression in isolation in the 7.0 series.

A regression of vmalloc related to kvfree_rcu might be new. Although if it's
kvfree_rcu() of vmalloc'd objects, it would be weird. More likely they are
kvmalloc'd but small enough to be actually kmalloc'd? What are the details
of that test?

> I'm aware that some fixes for the sheaves series have already been
> merged around v7.0-rc3; however, these do not appear to resolve the
> regressions described above completely. Are there additional fixes or
> follow-ups in progress that I should evaluate? I can investigate
> further and provide additional data, if that would be useful.

We have some followups planned for 7.1 that would make a difference for
systems with memoryless nodes. That would mean "numactl -H" shows nodes that
have cpus but no memory, or that memory is all ZONE_MOVABLE and not ZONE_NORMAL.

Thanks,
Vlastimil

> Thank you.
> Aishwarya Rambhadran
> 
> 
> On 23/01/26 12:22 PM, Vlastimil Babka wrote:

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-03-26 14:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-23  6:52 [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Vlastimil Babka
2026-01-23  6:52 ` [PATCH v4 01/22] mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache() Vlastimil Babka
2026-01-27 16:08   ` Liam R. Howlett
2026-01-29 15:18 ` [PATCH v4 00/22] slab: replace cpu (partial) slabs with sheaves Hao Li
2026-01-29 15:28   ` Vlastimil Babka
2026-01-29 16:06     ` Hao Li
2026-01-29 16:44       ` Liam R. Howlett
2026-01-30  4:38         ` Hao Li
2026-01-30  4:50     ` Hao Li
2026-01-30  6:17       ` Hao Li
2026-02-04 18:02       ` Vlastimil Babka
2026-02-04 18:24         ` Christoph Lameter (Ampere)
2026-02-06 16:44           ` Vlastimil Babka
2026-03-26 12:43 ` [REGRESSION] " Aishwarya Rambhadran
2026-03-26 14:42   ` Vlastimil Babka (SUSE)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox