linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/14] SLUB percpu sheaves
@ 2025-07-23 13:34 Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 01/14] slab: add opt-in caching layer of " Vlastimil Babka
                   ` (14 more replies)
  0 siblings, 15 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett,
	Liam R. Howlett

Hi,

This series adds an opt-in percpu array-based caching layer to SLUB.
It has evolved to a state where kmem caches with sheaves are compatible
with all SLUB features (slub_debug, SLUB_TINY, NUMA locality
considerations). My hope is therefore that it can eventually be enabled
for all kmem caches and replace the cpu (partial) slabs.

The v5 is posted for reviews and testing/benchmarking purposes. After
6.17-rc1 I hope to post a rebased v6 and start including it in
linux-next.

Note the name "sheaf" was invented by Matthew Wilcox so we don't call
the arrays magazines like the original Bonwick paper. The per-NUMA-node
cache of sheaves is thus called "barn".

This caching may seem similar to the arrays in SLAB, but there are some
important differences:

- does not distinguish NUMA locality, thus there are no per-node
  "shared" arrays (with possible lock contention) and no "alien" arrays
  that would need periodical flushing
  - NUMA restricted allocations and strict_numa mode is still honoured,
    the percpu sheaves are bypassed for those allocations
  - a later patch (for separate evaluation) makes freeing remote objects
    bypass sheaves so sheaves contain mostly (not strictly) local objects
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
  for guaranteed and efficient allocations in a restricted context, when
  the upper bound for needed objects is known but rarely reached
- opt-in, not used for every cache (for now)

The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset. In v5 I include
Liam's patches for full maple tree conversion that uses the improved
preallocation API.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  thanks to local_trylock() it becomes a preempt_disable() and no atomic
  operations. Same for freeing, which is otherwise a local double cmpxchg
  only for short term allocations (so the same slab is still active on the
  same cpu when freeing the object) and a more costly locked double
  cmpxchg otherwise.

- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
  separate percpu sheaf and only submit the whole sheaf to call_rcu()
  when full. After the grace period, the sheaf can be used for
  allocations, which is more efficient than freeing and reallocating
  individual slab objects (even with the batching done by kfree_rcu()
  implementation itself). In case only some cpus are allowed to handle rcu
  callbacks, the sheaf can still be made available to other cpus on the
  same node via the shared barn. The maple_node cache uses kfree_rcu() and
  thus can benefit from this.

- Preallocation support. A prefilled sheaf can be privately borrowed to
  perform a short term operation that is not allowed to block in the
  middle and may need to allocate some objects. If an upper bound (worst
  case) for the number of allocations is known, but only much fewer
  allocations actually needed on average, borrowing and returning a sheaf
  is much more efficient then a bulk allocation for the worst case
  followed by a bulk free of the many unused objects. Maple tree write
  operations should benefit from this.

- Compatibility with slub_debug. When slub_debug is enabled for a cache,
  we simply don't create the percpu sheaves so that the debugging hooks
  (at the node partial list slowpaths) are reached as before. The same
  thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by
  reusing the (ineffective) paths for requests exceeding the cache's
  sheaf_capacity. This is in line with the existing approach where
  debugging bypasses the fast paths and SLUB_TINY preferes memory
  savings over performance.

GIT TREES:

this series: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v5r0
It is based on v6.16-rc1.

this series plus a microbenchmark hacked into slub_kunit:
https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v5-benchmarking

It allows evaluating overhead of the added sheaves code, and benefits
for single-threaded allocation/frees of varying batch size. I plan to
look into adding multi-threaded scenarios too.

The last commit there also adds sheaves to every cache to allow
measuring effects on other caches than vma and maple node. Note these
measurements should be compared to slab_nomerge boots without sheaves,
as adding sheaves makes caches unmergeable.

RESULTS:

In order to get some numbers that should be only due to differences in
implementation and no cache layout side-effects in users of the slab
objects etc, I have started with a in-kernel microbenchmark that does
allocating and freeing from a slab cache with or without sheaves and/or
memcg. It's either alternating single object alloc and free, or
allocates 10 objects and frees them, then 100, then 1000
- in order to see the effects of exhausting percpu sheaves or barn, or
(without sheaves) the percpu slabs. The order of objects to free can
be also shuffled instead of FIFO - to stress the non-sheaf freeing
slowpath more.

Measurements done on Ryzen 7 5700, bare metal.

The first question was how just having the sheaves implementation affects
existing no-sheaf caches due to the extra (unused) code. I have experimented
with changing inlining and adding unlikely() to the sheaves case. The
optimum seems is what's currently in the implementation - fast-path sheaves
usage is inlined, any handling of main sheaf empty on alloc/full on free is
a separate function, and the if (s->sheaf_capacity) has neither likely() nor
unlikely(). When I added unlikely() it destroyed the performance of sheaves
completely.

So the result is that with batch size 10, there's 2.4% overhead, and the
other cases are all impacted less than this. Hopefully acceptable with the
plan that eventually there would be sheaves everywhere and the current
cpu (partial) slabs scheme removed.

As for benefits of enabling sheaves (capacity=32) see the results below,
looks all good here. Of course this microbenchmark is not a complete
story though for at least these reasons:

- no kfree_rcu() evaluation
- doesn't show barn spinlock contention effects. In theory shouldn't be
worse than without sheaves because after exhausting cpu (partial) slabs, the
list_lock has to be taken. Sheaf capacity vs capacity of partial slabs is a
matter of tuning.

---------------------------------
 BATCH SIZE: 1 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115660272
 bench: no memcg, sheaves
 average (excl. iter 0): 95734972
 sheaves better by 17.2%
 bench: memcg, no sheaves
 average (excl. iter 0): 163682964
 bench: memcg, sheaves
 average (excl. iter 0): 144792803
 sheaves better by 11.5%

 ---------------------------------
 BATCH SIZE: 10 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115496906
 bench: no memcg, sheaves
 average (excl. iter 0): 97781102
 sheaves better by 15.3%
 bench: memcg, no sheaves
 average (excl. iter 0): 162771491
 bench: memcg, sheaves
 average (excl. iter 0): 144746490
 sheaves better by 11.0%

 ---------------------------------
 BATCH SIZE: 100 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 151796052
 bench: no memcg, sheaves
 average (excl. iter 0): 104641753
 sheaves better by 31.0%
 bench: memcg, no sheaves
 average (excl. iter 0): 200733436
 bench: memcg, sheaves
 average (excl. iter 0): 151340989
 sheaves better by 24.6%

 ---------------------------------
 BATCH SIZE: 1000 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 187623118
 bench: no memcg, sheaves
 average (excl. iter 0): 130914624
 sheaves better by 30.2%
 bench: memcg, no sheaves
 average (excl. iter 0): 240239575
 bench: memcg, sheaves
 average (excl. iter 0): 181474462
 sheaves better by 24.4%

 ---------------------------------
 BATCH SIZE: 10 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115110219
 bench: no memcg, sheaves
 average (excl. iter 0): 100597405
 sheaves better by 12.6%
 bench: memcg, no sheaves
 average (excl. iter 0): 163573377
 bench: memcg, sheaves
 average (excl. iter 0): 144535545
 sheaves better by 11.6%

 ---------------------------------
 BATCH SIZE: 100 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 152457970
 bench: no memcg, sheaves
 average (excl. iter 0): 108720274
 sheaves better by 28.6%
 bench: memcg, no sheaves
 average (excl. iter 0): 203478732
 bench: memcg, sheaves
 average (excl. iter 0): 151241821
 sheaves better by 25.6%

 ---------------------------------
 BATCH SIZE: 1000 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 189950559
 bench: no memcg, sheaves
 average (excl. iter 0): 177934450
 sheaves better by 6.3%
 bench: memcg, no sheaves
 average (excl. iter 0): 242988187
 bench: memcg, sheaves
 average (excl. iter 0): 221609979
 sheaves better by 8.7%

Vlastimil

---
Changes in v5:
- Apply review tags (Harry, Suren) except where changed too much (first
  patch).
- Handle CONFIG_SLUB_TINY by not creating percpu sheaves (Harry)
- Apply review feedback (typos, comments).
- Extract handling sheaf slow paths to separate non-inline functions
  __pcs_handle_empty() and __pcs_handle_full().
- Fix empty sheaf leak in rcu_free_sheaf() (Suren)
- Add "allow NUMA restricted allocations to use percpu sheaves".
- Add Liam's maple tree full sheaf conversion patches for easier
  evaluation.
- Rebase to v6.16-rc1.
- Link to v4: https://patch.msgid.link/20250425-slub-percpu-caches-v4-0-8a636982b4a4@suse.cz

Changes in v4:
- slub_debug disables sheaves for the cache in order to work properly
- strict_numa mode works as intended
- added a separate patch to make freeing remote objects skip sheaves
- various code refactoring suggested by Suren and Harry
- removed less useful stat counters and added missing ones for barn
  and prefilled sheaf events
- Link to v3: https://lore.kernel.org/r/20250317-slub-percpu-caches-v3-0-9d9884d8b643@suse.cz

Changes in v3:
- Squash localtry_lock conversion so it's used immediately.
- Incorporate feedback and add tags from Suren and Harry - thanks!
  - Mostly adding comments and some refactoring.
  - Fixes for kfree_rcu_sheaf() vmalloc handling, cpu hotremove
    flushing.
  - Fix wrong condition in kmem_cache_return_sheaf() that may have
    affected performance negatively.
  - Refactoring of free_to_pcs()
- Link to v2: https://lore.kernel.org/r/20250214-slub-percpu-caches-v2-0-88592ee0966a@suse.cz

Changes in v2:
- Removed kfree_rcu() destructors support as VMAs will not need it
  anymore after [3] is merged.
- Changed to localtry_lock_t borrowed from [2] instead of an own
  implementation of the same idea.
- Many fixes and improvements thanks to Liam's adoption for maple tree
  nodes.
- Userspace Testing stubs by Liam.
- Reduced limitations/todos - hooking to kfree_rcu() is complete,
  prefilled sheaves can exceed cache's sheaf_capacity.
- Link to v1: https://lore.kernel.org/r/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz

---
Liam R. Howlett (6):
      tools: Add testing support for changes to rcu and slab for sheaves
      tools: Add sheaves support to testing infrastructure
      testing/radix-tree/maple: Increase readers and reduce delay for faster machines
      maple_tree: Sheaf conversion
      maple_tree: Add single node allocation support to maple state
      maple_tree: Convert forking to use the sheaf interface

Vlastimil Babka (8):
      slab: add opt-in caching layer of percpu sheaves
      slab: add sheaf support for batching kfree_rcu() operations
      slab: sheaf prefilling for guaranteed allocations
      slab: determine barn status racily outside of lock
      maple_tree: use percpu sheaves for maple_node_cache
      mm, vma: use percpu sheaves for vm_area_struct cache
      mm, slub: skip percpu sheaves for remote object freeing
      mm, slab: allow NUMA restricted allocations to use percpu sheaves

 include/linux/maple_tree.h            |    6 +-
 include/linux/slab.h                  |   47 +
 lib/maple_tree.c                      |  393 +++-----
 lib/test_maple_tree.c                 |    8 +
 mm/slab.h                             |    4 +
 mm/slab_common.c                      |   32 +-
 mm/slub.c                             | 1646 +++++++++++++++++++++++++++++++--
 mm/vma_init.c                         |    1 +
 tools/include/linux/slab.h            |   65 +-
 tools/testing/radix-tree/maple.c      |  639 +++----------
 tools/testing/shared/linux.c          |  112 ++-
 tools/testing/shared/linux/rcupdate.h |   22 +
 12 files changed, 2104 insertions(+), 871 deletions(-)
---
base-commit: 82efd569a8909f2b13140c1b3de88535aea0b051
change-id: 20231128-slub-percpu-caches-9441892011d7

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 01/14] slab: add opt-in caching layer of percpu sheaves
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-08-18 10:09   ` Harry Yoo
  2025-08-19  4:19   ` Suren Baghdasaryan
  2025-07-23 13:34 ` [PATCH v5 02/14] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Specifying a non-zero value for a new struct kmem_cache_args field
sheaf_capacity will setup a caching layer of percpu arrays called
sheaves of given capacity for the created cache.

Allocations from the cache will allocate via the percpu sheaves (main or
spare) as long as they have no NUMA node preference. Frees will also
put the object back into one of the sheaves.

When both percpu sheaves are found empty during an allocation, an empty
sheaf may be replaced with a full one from the per-node barn. If none
are available and the allocation is allowed to block, an empty sheaf is
refilled from slab(s) by an internal bulk alloc operation. When both
percpu sheaves are full during freeing, the barn can replace a full one
with an empty one, unless over a full sheaves limit. In that case a
sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
sheaves and barns is also wired to the existing cpu flushing and cache
shrinking operations.

The sheaves do not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
the sheaves are bypassed.

The bulk operations exposed to slab users also try to utilize the
sheaves as long as the necessary (full or empty) sheaves are available
on the cpu or in the barn. Once depleted, they will fallback to bulk
alloc/free to slabs directly to avoid double copying.

The sheaf_capacity value is exported in sysfs for observability.

Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
count objects allocated or freed using the sheaves (and thus not
counting towards the other alloc/free path counters). Counters
sheaf_refill and sheaf_flush count objects filled or flushed from or to
slab pages, and can be used to assess how effective the caching is. The
refill and flush operations will also count towards the usual
alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
the backing slabs.  For barn operations, barn_get and barn_put count how
many full sheaves were get from or put to the barn, the _fail variants
count how many such requests could not be satisfied mainly  because the
barn was either empty or full. While the barn also holds empty sheaves
to make some operations easier, these are not as critical to mandate own
counters.  Finally, there are sheaf_alloc/sheaf_free counters.

Access to the percpu sheaves is protected by local_trylock() when
potential callers include irq context, and local_lock() otherwise (such
as when we already know the gfp flags allow blocking). The trylock
failures should be rare and we can easily fallback. Each per-NUMA-node
barn has a spin_lock.

When slub_debug is enabled for a cache with sheaf_capacity also
specified, the latter is ignored so that allocations and frees reach the
slow path where debugging hooks are processed. Similarly, we ignore it
with CONFIG_SLUB_TINY which prefers low memory usage to performance.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |   31 ++
 mm/slab.h            |    2 +
 mm/slab_common.c     |    5 +-
 mm/slub.c            | 1101 +++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 1092 insertions(+), 47 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..6cfd085907afb8fc6e502ff7a1a1830c52ff9125 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -335,6 +335,37 @@ struct kmem_cache_args {
 	 * %NULL means no constructor.
 	 */
 	void (*ctor)(void *);
+	/**
+	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
+	 *
+	 * With a non-zero value, allocations from the cache go through caching
+	 * arrays called sheaves. Each cpu has a main sheaf that's always
+	 * present, and a spare sheaf thay may be not present. When both become
+	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
+	 * from the per-node barn.
+	 *
+	 * When no full sheaf is available, and gfp flags allow blocking, a
+	 * sheaf is allocated and filled from slab(s) using bulk allocation.
+	 * Otherwise the allocation falls back to the normal operation
+	 * allocating a single object from a slab.
+	 *
+	 * Analogically when freeing and both percpu sheaves are full, the barn
+	 * may replace it with an empty sheaf, unless it's over capacity. In
+	 * that case a sheaf is bulk freed to slab pages.
+	 *
+	 * The sheaves do not enforce NUMA placement of objects, so allocations
+	 * via kmem_cache_alloc_node() with a node specified other than
+	 * NUMA_NO_NODE will bypass them.
+	 *
+	 * Bulk allocation and free operations also try to use the cpu sheaves
+	 * and barn, but fallback to using slab pages directly.
+	 *
+	 * When slub_debug is enabled for the cache, the sheaf_capacity argument
+	 * is ignored.
+	 *
+	 * %0 means no sheaves will be created.
+	 */
+	unsigned int sheaf_capacity;
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
diff --git a/mm/slab.h b/mm/slab.h
index 05a21dc796e095e8db934564d559494cd81746ec..1980330c2fcb4a4613a7e4f7efc78b349993fd89 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -259,6 +259,7 @@ struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
 #endif
+	struct slub_percpu_sheaves __percpu *cpu_sheaves;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
 	unsigned long min_partial;
@@ -272,6 +273,7 @@ struct kmem_cache {
 	/* Number of per cpu partial slabs to keep around */
 	unsigned int cpu_partial_slabs;
 #endif
+	unsigned int sheaf_capacity;
 	struct kmem_cache_order_objects oo;
 
 	/* Allocation and freeing of slabs */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bfe7c40eeee1a01c175766935c1e3c0304434a53..e2b197e47866c30acdbd1fee4159f262a751c5a7 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
 		return 1;
 #endif
 
+	if (s->cpu_sheaves)
+		return 1;
+
 	/*
 	 * We may have set a slab to be unmergeable during bootstrap.
 	 */
@@ -321,7 +324,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 		    object_size - args->usersize < args->useroffset))
 		args->usersize = args->useroffset = 0;
 
-	if (!args->usersize)
+	if (!args->usersize && !args->sheaf_capacity)
 		s = __kmem_cache_alias(name, object_size, args->align, flags,
 				       args->ctor);
 	if (s)
diff --git a/mm/slub.c b/mm/slub.c
index 31e11ef256f90ad8a21d6b090f810f4c991a68d6..6543aaade60b0adaab232b2256d65c1042c62e1c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -346,8 +346,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
 #endif
 
 enum stat_item {
+	ALLOC_PCS,		/* Allocation from percpu sheaf */
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
+	FREE_PCS,		/* Free to percpu sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -372,6 +374,14 @@ enum stat_item {
 	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
 	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
 	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
+	SHEAF_FLUSH,		/* Objects flushed from a sheaf */
+	SHEAF_REFILL,		/* Objects refilled to a sheaf */
+	SHEAF_ALLOC,		/* Allocation of an empty sheaf */
+	SHEAF_FREE,		/* Freeing of an empty sheaf */
+	BARN_GET,		/* Got full sheaf from barn */
+	BARN_GET_FAIL,		/* Failed to get full sheaf from barn */
+	BARN_PUT,		/* Put full sheaf to barn */
+	BARN_PUT_FAIL,		/* Failed to put full sheaf to barn */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -418,6 +428,33 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
 #endif
 }
 
+#define MAX_FULL_SHEAVES	10
+#define MAX_EMPTY_SHEAVES	10
+
+struct node_barn {
+	spinlock_t lock;
+	struct list_head sheaves_full;
+	struct list_head sheaves_empty;
+	unsigned int nr_full;
+	unsigned int nr_empty;
+};
+
+struct slab_sheaf {
+	union {
+		struct rcu_head rcu_head;
+		struct list_head barn_list;
+	};
+	unsigned int size;
+	void *objects[];
+};
+
+struct slub_percpu_sheaves {
+	local_trylock_t lock;
+	struct slab_sheaf *main; /* never NULL when unlocked */
+	struct slab_sheaf *spare; /* empty or full, may be NULL */
+	struct node_barn *barn;
+};
+
 /*
  * The slab lists for all objects.
  */
@@ -430,6 +467,7 @@ struct kmem_cache_node {
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
+	struct node_barn *barn;
 };
 
 static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
@@ -453,12 +491,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
  */
 static nodemask_t slab_nodes;
 
-#ifndef CONFIG_SLUB_TINY
 /*
  * Workqueue used for flush_cpu_slab().
  */
 static struct workqueue_struct *flushwq;
-#endif
+
+struct slub_flush_work {
+	struct work_struct work;
+	struct kmem_cache *s;
+	bool skip;
+};
+
+static DEFINE_MUTEX(flush_lock);
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
 
 /********************************************************************
  * 			Core slab cache functions
@@ -2437,6 +2482,359 @@ static void *setup_object(struct kmem_cache *s, void *object)
 	return object;
 }
 
+static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
+					s->sheaf_capacity), gfp);
+
+	if (unlikely(!sheaf))
+		return NULL;
+
+	stat(s, SHEAF_ALLOC);
+
+	return sheaf;
+}
+
+static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	kfree(sheaf);
+
+	stat(s, SHEAF_FREE);
+}
+
+static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+				   size_t size, void **p);
+
+
+static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
+			 gfp_t gfp)
+{
+	int to_fill = s->sheaf_capacity - sheaf->size;
+	int filled;
+
+	if (!to_fill)
+		return 0;
+
+	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
+					 &sheaf->objects[sheaf->size]);
+
+	sheaf->size += filled;
+
+	stat_add(s, SHEAF_REFILL, filled);
+
+	if (filled < to_fill)
+		return -ENOMEM;
+
+	return 0;
+}
+
+
+static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (!sheaf)
+		return NULL;
+
+	if (refill_sheaf(s, sheaf, gfp)) {
+		free_empty_sheaf(s, sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+/*
+ * Maximum number of objects freed during a single flush of main pcs sheaf.
+ * Translates directly to an on-stack array size.
+ */
+#define PCS_BATCH_MAX	32U
+
+static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
+
+/*
+ * Free all objects from the main sheaf. In order to perform
+ * __kmem_cache_free_bulk() outside of cpu_sheaves->lock, work in batches where
+ * object pointers are moved to a on-stack array under the lock. To bound the
+ * stack usage, limit each batch to PCS_BATCH_MAX.
+ *
+ * returns true if at least partially flushed
+ */
+static bool sheaf_flush_main(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned int batch, remaining;
+	void *objects[PCS_BATCH_MAX];
+	struct slab_sheaf *sheaf;
+	bool ret = false;
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return ret;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+	sheaf = pcs->main;
+
+	batch = min(PCS_BATCH_MAX, sheaf->size);
+
+	sheaf->size -= batch;
+	memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
+
+	remaining = sheaf->size;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	__kmem_cache_free_bulk(s, batch, &objects[0]);
+
+	stat_add(s, SHEAF_FLUSH, batch);
+
+	ret = true;
+
+	if (remaining)
+		goto next_batch;
+
+	return ret;
+}
+
+/*
+ * Free all objects from a sheaf that's unused, i.e. not linked to any
+ * cpu_sheaves, so we need no locking and batching. The locking is also not
+ * necessary when flushing cpu's sheaves (both spare and main) during cpu
+ * hotremove as the cpu is not executing anymore.
+ */
+static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	if (!sheaf->size)
+		return;
+
+	stat_add(s, SHEAF_FLUSH, sheaf->size);
+
+	__kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+
+	sheaf->size = 0;
+}
+
+/*
+ * Caller needs to make sure migration is disabled in order to fully flush
+ * single cpu's sheaves
+ *
+ * must not be called from an irq
+ *
+ * flushing operations are rare so let's keep it simple and flush to slabs
+ * directly, skipping the barn
+ */
+static void pcs_flush_all(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *spare;
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	spare = pcs->spare;
+	pcs->spare = NULL;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (spare) {
+		sheaf_flush_unused(s, spare);
+		free_empty_sheaf(s, spare);
+	}
+
+	sheaf_flush_main(s);
+}
+
+static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	/* The cpu is not executing anymore so we don't need pcs->lock */
+	sheaf_flush_unused(s, pcs->main);
+	if (pcs->spare) {
+		sheaf_flush_unused(s, pcs->spare);
+		free_empty_sheaf(s, pcs->spare);
+		pcs->spare = NULL;
+	}
+}
+
+static void pcs_destroy(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		/* can happen when unwinding failed create */
+		if (!pcs->main)
+			continue;
+
+		/*
+		 * We have already passed __kmem_cache_shutdown() so everything
+		 * was flushed and there should be no objects allocated from
+		 * slabs, otherwise kmem_cache_destroy() would have aborted.
+		 * Therefore something would have to be really wrong if the
+		 * warnings here trigger, and we should rather leave objects and
+		 * sheaves to leak in that case.
+		 */
+
+		WARN_ON(pcs->spare);
+
+		if (!WARN_ON(pcs->main->size)) {
+			free_empty_sheaf(s, pcs->main);
+			pcs->main = NULL;
+		}
+	}
+
+	free_percpu(s->cpu_sheaves);
+	s->cpu_sheaves = NULL;
+}
+
+static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *empty = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_empty) {
+		empty = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&empty->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+/*
+ * The following two functions are used mainly in cases where we have to undo an
+ * intended action due to a race or cpu migration. Thus they do not check the
+ * empty or full sheaf limits for simplicity.
+ */
+
+static void barn_put_empty_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_add(&sheaf->barn_list, &barn->sheaves_empty);
+	barn->nr_empty++;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+}
+
+static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_add(&sheaf->barn_list, &barn->sheaves_full);
+	barn->nr_full++;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+}
+
+/*
+ * If a full sheaf is available, return it and put the supplied empty one to
+ * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
+ * change.
+ */
+static struct slab_sheaf *
+barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&full->barn_list);
+		list_add(&empty->barn_list, &barn->sheaves_empty);
+		barn->nr_full--;
+		barn->nr_empty++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+/*
+ * If an empty sheaf is available, return it and put the supplied full one to
+ * barn. But if there are too many full sheaves, reject this with -E2BIG.
+ */
+static struct slab_sheaf *
+barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
+{
+	struct slab_sheaf *empty;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full >= MAX_FULL_SHEAVES) {
+		empty = ERR_PTR(-E2BIG);
+	} else if (!barn->nr_empty) {
+		empty = ERR_PTR(-ENOMEM);
+	} else {
+		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
+					 barn_list);
+		list_del(&empty->barn_list);
+		list_add(&full->barn_list, &barn->sheaves_full);
+		barn->nr_empty--;
+		barn->nr_full++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+static void barn_init(struct node_barn *barn)
+{
+	spin_lock_init(&barn->lock);
+	INIT_LIST_HEAD(&barn->sheaves_full);
+	INIT_LIST_HEAD(&barn->sheaves_empty);
+	barn->nr_full = 0;
+	barn->nr_empty = 0;
+}
+
+static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
+{
+	struct list_head empty_list;
+	struct list_head full_list;
+	struct slab_sheaf *sheaf, *sheaf2;
+	unsigned long flags;
+
+	INIT_LIST_HEAD(&empty_list);
+	INIT_LIST_HEAD(&full_list);
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_splice_init(&barn->sheaves_full, &full_list);
+	barn->nr_full = 0;
+	list_splice_init(&barn->sheaves_empty, &empty_list);
+	barn->nr_empty = 0;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
+		sheaf_flush_unused(s, sheaf);
+		free_empty_sheaf(s, sheaf);
+	}
+
+	list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
+		free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Slab allocation and freeing
  */
@@ -3312,11 +3710,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 	put_partials_cpu(s, c);
 }
 
-struct slub_flush_work {
-	struct work_struct work;
-	struct kmem_cache *s;
-	bool skip;
-};
+static inline void flush_this_cpu_slab(struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+
+	if (c->slab)
+		flush_slab(s, c);
+
+	put_partials(s);
+}
+
+static bool has_cpu_slab(int cpu, struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+	return c->slab || slub_percpu_partial(c);
+}
+
+#else /* CONFIG_SLUB_TINY */
+static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
+static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
+static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
+#endif /* CONFIG_SLUB_TINY */
+
+static bool has_pcs_used(int cpu, struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	if (!s->cpu_sheaves)
+		return false;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	return (pcs->spare || pcs->main->size);
+}
+
+static void pcs_flush_all(struct kmem_cache *s);
 
 /*
  * Flush cpu slab.
@@ -3326,30 +3755,18 @@ struct slub_flush_work {
 static void flush_cpu_slab(struct work_struct *w)
 {
 	struct kmem_cache *s;
-	struct kmem_cache_cpu *c;
 	struct slub_flush_work *sfw;
 
 	sfw = container_of(w, struct slub_flush_work, work);
 
 	s = sfw->s;
-	c = this_cpu_ptr(s->cpu_slab);
 
-	if (c->slab)
-		flush_slab(s, c);
+	if (s->cpu_sheaves)
+		pcs_flush_all(s);
 
-	put_partials(s);
+	flush_this_cpu_slab(s);
 }
 
-static bool has_cpu_slab(int cpu, struct kmem_cache *s)
-{
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
-
-	return c->slab || slub_percpu_partial(c);
-}
-
-static DEFINE_MUTEX(flush_lock);
-static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
-
 static void flush_all_cpus_locked(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
@@ -3360,7 +3777,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		if (!has_cpu_slab(cpu, s)) {
+		if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
 			sfw->skip = true;
 			continue;
 		}
@@ -3396,19 +3813,15 @@ static int slub_cpu_dead(unsigned int cpu)
 	struct kmem_cache *s;
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list)
+	list_for_each_entry(s, &slab_caches, list) {
 		__flush_cpu_slab(s, cpu);
+		if (s->cpu_sheaves)
+			__pcs_flush_all_cpu(s, cpu);
+	}
 	mutex_unlock(&slab_mutex);
 	return 0;
 }
 
-#else /* CONFIG_SLUB_TINY */
-static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
-static inline void flush_all(struct kmem_cache *s) { }
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
-static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
-#endif /* CONFIG_SLUB_TINY */
-
 /*
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
@@ -4158,6 +4571,199 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
 }
 
+static struct slub_percpu_sheaves *
+__pcs_handle_empty(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
+{
+	struct slab_sheaf *empty = NULL;
+	struct slab_sheaf *full;
+	bool can_alloc;
+
+	if (pcs->spare && pcs->spare->size > 0) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
+
+	if (full) {
+		stat(s, BARN_GET);
+		pcs->main = full;
+		return pcs;
+	}
+
+	stat(s, BARN_GET_FAIL);
+
+	can_alloc = gfpflags_allow_blocking(gfp);
+
+	if (can_alloc) {
+		if (pcs->spare) {
+			empty = pcs->spare;
+			pcs->spare = NULL;
+		} else {
+			empty = barn_get_empty_sheaf(pcs->barn);
+		}
+	}
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (!can_alloc)
+		return NULL;
+
+	if (empty) {
+		if (!refill_sheaf(s, empty, gfp)) {
+			full = empty;
+		} else {
+			/*
+			 * we must be very low on memory so don't bother
+			 * with the barn
+			 */
+			free_empty_sheaf(s, empty);
+		}
+	} else {
+		full = alloc_full_sheaf(s, gfp);
+	}
+
+	if (!full)
+		return NULL;
+
+	/*
+	 * we can reach here only when gfpflags_allow_blocking
+	 * so this must not be an irq
+	 */
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	/*
+	 * If we are returning empty sheaf, we either got it from the
+	 * barn or had to allocate one. If we are returning a full
+	 * sheaf, it's due to racing or being migrated to a different
+	 * cpu. Breaching the barn's sheaf limits should be thus rare
+	 * enough so just ignore them to simplify the recovery.
+	 */
+
+	if (pcs->main->size == 0) {
+		barn_put_empty_sheaf(pcs->barn, pcs->main);
+		pcs->main = full;
+		return pcs;
+	}
+
+	if (!pcs->spare) {
+		pcs->spare = full;
+		return pcs;
+	}
+
+	if (pcs->spare->size == 0) {
+		barn_put_empty_sheaf(pcs->barn, pcs->spare);
+		pcs->spare = full;
+		return pcs;
+	}
+
+	barn_put_full_sheaf(pcs->barn, full);
+	stat(s, BARN_PUT);
+
+	return pcs;
+}
+
+static __fastpath_inline
+void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slub_percpu_sheaves *pcs;
+	void *object;
+
+#ifdef CONFIG_NUMA
+	if (static_branch_unlikely(&strict_numa)) {
+		if (current->mempolicy)
+			return NULL;
+	}
+#endif
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return NULL;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+		pcs = __pcs_handle_empty(s, pcs, gfp);
+		if (unlikely(!pcs))
+			return NULL;
+	}
+
+	object = pcs->main->objects[--pcs->main->size];
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat(s, ALLOC_PCS);
+
+	return object;
+}
+
+static __fastpath_inline
+unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main;
+	unsigned int allocated = 0;
+	unsigned int batch;
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return allocated;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+
+		struct slab_sheaf *full;
+
+		if (pcs->spare && pcs->spare->size > 0) {
+			swap(pcs->main, pcs->spare);
+			goto do_alloc;
+		}
+
+		full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
+
+		if (full) {
+			stat(s, BARN_GET);
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		stat(s, BARN_GET_FAIL);
+
+		local_unlock(&s->cpu_sheaves->lock);
+
+		/*
+		 * Once full sheaves in barn are depleted, let the bulk
+		 * allocation continue from slab pages, otherwise we would just
+		 * be copying arrays of pointers twice.
+		 */
+		return allocated;
+	}
+
+do_alloc:
+
+	main = pcs->main;
+	batch = min(size, main->size);
+
+	main->size -= batch;
+	memcpy(p, main->objects + main->size, batch * sizeof(void *));
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat_add(s, ALLOC_PCS, batch);
+
+	allocated += batch;
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+
+	return allocated;
+}
+
+
 /*
  * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  * have the fastpath folded into their functions. So no function call
@@ -4182,7 +4788,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
+	if (s->cpu_sheaves && node == NUMA_NO_NODE)
+		object = alloc_from_pcs(s, gfpflags);
+
+	if (!object)
+		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
 	init = slab_want_init_on_alloc(gfpflags, s);
@@ -4554,6 +5164,274 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+/*
+ * pcs is locked. We should have get rid of the spare sheaf and obtained an
+ * empty sheaf, while the main sheaf is full. We want to install the empty sheaf
+ * as a main sheaf, and make the current main sheaf a spare sheaf.
+ *
+ * However due to having relinquished the cpu_sheaves lock when obtaining
+ * the empty sheaf, we need to handle some unlikely but possible cases.
+ *
+ * If we put any sheaf to barn here, it's because we were interrupted or have
+ * been migrated to a different cpu, which should be rare enough so just ignore
+ * the barn's limits to simplify the handling.
+ *
+ * An alternative scenario that gets us here is when we fail
+ * barn_replace_full_sheaf(), because there's no empty sheaf available in the
+ * barn, so we had to allocate it by alloc_empty_sheaf(). But because we saw the
+ * limit on full sheaves was not exceeded, we assume it didn't change and just
+ * put the full sheaf there.
+ */
+static void __pcs_install_empty_sheaf(struct kmem_cache *s,
+		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
+{
+	/* This is what we expect to find if nobody interrupted us. */
+	if (likely(!pcs->spare)) {
+		pcs->spare = pcs->main;
+		pcs->main = empty;
+		return;
+	}
+
+	/*
+	 * Unlikely because if the main sheaf had space, we would have just
+	 * freed to it. Get rid of our empty sheaf.
+	 */
+	if (pcs->main->size < s->sheaf_capacity) {
+		barn_put_empty_sheaf(pcs->barn, empty);
+		return;
+	}
+
+	/* Also unlikely for the same reason/ */
+	if (pcs->spare->size < s->sheaf_capacity) {
+		swap(pcs->main, pcs->spare);
+		barn_put_empty_sheaf(pcs->barn, empty);
+		return;
+	}
+
+	/*
+	 * We probably failed barn_replace_full_sheaf() due to no empty sheaf
+	 * available there, but we allocated one, so finish the job.
+	 */
+	barn_put_full_sheaf(pcs->barn, pcs->main);
+	stat(s, BARN_PUT);
+	pcs->main = empty;
+}
+
+static struct slub_percpu_sheaves *
+__pcs_handle_full(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
+{
+	struct slab_sheaf *empty;
+	bool put_fail;
+
+restart:
+	put_fail = false;
+
+	if (!pcs->spare) {
+		empty = barn_get_empty_sheaf(pcs->barn);
+		if (empty) {
+			pcs->spare = pcs->main;
+			pcs->main = empty;
+			return pcs;
+		}
+		goto alloc_empty;
+	}
+
+	if (pcs->spare->size < s->sheaf_capacity) {
+		swap(pcs->main, pcs->spare);
+		return pcs;
+	}
+
+	empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
+
+	if (!IS_ERR(empty)) {
+		stat(s, BARN_PUT);
+		pcs->main = empty;
+		return pcs;
+	}
+
+	if (PTR_ERR(empty) == -E2BIG) {
+		/* Since we got here, spare exists and is full */
+		struct slab_sheaf *to_flush = pcs->spare;
+
+		stat(s, BARN_PUT_FAIL);
+
+		pcs->spare = NULL;
+		local_unlock(&s->cpu_sheaves->lock);
+
+		sheaf_flush_unused(s, to_flush);
+		empty = to_flush;
+		goto got_empty;
+	}
+
+	/*
+	 * We could not replace full sheaf because barn had no empty
+	 * sheaves. We can still allocate it and put the full sheaf in
+	 * __pcs_install_empty_sheaf(), but if we fail to allocate it,
+	 * make sure to count the fail.
+	 */
+	put_fail = true;
+
+alloc_empty:
+	local_unlock(&s->cpu_sheaves->lock);
+
+	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+	if (empty)
+		goto got_empty;
+
+	if (put_fail)
+		 stat(s, BARN_PUT_FAIL);
+
+	if (!sheaf_flush_main(s))
+		return NULL;
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return NULL;
+
+	/*
+	 * we flushed the main sheaf so it should be empty now,
+	 * but in case we got preempted or migrated, we need to
+	 * check again
+	 */
+	if (pcs->main->size == s->sheaf_capacity)
+		goto restart;
+
+	return pcs;
+
+got_empty:
+	if (!local_trylock(&s->cpu_sheaves->lock)) {
+		barn_put_empty_sheaf(pcs->barn, empty);
+		return NULL;
+	}
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+	__pcs_install_empty_sheaf(s, pcs, empty);
+
+	return pcs;
+}
+
+/*
+ * Free an object to the percpu sheaves.
+ * The object is expected to have passed slab_free_hook() already.
+ */
+static __fastpath_inline
+bool free_to_pcs(struct kmem_cache *s, void *object)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return false;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
+
+		pcs = __pcs_handle_full(s, pcs);
+		if (unlikely(!pcs))
+			return false;
+	}
+
+	pcs->main->objects[pcs->main->size++] = object;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat(s, FREE_PCS);
+
+	return true;
+}
+
+/*
+ * Bulk free objects to the percpu sheaves.
+ * Unlike free_to_pcs() this includes the calls to all necessary hooks
+ * and the fallback to freeing to slab pages.
+ */
+static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main, *empty;
+	unsigned int batch, i = 0;
+	bool init;
+
+	init = slab_want_init_on_free(s);
+
+	while (i < size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
+			p[i] = p[--size];
+			if (!size)
+				return;
+			continue;
+		}
+
+		i++;
+	}
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto fallback;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (likely(pcs->main->size < s->sheaf_capacity))
+		goto do_free;
+
+	if (!pcs->spare) {
+		empty = barn_get_empty_sheaf(pcs->barn);
+		if (!empty)
+			goto no_empty;
+
+		pcs->spare = pcs->main;
+		pcs->main = empty;
+		goto do_free;
+	}
+
+	if (pcs->spare->size < s->sheaf_capacity) {
+		swap(pcs->main, pcs->spare);
+		goto do_free;
+	}
+
+	empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
+	if (IS_ERR(empty)) {
+		stat(s, BARN_PUT_FAIL);
+		goto no_empty;
+	}
+
+	stat(s, BARN_PUT);
+	pcs->main = empty;
+
+do_free:
+	main = pcs->main;
+	batch = min(size, s->sheaf_capacity - main->size);
+
+	memcpy(main->objects + main->size, p, batch * sizeof(void *));
+	main->size += batch;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat_add(s, FREE_PCS, batch);
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+
+	return;
+
+no_empty:
+	local_unlock(&s->cpu_sheaves->lock);
+
+	/*
+	 * if we depleted all empty sheaves in the barn or there are too
+	 * many full sheaves, free the rest to slab pages
+	 */
+fallback:
+	__kmem_cache_free_bulk(s, size, p);
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4640,7 +5518,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	memcg_slab_free_hook(s, slab, &object, 1);
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
-	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+		return;
+
+	if (!s->cpu_sheaves || !free_to_pcs(s, object))
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 
@@ -5236,6 +6117,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 	if (!size)
 		return;
 
+	/*
+	 * freeing to sheaves is so incompatible with the detached freelist so
+	 * once we go that way, we have to do everything differently
+	 */
+	if (s && s->cpu_sheaves) {
+		free_to_pcs_bulk(s, size, p);
+		return;
+	}
+
 	do {
 		struct detached_freelist df;
 
@@ -5354,7 +6244,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 				 void **p)
 {
-	int i;
+	unsigned int i = 0;
 
 	if (!size)
 		return 0;
@@ -5363,9 +6253,20 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 	if (unlikely(!s))
 		return 0;
 
-	i = __kmem_cache_alloc_bulk(s, flags, size, p);
-	if (unlikely(i == 0))
-		return 0;
+	if (s->cpu_sheaves)
+		i = alloc_from_pcs_bulk(s, size, p);
+
+	if (i < size) {
+		/*
+		 * If we ran out of memory, don't bother with freeing back to
+		 * the percpu sheaves, we have bigger problems.
+		 */
+		if (unlikely(__kmem_cache_alloc_bulk(s, flags, size - i, p + i) == 0)) {
+			if (i > 0)
+				__kmem_cache_free_bulk(s, i, p);
+			return 0;
+		}
+	}
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.
@@ -5375,11 +6276,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 		    slab_want_init_on_alloc(flags, s), s->object_size))) {
 		return 0;
 	}
-	return i;
+
+	return size;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 
-
 /*
  * Object placement in a slab is made very easy because we always start at
  * offset 0. If we tune the size of the object to the alignment then we can
@@ -5513,7 +6414,7 @@ static inline int calculate_order(unsigned int size)
 }
 
 static void
-init_kmem_cache_node(struct kmem_cache_node *n)
+init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
@@ -5523,6 +6424,9 @@ init_kmem_cache_node(struct kmem_cache_node *n)
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
+	n->barn = barn;
+	if (barn)
+		barn_init(barn);
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -5553,6 +6457,30 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 }
 #endif /* CONFIG_SLUB_TINY */
 
+static int init_percpu_sheaves(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+		int nid;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		local_trylock_init(&pcs->lock);
+
+		nid = cpu_to_mem(cpu);
+
+		pcs->barn = get_node(s, nid)->barn;
+		pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+
+		if (!pcs->main)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static struct kmem_cache *kmem_cache_node;
 
 /*
@@ -5588,7 +6516,7 @@ static void early_kmem_cache_node_alloc(int node)
 	slab->freelist = get_freepointer(kmem_cache_node, n);
 	slab->inuse = 1;
 	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n);
+	init_kmem_cache_node(n, NULL);
 	inc_slabs_node(kmem_cache_node, node, slab->objects);
 
 	/*
@@ -5604,6 +6532,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn) {
+			WARN_ON(n->barn->nr_full);
+			WARN_ON(n->barn->nr_empty);
+			kfree(n->barn);
+			n->barn = NULL;
+		}
+
 		s->node[node] = NULL;
 		kmem_cache_free(kmem_cache_node, n);
 	}
@@ -5612,6 +6547,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
+	if (s->cpu_sheaves)
+		pcs_destroy(s);
 #ifndef CONFIG_SLUB_TINY
 	free_percpu(s->cpu_slab);
 #endif
@@ -5624,20 +6561,29 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_node_mask(node, slab_nodes) {
 		struct kmem_cache_node *n;
+		struct node_barn *barn = NULL;
 
 		if (slab_state == DOWN) {
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+			if (!barn)
+				return 0;
+		}
+
 		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
-
 		if (!n) {
-			free_kmem_cache_nodes(s);
+			kfree(barn);
 			return 0;
 		}
 
-		init_kmem_cache_node(n);
+		init_kmem_cache_node(n, barn);
+
 		s->node[node] = n;
 	}
 	return 1;
@@ -5894,6 +6840,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	flush_all_cpus_locked(s);
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn)
+			barn_shrink(s, n->barn);
 		free_partial(s, n);
 		if (n->nr_partial || node_nr_slabs(n))
 			return 1;
@@ -6097,6 +7045,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 			INIT_LIST_HEAD(promote + i);
 
+		if (n->barn)
+			barn_shrink(s, n->barn);
+
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -6209,12 +7160,24 @@ static int slab_mem_going_online_callback(void *arg)
 	 */
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
+		struct node_barn *barn = NULL;
+
 		/*
 		 * The structure may already exist if the node was previously
 		 * onlined and offlined.
 		 */
 		if (get_node(s, nid))
 			continue;
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
+
+			if (!barn) {
+				ret = -ENOMEM;
+				goto out;
+			}
+		}
+
 		/*
 		 * XXX: kmem_cache_alloc_node will fallback to other nodes
 		 *      since memory is not yet available from the node that
@@ -6222,10 +7185,13 @@ static int slab_mem_going_online_callback(void *arg)
 		 */
 		n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
 		if (!n) {
+			kfree(barn);
 			ret = -ENOMEM;
 			goto out;
 		}
-		init_kmem_cache_node(n);
+
+		init_kmem_cache_node(n, barn);
+
 		s->node[nid] = n;
 	}
 	/*
@@ -6444,6 +7410,17 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 
 	set_cpu_partial(s);
 
+	if (args->sheaf_capacity && !IS_ENABLED(CONFIG_SLUB_TINY)
+					&& !(s->flags & SLAB_DEBUG_FLAGS)) {
+		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
+		if (!s->cpu_sheaves) {
+			err = -ENOMEM;
+			goto out;
+		}
+		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
+		s->sheaf_capacity = args->sheaf_capacity;
+	}
+
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -6460,6 +7437,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	if (!alloc_kmem_cache_cpus(s))
 		goto out;
 
+	if (s->cpu_sheaves) {
+		err = init_percpu_sheaves(s);
+		if (err)
+			goto out;
+	}
+
 	err = 0;
 
 	/* Mutex is not taken during early boot */
@@ -6481,7 +7464,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 		__kmem_cache_release(s);
 	return err;
 }
-
 #ifdef SLAB_SUPPORTS_SYSFS
 static int count_inuse(struct slab *slab)
 {
@@ -6912,6 +7894,12 @@ static ssize_t order_show(struct kmem_cache *s, char *buf)
 }
 SLAB_ATTR_RO(order);
 
+static ssize_t sheaf_capacity_show(struct kmem_cache *s, char *buf)
+{
+	return sysfs_emit(buf, "%u\n", s->sheaf_capacity);
+}
+SLAB_ATTR_RO(sheaf_capacity);
+
 static ssize_t min_partial_show(struct kmem_cache *s, char *buf)
 {
 	return sysfs_emit(buf, "%lu\n", s->min_partial);
@@ -7259,8 +8247,10 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
+STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
+STAT_ATTR(FREE_PCS, free_cpu_sheaf);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -7285,6 +8275,14 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
 STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
 STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
 STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
+STAT_ATTR(SHEAF_FLUSH, sheaf_flush);
+STAT_ATTR(SHEAF_REFILL, sheaf_refill);
+STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
+STAT_ATTR(SHEAF_FREE, sheaf_free);
+STAT_ATTR(BARN_GET, barn_get);
+STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
+STAT_ATTR(BARN_PUT, barn_put);
+STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -7315,6 +8313,7 @@ static struct attribute *slab_attrs[] = {
 	&object_size_attr.attr,
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
+	&sheaf_capacity_attr.attr,
 	&min_partial_attr.attr,
 	&cpu_partial_attr.attr,
 	&objects_partial_attr.attr,
@@ -7346,8 +8345,10 @@ static struct attribute *slab_attrs[] = {
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
+	&alloc_cpu_sheaf_attr.attr,
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
+	&free_cpu_sheaf_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
@@ -7372,6 +8373,14 @@ static struct attribute *slab_attrs[] = {
 	&cpu_partial_free_attr.attr,
 	&cpu_partial_node_attr.attr,
 	&cpu_partial_drain_attr.attr,
+	&sheaf_flush_attr.attr,
+	&sheaf_refill_attr.attr,
+	&sheaf_alloc_attr.attr,
+	&sheaf_free_attr.attr,
+	&barn_get_attr.attr,
+	&barn_get_fail_attr.attr,
+	&barn_put_attr.attr,
+	&barn_put_fail_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 02/14] slab: add sheaf support for batching kfree_rcu() operations
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 01/14] slab: add opt-in caching layer of " Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-07-23 16:39   ` Uladzislau Rezki
  2025-07-23 13:34 ` [PATCH v5 03/14] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.

kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put it in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.

It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() implementation.

Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
  existing batching
- sheaves can be reused for allocations via barn instead of being
  flushed to slabs, which is more efficient
  - this includes cases where only some cpus are allowed to process rcu
    callbacks (Android)

Possible disadvantage:
- objects might be waiting for more than their grace period (it is
  determined by the last object freed into the sheaf), increasing memory
  usage - but the existing batching does that too.

Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.

Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
count how many kfree_rcu() used the rcu_free sheaf successfully and how
many had to fall back to the existing implementation.

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h        |   2 +
 mm/slab_common.c |  24 +++++++
 mm/slub.c        | 193 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 214 insertions(+), 5 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 1980330c2fcb4a4613a7e4f7efc78b349993fd89..44c9b70eaabbd87c06fb39b79dfb791d515acbde 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -459,6 +459,8 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
 }
 
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
 			 SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e2b197e47866c30acdbd1fee4159f262a751c5a7..2d806e02568532a1000fd3912db6978e945dcfa8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1608,6 +1608,27 @@ static void kfree_rcu_work(struct work_struct *work)
 		kvfree_rcu_list(head);
 }
 
+static bool kfree_rcu_sheaf(void *obj)
+{
+	struct kmem_cache *s;
+	struct folio *folio;
+	struct slab *slab;
+
+	if (is_vmalloc_addr(obj))
+		return false;
+
+	folio = virt_to_folio(obj);
+	if (unlikely(!folio_test_slab(folio)))
+		return false;
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+	if (s->cpu_sheaves)
+		return __kfree_rcu_sheaf(s, obj);
+
+	return false;
+}
+
 static bool
 need_offload_krc(struct kfree_rcu_cpu *krcp)
 {
@@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	if (!head)
 		might_sleep();
 
+	if (kfree_rcu_sheaf(ptr))
+		return;
+
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
diff --git a/mm/slub.c b/mm/slub.c
index 6543aaade60b0adaab232b2256d65c1042c62e1c..f6d86cd3983533784583f1df6add186c4a74cd97 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -350,6 +350,8 @@ enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
 	FREE_PCS,		/* Free to percpu sheaf */
+	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
+	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -444,6 +446,7 @@ struct slab_sheaf {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
 	};
+	struct kmem_cache *cache;
 	unsigned int size;
 	void *objects[];
 };
@@ -452,6 +455,7 @@ struct slub_percpu_sheaves {
 	local_trylock_t lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
+	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
 	struct node_barn *barn;
 };
 
@@ -2490,6 +2494,8 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
 	if (unlikely(!sheaf))
 		return NULL;
 
+	sheaf->cache = s;
+
 	stat(s, SHEAF_ALLOC);
 
 	return sheaf;
@@ -2614,6 +2620,43 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
 	sheaf->size = 0;
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf)
+{
+	bool init = slab_want_init_on_free(s);
+	void **p = &sheaf->objects[0];
+	unsigned int i = 0;
+
+	while (i < sheaf->size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, true))) {
+			p[i] = p[--sheaf->size];
+			continue;
+		}
+
+		i++;
+	}
+}
+
+static void rcu_free_sheaf_nobarn(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+	s = sheaf->cache;
+
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	sheaf_flush_unused(s, sheaf);
+
+	free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
@@ -2626,7 +2669,7 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
 static void pcs_flush_all(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
-	struct slab_sheaf *spare;
+	struct slab_sheaf *spare, *rcu_free;
 
 	local_lock(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -2634,6 +2677,9 @@ static void pcs_flush_all(struct kmem_cache *s)
 	spare = pcs->spare;
 	pcs->spare = NULL;
 
+	rcu_free = pcs->rcu_free;
+	pcs->rcu_free = NULL;
+
 	local_unlock(&s->cpu_sheaves->lock);
 
 	if (spare) {
@@ -2641,6 +2687,9 @@ static void pcs_flush_all(struct kmem_cache *s)
 		free_empty_sheaf(s, spare);
 	}
 
+	if (rcu_free)
+		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+
 	sheaf_flush_main(s);
 }
 
@@ -2657,6 +2706,11 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
 		free_empty_sheaf(s, pcs->spare);
 		pcs->spare = NULL;
 	}
+
+	if (pcs->rcu_free) {
+		call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+		pcs->rcu_free = NULL;
+	}
 }
 
 static void pcs_destroy(struct kmem_cache *s)
@@ -2682,6 +2736,7 @@ static void pcs_destroy(struct kmem_cache *s)
 		 */
 
 		WARN_ON(pcs->spare);
+		WARN_ON(pcs->rcu_free);
 
 		if (!WARN_ON(pcs->main->size)) {
 			free_empty_sheaf(s, pcs->main);
@@ -3742,7 +3797,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
 
 	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-	return (pcs->spare || pcs->main->size);
+	return (pcs->spare || pcs->rcu_free || pcs->main->size);
 }
 
 static void pcs_flush_all(struct kmem_cache *s);
@@ -5339,6 +5394,127 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
 	return true;
 }
 
+static void rcu_free_sheaf(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct node_barn *barn;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+
+	s = sheaf->cache;
+
+	/*
+	 * This may remove some objects due to slab_free_hook() returning false,
+	 * so that the sheaf might no longer be completely full. But it's easier
+	 * to handle it as full (unless it became completely empty), as the code
+	 * handles it fine. The only downside is that sheaf will serve fewer
+	 * allocations when reused. It only happens due to debugging, which is a
+	 * performance hit anyway.
+	 */
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	barn = get_node(s, numa_mem_id())->barn;
+
+	/* due to slab_free_hook() */
+	if (unlikely(sheaf->size == 0))
+		goto empty;
+
+	/*
+	 * Checking nr_full/nr_empty outside lock avoids contention in case the
+	 * barn is at the respective limit. Due to the race we might go over the
+	 * limit but that should be rare and harmless.
+	 */
+
+	if (data_race(barn->nr_full) < MAX_FULL_SHEAVES) {
+		stat(s, BARN_PUT);
+		barn_put_full_sheaf(barn, sheaf);
+		return;
+	}
+
+	stat(s, BARN_PUT_FAIL);
+	sheaf_flush_unused(s, sheaf);
+
+empty:
+	if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
+		barn_put_empty_sheaf(barn, sheaf);
+		return;
+	}
+
+	free_empty_sheaf(s, sheaf);
+}
+
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *rcu_sheaf;
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto fail;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(!pcs->rcu_free)) {
+
+		struct slab_sheaf *empty;
+
+		if (pcs->spare && pcs->spare->size == 0) {
+			pcs->rcu_free = pcs->spare;
+			pcs->spare = NULL;
+			goto do_free;
+		}
+
+		empty = barn_get_empty_sheaf(pcs->barn);
+
+		if (empty) {
+			pcs->rcu_free = empty;
+			goto do_free;
+		}
+
+		local_unlock(&s->cpu_sheaves->lock);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty)
+			goto fail;
+
+		if (!local_trylock(&s->cpu_sheaves->lock)) {
+			barn_put_empty_sheaf(pcs->barn, empty);
+			goto fail;
+		}
+
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		if (unlikely(pcs->rcu_free))
+			barn_put_empty_sheaf(pcs->barn, empty);
+		else
+			pcs->rcu_free = empty;
+	}
+
+do_free:
+
+	rcu_sheaf = pcs->rcu_free;
+
+	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
+
+	if (likely(rcu_sheaf->size < s->sheaf_capacity))
+		rcu_sheaf = NULL;
+	else
+		pcs->rcu_free = NULL;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (rcu_sheaf)
+		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+
+	stat(s, FREE_RCU_SHEAF);
+	return true;
+
+fail:
+	stat(s, FREE_RCU_SHEAF_FAIL);
+	return false;
+}
+
 /*
  * Bulk free objects to the percpu sheaves.
  * Unlike free_to_pcs() this includes the calls to all necessary hooks
@@ -5348,10 +5524,8 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *main, *empty;
+	bool init = slab_want_init_on_free(s);
 	unsigned int batch, i = 0;
-	bool init;
-
-	init = slab_want_init_on_free(s);
 
 	while (i < size) {
 		struct slab *slab = virt_to_slab(p[i]);
@@ -6838,6 +7012,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	flush_all_cpus_locked(s);
+
+	/* we might have rcu sheaves in flight */
+	if (s->cpu_sheaves)
+		rcu_barrier();
+
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
 		if (n->barn)
@@ -8251,6 +8430,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
 STAT_ATTR(FREE_PCS, free_cpu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -8349,6 +8530,8 @@ static struct attribute *slab_attrs[] = {
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
 	&free_cpu_sheaf_attr.attr,
+	&free_rcu_sheaf_attr.attr,
+	&free_rcu_sheaf_fail_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 03/14] slab: sheaf prefilling for guaranteed allocations
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 01/14] slab: add opt-in caching layer of " Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 02/14] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 04/14] slab: determine barn status racily outside of lock Vlastimil Babka
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Add functions for efficient guaranteed allocations e.g. in a critical
section that cannot sleep, when the exact number of allocations is not
known beforehand, but an upper limit can be calculated.

kmem_cache_prefill_sheaf() returns a sheaf containing at least given
number of objects.

kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
and is guaranteed not to fail until depleted.

kmem_cache_return_sheaf() is for giving the sheaf back to the slab
allocator after the critical section. This will also attempt to refill
it to cache's sheaf capacity for better efficiency of sheaves handling,
but it's not stricly necessary to succeed.

kmem_cache_refill_sheaf() can be used to refill a previously obtained
sheaf to requested size. If the current size is sufficient, it does
nothing. If the requested size exceeds cache's sheaf_capacity and the
sheaf's current capacity, the sheaf will be replaced with a new one,
hence the indirect pointer parameter.

kmem_cache_sheaf_size() can be used to query the current size.

The implementation supports requesting sizes that exceed cache's
sheaf_capacity, but it is not efficient - such "oversize" sheaves are
allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
might be especially ineffective when replacing a sheaf with a new one of
a larger capacity. It is therefore better to size cache's
sheaf_capacity accordingly to make oversize sheaves exceptional.

CONFIG_SLUB_STATS counters are added for sheaf prefill and return
operations. A prefill or return is considered _fast when it is able to
grab or return a percpu spare sheaf (even if the sheaf needs a refill to
satisfy the request, as those should amortize over time), and _slow
otherwise (when the barn or even sheaf allocation/freeing has to be
involved). sheaf_prefill_oversize is provided to determine how many
prefills were oversize (counter for oversize returns is not necessary as
all oversize refills result in oversize returns).

When slub_debug is enabled for a cache with sheaves, no percpu sheaves
exist for it, but the prefill functionality is still provided simply by
all prefilled sheaves becoming oversize. If percpu sheaves are not
created for a cache due to not passing the sheaf_capacity argument on
cache creation, the prefills also work through oversize sheaves, but
there's a WARN_ON_ONCE() to indicate the omission.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |  16 ++++
 mm/slub.c            | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 281 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 6cfd085907afb8fc6e502ff7a1a1830c52ff9125..3ff70547db49d0880b1b6cb100527936e88ca509 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -829,6 +829,22 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
 				   int node) __assume_slab_alignment __malloc;
 #define kmem_cache_alloc_node(...)	alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
+
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf **sheafp, unsigned int size);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+				       struct slab_sheaf *sheaf);
+
+void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
+			struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_from_sheaf(...)	\
+			alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))
+
+unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf);
+
 /*
  * These macros allow declaring a kmem_buckets * parameter alongside size, which
  * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
diff --git a/mm/slub.c b/mm/slub.c
index f6d86cd3983533784583f1df6add186c4a74cd97..8b3093ee2e02c9ff4e149ac54833db4972b414a3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -384,6 +384,11 @@ enum stat_item {
 	BARN_GET_FAIL,		/* Failed to get full sheaf from barn */
 	BARN_PUT,		/* Put full sheaf to barn */
 	BARN_PUT_FAIL,		/* Failed to put full sheaf to barn */
+	SHEAF_PREFILL_FAST,	/* Sheaf prefill grabbed the spare sheaf */
+	SHEAF_PREFILL_SLOW,	/* Sheaf prefill found no spare sheaf */
+	SHEAF_PREFILL_OVERSIZE,	/* Allocation of oversize sheaf for prefill */
+	SHEAF_RETURN_FAST,	/* Sheaf return reattached spare sheaf */
+	SHEAF_RETURN_SLOW,	/* Sheaf return could not reattach spare */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -445,6 +450,8 @@ struct slab_sheaf {
 	union {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
+		/* only used for prefilled sheafs */
+		unsigned int capacity;
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
@@ -2797,6 +2804,30 @@ static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf
 	spin_unlock_irqrestore(&barn->lock, flags);
 }
 
+static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *sheaf = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		sheaf = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_full--;
+	} else if (barn->nr_empty) {
+		sheaf = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return sheaf;
+}
+
 /*
  * If a full sheaf is available, return it and put the supplied empty one to
  * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
@@ -4919,6 +4950,230 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
 
+/*
+ * returns a sheaf that has at least the requested size
+ * when prefilling is needed, do so with given gfp flags
+ *
+ * return NULL if sheaf allocation or prefilling failed
+ */
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *sheaf = NULL;
+
+	if (unlikely(size > s->sheaf_capacity)) {
+
+		/*
+		 * slab_debug disables cpu sheaves intentionally so all
+		 * prefilled sheaves become "oversize" and we give up on
+		 * performance for the debugging. Same with SLUB_TINY.
+		 * Creating a cache without sheaves and then requesting a
+		 * prefilled sheaf is however not expected, so warn.
+		 */
+		WARN_ON_ONCE(s->sheaf_capacity == 0 &&
+			     !IS_ENABLED(CONFIG_SLUB_TINY) &&
+			     !(s->flags & SLAB_DEBUG_FLAGS));
+
+		sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
+		if (!sheaf)
+			return NULL;
+
+		stat(s, SHEAF_PREFILL_OVERSIZE);
+		sheaf->cache = s;
+		sheaf->capacity = size;
+
+		if (!__kmem_cache_alloc_bulk(s, gfp, size,
+					     &sheaf->objects[0])) {
+			kfree(sheaf);
+			return NULL;
+		}
+
+		sheaf->size = size;
+
+		return sheaf;
+	}
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (pcs->spare) {
+		sheaf = pcs->spare;
+		pcs->spare = NULL;
+		stat(s, SHEAF_PREFILL_FAST);
+	} else {
+		stat(s, SHEAF_PREFILL_SLOW);
+		sheaf = barn_get_full_or_empty_sheaf(pcs->barn);
+		if (sheaf && sheaf->size)
+			stat(s, BARN_GET);
+		else
+			stat(s, BARN_GET_FAIL);
+	}
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+
+	if (!sheaf)
+		sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (sheaf && sheaf->size < size) {
+		if (refill_sheaf(s, sheaf, gfp)) {
+			sheaf_flush_unused(s, sheaf);
+			free_empty_sheaf(s, sheaf);
+			sheaf = NULL;
+		}
+	}
+
+	if (sheaf)
+		sheaf->capacity = s->sheaf_capacity;
+
+	return sheaf;
+}
+
+/*
+ * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
+ *
+ * If the sheaf cannot simply become the percpu spare sheaf, but there's space
+ * for a full sheaf in the barn, we try to refill the sheaf back to the cache's
+ * sheaf_capacity to avoid handling partially full sheaves.
+ *
+ * If the refill fails because gfp is e.g. GFP_NOWAIT, or the barn is full, the
+ * sheaf is instead flushed and freed.
+ */
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+			     struct slab_sheaf *sheaf)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct node_barn *barn;
+
+	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
+		sheaf_flush_unused(s, sheaf);
+		kfree(sheaf);
+		return;
+	}
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (!pcs->spare) {
+		pcs->spare = sheaf;
+		sheaf = NULL;
+		stat(s, SHEAF_RETURN_FAST);
+	}
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (!sheaf)
+		return;
+
+	stat(s, SHEAF_RETURN_SLOW);
+
+	/* Accessing pcs->barn outside local_lock is safe. */
+	barn = pcs->barn;
+
+	/*
+	 * If the barn has too many full sheaves or we fail to refill the sheaf,
+	 * simply flush and free it.
+	 */
+	if (data_race(pcs->barn->nr_full) >= MAX_FULL_SHEAVES ||
+	    refill_sheaf(s, sheaf, gfp)) {
+		sheaf_flush_unused(s, sheaf);
+		free_empty_sheaf(s, sheaf);
+		return;
+	}
+
+	barn_put_full_sheaf(barn, sheaf);
+	stat(s, BARN_PUT);
+}
+
+/*
+ * refill a sheaf previously returned by kmem_cache_prefill_sheaf to at least
+ * the given size
+ *
+ * the sheaf might be replaced by a new one when requesting more than
+ * s->sheaf_capacity objects if such replacement is necessary, but the refill
+ * fails (returning -ENOMEM), the existing sheaf is left intact
+ *
+ * In practice we always refill to full sheaf's capacity.
+ */
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+			    struct slab_sheaf **sheafp, unsigned int size)
+{
+	struct slab_sheaf *sheaf;
+
+	/*
+	 * TODO: do we want to support *sheaf == NULL to be equivalent of
+	 * kmem_cache_prefill_sheaf() ?
+	 */
+	if (!sheafp || !(*sheafp))
+		return -EINVAL;
+
+	sheaf = *sheafp;
+	if (sheaf->size >= size)
+		return 0;
+
+	if (likely(sheaf->capacity >= size)) {
+		if (likely(sheaf->capacity == s->sheaf_capacity))
+			return refill_sheaf(s, sheaf, gfp);
+
+		if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
+					     &sheaf->objects[sheaf->size])) {
+			return -ENOMEM;
+		}
+		sheaf->size = sheaf->capacity;
+
+		return 0;
+	}
+
+	/*
+	 * We had a regular sized sheaf and need an oversize one, or we had an
+	 * oversize one already but need a larger one now.
+	 * This should be a very rare path so let's not complicate it.
+	 */
+	sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
+	if (!sheaf)
+		return -ENOMEM;
+
+	kmem_cache_return_sheaf(s, gfp, *sheafp);
+	*sheafp = sheaf;
+	return 0;
+}
+
+/*
+ * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
+ *
+ * Guaranteed not to fail as many allocations as was the requested size.
+ * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
+ *
+ * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
+ * memcg charging is forced over limit if necessary, to avoid failure.
+ */
+void *
+kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
+				   struct slab_sheaf *sheaf)
+{
+	void *ret = NULL;
+	bool init;
+
+	if (sheaf->size == 0)
+		goto out;
+
+	ret = sheaf->objects[--sheaf->size];
+
+	init = slab_want_init_on_alloc(gfp, s);
+
+	/* add __GFP_NOFAIL to force successful memcg charging */
+	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
+out:
+	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
+
+	return ret;
+}
+
+unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
+{
+	return sheaf->size;
+}
 /*
  * To avoid unnecessary overhead, we pass through large allocation requests
  * directly to the page allocator. We use __GFP_COMP, because we will need to
@@ -8464,6 +8719,11 @@ STAT_ATTR(BARN_GET, barn_get);
 STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
 STAT_ATTR(BARN_PUT, barn_put);
 STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
+STAT_ATTR(SHEAF_PREFILL_FAST, sheaf_prefill_fast);
+STAT_ATTR(SHEAF_PREFILL_SLOW, sheaf_prefill_slow);
+STAT_ATTR(SHEAF_PREFILL_OVERSIZE, sheaf_prefill_oversize);
+STAT_ATTR(SHEAF_RETURN_FAST, sheaf_return_fast);
+STAT_ATTR(SHEAF_RETURN_SLOW, sheaf_return_slow);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -8564,6 +8824,11 @@ static struct attribute *slab_attrs[] = {
 	&barn_get_fail_attr.attr,
 	&barn_put_attr.attr,
 	&barn_put_fail_attr.attr,
+	&sheaf_prefill_fast_attr.attr,
+	&sheaf_prefill_slow_attr.attr,
+	&sheaf_prefill_oversize_attr.attr,
+	&sheaf_return_fast_attr.attr,
+	&sheaf_return_slow_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 04/14] slab: determine barn status racily outside of lock
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (2 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 03/14] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 05/14] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

The possibility of many barn operations is determined by the current
number of full or empty sheaves. Taking the barn->lock just to find out
that e.g. there are no empty sheaves results in unnecessary overhead and
lock contention. Thus perform these checks outside of the lock with a
data_race() annotated variable read and fail quickly without taking the
lock.

Checks for sheaf availability that racily succeed have to be obviously
repeated under the lock for correctness, but we can skip repeating
checks if there are too many sheaves on the given list as the limits
don't need to be strict.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/slub.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8b3093ee2e02c9ff4e149ac54833db4972b414a3..339d91c6ea29be99a14a8914117fab0e3e6ed26b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2760,9 +2760,12 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
 	struct slab_sheaf *empty = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_empty))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_empty) {
+	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty,
 					 struct slab_sheaf, barn_list);
 		list_del(&empty->barn_list);
@@ -2809,6 +2812,9 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
 	struct slab_sheaf *sheaf = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_full) && !data_race(barn->nr_empty))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
 	if (barn->nr_full) {
@@ -2839,9 +2845,12 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
 	struct slab_sheaf *full = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_full))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_full) {
+	if (likely(barn->nr_full)) {
 		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
 					barn_list);
 		list_del(&full->barn_list);
@@ -2864,19 +2873,23 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
 	struct slab_sheaf *empty;
 	unsigned long flags;
 
+	/* we don't repeat this check under barn->lock as it's not critical */
+	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
+		return ERR_PTR(-E2BIG);
+	if (!data_race(barn->nr_empty))
+		return ERR_PTR(-ENOMEM);
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_full >= MAX_FULL_SHEAVES) {
-		empty = ERR_PTR(-E2BIG);
-	} else if (!barn->nr_empty) {
-		empty = ERR_PTR(-ENOMEM);
-	} else {
+	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
 					 barn_list);
 		list_del(&empty->barn_list);
 		list_add(&full->barn_list, &barn->sheaves_full);
 		barn->nr_empty--;
 		barn->nr_full++;
+	} else {
+		empty = ERR_PTR(-ENOMEM);
 	}
 
 	spin_unlock_irqrestore(&barn->lock, flags);

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 05/14] tools: Add testing support for changes to rcu and slab for sheaves
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (3 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 04/14] slab: determine barn status racily outside of lock Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-08-22 16:28   ` Suren Baghdasaryan
  2025-07-23 13:34 ` [PATCH v5 06/14] tools: Add sheaves support to testing infrastructure Vlastimil Babka
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Make testing work for the slab and rcu changes that have come in with
the sheaves work.

This only works with one kmem_cache, and only the first one used.
Subsequent setting of kmem_cache will not update the active kmem_cache
and will be silently dropped because there are other tests which happen
after the kmem_cache of interest is set.

The saved active kmem_cache is used in the rcu callback, which passes
the object to be freed.

The rcu call takes the rcu_head, which is passed in as the field in the
struct (in this case rcu in the maple tree node), which is calculated by
pointer math.  The offset of which is saved (in a global variable) for
restoring the node pointer on the callback after the rcu grace period
expires.

Don't use any of this outside of testing, please.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/include/linux/slab.h            | 41 ++++++++++++++++++++++++++++++++---
 tools/testing/shared/linux.c          | 24 ++++++++++++++++----
 tools/testing/shared/linux/rcupdate.h | 22 +++++++++++++++++++
 3 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index c87051e2b26f5a7fee0362697fae067076b8e84d..d1444e79f2685edb828adbce8b3fbb500c0f8844 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -23,6 +23,12 @@ enum slab_state {
 	FULL
 };
 
+struct kmem_cache_args {
+	unsigned int align;
+	unsigned int sheaf_capacity;
+	void (*ctor)(void *);
+};
+
 static inline void *kzalloc(size_t size, gfp_t gfp)
 {
 	return kmalloc(size, gfp | __GFP_ZERO);
@@ -37,9 +43,38 @@ static inline void *kmem_cache_alloc(struct kmem_cache *cachep, int flags)
 }
 void kmem_cache_free(struct kmem_cache *cachep, void *objp);
 
-struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
-			unsigned int align, unsigned int flags,
-			void (*ctor)(void *));
+
+struct kmem_cache *
+__kmem_cache_create_args(const char *name, unsigned int size,
+		struct kmem_cache_args *args, unsigned int flags);
+
+/* If NULL is passed for @args, use this variant with default arguments. */
+static inline struct kmem_cache *
+__kmem_cache_default_args(const char *name, unsigned int size,
+		struct kmem_cache_args *args, unsigned int flags)
+{
+	struct kmem_cache_args kmem_default_args = {};
+
+	return __kmem_cache_create_args(name, size, &kmem_default_args, flags);
+}
+
+static inline struct kmem_cache *
+__kmem_cache_create(const char *name, unsigned int size, unsigned int align,
+		unsigned int flags, void (*ctor)(void *))
+{
+	struct kmem_cache_args kmem_args = {
+		.align	= align,
+		.ctor	= ctor,
+	};
+
+	return __kmem_cache_create_args(name, size, &kmem_args, flags);
+}
+
+#define kmem_cache_create(__name, __object_size, __args, ...)           \
+	_Generic((__args),                                              \
+		struct kmem_cache_args *: __kmem_cache_create_args,	\
+		void *: __kmem_cache_default_args,			\
+		default: __kmem_cache_create)(__name, __object_size, __args, __VA_ARGS__)
 
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index 0f97fb0d19e19c327aa4843a35b45cc086f4f366..f998555a1b2af4a899a468a652b04622df459ed3 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -20,6 +20,7 @@ struct kmem_cache {
 	pthread_mutex_t lock;
 	unsigned int size;
 	unsigned int align;
+	unsigned int sheaf_capacity;
 	int nr_objs;
 	void *objs;
 	void (*ctor)(void *);
@@ -31,6 +32,8 @@ struct kmem_cache {
 	void *private;
 };
 
+static struct kmem_cache *kmem_active = NULL;
+
 void kmem_cache_set_callback(struct kmem_cache *cachep, void (*callback)(void *))
 {
 	cachep->callback = callback;
@@ -147,6 +150,14 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
 	pthread_mutex_unlock(&cachep->lock);
 }
 
+void kmem_cache_free_active(void *objp)
+{
+	if (!kmem_active)
+		printf("WARNING: No active kmem_cache\n");
+
+	kmem_cache_free(kmem_active, objp);
+}
+
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
 {
 	if (kmalloc_verbose)
@@ -234,23 +245,28 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 }
 
 struct kmem_cache *
-kmem_cache_create(const char *name, unsigned int size, unsigned int align,
-		unsigned int flags, void (*ctor)(void *))
+__kmem_cache_create_args(const char *name, unsigned int size,
+			  struct kmem_cache_args *args,
+			  unsigned int flags)
 {
 	struct kmem_cache *ret = malloc(sizeof(*ret));
 
 	pthread_mutex_init(&ret->lock, NULL);
 	ret->size = size;
-	ret->align = align;
+	ret->align = args->align;
+	ret->sheaf_capacity = args->sheaf_capacity;
 	ret->nr_objs = 0;
 	ret->nr_allocated = 0;
 	ret->nr_tallocated = 0;
 	ret->objs = NULL;
-	ret->ctor = ctor;
+	ret->ctor = args->ctor;
 	ret->non_kernel = 0;
 	ret->exec_callback = false;
 	ret->callback = NULL;
 	ret->private = NULL;
+	if (!kmem_active)
+		kmem_active = ret;
+
 	return ret;
 }
 
diff --git a/tools/testing/shared/linux/rcupdate.h b/tools/testing/shared/linux/rcupdate.h
index fed468fb0c78db6f33fb1900c7110ab5f3c19c65..c95e2f0bbd93798e544d7d34e0823ed68414f924 100644
--- a/tools/testing/shared/linux/rcupdate.h
+++ b/tools/testing/shared/linux/rcupdate.h
@@ -9,4 +9,26 @@
 #define rcu_dereference_check(p, cond) rcu_dereference(p)
 #define RCU_INIT_POINTER(p, v)	do { (p) = (v); } while (0)
 
+void kmem_cache_free_active(void *objp);
+static unsigned long kfree_cb_offset = 0;
+
+static inline void kfree_rcu_cb(struct rcu_head *head)
+{
+	void *objp = (void *) ((unsigned long)head - kfree_cb_offset);
+
+	kmem_cache_free_active(objp);
+}
+
+#ifndef offsetof
+#define offsetof(TYPE, MEMBER)	__builtin_offsetof(TYPE, MEMBER)
+#endif
+
+#define kfree_rcu(ptr, rhv)						\
+do {									\
+	if (!kfree_cb_offset)						\
+		kfree_cb_offset = offsetof(typeof(*(ptr)), rhv);	\
+									\
+	call_rcu(&ptr->rhv, kfree_rcu_cb);				\
+} while (0)
+
 #endif

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 06/14] tools: Add sheaves support to testing infrastructure
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (4 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 05/14] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-08-22 16:56   ` Suren Baghdasaryan
  2025-07-23 13:34 ` [PATCH v5 07/14] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Allocate a sheaf and fill it to the count amount.  Does not fill to the
sheaf limit to detect incorrect allocation requests.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/include/linux/slab.h   | 24 +++++++++++++
 tools/testing/shared/linux.c | 84 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index d1444e79f2685edb828adbce8b3fbb500c0f8844..1962d7f1abee154e1cda5dba28aef213088dd198 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -23,6 +23,13 @@ enum slab_state {
 	FULL
 };
 
+struct slab_sheaf {
+	struct kmem_cache *cache;
+	unsigned int size;
+	unsigned int capacity;
+	void *objects[];
+};
+
 struct kmem_cache_args {
 	unsigned int align;
 	unsigned int sheaf_capacity;
@@ -80,4 +87,21 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			  void **list);
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
+
+void *
+kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf);
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf **sheafp, unsigned int size);
+
+static inline unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
+{
+	return sheaf->size;
+}
+
 #endif		/* _TOOLS_SLAB_H */
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index f998555a1b2af4a899a468a652b04622df459ed3..e0255f53159bd3a1325d49192283dd6790a5e3b8 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -181,6 +181,12 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 	if (kmalloc_verbose)
 		pr_debug("Bulk alloc %zu\n", size);
 
+	if (cachep->exec_callback) {
+		if (cachep->callback)
+			cachep->callback(cachep->private);
+		cachep->exec_callback = false;
+	}
+
 	pthread_mutex_lock(&cachep->lock);
 	if (cachep->nr_objs >= size) {
 		struct radix_tree_node *node;
@@ -270,6 +276,84 @@ __kmem_cache_create_args(const char *name, unsigned int size,
 	return ret;
 }
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
+{
+	struct slab_sheaf *sheaf;
+	unsigned int capacity;
+
+	if (size > s->sheaf_capacity)
+		capacity = size;
+	else
+		capacity = s->sheaf_capacity;
+
+	sheaf = malloc(sizeof(*sheaf) + sizeof(void *) * s->sheaf_capacity * capacity);
+	if (!sheaf) {
+		return NULL;
+	}
+
+	memset(sheaf, 0, size);
+	sheaf->cache = s;
+	sheaf->capacity = capacity;
+	sheaf->size = kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects);
+	if (!sheaf->size) {
+		free(sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		 struct slab_sheaf **sheafp, unsigned int size)
+{
+	struct slab_sheaf *sheaf = *sheafp;
+	int refill;
+
+	if (sheaf->size >= size)
+		return 0;
+
+	if (size > sheaf->capacity) {
+		sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
+		if (!sheaf)
+			return -ENOMEM;
+
+		kmem_cache_return_sheaf(s, gfp, *sheafp);
+		*sheafp = sheaf;
+		return 0;
+	}
+
+	refill = kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
+				       &sheaf->objects[sheaf->size]);
+	if (!refill)
+		return -ENOMEM;
+
+	sheaf->size += refill;
+	return 0;
+}
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+		 struct slab_sheaf *sheaf)
+{
+	if (sheaf->size) {
+		//s->non_kernel += sheaf->size;
+		kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+	}
+	free(sheaf);
+}
+
+void *
+kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf)
+{
+	if (sheaf->size == 0) {
+		printf("Nothing left in sheaf!\n");
+		return NULL;
+	}
+
+	return sheaf->objects[--sheaf->size];
+}
+
 /*
  * Test the test infrastructure for kem_cache_alloc/free and bulk counterparts.
  */

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 07/14] maple_tree: use percpu sheaves for maple_node_cache
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (5 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 06/14] tools: Add sheaves support to testing infrastructure Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 08/14] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
improve its performance. Change the single node rcu freeing in
ma_free_rcu() to use kfree_rcu() instead of the custom callback, which
allows the rcu_free sheaf batching to be used. Note there are other
users of mt_free_rcu() where larger parts of maple tree are submitted to
call_rcu() as a whole, and that cannot use the rcu_free sheaf. But it's
still possible for maple nodes freed this way to be reused via the barn,
even if only some cpus are allowed to process rcu callbacks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 lib/maple_tree.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index affe979bd14d30b96f8e012ff03dfd2fda6eec0b..82f39fe29a462aa3c779789a28efdd6cdef64c79 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -208,7 +208,7 @@ static void mt_free_rcu(struct rcu_head *head)
 static void ma_free_rcu(struct maple_node *node)
 {
 	WARN_ON(node->parent != ma_parent_ptr(node));
-	call_rcu(&node->rcu, mt_free_rcu);
+	kfree_rcu(node, rcu);
 }
 
 static void mt_set_height(struct maple_tree *mt, unsigned char height)
@@ -6285,9 +6285,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 
 void __init maple_tree_init(void)
 {
+	struct kmem_cache_args args = {
+		.align  = sizeof(struct maple_node),
+		.sheaf_capacity = 32,
+	};
+
 	maple_node_cache = kmem_cache_create("maple_node",
-			sizeof(struct maple_node), sizeof(struct maple_node),
-			SLAB_PANIC, NULL);
+			sizeof(struct maple_node), &args,
+			SLAB_PANIC);
 }
 
 /**

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 08/14] mm, vma: use percpu sheaves for vm_area_struct cache
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (6 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 07/14] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 09/14] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Create the vm_area_struct cache with percpu sheaves of size 32 to
improve its performance.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vma_init.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/vma_init.c b/mm/vma_init.c
index 8e53c7943561e7324e7992946b4065dec1149b82..52c6b55fac4519e0da39ca75ad018e14449d1d95 100644
--- a/mm/vma_init.c
+++ b/mm/vma_init.c
@@ -16,6 +16,7 @@ void __init vma_state_init(void)
 	struct kmem_cache_args args = {
 		.use_freeptr_offset = true,
 		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+		.sheaf_capacity = 32,
 	};
 
 	vm_area_cachep = kmem_cache_create("vm_area_struct",

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 09/14] mm, slub: skip percpu sheaves for remote object freeing
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (7 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 08/14] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-08-25  5:22   ` Harry Yoo
  2025-07-23 13:34 ` [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Since we don't control the NUMA locality of objects in percpu sheaves,
allocations with node restrictions bypass them. Allocations without
restrictions may however still expect to get local objects with high
probability, and the introduction of sheaves can decrease it due to
freed object from a remote node ending up in percpu sheaves.

The fraction of such remote frees seems low (5% on an 8-node machine)
but it can be expected that some cache or workload specific corner cases
exist. We can either conclude that this is not a problem due to the low
fraction, or we can make remote frees bypass percpu sheaves and go
directly to their slabs. This will make the remote frees more expensive,
but if if's only a small fraction, most frees will still benefit from
the lower overhead of percpu sheaves.

This patch thus makes remote object freeing bypass percpu sheaves,
including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
it's not intended to be 100% guarantee that percpu sheaves will only
contain local objects. The refill from slabs does not provide that
guarantee in the first place, and there might be cpu migrations
happening when we need to unlock the local_lock. Avoiding all that could
be possible but complicated so we can leave it for later investigation
whether it would be worth it. It can be expected that the more selective
freeing will itself prevent accumulation of remote objects in percpu
sheaves so any such violations would have only short-term effects.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab_common.c |  7 +++++--
 mm/slub.c        | 42 ++++++++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 2d806e02568532a1000fd3912db6978e945dcfa8..f466f68a5bd82030a987baf849a98154cd48ef23 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1623,8 +1623,11 @@ static bool kfree_rcu_sheaf(void *obj)
 
 	slab = folio_slab(folio);
 	s = slab->slab_cache;
-	if (s->cpu_sheaves)
-		return __kfree_rcu_sheaf(s, obj);
+	if (s->cpu_sheaves) {
+		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
+			   slab_nid(slab) == numa_node_id()))
+			return __kfree_rcu_sheaf(s, obj);
+	}
 
 	return false;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 339d91c6ea29be99a14a8914117fab0e3e6ed26b..50fc35b8fc9b3101821c338e9469c134677ded51 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -455,6 +455,7 @@ struct slab_sheaf {
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
+	int node; /* only used for rcu_sheaf */
 	void *objects[];
 };
 
@@ -5682,7 +5683,7 @@ static void rcu_free_sheaf(struct rcu_head *head)
 	 */
 	__rcu_free_sheaf_prepare(s, sheaf);
 
-	barn = get_node(s, numa_mem_id())->barn;
+	barn = get_node(s, sheaf->node)->barn;
 
 	/* due to slab_free_hook() */
 	if (unlikely(sheaf->size == 0))
@@ -5765,10 +5766,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
 	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
 
-	if (likely(rcu_sheaf->size < s->sheaf_capacity))
+	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
 		rcu_sheaf = NULL;
-	else
+	} else {
 		pcs->rcu_free = NULL;
+		rcu_sheaf->node = numa_mem_id();
+	}
 
 	local_unlock(&s->cpu_sheaves->lock);
 
@@ -5794,7 +5797,11 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	struct slab_sheaf *main, *empty;
 	bool init = slab_want_init_on_free(s);
 	unsigned int batch, i = 0;
+	void *remote_objects[PCS_BATCH_MAX];
+	unsigned int remote_nr = 0;
+	int node = numa_mem_id();
 
+next_remote_batch:
 	while (i < size) {
 		struct slab *slab = virt_to_slab(p[i]);
 
@@ -5804,7 +5811,15 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
 			p[i] = p[--size];
 			if (!size)
-				return;
+				goto flush_remote;
+			continue;
+		}
+
+		if (unlikely(IS_ENABLED(CONFIG_NUMA) && slab_nid(slab) != node)) {
+			remote_objects[remote_nr] = p[i];
+			p[i] = p[--size];
+			if (++remote_nr >= PCS_BATCH_MAX)
+				goto flush_remote;
 			continue;
 		}
 
@@ -5872,6 +5887,15 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	 */
 fallback:
 	__kmem_cache_free_bulk(s, size, p);
+
+flush_remote:
+	if (remote_nr) {
+		__kmem_cache_free_bulk(s, remote_nr, &remote_objects[0]);
+		if (i < size) {
+			remote_nr = 0;
+			goto next_remote_batch;
+		}
+	}
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -5963,8 +5987,14 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (!s->cpu_sheaves || !free_to_pcs(s, object))
-		do_slab_free(s, slab, object, object, 1, addr);
+	if (s->cpu_sheaves && likely(!IS_ENABLED(CONFIG_NUMA) ||
+				     slab_nid(slab) == numa_mem_id())) {
+		if (likely(free_to_pcs(s, object))) {
+			return;
+		}
+	}
+
+	do_slab_free(s, slab, object, object, 1, addr);
 }
 
 #ifdef CONFIG_MEMCG

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (8 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 09/14] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-08-22 19:58   ` Suren Baghdasaryan
  2025-08-25  6:52   ` Harry Yoo
  2025-07-23 13:34 ` [PATCH v5 11/14] testing/radix-tree/maple: Increase readers and reduce delay for faster machines Vlastimil Babka
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Currently allocations asking for a specific node explicitly or via
mempolicy in strict_numa node bypass percpu sheaves. Since sheaves
contain mostly local objects, we can try allocating from them if the
local node happens to be the requested node or allowed by the mempolicy.
If we find the object from percpu sheaves is not from the expected node,
we skip the sheaves - this should be rare.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 50fc35b8fc9b3101821c338e9469c134677ded51..b98983b8d2e3e04ea256d91efcf0215ff0ae7e38 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4765,18 +4765,42 @@ __pcs_handle_empty(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t
 }
 
 static __fastpath_inline
-void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
+void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
 {
 	struct slub_percpu_sheaves *pcs;
 	void *object;
 
 #ifdef CONFIG_NUMA
-	if (static_branch_unlikely(&strict_numa)) {
-		if (current->mempolicy)
-			return NULL;
+	if (static_branch_unlikely(&strict_numa) &&
+			 node == NUMA_NO_NODE) {
+
+		struct mempolicy *mpol = current->mempolicy;
+
+		if (mpol) {
+			/*
+			 * Special BIND rule support. If the local node
+			 * is in permitted set then do not redirect
+			 * to a particular node.
+			 * Otherwise we apply the memory policy to get
+			 * the node we need to allocate on.
+			 */
+			if (mpol->mode != MPOL_BIND ||
+					!node_isset(numa_mem_id(), mpol->nodes))
+
+				node = mempolicy_slab_node();
+		}
 	}
 #endif
 
+	if (unlikely(node != NUMA_NO_NODE)) {
+		/*
+		 * We assume the percpu sheaves contain only local objects
+		 * although it's not completely guaranteed, so we verify later.
+		 */
+		if (node != numa_mem_id())
+			return NULL;
+	}
+
 	if (!local_trylock(&s->cpu_sheaves->lock))
 		return NULL;
 
@@ -4788,7 +4812,21 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
 			return NULL;
 	}
 
-	object = pcs->main->objects[--pcs->main->size];
+	object = pcs->main->objects[pcs->main->size - 1];
+
+	if (unlikely(node != NUMA_NO_NODE)) {
+		/*
+		 * Verify that the object was from the node we want. This could
+		 * be false because of cpu migration during an unlocked part of
+		 * the current allocation or previous freeing process.
+		 */
+		if (folio_nid(virt_to_folio(object)) != node) {
+			local_unlock(&s->cpu_sheaves->lock);
+			return NULL;
+		}
+	}
+
+	pcs->main->size--;
 
 	local_unlock(&s->cpu_sheaves->lock);
 
@@ -4888,8 +4926,8 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	if (s->cpu_sheaves && node == NUMA_NO_NODE)
-		object = alloc_from_pcs(s, gfpflags);
+	if (s->cpu_sheaves)
+		object = alloc_from_pcs(s, gfpflags, node);
 
 	if (!object)
 		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 11/14] testing/radix-tree/maple: Increase readers and reduce delay for faster machines
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (9 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-07-23 13:34 ` [PATCH v5 12/14] maple_tree: Sheaf conversion Vlastimil Babka
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <howlett@gmail.com>

Add more threads and reduce the timing of the readers to increase the
possibility of catching the rcu changes.  The test does not pass unless
the reader is seen.

Signed-off-by: Liam R. Howlett <howlett@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 tools/testing/radix-tree/maple.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 2c0b3830125336af760768597d39ed07a2f8e92b..f6f923c9dc1039997953a94ec184c560b225c2d4 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35062,7 +35062,7 @@ void run_check_rcu_slowread(struct maple_tree *mt, struct rcu_test_struct *vals)
 
 	int i;
 	void *(*function)(void *);
-	pthread_t readers[20];
+	pthread_t readers[30];
 	unsigned int index = vals->index;
 
 	mt_set_in_rcu(mt);
@@ -35080,14 +35080,14 @@ void run_check_rcu_slowread(struct maple_tree *mt, struct rcu_test_struct *vals)
 		}
 	}
 
-	usleep(5); /* small yield to ensure all threads are at least started. */
+	usleep(3); /* small yield to ensure all threads are at least started. */
 
 	while (index <= vals->last) {
 		mtree_store(mt, index,
 			    (index % 2 ? vals->entry2 : vals->entry3),
 			    GFP_KERNEL);
 		index++;
-		usleep(5);
+		usleep(2);
 	}
 
 	while (i--)
@@ -35098,6 +35098,7 @@ void run_check_rcu_slowread(struct maple_tree *mt, struct rcu_test_struct *vals)
 	MT_BUG_ON(mt, !vals->seen_entry3);
 	MT_BUG_ON(mt, !vals->seen_both);
 }
+
 static noinline void __init check_rcu_simulated(struct maple_tree *mt)
 {
 	unsigned long i, nr_entries = 1000;

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 12/14] maple_tree: Sheaf conversion
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (10 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 11/14] testing/radix-tree/maple: Increase readers and reduce delay for faster machines Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-08-22 20:18   ` Suren Baghdasaryan
  2025-07-23 13:34 ` [PATCH v5 13/14] maple_tree: Add single node allocation support to maple state Vlastimil Babka
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

From: "Liam R. Howlett" <Liam.Howlett@oracle.com>

Use sheaves instead of bulk allocations.  This should speed up the
allocations and the return path of unused allocations.

Remove push/pop of nodes from maple state.
Remove unnecessary testing
ifdef out other testing that probably will be deleted
Fix testcase for testing race
Move some testing around in the same commit.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/maple_tree.h       |   6 +-
 lib/maple_tree.c                 | 331 ++++----------------
 lib/test_maple_tree.c            |   8 +
 tools/testing/radix-tree/maple.c | 632 +++++++--------------------------------
 tools/testing/shared/linux.c     |   8 +-
 5 files changed, 185 insertions(+), 800 deletions(-)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 9ef1290382249462d73ae72435dada7ce4b0622c..3cf1ae9dde7ce43fa20ae400c01fefad048c302e 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -442,7 +442,8 @@ struct ma_state {
 	struct maple_enode *node;	/* The node containing this entry */
 	unsigned long min;		/* The minimum index of this node - implied pivot min */
 	unsigned long max;		/* The maximum index of this node - implied pivot max */
-	struct maple_alloc *alloc;	/* Allocated nodes for this operation */
+	struct slab_sheaf *sheaf;	/* Allocated nodes for this operation */
+	unsigned long node_request;
 	enum maple_status status;	/* The status of the state (active, start, none, etc) */
 	unsigned char depth;		/* depth of tree descent during write */
 	unsigned char offset;
@@ -490,7 +491,8 @@ struct ma_wr_state {
 		.status = ma_start,					\
 		.min = 0,						\
 		.max = ULONG_MAX,					\
-		.alloc = NULL,						\
+		.node_request= 0,					\
+		.sheaf = NULL,						\
 		.mas_flags = 0,						\
 		.store_type = wr_invalid,				\
 	}
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 82f39fe29a462aa3c779789a28efdd6cdef64c79..3c3c14a76d98ded3b619c178d64099b464a2ca23 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -198,6 +198,22 @@ static void mt_free_rcu(struct rcu_head *head)
 	kmem_cache_free(maple_node_cache, node);
 }
 
+static void mt_return_sheaf(struct slab_sheaf *sheaf)
+{
+	kmem_cache_return_sheaf(maple_node_cache, GFP_KERNEL, sheaf);
+}
+
+static struct slab_sheaf *mt_get_sheaf(gfp_t gfp, int count)
+{
+	return kmem_cache_prefill_sheaf(maple_node_cache, gfp, count);
+}
+
+static int mt_refill_sheaf(gfp_t gfp, struct slab_sheaf **sheaf,
+		unsigned int size)
+{
+	return kmem_cache_refill_sheaf(maple_node_cache, gfp, sheaf, size);
+}
+
 /*
  * ma_free_rcu() - Use rcu callback to free a maple node
  * @node: The node to free
@@ -590,67 +606,6 @@ static __always_inline bool mte_dead_node(const struct maple_enode *enode)
 	return ma_dead_node(node);
 }
 
-/*
- * mas_allocated() - Get the number of nodes allocated in a maple state.
- * @mas: The maple state
- *
- * The ma_state alloc member is overloaded to hold a pointer to the first
- * allocated node or to the number of requested nodes to allocate.  If bit 0 is
- * set, then the alloc contains the number of requested nodes.  If there is an
- * allocated node, then the total allocated nodes is in that node.
- *
- * Return: The total number of nodes allocated
- */
-static inline unsigned long mas_allocated(const struct ma_state *mas)
-{
-	if (!mas->alloc || ((unsigned long)mas->alloc & 0x1))
-		return 0;
-
-	return mas->alloc->total;
-}
-
-/*
- * mas_set_alloc_req() - Set the requested number of allocations.
- * @mas: the maple state
- * @count: the number of allocations.
- *
- * The requested number of allocations is either in the first allocated node,
- * located in @mas->alloc->request_count, or directly in @mas->alloc if there is
- * no allocated node.  Set the request either in the node or do the necessary
- * encoding to store in @mas->alloc directly.
- */
-static inline void mas_set_alloc_req(struct ma_state *mas, unsigned long count)
-{
-	if (!mas->alloc || ((unsigned long)mas->alloc & 0x1)) {
-		if (!count)
-			mas->alloc = NULL;
-		else
-			mas->alloc = (struct maple_alloc *)(((count) << 1U) | 1U);
-		return;
-	}
-
-	mas->alloc->request_count = count;
-}
-
-/*
- * mas_alloc_req() - get the requested number of allocations.
- * @mas: The maple state
- *
- * The alloc count is either stored directly in @mas, or in
- * @mas->alloc->request_count if there is at least one node allocated.  Decode
- * the request count if it's stored directly in @mas->alloc.
- *
- * Return: The allocation request count.
- */
-static inline unsigned int mas_alloc_req(const struct ma_state *mas)
-{
-	if ((unsigned long)mas->alloc & 0x1)
-		return (unsigned long)(mas->alloc) >> 1;
-	else if (mas->alloc)
-		return mas->alloc->request_count;
-	return 0;
-}
-
 /*
  * ma_pivots() - Get a pointer to the maple node pivots.
  * @node: the maple node
@@ -1148,77 +1103,15 @@ static int mas_ascend(struct ma_state *mas)
  */
 static inline struct maple_node *mas_pop_node(struct ma_state *mas)
 {
-	struct maple_alloc *ret, *node = mas->alloc;
-	unsigned long total = mas_allocated(mas);
-	unsigned int req = mas_alloc_req(mas);
+	struct maple_node *ret;
 
-	/* nothing or a request pending. */
-	if (WARN_ON(!total))
+	if (WARN_ON_ONCE(!mas->sheaf))
 		return NULL;
 
-	if (total == 1) {
-		/* single allocation in this ma_state */
-		mas->alloc = NULL;
-		ret = node;
-		goto single_node;
-	}
-
-	if (node->node_count == 1) {
-		/* Single allocation in this node. */
-		mas->alloc = node->slot[0];
-		mas->alloc->total = node->total - 1;
-		ret = node;
-		goto new_head;
-	}
-	node->total--;
-	ret = node->slot[--node->node_count];
-	node->slot[node->node_count] = NULL;
-
-single_node:
-new_head:
-	if (req) {
-		req++;
-		mas_set_alloc_req(mas, req);
-	}
-
+	ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
 	memset(ret, 0, sizeof(*ret));
-	return (struct maple_node *)ret;
-}
-
-/*
- * mas_push_node() - Push a node back on the maple state allocation.
- * @mas: The maple state
- * @used: The used maple node
- *
- * Stores the maple node back into @mas->alloc for reuse.  Updates allocated and
- * requested node count as necessary.
- */
-static inline void mas_push_node(struct ma_state *mas, struct maple_node *used)
-{
-	struct maple_alloc *reuse = (struct maple_alloc *)used;
-	struct maple_alloc *head = mas->alloc;
-	unsigned long count;
-	unsigned int requested = mas_alloc_req(mas);
-
-	count = mas_allocated(mas);
 
-	reuse->request_count = 0;
-	reuse->node_count = 0;
-	if (count) {
-		if (head->node_count < MAPLE_ALLOC_SLOTS) {
-			head->slot[head->node_count++] = reuse;
-			head->total++;
-			goto done;
-		}
-		reuse->slot[0] = head;
-		reuse->node_count = 1;
-	}
-
-	reuse->total = count + 1;
-	mas->alloc = reuse;
-done:
-	if (requested > 1)
-		mas_set_alloc_req(mas, requested - 1);
+	return ret;
 }
 
 /*
@@ -1228,75 +1121,32 @@ static inline void mas_push_node(struct ma_state *mas, struct maple_node *used)
  */
 static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 {
-	struct maple_alloc *node;
-	unsigned long allocated = mas_allocated(mas);
-	unsigned int requested = mas_alloc_req(mas);
-	unsigned int count;
-	void **slots = NULL;
-	unsigned int max_req = 0;
-
-	if (!requested)
-		return;
+	if (unlikely(mas->sheaf)) {
+		unsigned long refill = mas->node_request;
 
-	mas_set_alloc_req(mas, 0);
-	if (mas->mas_flags & MA_STATE_PREALLOC) {
-		if (allocated)
+		if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
+			mas->node_request = 0;
 			return;
-		WARN_ON(!allocated);
-	}
-
-	if (!allocated || mas->alloc->node_count == MAPLE_ALLOC_SLOTS) {
-		node = (struct maple_alloc *)mt_alloc_one(gfp);
-		if (!node)
-			goto nomem_one;
-
-		if (allocated) {
-			node->slot[0] = mas->alloc;
-			node->node_count = 1;
-		} else {
-			node->node_count = 0;
 		}
 
-		mas->alloc = node;
-		node->total = ++allocated;
-		node->request_count = 0;
-		requested--;
-	}
+		if (mt_refill_sheaf(gfp, &mas->sheaf, refill))
+			goto error;
 
-	node = mas->alloc;
-	while (requested) {
-		max_req = MAPLE_ALLOC_SLOTS - node->node_count;
-		slots = (void **)&node->slot[node->node_count];
-		max_req = min(requested, max_req);
-		count = mt_alloc_bulk(gfp, max_req, slots);
-		if (!count)
-			goto nomem_bulk;
-
-		if (node->node_count == 0) {
-			node->slot[0]->node_count = 0;
-			node->slot[0]->request_count = 0;
-		}
+		mas->node_request = 0;
+		return;
+	}
 
-		node->node_count += count;
-		allocated += count;
-		/* find a non-full node*/
-		do {
-			node = node->slot[0];
-		} while (unlikely(node->node_count == MAPLE_ALLOC_SLOTS));
-		requested -= count;
+	mas->sheaf = mt_get_sheaf(gfp, mas->node_request);
+	if (likely(mas->sheaf)) {
+		mas->node_request = 0;
+		return;
 	}
-	mas->alloc->total = allocated;
-	return;
 
-nomem_bulk:
-	/* Clean up potential freed allocations on bulk failure */
-	memset(slots, 0, max_req * sizeof(unsigned long));
-	mas->alloc->total = allocated;
-nomem_one:
-	mas_set_alloc_req(mas, requested);
+error:  
 	mas_set_err(mas, -ENOMEM);
 }
 
+
 /*
  * mas_free() - Free an encoded maple node
  * @mas: The maple state
@@ -1307,42 +1157,7 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
  */
 static inline void mas_free(struct ma_state *mas, struct maple_enode *used)
 {
-	struct maple_node *tmp = mte_to_node(used);
-
-	if (mt_in_rcu(mas->tree))
-		ma_free_rcu(tmp);
-	else
-		mas_push_node(mas, tmp);
-}
-
-/*
- * mas_node_count_gfp() - Check if enough nodes are allocated and request more
- * if there is not enough nodes.
- * @mas: The maple state
- * @count: The number of nodes needed
- * @gfp: the gfp flags
- */
-static void mas_node_count_gfp(struct ma_state *mas, int count, gfp_t gfp)
-{
-	unsigned long allocated = mas_allocated(mas);
-
-	if (allocated < count) {
-		mas_set_alloc_req(mas, count - allocated);
-		mas_alloc_nodes(mas, gfp);
-	}
-}
-
-/*
- * mas_node_count() - Check if enough nodes are allocated and request more if
- * there is not enough nodes.
- * @mas: The maple state
- * @count: The number of nodes needed
- *
- * Note: Uses GFP_NOWAIT | __GFP_NOWARN for gfp flags.
- */
-static void mas_node_count(struct ma_state *mas, int count)
-{
-	return mas_node_count_gfp(mas, count, GFP_NOWAIT | __GFP_NOWARN);
+	ma_free_rcu(mte_to_node(used));
 }
 
 /*
@@ -2517,10 +2332,7 @@ static inline void mas_topiary_node(struct ma_state *mas,
 	enode = tmp_mas->node;
 	tmp = mte_to_node(enode);
 	mte_set_node_dead(enode);
-	if (in_rcu)
-		ma_free_rcu(tmp);
-	else
-		mas_push_node(mas, tmp);
+	ma_free_rcu(tmp);
 }
 
 /*
@@ -4168,7 +3980,7 @@ static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas)
  *
  * Return: Number of nodes required for preallocation.
  */
-static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
+static inline void mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
 {
 	struct ma_state *mas = wr_mas->mas;
 	unsigned char height = mas_mt_height(mas);
@@ -4214,7 +4026,7 @@ static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
 		WARN_ON_ONCE(1);
 	}
 
-	return ret;
+	mas->node_request = ret;
 }
 
 /*
@@ -4275,15 +4087,15 @@ static inline enum store_type mas_wr_store_type(struct ma_wr_state *wr_mas)
  */
 static inline void mas_wr_preallocate(struct ma_wr_state *wr_mas, void *entry)
 {
-	int request;
+	struct ma_state *mas = wr_mas->mas;
 
 	mas_wr_prealloc_setup(wr_mas);
-	wr_mas->mas->store_type = mas_wr_store_type(wr_mas);
-	request = mas_prealloc_calc(wr_mas, entry);
-	if (!request)
+	mas->store_type = mas_wr_store_type(wr_mas);
+	mas_prealloc_calc(wr_mas, entry);
+	if (!mas->node_request)
 		return;
 
-	mas_node_count(wr_mas->mas, request);
+	mas_alloc_nodes(mas, GFP_NOWAIT | __GFP_NOWARN);
 }
 
 /**
@@ -5398,7 +5210,6 @@ static inline void mte_destroy_walk(struct maple_enode *enode,
  */
 void *mas_store(struct ma_state *mas, void *entry)
 {
-	int request;
 	MA_WR_STATE(wr_mas, mas, entry);
 
 	trace_ma_write(__func__, mas, 0, entry);
@@ -5428,11 +5239,11 @@ void *mas_store(struct ma_state *mas, void *entry)
 		return wr_mas.content;
 	}
 
-	request = mas_prealloc_calc(&wr_mas, entry);
-	if (!request)
+	mas_prealloc_calc(&wr_mas, entry);
+	if (!mas->node_request)
 		goto store;
 
-	mas_node_count(mas, request);
+	mas_alloc_nodes(mas, GFP_NOWAIT | __GFP_NOWARN);
 	if (mas_is_err(mas))
 		return NULL;
 
@@ -5520,26 +5331,25 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
 int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
 {
 	MA_WR_STATE(wr_mas, mas, entry);
-	int ret = 0;
-	int request;
 
 	mas_wr_prealloc_setup(&wr_mas);
 	mas->store_type = mas_wr_store_type(&wr_mas);
-	request = mas_prealloc_calc(&wr_mas, entry);
-	if (!request)
-		return ret;
+	mas_prealloc_calc(&wr_mas, entry);
+	if (!mas->node_request)
+		return 0;
 
-	mas_node_count_gfp(mas, request, gfp);
+	mas_alloc_nodes(mas, gfp);
 	if (mas_is_err(mas)) {
-		mas_set_alloc_req(mas, 0);
-		ret = xa_err(mas->node);
+		int ret = xa_err(mas->node);
+
+		mas->node_request = 0;
 		mas_destroy(mas);
 		mas_reset(mas);
 		return ret;
 	}
 
 	mas->mas_flags |= MA_STATE_PREALLOC;
-	return ret;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(mas_preallocate);
 
@@ -5553,9 +5363,6 @@ EXPORT_SYMBOL_GPL(mas_preallocate);
  */
 void mas_destroy(struct ma_state *mas)
 {
-	struct maple_alloc *node;
-	unsigned long total;
-
 	/*
 	 * When using mas_for_each() to insert an expected number of elements,
 	 * it is possible that the number inserted is less than the expected
@@ -5576,21 +5383,11 @@ void mas_destroy(struct ma_state *mas)
 	}
 	mas->mas_flags &= ~(MA_STATE_BULK|MA_STATE_PREALLOC);
 
-	total = mas_allocated(mas);
-	while (total) {
-		node = mas->alloc;
-		mas->alloc = node->slot[0];
-		if (node->node_count > 1) {
-			size_t count = node->node_count - 1;
-
-			mt_free_bulk(count, (void __rcu **)&node->slot[1]);
-			total -= count;
-		}
-		mt_free_one(ma_mnode_ptr(node));
-		total--;
-	}
+	mas->node_request = 0;
+	if (mas->sheaf)
+		mt_return_sheaf(mas->sheaf);
 
-	mas->alloc = NULL;
+	mas->sheaf = NULL;
 }
 EXPORT_SYMBOL_GPL(mas_destroy);
 
@@ -5640,7 +5437,8 @@ int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
 	/* Internal nodes */
 	nr_nodes += DIV_ROUND_UP(nr_nodes, nonleaf_cap);
 	/* Add working room for split (2 nodes) + new parents */
-	mas_node_count_gfp(mas, nr_nodes + 3, GFP_KERNEL);
+	mas->node_request = nr_nodes + 3;
+	mas_alloc_nodes(mas, GFP_KERNEL);
 
 	/* Detect if allocations run out */
 	mas->mas_flags |= MA_STATE_PREALLOC;
@@ -6276,7 +6074,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 		mas_alloc_nodes(mas, gfp);
 	}
 
-	if (!mas_allocated(mas))
+	if (!mas->sheaf)
 		return false;
 
 	mas->status = ma_start;
@@ -7671,8 +7469,9 @@ void mas_dump(const struct ma_state *mas)
 
 	pr_err("[%u/%u] index=%lx last=%lx\n", mas->offset, mas->end,
 	       mas->index, mas->last);
-	pr_err("     min=%lx max=%lx alloc=" PTR_FMT ", depth=%u, flags=%x\n",
-	       mas->min, mas->max, mas->alloc, mas->depth, mas->mas_flags);
+	pr_err("     min=%lx max=%lx sheaf=" PTR_FMT ", request %lu depth=%u, flags=%x\n",
+	       mas->min, mas->max, mas->sheaf, mas->node_request, mas->depth,
+	       mas->mas_flags);
 	if (mas->index > mas->last)
 		pr_err("Check index & last\n");
 }
diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 13e2a10d7554d6b1de5ffbda59f3a5bc4039a8c8..5549eb4200c7974e3bb457e0fd054c434e4b85da 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -2746,6 +2746,7 @@ static noinline void __init check_fuzzer(struct maple_tree *mt)
 	mtree_test_erase(mt, ULONG_MAX - 10);
 }
 
+#if 0
 /* duplicate the tree with a specific gap */
 static noinline void __init check_dup_gaps(struct maple_tree *mt,
 				    unsigned long nr_entries, bool zero_start,
@@ -2770,6 +2771,7 @@ static noinline void __init check_dup_gaps(struct maple_tree *mt,
 		mtree_store_range(mt, i*10, (i+1)*10 - gap,
 				  xa_mk_value(i), GFP_KERNEL);
 
+	mt_dump(mt, mt_dump_dec);
 	mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
 	mt_set_non_kernel(99999);
 	down_write(&newmt_lock);
@@ -2779,9 +2781,12 @@ static noinline void __init check_dup_gaps(struct maple_tree *mt,
 
 	rcu_read_lock();
 	mas_for_each(&mas, tmp, ULONG_MAX) {
+		printk("%lu nodes %lu\n", mas.index,
+		       kmem_cache_sheaf_count(newmas.sheaf));
 		newmas.index = mas.index;
 		newmas.last = mas.last;
 		mas_store(&newmas, tmp);
+		mt_dump(&newmt, mt_dump_dec);
 	}
 	rcu_read_unlock();
 	mas_destroy(&newmas);
@@ -2878,6 +2883,7 @@ static noinline void __init check_dup(struct maple_tree *mt)
 		cond_resched();
 	}
 }
+#endif
 
 static noinline void __init check_bnode_min_spanning(struct maple_tree *mt)
 {
@@ -4045,9 +4051,11 @@ static int __init maple_tree_seed(void)
 	check_fuzzer(&tree);
 	mtree_destroy(&tree);
 
+#if 0
 	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
 	check_dup(&tree);
 	mtree_destroy(&tree);
+#endif
 
 	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
 	check_bnode_min_spanning(&tree);
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index f6f923c9dc1039997953a94ec184c560b225c2d4..1bd789191f232385d69f2dd3e900bac99d8919ff 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -63,430 +63,6 @@ struct rcu_reader_struct {
 	struct rcu_test_struct2 *test;
 };
 
-static int get_alloc_node_count(struct ma_state *mas)
-{
-	int count = 1;
-	struct maple_alloc *node = mas->alloc;
-
-	if (!node || ((unsigned long)node & 0x1))
-		return 0;
-	while (node->node_count) {
-		count += node->node_count;
-		node = node->slot[0];
-	}
-	return count;
-}
-
-static void check_mas_alloc_node_count(struct ma_state *mas)
-{
-	mas_node_count_gfp(mas, MAPLE_ALLOC_SLOTS + 1, GFP_KERNEL);
-	mas_node_count_gfp(mas, MAPLE_ALLOC_SLOTS + 3, GFP_KERNEL);
-	MT_BUG_ON(mas->tree, get_alloc_node_count(mas) != mas->alloc->total);
-	mas_destroy(mas);
-}
-
-/*
- * check_new_node() - Check the creation of new nodes and error path
- * verification.
- */
-static noinline void __init check_new_node(struct maple_tree *mt)
-{
-
-	struct maple_node *mn, *mn2, *mn3;
-	struct maple_alloc *smn;
-	struct maple_node *nodes[100];
-	int i, j, total;
-
-	MA_STATE(mas, mt, 0, 0);
-
-	check_mas_alloc_node_count(&mas);
-
-	/* Try allocating 3 nodes */
-	mtree_lock(mt);
-	mt_set_non_kernel(0);
-	/* request 3 nodes to be allocated. */
-	mas_node_count(&mas, 3);
-	/* Allocation request of 3. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 3);
-	/* Allocate failed. */
-	MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-
-	MT_BUG_ON(mt, mas_allocated(&mas) != 3);
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mn == NULL);
-	MT_BUG_ON(mt, mas.alloc == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
-	mas_push_node(&mas, mn);
-	mas_reset(&mas);
-	mas_destroy(&mas);
-	mtree_unlock(mt);
-
-
-	/* Try allocating 1 node, then 2 more */
-	mtree_lock(mt);
-	/* Set allocation request to 1. */
-	mas_set_alloc_req(&mas, 1);
-	/* Check Allocation request of 1. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
-	mas_set_err(&mas, -ENOMEM);
-	/* Validate allocation request. */
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	/* Eat the requested node. */
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mn == NULL);
-	MT_BUG_ON(mt, mn->slot[0] != NULL);
-	MT_BUG_ON(mt, mn->slot[1] != NULL);
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-
-	mn->parent = ma_parent_ptr(mn);
-	ma_free_rcu(mn);
-	mas.status = ma_start;
-	mas_destroy(&mas);
-	/* Allocate 3 nodes, will fail. */
-	mas_node_count(&mas, 3);
-	/* Drop the lock and allocate 3 nodes. */
-	mas_nomem(&mas, GFP_KERNEL);
-	/* Ensure 3 are allocated. */
-	MT_BUG_ON(mt, mas_allocated(&mas) != 3);
-	/* Allocation request of 0. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 0);
-
-	MT_BUG_ON(mt, mas.alloc == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[1] == NULL);
-	/* Ensure we counted 3. */
-	MT_BUG_ON(mt, mas_allocated(&mas) != 3);
-	/* Free. */
-	mas_reset(&mas);
-	mas_destroy(&mas);
-
-	/* Set allocation request to 1. */
-	mas_set_alloc_req(&mas, 1);
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
-	mas_set_err(&mas, -ENOMEM);
-	/* Validate allocation request. */
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	MT_BUG_ON(mt, mas_allocated(&mas) != 1);
-	/* Check the node is only one node. */
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	MT_BUG_ON(mt, mn == NULL);
-	MT_BUG_ON(mt, mn->slot[0] != NULL);
-	MT_BUG_ON(mt, mn->slot[1] != NULL);
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	mas_push_node(&mas, mn);
-	MT_BUG_ON(mt, mas_allocated(&mas) != 1);
-	MT_BUG_ON(mt, mas.alloc->node_count);
-
-	mas_set_alloc_req(&mas, 2); /* request 2 more. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 2);
-	mas_set_err(&mas, -ENOMEM);
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	MT_BUG_ON(mt, mas_allocated(&mas) != 3);
-	MT_BUG_ON(mt, mas.alloc == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
-	MT_BUG_ON(mt, mas.alloc->slot[1] == NULL);
-	for (i = 2; i >= 0; i--) {
-		mn = mas_pop_node(&mas);
-		MT_BUG_ON(mt, mas_allocated(&mas) != i);
-		MT_BUG_ON(mt, !mn);
-		MT_BUG_ON(mt, not_empty(mn));
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-	}
-
-	total = 64;
-	mas_set_alloc_req(&mas, total); /* request 2 more. */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != total);
-	mas_set_err(&mas, -ENOMEM);
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	for (i = total; i > 0; i--) {
-		unsigned int e = 0; /* expected node_count */
-
-		if (!MAPLE_32BIT) {
-			if (i >= 35)
-				e = i - 34;
-			else if (i >= 5)
-				e = i - 4;
-			else if (i >= 2)
-				e = i - 1;
-		} else {
-			if (i >= 4)
-				e = i - 3;
-			else if (i >= 1)
-				e = i - 1;
-			else
-				e = 0;
-		}
-
-		MT_BUG_ON(mt, mas.alloc->node_count != e);
-		mn = mas_pop_node(&mas);
-		MT_BUG_ON(mt, not_empty(mn));
-		MT_BUG_ON(mt, mas_allocated(&mas) != i - 1);
-		MT_BUG_ON(mt, !mn);
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-	}
-
-	total = 100;
-	for (i = 1; i < total; i++) {
-		mas_set_alloc_req(&mas, i);
-		mas_set_err(&mas, -ENOMEM);
-		MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-		for (j = i; j > 0; j--) {
-			mn = mas_pop_node(&mas);
-			MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
-			MT_BUG_ON(mt, !mn);
-			MT_BUG_ON(mt, not_empty(mn));
-			mas_push_node(&mas, mn);
-			MT_BUG_ON(mt, mas_allocated(&mas) != j);
-			mn = mas_pop_node(&mas);
-			MT_BUG_ON(mt, not_empty(mn));
-			MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
-			mn->parent = ma_parent_ptr(mn);
-			ma_free_rcu(mn);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-
-		mas_set_alloc_req(&mas, i);
-		mas_set_err(&mas, -ENOMEM);
-		MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-		for (j = 0; j <= i/2; j++) {
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
-			nodes[j] = mas_pop_node(&mas);
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
-		}
-
-		while (j) {
-			j--;
-			mas_push_node(&mas, nodes[j]);
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas) != i);
-		for (j = 0; j <= i/2; j++) {
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
-			mn = mas_pop_node(&mas);
-			MT_BUG_ON(mt, not_empty(mn));
-			mn->parent = ma_parent_ptr(mn);
-			ma_free_rcu(mn);
-			MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
-		}
-		mas_reset(&mas);
-		MT_BUG_ON(mt, mas_nomem(&mas, GFP_KERNEL));
-		mas_destroy(&mas);
-
-	}
-
-	/* Set allocation request. */
-	total = 500;
-	mas_node_count(&mas, total);
-	/* Drop the lock and allocate the nodes. */
-	mas_nomem(&mas, GFP_KERNEL);
-	MT_BUG_ON(mt, !mas.alloc);
-	i = 1;
-	smn = mas.alloc;
-	while (i < total) {
-		for (j = 0; j < MAPLE_ALLOC_SLOTS; j++) {
-			i++;
-			MT_BUG_ON(mt, !smn->slot[j]);
-			if (i == total)
-				break;
-		}
-		smn = smn->slot[0]; /* next. */
-	}
-	MT_BUG_ON(mt, mas_allocated(&mas) != total);
-	mas_reset(&mas);
-	mas_destroy(&mas); /* Free. */
-
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	for (i = 1; i < 128; i++) {
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		MT_BUG_ON(mt, mas_allocated(&mas) != i); /* check request filled */
-		for (j = i; j > 0; j--) { /*Free the requests */
-			mn = mas_pop_node(&mas); /* get the next node. */
-			MT_BUG_ON(mt, mn == NULL);
-			MT_BUG_ON(mt, not_empty(mn));
-			mn->parent = ma_parent_ptr(mn);
-			ma_free_rcu(mn);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	}
-
-	for (i = 1; i < MAPLE_NODE_MASK + 1; i++) {
-		MA_STATE(mas2, mt, 0, 0);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		MT_BUG_ON(mt, mas_allocated(&mas) != i); /* check request filled */
-		for (j = 1; j <= i; j++) { /* Move the allocations to mas2 */
-			mn = mas_pop_node(&mas); /* get the next node. */
-			MT_BUG_ON(mt, mn == NULL);
-			MT_BUG_ON(mt, not_empty(mn));
-			mas_push_node(&mas2, mn);
-			MT_BUG_ON(mt, mas_allocated(&mas2) != j);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-		MT_BUG_ON(mt, mas_allocated(&mas2) != i);
-
-		for (j = i; j > 0; j--) { /*Free the requests */
-			MT_BUG_ON(mt, mas_allocated(&mas2) != j);
-			mn = mas_pop_node(&mas2); /* get the next node. */
-			MT_BUG_ON(mt, mn == NULL);
-			MT_BUG_ON(mt, not_empty(mn));
-			mn->parent = ma_parent_ptr(mn);
-			ma_free_rcu(mn);
-		}
-		MT_BUG_ON(mt, mas_allocated(&mas2) != 0);
-	}
-
-
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 1); /* Request */
-	MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
-	MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS);
-
-	mn = mas_pop_node(&mas); /* get the next node. */
-	MT_BUG_ON(mt, mn == NULL);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS);
-	MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS - 1);
-
-	mas_push_node(&mas, mn);
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
-	MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS);
-
-	/* Check the limit of pop/push/pop */
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 2); /* Request */
-	MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
-	MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
-	MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
-	MT_BUG_ON(mt, mas_alloc_req(&mas));
-	MT_BUG_ON(mt, mas.alloc->node_count != 1);
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
-	MT_BUG_ON(mt, mas.alloc->node_count  != MAPLE_ALLOC_SLOTS);
-	mas_push_node(&mas, mn);
-	MT_BUG_ON(mt, mas.alloc->node_count != 1);
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, not_empty(mn));
-	mn->parent = ma_parent_ptr(mn);
-	ma_free_rcu(mn);
-	for (i = 1; i <= MAPLE_ALLOC_SLOTS + 1; i++) {
-		mn = mas_pop_node(&mas);
-		MT_BUG_ON(mt, not_empty(mn));
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-	}
-	MT_BUG_ON(mt, mas_allocated(&mas) != 0);
-
-
-	for (i = 3; i < MAPLE_NODE_MASK * 3; i++) {
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mas_push_node(&mas, mn); /* put it back */
-		mas_destroy(&mas);
-
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn2 = mas_pop_node(&mas); /* get the next node. */
-		mas_push_node(&mas, mn); /* put them back */
-		mas_push_node(&mas, mn2);
-		mas_destroy(&mas);
-
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn2 = mas_pop_node(&mas); /* get the next node. */
-		mn3 = mas_pop_node(&mas); /* get the next node. */
-		mas_push_node(&mas, mn); /* put them back */
-		mas_push_node(&mas, mn2);
-		mas_push_node(&mas, mn3);
-		mas_destroy(&mas);
-
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-		mas_destroy(&mas);
-
-		mas.node = MA_ERROR(-ENOMEM);
-		mas_node_count(&mas, i); /* Request */
-		mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-		mn = mas_pop_node(&mas); /* get the next node. */
-		mn->parent = ma_parent_ptr(mn);
-		ma_free_rcu(mn);
-		mas_destroy(&mas);
-	}
-
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, 5); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	MT_BUG_ON(mt, mas_allocated(&mas) != 5);
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, 10); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	mas.status = ma_start;
-	MT_BUG_ON(mt, mas_allocated(&mas) != 10);
-	mas_destroy(&mas);
-
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS - 1); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS - 1);
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, 10 + MAPLE_ALLOC_SLOTS - 1); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	mas.status = ma_start;
-	MT_BUG_ON(mt, mas_allocated(&mas) != 10 + MAPLE_ALLOC_SLOTS - 1);
-	mas_destroy(&mas);
-
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 1); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 2 + 2); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	mas.status = ma_start;
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 2 + 2);
-	mas_destroy(&mas);
-
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 2 + 1); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 2 + 1);
-	mas.node = MA_ERROR(-ENOMEM);
-	mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 3 + 2); /* Request */
-	mas_nomem(&mas, GFP_KERNEL); /* Fill request */
-	mas.status = ma_start;
-	MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 3 + 2);
-	mas_destroy(&mas);
-
-	mtree_unlock(mt);
-}
-
 /*
  * Check erasing including RCU.
  */
@@ -35458,8 +35034,7 @@ static void check_dfs_preorder(struct maple_tree *mt)
 	mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
 	mas_reset(&mas);
 	mt_zero_nr_tallocated();
-	mt_set_non_kernel(200);
-	mas_expected_entries(&mas, max);
+	mt_set_non_kernel(1000);
 	for (count = 0; count <= max; count++) {
 		mas.index = mas.last = count;
 		mas_store(&mas, xa_mk_value(count));
@@ -35524,6 +35099,13 @@ static unsigned char get_vacant_height(struct ma_wr_state *wr_mas, void *entry)
 	return vacant_height;
 }
 
+static int mas_allocated(struct ma_state *mas)
+{
+	if (mas->sheaf)
+		return kmem_cache_sheaf_size(mas->sheaf);
+
+	return 0;
+}
 /* Preallocation testing */
 static noinline void __init check_prealloc(struct maple_tree *mt)
 {
@@ -35533,8 +35115,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
 	unsigned char vacant_height;
 	struct maple_node *mn;
 	void *ptr = check_prealloc;
+	struct ma_wr_state wr_mas;
 	MA_STATE(mas, mt, 10, 20);
-	MA_WR_STATE(wr_mas, &mas, ptr);
 
 	mt_set_non_kernel(1000);
 	for (i = 0; i <= max; i++)
@@ -35542,7 +35124,11 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
 
 	/* Spanning store */
 	mas_set_range(&mas, 470, 500);
-	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
+	wr_mas.mas = &mas;
+
+	mas_wr_preallocate(&wr_mas, ptr);
+	MT_BUG_ON(mt, mas.store_type != wr_spanning_store);
+	MT_BUG_ON(mt, mas_is_err(&mas));
 	allocated = mas_allocated(&mas);
 	height = mas_mt_height(&mas);
 	vacant_height = get_vacant_height(&wr_mas, ptr);
@@ -35552,6 +35138,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
 	allocated = mas_allocated(&mas);
 	MT_BUG_ON(mt, allocated != 0);
 
+	mas_wr_preallocate(&wr_mas, ptr);
 	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
 	allocated = mas_allocated(&mas);
 	height = mas_mt_height(&mas);
@@ -35592,20 +35179,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
 	mn->parent = ma_parent_ptr(mn);
 	ma_free_rcu(mn);
 
-	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
-	allocated = mas_allocated(&mas);
-	height = mas_mt_height(&mas);
-	vacant_height = get_vacant_height(&wr_mas, ptr);
-	MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3);
-	mn = mas_pop_node(&mas);
-	MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
-	mas_push_node(&mas, mn);
-	MT_BUG_ON(mt, mas_allocated(&mas) != allocated);
-	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
-	mas_destroy(&mas);
-	allocated = mas_allocated(&mas);
-	MT_BUG_ON(mt, allocated != 0);
-
 	MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
 	allocated = mas_allocated(&mas);
 	height = mas_mt_height(&mas);
@@ -36394,11 +35967,17 @@ static void check_nomem_writer_race(struct maple_tree *mt)
 	check_load(mt, 6, xa_mk_value(0xC));
 	mtree_unlock(mt);
 
+	mt_set_non_kernel(0);
 	/* test for the same race but with mas_store_gfp() */
 	mtree_store_range(mt, 0, 5, xa_mk_value(0xA), GFP_KERNEL);
 	mtree_store_range(mt, 6, 10, NULL, GFP_KERNEL);
 
 	mas_set_range(&mas, 0, 5);
+
+	/* setup writer 2 that will trigger the race condition */
+	mt_set_private(mt);
+	mt_set_callback(writer2);
+
 	mtree_lock(mt);
 	mas_store_gfp(&mas, NULL, GFP_KERNEL);
 
@@ -36435,7 +36014,6 @@ static inline int check_vma_modification(struct maple_tree *mt)
 	__mas_set_range(&mas, 0x7ffde4ca2000, 0x7ffffffff000 - 1);
 	mas_preallocate(&mas, NULL, GFP_KERNEL);
 	mas_store_prealloc(&mas, NULL);
-	mt_dump(mt, mt_dump_hex);
 
 	mas_destroy(&mas);
 	mtree_unlock(mt);
@@ -36453,6 +36031,8 @@ static inline void check_bulk_rebalance(struct maple_tree *mt)
 
 	build_full_tree(mt, 0, 2);
 
+
+	mtree_lock(mt);
 	/* erase every entry in the tree */
 	do {
 		/* set up bulk store mode */
@@ -36462,6 +36042,85 @@ static inline void check_bulk_rebalance(struct maple_tree *mt)
 	} while (mas_prev(&mas, 0) != NULL);
 
 	mas_destroy(&mas);
+	mtree_unlock(mt);
+}
+
+static unsigned long get_last_index(struct ma_state *mas)
+{
+	struct maple_node *node = mas_mn(mas);
+	enum maple_type mt = mte_node_type(mas->node);
+	unsigned long *pivots = ma_pivots(node, mt);
+	unsigned long last_index = mas_data_end(mas);
+
+	BUG_ON(last_index == 0);
+
+	return pivots[last_index - 1] + 1;
+}
+
+/*
+ * Assert that we handle spanning stores that consume the entirety of the right
+ * leaf node correctly.
+ */
+static void test_spanning_store_regression(void)
+{
+	unsigned long from = 0, to = 0;
+	DEFINE_MTREE(tree);
+	MA_STATE(mas, &tree, 0, 0);
+
+	/*
+	 * Build a 3-level tree. We require a parent node below the root node
+	 * and 2 leaf nodes under it, so we can span the entirety of the right
+	 * hand node.
+	 */
+	build_full_tree(&tree, 0, 3);
+
+	/* Descend into position at depth 2. */
+	mas_reset(&mas);
+	mas_start(&mas);
+	mas_descend(&mas);
+	mas_descend(&mas);
+
+	/*
+	 * We need to establish a tree like the below.
+	 *
+	 * Then we can try a store in [from, to] which results in a spanned
+	 * store across nodes B and C, with the maple state at the time of the
+	 * write being such that only the subtree at A and below is considered.
+	 *
+	 * Height
+	 *  0                              Root Node
+	 *                                  /      \
+	 *                    pivot = to   /        \ pivot = ULONG_MAX
+	 *                                /          \
+	 *   1                       A [-----]       ...
+	 *                              /   \
+	 *                pivot = from /     \ pivot = to
+	 *                            /       \
+	 *   2 (LEAVES)          B [-----]  [-----] C
+	 *                                       ^--- Last pivot to.
+	 */
+	while (true) {
+		unsigned long tmp = get_last_index(&mas);
+
+		if (mas_next_sibling(&mas)) {
+			from = tmp;
+			to = mas.max;
+		} else {
+			break;
+		}
+	}
+
+	BUG_ON(from == 0 && to == 0);
+
+	/* Perform the store. */
+	mas_set_range(&mas, from, to);
+	mas_store_gfp(&mas, xa_mk_value(0xdead), GFP_KERNEL);
+
+	/* If the regression occurs, the validation will fail. */
+	mt_validate(&tree);
+
+	/* Cleanup. */
+	__mt_destroy(&tree);
 }
 
 void farmer_tests(void)
@@ -36525,6 +36184,7 @@ void farmer_tests(void)
 	check_collapsing_rebalance(&tree);
 	mtree_destroy(&tree);
 
+
 	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
 	check_null_expand(&tree);
 	mtree_destroy(&tree);
@@ -36538,10 +36198,6 @@ void farmer_tests(void)
 	check_erase_testset(&tree);
 	mtree_destroy(&tree);
 
-	mt_init_flags(&tree, 0);
-	check_new_node(&tree);
-	mtree_destroy(&tree);
-
 	if (!MAPLE_32BIT) {
 		mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
 		check_rcu_simulated(&tree);
@@ -36563,95 +36219,13 @@ void farmer_tests(void)
 
 	/* No memory handling */
 	check_nomem(&tree);
-}
-
-static unsigned long get_last_index(struct ma_state *mas)
-{
-	struct maple_node *node = mas_mn(mas);
-	enum maple_type mt = mte_node_type(mas->node);
-	unsigned long *pivots = ma_pivots(node, mt);
-	unsigned long last_index = mas_data_end(mas);
-
-	BUG_ON(last_index == 0);
 
-	return pivots[last_index - 1] + 1;
-}
-
-/*
- * Assert that we handle spanning stores that consume the entirety of the right
- * leaf node correctly.
- */
-static void test_spanning_store_regression(void)
-{
-	unsigned long from = 0, to = 0;
-	DEFINE_MTREE(tree);
-	MA_STATE(mas, &tree, 0, 0);
-
-	/*
-	 * Build a 3-level tree. We require a parent node below the root node
-	 * and 2 leaf nodes under it, so we can span the entirety of the right
-	 * hand node.
-	 */
-	build_full_tree(&tree, 0, 3);
-
-	/* Descend into position at depth 2. */
-	mas_reset(&mas);
-	mas_start(&mas);
-	mas_descend(&mas);
-	mas_descend(&mas);
-
-	/*
-	 * We need to establish a tree like the below.
-	 *
-	 * Then we can try a store in [from, to] which results in a spanned
-	 * store across nodes B and C, with the maple state at the time of the
-	 * write being such that only the subtree at A and below is considered.
-	 *
-	 * Height
-	 *  0                              Root Node
-	 *                                  /      \
-	 *                    pivot = to   /        \ pivot = ULONG_MAX
-	 *                                /          \
-	 *   1                       A [-----]       ...
-	 *                              /   \
-	 *                pivot = from /     \ pivot = to
-	 *                            /       \
-	 *   2 (LEAVES)          B [-----]  [-----] C
-	 *                                       ^--- Last pivot to.
-	 */
-	while (true) {
-		unsigned long tmp = get_last_index(&mas);
-
-		if (mas_next_sibling(&mas)) {
-			from = tmp;
-			to = mas.max;
-		} else {
-			break;
-		}
-	}
-
-	BUG_ON(from == 0 && to == 0);
-
-	/* Perform the store. */
-	mas_set_range(&mas, from, to);
-	mas_store_gfp(&mas, xa_mk_value(0xdead), GFP_KERNEL);
-
-	/* If the regression occurs, the validation will fail. */
-	mt_validate(&tree);
-
-	/* Cleanup. */
-	__mt_destroy(&tree);
-}
-
-static void regression_tests(void)
-{
 	test_spanning_store_regression();
 }
 
 void maple_tree_tests(void)
 {
 #if !defined(BENCH)
-	regression_tests();
 	farmer_tests();
 #endif
 	maple_tree_seed();
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index e0255f53159bd3a1325d49192283dd6790a5e3b8..6a15665fc8315168c718e6810c7deaeed13a3a6a 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -82,7 +82,8 @@ void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
 
 	if (!(gfp & __GFP_DIRECT_RECLAIM)) {
 		if (!cachep->non_kernel) {
-			cachep->exec_callback = true;
+			if (cachep->callback)
+				cachep->exec_callback = true;
 			return NULL;
 		}
 
@@ -236,6 +237,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 		for (i = 0; i < size; i++)
 			__kmem_cache_free_locked(cachep, p[i]);
 		pthread_mutex_unlock(&cachep->lock);
+		if (cachep->callback)
+			cachep->exec_callback = true;
 		return 0;
 	}
 
@@ -288,9 +291,8 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 		capacity = s->sheaf_capacity;
 
 	sheaf = malloc(sizeof(*sheaf) + sizeof(void *) * s->sheaf_capacity * capacity);
-	if (!sheaf) {
+	if (!sheaf)
 		return NULL;
-	}
 
 	memset(sheaf, 0, size);
 	sheaf->cache = s;

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 13/14] maple_tree: Add single node allocation support to maple state
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (11 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 12/14] maple_tree: Sheaf conversion Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-08-22 20:25   ` Suren Baghdasaryan
  2025-07-23 13:34 ` [PATCH v5 14/14] maple_tree: Convert forking to use the sheaf interface Vlastimil Babka
  2025-08-15 22:53 ` [PATCH v5 00/14] SLUB percpu sheaves Sudarsan Mahendran
  14 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The fast path through a write will require replacing a single node in
the tree.  Using a sheaf (32 nodes) is too heavy for the fast path, so
special case the node store operation by just allocating one node in the
maple state.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/maple_tree.h |  4 +++-
 lib/maple_tree.c           | 47 ++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 3cf1ae9dde7ce43fa20ae400c01fefad048c302e..61eb5e7d09ad0133978e3ac4b2af66710421e769 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -443,6 +443,7 @@ struct ma_state {
 	unsigned long min;		/* The minimum index of this node - implied pivot min */
 	unsigned long max;		/* The maximum index of this node - implied pivot max */
 	struct slab_sheaf *sheaf;	/* Allocated nodes for this operation */
+	struct maple_node *alloc;	/* allocated nodes */
 	unsigned long node_request;
 	enum maple_status status;	/* The status of the state (active, start, none, etc) */
 	unsigned char depth;		/* depth of tree descent during write */
@@ -491,8 +492,9 @@ struct ma_wr_state {
 		.status = ma_start,					\
 		.min = 0,						\
 		.max = ULONG_MAX,					\
-		.node_request= 0,					\
 		.sheaf = NULL,						\
+		.alloc = NULL,						\
+		.node_request= 0,					\
 		.mas_flags = 0,						\
 		.store_type = wr_invalid,				\
 	}
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 3c3c14a76d98ded3b619c178d64099b464a2ca23..9aa782b1497f224e7366ebbd65f997523ee0c8ab 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1101,16 +1101,23 @@ static int mas_ascend(struct ma_state *mas)
  *
  * Return: A pointer to a maple node.
  */
-static inline struct maple_node *mas_pop_node(struct ma_state *mas)
+static __always_inline struct maple_node *mas_pop_node(struct ma_state *mas)
 {
 	struct maple_node *ret;
 
+	if (mas->alloc) {
+		ret = mas->alloc;
+		mas->alloc = NULL;
+		goto out;
+	}
+
 	if (WARN_ON_ONCE(!mas->sheaf))
 		return NULL;
 
 	ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
-	memset(ret, 0, sizeof(*ret));
 
+out:
+	memset(ret, 0, sizeof(*ret));
 	return ret;
 }
 
@@ -1121,9 +1128,34 @@ static inline struct maple_node *mas_pop_node(struct ma_state *mas)
  */
 static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 {
-	if (unlikely(mas->sheaf)) {
-		unsigned long refill = mas->node_request;
+	if (!mas->node_request)
+		return;
+
+	if (mas->node_request == 1) {
+		if (mas->sheaf)
+			goto use_sheaf;
+
+		if (mas->alloc)
+			return;
 
+		mas->alloc = mt_alloc_one(gfp);
+		if (!mas->alloc)
+			goto error;
+
+		mas->node_request = 0;
+		return;
+	}
+
+use_sheaf:
+	if (unlikely(mas->alloc)) {
+		mt_free_one(mas->alloc);
+		mas->alloc = NULL;
+	}
+
+	if (mas->sheaf) {
+		unsigned long refill;
+
+		refill = mas->node_request;
 		if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
 			mas->node_request = 0;
 			return;
@@ -5386,8 +5418,11 @@ void mas_destroy(struct ma_state *mas)
 	mas->node_request = 0;
 	if (mas->sheaf)
 		mt_return_sheaf(mas->sheaf);
-
 	mas->sheaf = NULL;
+
+	if (mas->alloc)
+		mt_free_one(mas->alloc);
+	mas->alloc = NULL;
 }
 EXPORT_SYMBOL_GPL(mas_destroy);
 
@@ -6074,7 +6109,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 		mas_alloc_nodes(mas, gfp);
 	}
 
-	if (!mas->sheaf)
+	if (!mas->sheaf && !mas->alloc)
 		return false;
 
 	mas->status = ma_start;

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 14/14] maple_tree: Convert forking to use the sheaf interface
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (12 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 13/14] maple_tree: Add single node allocation support to maple state Vlastimil Babka
@ 2025-07-23 13:34 ` Vlastimil Babka
  2025-08-22 20:29   ` Suren Baghdasaryan
  2025-08-15 22:53 ` [PATCH v5 00/14] SLUB percpu sheaves Sudarsan Mahendran
  14 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-23 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Use the generic interface which should result in less bulk allocations
during a forking.

A part of this is to abstract the freeing of the sheaf or maple state
allocations into its own function so mas_destroy() and the tree
duplication code can use the same functionality to return any unused
resources.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 lib/maple_tree.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 9aa782b1497f224e7366ebbd65f997523ee0c8ab..180d5e2ea49440248aaae04a066276406b2537ed 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1178,6 +1178,19 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
 	mas_set_err(mas, -ENOMEM);
 }
 
+static inline void mas_empty_nodes(struct ma_state *mas)
+{
+	mas->node_request = 0;
+	if (mas->sheaf) {
+		mt_return_sheaf(mas->sheaf);
+		mas->sheaf = NULL;
+	}
+
+	if (mas->alloc) {
+		mt_free_one(mas->alloc);
+		mas->alloc = NULL;
+	}
+}
 
 /*
  * mas_free() - Free an encoded maple node
@@ -5414,15 +5427,7 @@ void mas_destroy(struct ma_state *mas)
 		mas->mas_flags &= ~MA_STATE_REBALANCE;
 	}
 	mas->mas_flags &= ~(MA_STATE_BULK|MA_STATE_PREALLOC);
-
-	mas->node_request = 0;
-	if (mas->sheaf)
-		mt_return_sheaf(mas->sheaf);
-	mas->sheaf = NULL;
-
-	if (mas->alloc)
-		mt_free_one(mas->alloc);
-	mas->alloc = NULL;
+	mas_empty_nodes(mas);
 }
 EXPORT_SYMBOL_GPL(mas_destroy);
 
@@ -6499,7 +6504,7 @@ static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
 	struct maple_node *node = mte_to_node(mas->node);
 	struct maple_node *new_node = mte_to_node(new_mas->node);
 	enum maple_type type;
-	unsigned char request, count, i;
+	unsigned char count, i;
 	void __rcu **slots;
 	void __rcu **new_slots;
 	unsigned long val;
@@ -6507,20 +6512,17 @@ static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
 	/* Allocate memory for child nodes. */
 	type = mte_node_type(mas->node);
 	new_slots = ma_slots(new_node, type);
-	request = mas_data_end(mas) + 1;
-	count = mt_alloc_bulk(gfp, request, (void **)new_slots);
-	if (unlikely(count < request)) {
-		memset(new_slots, 0, request * sizeof(void *));
-		mas_set_err(mas, -ENOMEM);
+	count = mas->node_request = mas_data_end(mas) + 1;
+	mas_alloc_nodes(mas, gfp);
+	if (unlikely(mas_is_err(mas)))
 		return;
-	}
 
-	/* Restore node type information in slots. */
 	slots = ma_slots(node, type);
 	for (i = 0; i < count; i++) {
 		val = (unsigned long)mt_slot_locked(mas->tree, slots, i);
 		val &= MAPLE_NODE_MASK;
-		((unsigned long *)new_slots)[i] |= val;
+		new_slots[i] = ma_mnode_ptr((unsigned long)mas_pop_node(mas) |
+					    val);
 	}
 }
 
@@ -6574,7 +6576,7 @@ static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
 			/* Only allocate child nodes for non-leaf nodes. */
 			mas_dup_alloc(mas, new_mas, gfp);
 			if (unlikely(mas_is_err(mas)))
-				return;
+				goto empty_mas;
 		} else {
 			/*
 			 * This is the last leaf node and duplication is
@@ -6607,6 +6609,8 @@ static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
 	/* Make them the same height */
 	new_mas->tree->ma_flags = mas->tree->ma_flags;
 	rcu_assign_pointer(new_mas->tree->ma_root, root);
+empty_mas:
+	mas_empty_nodes(mas);
 }
 
 /**

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 02/14] slab: add sheaf support for batching kfree_rcu() operations
  2025-07-23 13:34 ` [PATCH v5 02/14] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2025-07-23 16:39   ` Uladzislau Rezki
  2025-07-24 14:30     ` Vlastimil Babka
  0 siblings, 1 reply; 45+ messages in thread
From: Uladzislau Rezki @ 2025-07-23 16:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Uladzislau Rezki,
	linux-mm, linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 03:34:35PM +0200, Vlastimil Babka wrote:
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
> 
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put it in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
> 
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() implementation.
> 
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
> 
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too.
> 
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
> 
> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> many had to fall back to the existing implementation.
> 
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab.h        |   2 +
>  mm/slab_common.c |  24 +++++++
>  mm/slub.c        | 193 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 214 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 1980330c2fcb4a4613a7e4f7efc78b349993fd89..44c9b70eaabbd87c06fb39b79dfb791d515acbde 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -459,6 +459,8 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>  	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
>  }
>  
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
> +
>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>  			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
>  			 SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index e2b197e47866c30acdbd1fee4159f262a751c5a7..2d806e02568532a1000fd3912db6978e945dcfa8 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1608,6 +1608,27 @@ static void kfree_rcu_work(struct work_struct *work)
>  		kvfree_rcu_list(head);
>  }
>  
> +static bool kfree_rcu_sheaf(void *obj)
> +{
> +	struct kmem_cache *s;
> +	struct folio *folio;
> +	struct slab *slab;
> +
> +	if (is_vmalloc_addr(obj))
> +		return false;
> +
> +	folio = virt_to_folio(obj);
> +	if (unlikely(!folio_test_slab(folio)))
> +		return false;
> +
> +	slab = folio_slab(folio);
> +	s = slab->slab_cache;
> +	if (s->cpu_sheaves)
> +		return __kfree_rcu_sheaf(s, obj);
> +
> +	return false;
> +}
> +
>  static bool
>  need_offload_krc(struct kfree_rcu_cpu *krcp)
>  {
> @@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  	if (!head)
>  		might_sleep();
>  
> +	if (kfree_rcu_sheaf(ptr))
> +		return;
> +
>
I have a question here. kfree_rcu_sheaf(ptr) tries to revert freeing
an object over one more newly introduced path. This patch adds infra
for such purpose whereas we already have a main path over which we
free memory.

Why do not we use existing logic? As i see you can do:

   if (unlikely(!slab_free_hook(s, p[i], init, true))) {
        p[i] = p[--sheaf->size];
        continue;
   }

in the kfree_rcu_work() function where we process all ready to free objects.
I mean, for slab objects we can replace kfree_bulk() and scan all pointers
and free them over slab_free_hook().

Also we do use a pooled API and other improvements to speed up freeing.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 02/14] slab: add sheaf support for batching kfree_rcu() operations
  2025-07-23 16:39   ` Uladzislau Rezki
@ 2025-07-24 14:30     ` Vlastimil Babka
  2025-07-24 17:36       ` Uladzislau Rezki
  0 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-07-24 14:30 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, linux-mm, linux-kernel,
	rcu, maple-tree

On 7/23/25 18:39, Uladzislau Rezki wrote:
> On Wed, Jul 23, 2025 at 03:34:35PM +0200, Vlastimil Babka wrote:
>>  static bool
>>  need_offload_krc(struct kfree_rcu_cpu *krcp)
>>  {
>> @@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>>  	if (!head)
>>  		might_sleep();
>>  
>> +	if (kfree_rcu_sheaf(ptr))
>> +		return;
>> +
>>
> I have a question here. kfree_rcu_sheaf(ptr) tries to revert freeing
> an object over one more newly introduced path. This patch adds infra
> for such purpose whereas we already have a main path over which we
> free memory.
> 
> Why do not we use existing logic? As i see you can do:
> 
>    if (unlikely(!slab_free_hook(s, p[i], init, true))) {
>         p[i] = p[--sheaf->size];
>         continue;
>    }
> 
> in the kfree_rcu_work() function where we process all ready to free objects.

I'm not sure I understand. In kfree_rcu_work() we process individual
objects. There is no sheaf that you reference in the code above?
Or are you suggesting we add e.g. a "channel" of sheaves to process in
addition to the existing channels of objects?

> I mean, for slab objects we can replace kfree_bulk() and scan all pointers
> and free them over slab_free_hook().

The desired outcome after __rcu_free_sheaf_prepare() is to take the whole
sheaf and have it reused, not free individual objects. So we call
slab_free_hook() in __rcu_free_sheaf_prepare() but don't actually free
individual objects as necessary.

> Also we do use a pooled API and other improvements to speed up freeing.

It could be useful to know the details as in Suren's measurements there's
issues with kfree_rcu() using sheaves when lazy rcu is used. Is the
kfree_rcu() infra avoiding being too lazy somehow? We could use the same
techniques for sheaves.

> Thanks!
> 
> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 02/14] slab: add sheaf support for batching kfree_rcu() operations
  2025-07-24 14:30     ` Vlastimil Babka
@ 2025-07-24 17:36       ` Uladzislau Rezki
  0 siblings, 0 replies; 45+ messages in thread
From: Uladzislau Rezki @ 2025-07-24 17:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Uladzislau Rezki, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	linux-mm, linux-kernel, rcu, maple-tree

On Thu, Jul 24, 2025 at 04:30:49PM +0200, Vlastimil Babka wrote:
> On 7/23/25 18:39, Uladzislau Rezki wrote:
> > On Wed, Jul 23, 2025 at 03:34:35PM +0200, Vlastimil Babka wrote:
> >>  static bool
> >>  need_offload_krc(struct kfree_rcu_cpu *krcp)
> >>  {
> >> @@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
> >>  	if (!head)
> >>  		might_sleep();
> >>  
> >> +	if (kfree_rcu_sheaf(ptr))
> >> +		return;
> >> +
> >>
> > I have a question here. kfree_rcu_sheaf(ptr) tries to revert freeing
> > an object over one more newly introduced path. This patch adds infra
> > for such purpose whereas we already have a main path over which we
> > free memory.
> > 
> > Why do not we use existing logic? As i see you can do:
> > 
> >    if (unlikely(!slab_free_hook(s, p[i], init, true))) {
> >         p[i] = p[--sheaf->size];
> >         continue;
> >    }
> > 
> > in the kfree_rcu_work() function where we process all ready to free objects.
> 
> I'm not sure I understand. In kfree_rcu_work() we process individual
> objects. There is no sheaf that you reference in the code above?
> Or are you suggesting we add e.g. a "channel" of sheaves to process in
> addition to the existing channels of objects?
> 
There is no that above "sheaf" code. I put it just for reference.
I suggested to put such objects into regular existing channels and
process them. But for that purpose we need to check each SLAB object 
because currently we can free them over kfree_bulk().

A separate channel can also be maintained but it would add more logic
on top but at least it would consolidate the freeing path and use one
RCU machinery.

From the other hand what else can we free? You have a code in your patch:

	if (is_vmalloc_addr(obj))
		return false;

	folio = virt_to_folio(obj);
	if (unlikely(!folio_test_slab(folio)))
		return false;

vmalloc pointers go its own way, others are SLAB. What else can it be?
i.e. folio_test_slab() checks if obj->folio is part of the SLAB objects.
Can it return zero?

> > I mean, for slab objects we can replace kfree_bulk() and scan all pointers
> > and free them over slab_free_hook().
> 
> The desired outcome after __rcu_free_sheaf_prepare() is to take the whole
> sheaf and have it reused, not free individual objects. So we call
> slab_free_hook() in __rcu_free_sheaf_prepare() but don't actually free
> individual objects as necessary.
> 
I see.

> > Also we do use a pooled API and other improvements to speed up freeing.
> 
> It could be useful to know the details as in Suren's measurements there's
> issues with kfree_rcu() using sheaves when lazy rcu is used. Is the
> kfree_rcu() infra avoiding being too lazy somehow? We could use the same
> techniques for sheaves.
> 
I think, it is because your patch uses call_rcu() and not call_rcu_harry().
There is one more tricky part, it is about how long rcu_free_sheaf()
callback executes, because there are other callbacks in a queue which
can wait its time.

kfree_rcu() infra does not use call_rcu() chain because it can be slow.
We can delay a process of freed objects if an array of pointers is not
yet full. When a first object is added we arm the timer to kick the
process in 5 seconds. Once an array becomes full the logic switches into
a fast mode, reprogram a timer to trigger a process asap.

Also, this patch creates a collision because it goes its own way. We
have a kvfree_rcu_barrier() which becomes broken if this patch applied?

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/14] SLUB percpu sheaves
  2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
                   ` (13 preceding siblings ...)
  2025-07-23 13:34 ` [PATCH v5 14/14] maple_tree: Convert forking to use the sheaf interface Vlastimil Babka
@ 2025-08-15 22:53 ` Sudarsan Mahendran
  2025-08-16  8:05   ` Harry Yoo
  14 siblings, 1 reply; 45+ messages in thread
From: Sudarsan Mahendran @ 2025-08-15 22:53 UTC (permalink / raw)
  To: vbabka
  Cc: Liam.Howlett, cl, harry.yoo, howlett, linux-kernel, linux-mm,
	maple-tree, rcu, rientjes, roman.gushchin, surenb, urezki

Hi Vlastimil,

I ported this patch series on top of v6.17.
I had to resolve some merge conflicts because of 
fba46a5d83ca8decb338722fb4899026d8d9ead2

The conflict resolution looks like:

@@ -5524,20 +5335,19 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
 int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
 {
        MA_WR_STATE(wr_mas, mas, entry);
-       int ret = 0;
-       int request;

        mas_wr_prealloc_setup(&wr_mas);
        mas->store_type = mas_wr_store_type(&wr_mas);
-       request = mas_prealloc_calc(&wr_mas, entry);
-       if (!request)
+       mas_prealloc_calc(&wr_mas, entry);
+       if (!mas->node_request)
                goto set_flag;

        mas->mas_flags &= ~MA_STATE_PREALLOC;
-       mas_node_count_gfp(mas, request, gfp);
+       mas_alloc_nodes(mas, gfp);
        if (mas_is_err(mas)) {
-               mas_set_alloc_req(mas, 0);
-               ret = xa_err(mas->node);
+               int ret = xa_err(mas->node);
+
+               mas->node_request = 0;
                mas_destroy(mas);
                mas_reset(mas);
                return ret;
@@ -5545,7 +5355,7 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)

 set_flag:
        mas->mas_flags |= MA_STATE_PREALLOC;
-       return ret;
+       return 0;
 }
 EXPORT_SYMBOL_GPL(mas_preallocate);



When I try to boot this kernel, I see kernel panic
with rcu_free_sheaf() doing recursion into __kmem_cache_free_bulk()

Stack trace:

[    1.583673] Oops: stack guard page: 0000 [#1] SMP NOPTI
[    1.583676] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.17.0-smp-sheaves2 #1 NONE
[    1.583679] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
[    1.583684] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
[    1.583685] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
[    1.583687] RAX: 0000000000000000 RBX: 0000000000000012 RCX: ffffffff939e8640
[    1.583687] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI: ff2afe750004ad00
[    1.583688] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09: ff2afe75368c3b00
[    1.583689] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12: ff2aff31ba00b000
[    1.583690] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15: ff2afe750004ad00
[    1.583690] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000) knlGS:0000000000000000
[    1.583691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.583692] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4: 0000000000771ef0
[    1.583692] PKRU: 55555554
[    1.583693] Call Trace:
[    1.583694]  <IRQ>
[    1.583696]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583698]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583700]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583702]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583703]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583705]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583707]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583708]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583710]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583711]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583713]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583715]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583716]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583718]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583719]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583721]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583723]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583724]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583726]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583727]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583729]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583731]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583732]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583734]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583735]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583737]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583739]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583740]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583742]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583743]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583745]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583747]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583748]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583750]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583751]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583753]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583755]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583756]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583758]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583759]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583761]  ? update_group_capacity+0xad/0x1f0
[    1.583763]  ? sched_balance_rq+0x4f6/0x1e80
[    1.583765]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583767]  ? update_irq_load_avg+0x35/0x480
[    1.583768]  ? __pfx_rcu_free_sheaf+0x10/0x10
[    1.583769]  rcu_free_sheaf+0x86/0x110
[    1.583771]  rcu_do_batch+0x245/0x750
[    1.583772]  rcu_core+0x13a/0x260
[    1.583773]  handle_softirqs+0xcb/0x270
[    1.583775]  __irq_exit_rcu+0x48/0xf0
[    1.583776]  sysvec_apic_timer_interrupt+0x74/0x80
[    1.583778]  </IRQ>
[    1.583778]  <TASK>
[    1.583779]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[    1.583780] RIP: 0010:cpuidle_enter_state+0x101/0x290
[    1.583781] Code: 85 f4 ff ff 49 89 c4 8b 73 04 bf ff ff ff ff e8 d5 44 d4 ff 31 ff e8 9e c7 37 ff 80 7c 24 04 00 74 05 e8 12 45 d4 ff fb 85 ed <0f> 88 ba 00 00 00 89 e9 48 6b f9 68 4c 8b 44 24 08 49 8b 54 38 30
[    1.583782] RSP: 0018:ff40dbc4809afe80 EFLAGS: 00000202
[    1.583782] RAX: ff2aff31ba00b000 RBX: ff2afe75614b0800 RCX: 000000005e64b52b
[    1.583783] RDX: 000000005e73f761 RSI: 0000000000000067 RDI: 0000000000000000
[    1.583783] RBP: 0000000000000002 R08: fffffffffffffff6 R09: 0000000000000000
[    1.583784] R10: 0000000000000380 R11: ffffffff908c38d0 R12: 000000005e64b535
[    1.583784] R13: 000000005e5580da R14: ffffffff92890b10 R15: 0000000000000002
[    1.583784]  ? __pfx_read_tsc+0x10/0x10
[    1.583787]  cpuidle_enter+0x2c/0x40
[    1.583788]  do_idle+0x1a7/0x240
[    1.583790]  cpu_startup_entry+0x2a/0x30
[    1.583791]  start_secondary+0x95/0xa0
[    1.583794]  common_startup_64+0x13e/0x140
[    1.583796]  </TASK>
[    1.583796] Modules linked in:
[    1.583798] ---[ end trace 0000000000000000 ]---
[    1.583798] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
[    1.583800] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
[    1.583800] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
[    1.583801] RAX: 0000000000000000 RBX: 0000000000000012 RCX: ffffffff939e8640
[    1.583801] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI: ff2afe750004ad00
[    1.583801] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09: ff2afe75368c3b00
[    1.583802] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12: ff2aff31ba00b000
[    1.583802] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15: ff2afe750004ad00
[    1.583802] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000) knlGS:0000000000000000
[    1.583803] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.583803] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4: 0000000000771ef0
[    1.583803] PKRU: 55555554
[    1.583804] Kernel panic - not syncing: Fatal exception in interrupt
[    1.584659] Kernel Offset: 0xf600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/14] SLUB percpu sheaves
  2025-08-15 22:53 ` [PATCH v5 00/14] SLUB percpu sheaves Sudarsan Mahendran
@ 2025-08-16  8:05   ` Harry Yoo
       [not found]     ` <CAA9mObAiQbAYvzhW---VoqDA6Zsb152p5ePMvbco0xgwyvaB2Q@mail.gmail.com>
  0 siblings, 1 reply; 45+ messages in thread
From: Harry Yoo @ 2025-08-16  8:05 UTC (permalink / raw)
  To: Sudarsan Mahendran
  Cc: vbabka, Liam.Howlett, cl, howlett, linux-kernel, linux-mm,
	maple-tree, rcu, rientjes, roman.gushchin, surenb, urezki

On Fri, Aug 15, 2025 at 03:53:00PM -0700, Sudarsan Mahendran wrote:
> Hi Vlastimil,
> 
> I ported this patch series on top of v6.17.
> I had to resolve some merge conflicts because of 
> fba46a5d83ca8decb338722fb4899026d8d9ead2
> 
> The conflict resolution looks like:
> 
> @@ -5524,20 +5335,19 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>  int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>  {
>         MA_WR_STATE(wr_mas, mas, entry);
> -       int ret = 0;
> -       int request;
> 
>         mas_wr_prealloc_setup(&wr_mas);
>         mas->store_type = mas_wr_store_type(&wr_mas);
> -       request = mas_prealloc_calc(&wr_mas, entry);
> -       if (!request)
> +       mas_prealloc_calc(&wr_mas, entry);
> +       if (!mas->node_request)
>                 goto set_flag;
> 
>         mas->mas_flags &= ~MA_STATE_PREALLOC;
> -       mas_node_count_gfp(mas, request, gfp);
> +       mas_alloc_nodes(mas, gfp);
>         if (mas_is_err(mas)) {
> -               mas_set_alloc_req(mas, 0);
> -               ret = xa_err(mas->node);
> +               int ret = xa_err(mas->node);
> +
> +               mas->node_request = 0;
>                 mas_destroy(mas);
>                 mas_reset(mas);
>                 return ret;
> @@ -5545,7 +5355,7 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> 
>  set_flag:
>         mas->mas_flags |= MA_STATE_PREALLOC;
> -       return ret;
> +       return 0;
>  }
>  EXPORT_SYMBOL_GPL(mas_preallocate);
> 
> 
> 
> When I try to boot this kernel, I see kernel panic
> with rcu_free_sheaf() doing recursion into __kmem_cache_free_bulk()
> 
> Stack trace:
> 
> [    1.583673] Oops: stack guard page: 0000 [#1] SMP NOPTI
> [    1.583676] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.17.0-smp-sheaves2 #1 NONE
> [    1.583679] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
> [    1.583684] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
> [    1.583685] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
> [    1.583687] RAX: 0000000000000000 RBX: 0000000000000012 RCX: ffffffff939e8640
> [    1.583687] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI: ff2afe750004ad00
> [    1.583688] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09: ff2afe75368c3b00
> [    1.583689] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12: ff2aff31ba00b000
> [    1.583690] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15: ff2afe750004ad00
> [    1.583690] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000) knlGS:0000000000000000
> [    1.583691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.583692] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4: 0000000000771ef0
> [    1.583692] PKRU: 55555554
> [    1.583693] Call Trace:
> [    1.583694]  <IRQ>
> [    1.583696]  __kmem_cache_free_bulk+0x2c7/0x540

[..]

> [    1.583759]  __kmem_cache_free_bulk+0x2c7/0x540

Hi Sudarsan, thanks for the report.

I'm not really sure how __kmem_cache_free_bulk() can call itself.
There's no recursion of __kmem_cache_free_bulk() in the code.

As v6.17-rc1 is known to cause a few surprising bugs, could you please
rebase onto of mm-hotfixes-unstable and check if it still reproduces?

> [    1.583761]  ? update_group_capacity+0xad/0x1f0
> [    1.583763]  ? sched_balance_rq+0x4f6/0x1e80
> [    1.583765]  __kmem_cache_free_bulk+0x2c7/0x540
> [    1.583767]  ? update_irq_load_avg+0x35/0x480
> [    1.583768]  ? __pfx_rcu_free_sheaf+0x10/0x10
> [    1.583769]  rcu_free_sheaf+0x86/0x110
> [    1.583771]  rcu_do_batch+0x245/0x750
> [    1.583772]  rcu_core+0x13a/0x260
> [    1.583773]  handle_softirqs+0xcb/0x270
> [    1.583775]  __irq_exit_rcu+0x48/0xf0
> [    1.583776]  sysvec_apic_timer_interrupt+0x74/0x80
> [    1.583778]  </IRQ>
> [    1.583778]  <TASK>
> [    1.583779]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [    1.583780] RIP: 0010:cpuidle_enter_state+0x101/0x290
> [    1.583781] Code: 85 f4 ff ff 49 89 c4 8b 73 04 bf ff ff ff ff e8 d5 44 d4 ff 31 ff e8 9e c7 37 ff 80 7c 24 04 00 74 05 e8 12 45 d4 ff fb 85 ed <0f> 88 ba 00 00 00 89 e9 48 6b f9 68 4c 8b 44 24 08 49 8b 54 38 30
> [    1.583782] RSP: 0018:ff40dbc4809afe80 EFLAGS: 00000202
> [    1.583782] RAX: ff2aff31ba00b000 RBX: ff2afe75614b0800 RCX: 000000005e64b52b
> [    1.583783] RDX: 000000005e73f761 RSI: 0000000000000067 RDI: 0000000000000000
> [    1.583783] RBP: 0000000000000002 R08: fffffffffffffff6 R09: 0000000000000000
> [    1.583784] R10: 0000000000000380 R11: ffffffff908c38d0 R12: 000000005e64b535
> [    1.583784] R13: 000000005e5580da R14: ffffffff92890b10 R15: 0000000000000002
> [    1.583784]  ? __pfx_read_tsc+0x10/0x10
> [    1.583787]  cpuidle_enter+0x2c/0x40
> [    1.583788]  do_idle+0x1a7/0x240
> [    1.583790]  cpu_startup_entry+0x2a/0x30
> [    1.583791]  start_secondary+0x95/0xa0
> [    1.583794]  common_startup_64+0x13e/0x140
> [    1.583796]  </TASK>
> [    1.583796] Modules linked in:
> [    1.583798] ---[ end trace 0000000000000000 ]---
> [    1.583798] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
> [    1.583800] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
> [    1.583800] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
> [    1.583801] RAX: 0000000000000000 RBX: 0000000000000012 RCX: ffffffff939e8640
> [    1.583801] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI: ff2afe750004ad00
> [    1.583801] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09: ff2afe75368c3b00
> [    1.583802] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12: ff2aff31ba00b000
> [    1.583802] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15: ff2afe750004ad00
> [    1.583802] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000) knlGS:0000000000000000
> [    1.583803] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.583803] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4: 0000000000771ef0
> [    1.583803] PKRU: 55555554
> [    1.583804] Kernel panic - not syncing: Fatal exception in interrupt
> [    1.584659] Kernel Offset: 0xf600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/14] SLUB percpu sheaves
       [not found]     ` <CAA9mObAiQbAYvzhW---VoqDA6Zsb152p5ePMvbco0xgwyvaB2Q@mail.gmail.com>
@ 2025-08-16 18:31       ` Vlastimil Babka
  2025-08-16 18:33         ` Vlastimil Babka
  0 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-08-16 18:31 UTC (permalink / raw)
  To: Sudarsan Mahendran, Harry Yoo
  Cc: Liam.Howlett, cl, howlett, linux-kernel, linux-mm, maple-tree,
	rcu, rientjes, roman.gushchin, surenb, urezki, Greg Thelen

On 8/16/25 7:35 PM, Sudarsan Mahendran wrote:
> 
> 
> On Sat, Aug 16, 2025 at 1:06 AM Harry Yoo <harry.yoo@oracle.com
> <mailto:harry.yoo@oracle.com>> wrote:
>>
>> On Fri, Aug 15, 2025 at 03:53:00PM -0700, Sudarsan Mahendran wrote:
>> > Hi Vlastimil,
>> >
>> > I ported this patch series on top of v6.17.
>> > I had to resolve some merge conflicts because of
>> > fba46a5d83ca8decb338722fb4899026d8d9ead2
>> >
>> > The conflict resolution looks like:
>> >
>> > @@ -5524,20 +5335,19 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>> >  int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>> >  {
>> >         MA_WR_STATE(wr_mas, mas, entry);
>> > -       int ret = 0;
>> > -       int request;
>> >
>> >         mas_wr_prealloc_setup(&wr_mas);
>> >         mas->store_type = mas_wr_store_type(&wr_mas);
>> > -       request = mas_prealloc_calc(&wr_mas, entry);
>> > -       if (!request)
>> > +       mas_prealloc_calc(&wr_mas, entry);
>> > +       if (!mas->node_request)
>> >                 goto set_flag;
>> >
>> >         mas->mas_flags &= ~MA_STATE_PREALLOC;
>> > -       mas_node_count_gfp(mas, request, gfp);
>> > +       mas_alloc_nodes(mas, gfp);
>> >         if (mas_is_err(mas)) {
>> > -               mas_set_alloc_req(mas, 0);
>> > -               ret = xa_err(mas->node);
>> > +               int ret = xa_err(mas->node);
>> > +
>> > +               mas->node_request = 0;
>> >                 mas_destroy(mas);
>> >                 mas_reset(mas);
>> >                 return ret;
>> > @@ -5545,7 +5355,7 @@ int mas_preallocate(struct ma_state *mas, void
> *entry, gfp_t gfp)
>> >
>> >  set_flag:
>> >         mas->mas_flags |= MA_STATE_PREALLOC;
>> > -       return ret;
>> > +       return 0;
>> >  }
>> >  EXPORT_SYMBOL_GPL(mas_preallocate);
>> >
>> >
>> >
>> > When I try to boot this kernel, I see kernel panic
>> > with rcu_free_sheaf() doing recursion into __kmem_cache_free_bulk()
>> >
>> > Stack trace:
>> >
>> > [    1.583673] Oops: stack guard page: 0000 [#1] SMP NOPTI
>> > [    1.583676] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted
> 6.17.0-smp-sheaves2 #1 NONE
>> > [    1.583679] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
>> > [    1.583684] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48
> 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00
> 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
>> > [    1.583685] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
>> > [    1.583687] RAX: 0000000000000000 RBX: 0000000000000012 RCX:
> ffffffff939e8640
>> > [    1.583687] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI:
> ff2afe750004ad00
>> > [    1.583688] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09:
> ff2afe75368c3b00
>> > [    1.583689] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12:
> ff2aff31ba00b000
>> > [    1.583690] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15:
> ff2afe750004ad00
>> > [    1.583690] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000)
> knlGS:0000000000000000
>> > [    1.583691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [    1.583692] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4:
> 0000000000771ef0
>> > [    1.583692] PKRU: 55555554
>> > [    1.583693] Call Trace:
>> > [    1.583694]  <IRQ>
>> > [    1.583696]  __kmem_cache_free_bulk+0x2c7/0x540
>>
>> [..]
>>
>> > [    1.583759]  __kmem_cache_free_bulk+0x2c7/0x540
>>
>> Hi Sudarsan, thanks for the report.
>>
>> I'm not really sure how __kmem_cache_free_bulk() can call itself.
>> There's no recursion of __kmem_cache_free_bulk() in the code.
> Hi Harry,
> 
> I assume somehow the free_to_pcs_bulk() fallback case is taken, thus
> calling __kmem_cache_free_bulk(), which calls free_to_pcs_bulk() ad nauseam.

Could it be a rebase gone wrong? Mine to 6.17-rc1 is here (but untested)

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/
> free_to_pcs_bulk()
> {
> ...
> fallback:
>         __kmem_cache_free_bulk(s, size, p);
> ...
> }
> 
> static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size,
> void **p)

I don't have this, this codes seems to correspond to my
kmem_cache_free_bulk(), while __kmem_cache_free_bulk() is just
build_detached_freelist() and do_slab_free() with no sheaves involved.

> {
>         if (!size)
>                 return;
> 
>         /*
>          * freeing to sheaves is so incompatible with the detached
> freelist so
>          * once we go that way, we have to do everything differently
>          */
>         if (s && s->cpu_sheaves) {
>                 free_to_pcs_bulk(s, size, p);
>                 return;
>         }
> ...
> 
> Thanks Greg for pointing this out.
> 
> 
>> As v6.17-rc1 is known to cause a few surprising bugs, could you please
>> rebase onto of mm-hotfixes-unstable and check if it still reproduces?
>>
>> > [    1.583761]  ? update_group_capacity+0xad/0x1f0
>> > [    1.583763]  ? sched_balance_rq+0x4f6/0x1e80
>> > [    1.583765]  __kmem_cache_free_bulk+0x2c7/0x540
>> > [    1.583767]  ? update_irq_load_avg+0x35/0x480
>> > [    1.583768]  ? __pfx_rcu_free_sheaf+0x10/0x10
>> > [    1.583769]  rcu_free_sheaf+0x86/0x110
>> > [    1.583771]  rcu_do_batch+0x245/0x750
>> > [    1.583772]  rcu_core+0x13a/0x260
>> > [    1.583773]  handle_softirqs+0xcb/0x270
>> > [    1.583775]  __irq_exit_rcu+0x48/0xf0
>> > [    1.583776]  sysvec_apic_timer_interrupt+0x74/0x80
>> > [    1.583778]  </IRQ>
>> > [    1.583778]  <TASK>
>> > [    1.583779]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> > [    1.583780] RIP: 0010:cpuidle_enter_state+0x101/0x290
>> > [    1.583781] Code: 85 f4 ff ff 49 89 c4 8b 73 04 bf ff ff ff ff e8
> d5 44 d4 ff 31 ff e8 9e c7 37 ff 80 7c 24 04 00 74 05 e8 12 45 d4 ff fb
> 85 ed <0f> 88 ba 00 00 00 89 e9 48 6b f9 68 4c 8b 44 24 08 49 8b 54 38 30
>> > [    1.583782] RSP: 0018:ff40dbc4809afe80 EFLAGS: 00000202
>> > [    1.583782] RAX: ff2aff31ba00b000 RBX: ff2afe75614b0800 RCX:
> 000000005e64b52b
>> > [    1.583783] RDX: 000000005e73f761 RSI: 0000000000000067 RDI:
> 0000000000000000
>> > [    1.583783] RBP: 0000000000000002 R08: fffffffffffffff6 R09:
> 0000000000000000
>> > [    1.583784] R10: 0000000000000380 R11: ffffffff908c38d0 R12:
> 000000005e64b535
>> > [    1.583784] R13: 000000005e5580da R14: ffffffff92890b10 R15:
> 0000000000000002
>> > [    1.583784]  ? __pfx_read_tsc+0x10/0x10
>> > [    1.583787]  cpuidle_enter+0x2c/0x40
>> > [    1.583788]  do_idle+0x1a7/0x240
>> > [    1.583790]  cpu_startup_entry+0x2a/0x30
>> > [    1.583791]  start_secondary+0x95/0xa0
>> > [    1.583794]  common_startup_64+0x13e/0x140
>> > [    1.583796]  </TASK>
>> > [    1.583796] Modules linked in:
>> > [    1.583798] ---[ end trace 0000000000000000 ]---
>> > [    1.583798] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
>> > [    1.583800] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48
> 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00
> 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
>> > [    1.583800] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
>> > [    1.583801] RAX: 0000000000000000 RBX: 0000000000000012 RCX:
> ffffffff939e8640
>> > [    1.583801] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI:
> ff2afe750004ad00
>> > [    1.583801] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09:
> ff2afe75368c3b00
>> > [    1.583802] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12:
> ff2aff31ba00b000
>> > [    1.583802] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15:
> ff2afe750004ad00
>> > [    1.583802] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000)
> knlGS:0000000000000000
>> > [    1.583803] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [    1.583803] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4:
> 0000000000771ef0
>> > [    1.583803] PKRU: 55555554
>> > [    1.583804] Kernel panic - not syncing: Fatal exception in interrupt
>> > [    1.584659] Kernel Offset: 0xf600000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> >
>> >


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/14] SLUB percpu sheaves
  2025-08-16 18:31       ` Vlastimil Babka
@ 2025-08-16 18:33         ` Vlastimil Babka
  2025-08-17  4:28           ` Sudarsan Mahendran
  0 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-08-16 18:33 UTC (permalink / raw)
  To: Sudarsan Mahendran, Harry Yoo
  Cc: Liam.Howlett, cl, howlett, linux-kernel, linux-mm, maple-tree,
	rcu, rientjes, roman.gushchin, surenb, urezki, Greg Thelen

On 8/16/25 8:31 PM, Vlastimil Babka wrote:
>>
>> I assume somehow the free_to_pcs_bulk() fallback case is taken, thus
>> calling __kmem_cache_free_bulk(), which calls free_to_pcs_bulk() ad nauseam.
> Could it be a rebase gone wrong? Mine to 6.17-rc1 is here (but untested)
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/

This branch specifically
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slub-percpu-sheaves

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 00/14] SLUB percpu sheaves
  2025-08-16 18:33         ` Vlastimil Babka
@ 2025-08-17  4:28           ` Sudarsan Mahendran
  0 siblings, 0 replies; 45+ messages in thread
From: Sudarsan Mahendran @ 2025-08-17  4:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, Liam.Howlett, cl, howlett, linux-kernel, linux-mm,
	maple-tree, rcu, rientjes, roman.gushchin, surenb, urezki,
	Greg Thelen

On Sat, Aug 16, 2025 at 11:31 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/16/25 8:31 PM, Vlastimil Babka wrote:
> >>
> >> I assume somehow the free_to_pcs_bulk() fallback case is taken, thus
> >> calling __kmem_cache_free_bulk(), which calls free_to_pcs_bulk() ad nauseam.
> > Could it be a rebase gone wrong? Mine to 6.17-rc1 is here (but untested)

Yes Vlastimil,

You're right. It is a rebase gone wrong. Thanks for catching this.

I ported this patch series on top of v6.17-rc1 using b4 cmd

b4 am -o - 20250723-slub-percpu-caches-v5-0-b792cd830f5d@suse.cz | git
am --reject

For some reason b4 merging yielded me this:

git show 893ee67b5c75e7411e4e3c6ddaa8d0765985423e
slab: add opt-in caching layer of percpu sheaves

@@ -5252,6 +6133,15 @@ static void __kmem_cache_free_bulk(struct
kmem_cache *s, size_t size, void **p)
        if (!size)
                return;

+       /*
+        * freeing to sheaves is so incompatible with the detached freelist so
+        * once we go that way, we have to do everything differently
+        */
+       if (s && s->cpu_sheaves) {
+               free_to_pcs_bulk(s, size, p);
+               return;
+       }
+
        do {


Whereas the original patch [1] had this instead:

@@ -5033,6 +5801,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s,
size_t size, void **p)
  if (!size)
  return;

+ /*
+ * freeing to sheaves is so incompatible with the detached freelist so
+ * once we go that way, we have to do everything differently
+ */
+ if (s && s->cpu_sheaves) {
+ free_to_pcs_bulk(s, size, p);
+ return;
+ }
+

I have no idea why b4 got confused between kmem_cache_free_bulk() and
__kmem_cache_free_bulk().

After I fixed this issue, I'm able to boot the kernel successfully.

[1] https://lore.kernel.org/all/20250214-slub-percpu-caches-v2-1-88592ee0966a@suse.cz/

> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/
>
> This branch specifically
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slub-percpu-sheaves

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 01/14] slab: add opt-in caching layer of percpu sheaves
  2025-07-23 13:34 ` [PATCH v5 01/14] slab: add opt-in caching layer of " Vlastimil Babka
@ 2025-08-18 10:09   ` Harry Yoo
  2025-08-26  8:03     ` Vlastimil Babka
  2025-08-19  4:19   ` Suren Baghdasaryan
  1 sibling, 1 reply; 45+ messages in thread
From: Harry Yoo @ 2025-08-18 10:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 03:34:34PM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> The sheaf_capacity value is exported in sysfs for observability.
> 
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full. While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.
> 
> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed. Similarly, we ignore it
> with CONFIG_SLUB_TINY which prefers low memory usage to performance.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1101 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1092 insertions(+), 47 deletions(-)
> 
> @@ -4554,6 +5164,274 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +/*
> + * pcs is locked. We should have get rid of the spare sheaf and obtained an
> + * empty sheaf, while the main sheaf is full. We want to install the empty sheaf
> + * as a main sheaf, and make the current main sheaf a spare sheaf.
> + *
> + * However due to having relinquished the cpu_sheaves lock when obtaining
> + * the empty sheaf, we need to handle some unlikely but possible cases.
> + *
> + * If we put any sheaf to barn here, it's because we were interrupted or have
> + * been migrated to a different cpu, which should be rare enough so just ignore
> + * the barn's limits to simplify the handling.
> + *
> + * An alternative scenario that gets us here is when we fail
> + * barn_replace_full_sheaf(), because there's no empty sheaf available in the
> + * barn, so we had to allocate it by alloc_empty_sheaf(). But because we saw the
> + * limit on full sheaves was not exceeded, we assume it didn't change and just
> + * put the full sheaf there.
> + */
> +static void __pcs_install_empty_sheaf(struct kmem_cache *s,
> +		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
> +{
> +	/* This is what we expect to find if nobody interrupted us. */
> +	if (likely(!pcs->spare)) {
> +		pcs->spare = pcs->main;
> +		pcs->main = empty;
> +		return;
> +	}
> +
> +	/*
> +	 * Unlikely because if the main sheaf had space, we would have just
> +	 * freed to it. Get rid of our empty sheaf.
> +	 */
> +	if (pcs->main->size < s->sheaf_capacity) {
> +		barn_put_empty_sheaf(pcs->barn, empty);
> +		return;
> +	}
> +
> +	/* Also unlikely for the same reason/ */

nit: unnecessary '/'

> +	if (pcs->spare->size < s->sheaf_capacity) {
> +		swap(pcs->main, pcs->spare);
> +		barn_put_empty_sheaf(pcs->barn, empty);
> +		return;
> +	}
> +
> +	/*
> +	 * We probably failed barn_replace_full_sheaf() due to no empty sheaf
> +	 * available there, but we allocated one, so finish the job.
> +	 */
> +	barn_put_full_sheaf(pcs->barn, pcs->main);
> +	stat(s, BARN_PUT);
> +	pcs->main = empty;
> +}

> +static struct slub_percpu_sheaves *
> +__pcs_handle_full(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)
> +{
> +	struct slab_sheaf *empty;
> +	bool put_fail;
> +
> +restart:
> +	put_fail = false;
> +
> +	if (!pcs->spare) {
> +		empty = barn_get_empty_sheaf(pcs->barn);
> +		if (empty) {
> +			pcs->spare = pcs->main;
> +			pcs->main = empty;
> +			return pcs;
> +		}
> +		goto alloc_empty;
> +	}
> +
> +	if (pcs->spare->size < s->sheaf_capacity) {
> +		swap(pcs->main, pcs->spare);
> +		return pcs;
> +	}
> +
> +	empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +	if (!IS_ERR(empty)) {
> +		stat(s, BARN_PUT);
> +		pcs->main = empty;
> +		return pcs;
> +	}
> +
> +	if (PTR_ERR(empty) == -E2BIG) {
> +		/* Since we got here, spare exists and is full */
> +		struct slab_sheaf *to_flush = pcs->spare;
> +
> +		stat(s, BARN_PUT_FAIL);
> +
> +		pcs->spare = NULL;
> +		local_unlock(&s->cpu_sheaves->lock);
> +
> +		sheaf_flush_unused(s, to_flush);
> +		empty = to_flush;
> +		goto got_empty;
> +	}
> +
> +	/*
> +	 * We could not replace full sheaf because barn had no empty
> +	 * sheaves. We can still allocate it and put the full sheaf in
> +	 * __pcs_install_empty_sheaf(), but if we fail to allocate it,
> +	 * make sure to count the fail.
> +	 */
> +	put_fail = true;
> +
> +alloc_empty:
> +	local_unlock(&s->cpu_sheaves->lock);
> +
> +	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +	if (empty)
> +		goto got_empty;
> +
> +	if (put_fail)
> +		 stat(s, BARN_PUT_FAIL);
> +
> +	if (!sheaf_flush_main(s))
> +		return NULL;
> +
> +	if (!local_trylock(&s->cpu_sheaves->lock))
> +		return NULL;
> +
> +	/*
> +	 * we flushed the main sheaf so it should be empty now,
> +	 * but in case we got preempted or migrated, we need to
> +	 * check again
> +	 */
> +	if (pcs->main->size == s->sheaf_capacity)
> +		goto restart;

I think it's missing:

pcs = this_cpu_ptr(&s->cpu_sheaves);

between local_trylock() and reading pcs->main->size().

> +
> +	return pcs;
> +
> +got_empty:
> +	if (!local_trylock(&s->cpu_sheaves->lock)) {
> +		barn_put_empty_sheaf(pcs->barn, empty);
> +		return NULL;
> +	}
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +	__pcs_install_empty_sheaf(s, pcs, empty);
> +
> +	return pcs;
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -6481,7 +7464,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>  		__kmem_cache_release(s);
>  	return err;
>  }
> -

nit: unnecessary removal of a newline?

Otherwise looks good to me.

>  #ifdef SLAB_SUPPORTS_SYSFS
>  static int count_inuse(struct slab *slab)
>  {

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 01/14] slab: add opt-in caching layer of percpu sheaves
  2025-07-23 13:34 ` [PATCH v5 01/14] slab: add opt-in caching layer of " Vlastimil Babka
  2025-08-18 10:09   ` Harry Yoo
@ 2025-08-19  4:19   ` Suren Baghdasaryan
  2025-08-26  8:51     ` Vlastimil Babka
  1 sibling, 1 reply; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-19  4:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
>
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
>
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
>
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
>
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
>
> The sheaf_capacity value is exported in sysfs for observability.
>
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full. While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.
>
> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
>
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed. Similarly, we ignore it
> with CONFIG_SLUB_TINY which prefers low memory usage to performance.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1101 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1092 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..6cfd085907afb8fc6e502ff7a1a1830c52ff9125 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -335,6 +335,37 @@ struct kmem_cache_args {
>          * %NULL means no constructor.
>          */
>         void (*ctor)(void *);
> +       /**
> +        * @sheaf_capacity: Enable sheaves of given capacity for the cache.
> +        *
> +        * With a non-zero value, allocations from the cache go through caching
> +        * arrays called sheaves. Each cpu has a main sheaf that's always
> +        * present, and a spare sheaf thay may be not present. When both become

s/they/that


> +        * empty, there's an attempt to replace an empty sheaf with a full sheaf
> +        * from the per-node barn.
> +        *
> +        * When no full sheaf is available, and gfp flags allow blocking, a
> +        * sheaf is allocated and filled from slab(s) using bulk allocation.
> +        * Otherwise the allocation falls back to the normal operation
> +        * allocating a single object from a slab.
> +        *
> +        * Analogically when freeing and both percpu sheaves are full, the barn
> +        * may replace it with an empty sheaf, unless it's over capacity. In
> +        * that case a sheaf is bulk freed to slab pages.
> +        *
> +        * The sheaves do not enforce NUMA placement of objects, so allocations
> +        * via kmem_cache_alloc_node() with a node specified other than
> +        * NUMA_NO_NODE will bypass them.
> +        *
> +        * Bulk allocation and free operations also try to use the cpu sheaves
> +        * and barn, but fallback to using slab pages directly.
> +        *
> +        * When slub_debug is enabled for the cache, the sheaf_capacity argument
> +        * is ignored.
> +        *
> +        * %0 means no sheaves will be created.
> +        */
> +       unsigned int sheaf_capacity;
>  };
>
>  struct kmem_cache *__kmem_cache_create_args(const char *name,
> diff --git a/mm/slab.h b/mm/slab.h
> index 05a21dc796e095e8db934564d559494cd81746ec..1980330c2fcb4a4613a7e4f7efc78b349993fd89 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -259,6 +259,7 @@ struct kmem_cache {
>  #ifndef CONFIG_SLUB_TINY
>         struct kmem_cache_cpu __percpu *cpu_slab;
>  #endif
> +       struct slub_percpu_sheaves __percpu *cpu_sheaves;
>         /* Used for retrieving partial slabs, etc. */
>         slab_flags_t flags;
>         unsigned long min_partial;
> @@ -272,6 +273,7 @@ struct kmem_cache {
>         /* Number of per cpu partial slabs to keep around */
>         unsigned int cpu_partial_slabs;
>  #endif
> +       unsigned int sheaf_capacity;
>         struct kmem_cache_order_objects oo;
>
>         /* Allocation and freeing of slabs */
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index bfe7c40eeee1a01c175766935c1e3c0304434a53..e2b197e47866c30acdbd1fee4159f262a751c5a7 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
>                 return 1;
>  #endif
>
> +       if (s->cpu_sheaves)
> +               return 1;
> +
>         /*
>          * We may have set a slab to be unmergeable during bootstrap.
>          */
> @@ -321,7 +324,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
>                     object_size - args->usersize < args->useroffset))
>                 args->usersize = args->useroffset = 0;
>
> -       if (!args->usersize)
> +       if (!args->usersize && !args->sheaf_capacity)
>                 s = __kmem_cache_alias(name, object_size, args->align, flags,
>                                        args->ctor);
>         if (s)
> diff --git a/mm/slub.c b/mm/slub.c
> index 31e11ef256f90ad8a21d6b090f810f4c991a68d6..6543aaade60b0adaab232b2256d65c1042c62e1c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -346,8 +346,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
>  #endif
>
>  enum stat_item {
> +       ALLOC_PCS,              /* Allocation from percpu sheaf */
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> +       FREE_PCS,               /* Free to percpu sheaf */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -372,6 +374,14 @@ enum stat_item {
>         CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
>         CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
>         CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
> +       SHEAF_FLUSH,            /* Objects flushed from a sheaf */
> +       SHEAF_REFILL,           /* Objects refilled to a sheaf */
> +       SHEAF_ALLOC,            /* Allocation of an empty sheaf */
> +       SHEAF_FREE,             /* Freeing of an empty sheaf */
> +       BARN_GET,               /* Got full sheaf from barn */
> +       BARN_GET_FAIL,          /* Failed to get full sheaf from barn */
> +       BARN_PUT,               /* Put full sheaf to barn */
> +       BARN_PUT_FAIL,          /* Failed to put full sheaf to barn */
>         NR_SLUB_STAT_ITEMS
>  };
>
> @@ -418,6 +428,33 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
>  #endif
>  }
>
> +#define MAX_FULL_SHEAVES       10
> +#define MAX_EMPTY_SHEAVES      10
> +
> +struct node_barn {
> +       spinlock_t lock;
> +       struct list_head sheaves_full;
> +       struct list_head sheaves_empty;
> +       unsigned int nr_full;
> +       unsigned int nr_empty;
> +};
> +
> +struct slab_sheaf {
> +       union {
> +               struct rcu_head rcu_head;
> +               struct list_head barn_list;
> +       };
> +       unsigned int size;
> +       void *objects[];
> +};
> +
> +struct slub_percpu_sheaves {
> +       local_trylock_t lock;
> +       struct slab_sheaf *main; /* never NULL when unlocked */
> +       struct slab_sheaf *spare; /* empty or full, may be NULL */
> +       struct node_barn *barn;
> +};
> +
>  /*
>   * The slab lists for all objects.
>   */
> @@ -430,6 +467,7 @@ struct kmem_cache_node {
>         atomic_long_t total_objects;
>         struct list_head full;
>  #endif
> +       struct node_barn *barn;
>  };
>
>  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> @@ -453,12 +491,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>   */
>  static nodemask_t slab_nodes;
>
> -#ifndef CONFIG_SLUB_TINY
>  /*
>   * Workqueue used for flush_cpu_slab().
>   */
>  static struct workqueue_struct *flushwq;
> -#endif
> +
> +struct slub_flush_work {
> +       struct work_struct work;
> +       struct kmem_cache *s;
> +       bool skip;
> +};
> +
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>
>  /********************************************************************
>   *                     Core slab cache functions
> @@ -2437,6 +2482,359 @@ static void *setup_object(struct kmem_cache *s, void *object)
>         return object;
>  }
>
> +static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
> +                                       s->sheaf_capacity), gfp);
> +
> +       if (unlikely(!sheaf))
> +               return NULL;
> +
> +       stat(s, SHEAF_ALLOC);
> +
> +       return sheaf;
> +}
> +
> +static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
> +{
> +       kfree(sheaf);
> +
> +       stat(s, SHEAF_FREE);
> +}
> +
> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> +                                  size_t size, void **p);
> +
> +
> +static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
> +                        gfp_t gfp)
> +{
> +       int to_fill = s->sheaf_capacity - sheaf->size;
> +       int filled;
> +
> +       if (!to_fill)
> +               return 0;
> +
> +       filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> +                                        &sheaf->objects[sheaf->size]);
> +
> +       sheaf->size += filled;
> +
> +       stat_add(s, SHEAF_REFILL, filled);
> +
> +       if (filled < to_fill)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +
> +static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
> +
> +       if (!sheaf)
> +               return NULL;
> +
> +       if (refill_sheaf(s, sheaf, gfp)) {
> +               free_empty_sheaf(s, sheaf);
> +               return NULL;
> +       }
> +
> +       return sheaf;
> +}
> +
> +/*
> + * Maximum number of objects freed during a single flush of main pcs sheaf.
> + * Translates directly to an on-stack array size.
> + */
> +#define PCS_BATCH_MAX  32U
> +
> +static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
> +
> +/*
> + * Free all objects from the main sheaf. In order to perform
> + * __kmem_cache_free_bulk() outside of cpu_sheaves->lock, work in batches where
> + * object pointers are moved to a on-stack array under the lock. To bound the
> + * stack usage, limit each batch to PCS_BATCH_MAX.
> + *
> + * returns true if at least partially flushed
> + */
> +static bool sheaf_flush_main(struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       unsigned int batch, remaining;
> +       void *objects[PCS_BATCH_MAX];
> +       struct slab_sheaf *sheaf;
> +       bool ret = false;
> +
> +next_batch:
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return ret;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +       sheaf = pcs->main;
> +
> +       batch = min(PCS_BATCH_MAX, sheaf->size);
> +
> +       sheaf->size -= batch;
> +       memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
> +
> +       remaining = sheaf->size;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       __kmem_cache_free_bulk(s, batch, &objects[0]);
> +
> +       stat_add(s, SHEAF_FLUSH, batch);
> +
> +       ret = true;
> +
> +       if (remaining)
> +               goto next_batch;
> +
> +       return ret;
> +}
> +
> +/*
> + * Free all objects from a sheaf that's unused, i.e. not linked to any
> + * cpu_sheaves, so we need no locking and batching. The locking is also not
> + * necessary when flushing cpu's sheaves (both spare and main) during cpu
> + * hotremove as the cpu is not executing anymore.
> + */
> +static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
> +{
> +       if (!sheaf->size)
> +               return;
> +
> +       stat_add(s, SHEAF_FLUSH, sheaf->size);
> +
> +       __kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
> +
> +       sheaf->size = 0;
> +}
> +
> +/*
> + * Caller needs to make sure migration is disabled in order to fully flush
> + * single cpu's sheaves
> + *
> + * must not be called from an irq
> + *
> + * flushing operations are rare so let's keep it simple and flush to slabs
> + * directly, skipping the barn
> + */
> +static void pcs_flush_all(struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *spare;
> +
> +       local_lock(&s->cpu_sheaves->lock);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       spare = pcs->spare;
> +       pcs->spare = NULL;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       if (spare) {
> +               sheaf_flush_unused(s, spare);
> +               free_empty_sheaf(s, spare);
> +       }
> +
> +       sheaf_flush_main(s);
> +}
> +
> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +
> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +       /* The cpu is not executing anymore so we don't need pcs->lock */
> +       sheaf_flush_unused(s, pcs->main);
> +       if (pcs->spare) {
> +               sheaf_flush_unused(s, pcs->spare);
> +               free_empty_sheaf(s, pcs->spare);
> +               pcs->spare = NULL;
> +       }
> +}
> +
> +static void pcs_destroy(struct kmem_cache *s)
> +{
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct slub_percpu_sheaves *pcs;
> +
> +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +               /* can happen when unwinding failed create */
> +               if (!pcs->main)
> +                       continue;
> +
> +               /*
> +                * We have already passed __kmem_cache_shutdown() so everything
> +                * was flushed and there should be no objects allocated from
> +                * slabs, otherwise kmem_cache_destroy() would have aborted.
> +                * Therefore something would have to be really wrong if the
> +                * warnings here trigger, and we should rather leave objects and
> +                * sheaves to leak in that case.
> +                */
> +
> +               WARN_ON(pcs->spare);
> +
> +               if (!WARN_ON(pcs->main->size)) {
> +                       free_empty_sheaf(s, pcs->main);
> +                       pcs->main = NULL;
> +               }
> +       }
> +
> +       free_percpu(s->cpu_sheaves);
> +       s->cpu_sheaves = NULL;
> +}
> +
> +static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
> +{
> +       struct slab_sheaf *empty = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_empty) {
> +               empty = list_first_entry(&barn->sheaves_empty,
> +                                        struct slab_sheaf, barn_list);
> +               list_del(&empty->barn_list);
> +               barn->nr_empty--;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return empty;
> +}
> +
> +/*
> + * The following two functions are used mainly in cases where we have to undo an
> + * intended action due to a race or cpu migration. Thus they do not check the
> + * empty or full sheaf limits for simplicity.
> + */
> +
> +static void barn_put_empty_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
> +{
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       list_add(&sheaf->barn_list, &barn->sheaves_empty);
> +       barn->nr_empty++;
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +}
> +
> +static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
> +{
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       list_add(&sheaf->barn_list, &barn->sheaves_full);
> +       barn->nr_full++;
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +}
> +
> +/*
> + * If a full sheaf is available, return it and put the supplied empty one to
> + * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
> + * change.
> + */
> +static struct slab_sheaf *
> +barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
> +{
> +       struct slab_sheaf *full = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_full) {
> +               full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
> +                                       barn_list);
> +               list_del(&full->barn_list);
> +               list_add(&empty->barn_list, &barn->sheaves_empty);
> +               barn->nr_full--;
> +               barn->nr_empty++;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return full;
> +}

nit: missing new line

> +/*
> + * If an empty sheaf is available, return it and put the supplied full one to
> + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> + */
> +static struct slab_sheaf *
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> +{
> +       struct slab_sheaf *empty;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_full >= MAX_FULL_SHEAVES) {
> +               empty = ERR_PTR(-E2BIG);
> +       } else if (!barn->nr_empty) {
> +               empty = ERR_PTR(-ENOMEM);
> +       } else {
> +               empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
> +                                        barn_list);
> +               list_del(&empty->barn_list);
> +               list_add(&full->barn_list, &barn->sheaves_full);
> +               barn->nr_empty--;
> +               barn->nr_full++;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return empty;
> +}
> +
> +static void barn_init(struct node_barn *barn)
> +{
> +       spin_lock_init(&barn->lock);
> +       INIT_LIST_HEAD(&barn->sheaves_full);
> +       INIT_LIST_HEAD(&barn->sheaves_empty);
> +       barn->nr_full = 0;
> +       barn->nr_empty = 0;
> +}
> +
> +static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
> +{
> +       struct list_head empty_list;
> +       struct list_head full_list;
> +       struct slab_sheaf *sheaf, *sheaf2;
> +       unsigned long flags;
> +
> +       INIT_LIST_HEAD(&empty_list);
> +       INIT_LIST_HEAD(&full_list);
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       list_splice_init(&barn->sheaves_full, &full_list);
> +       barn->nr_full = 0;
> +       list_splice_init(&barn->sheaves_empty, &empty_list);
> +       barn->nr_empty = 0;
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
> +               sheaf_flush_unused(s, sheaf);
> +               free_empty_sheaf(s, sheaf);
> +       }
> +
> +       list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
> +               free_empty_sheaf(s, sheaf);
> +}
> +
>  /*
>   * Slab allocation and freeing
>   */
> @@ -3312,11 +3710,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>         put_partials_cpu(s, c);
>  }
>
> -struct slub_flush_work {
> -       struct work_struct work;
> -       struct kmem_cache *s;
> -       bool skip;
> -};
> +static inline void flush_this_cpu_slab(struct kmem_cache *s)
> +{
> +       struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> +
> +       if (c->slab)
> +               flush_slab(s, c);
> +
> +       put_partials(s);
> +}
> +
> +static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> +{
> +       struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> +
> +       return c->slab || slub_percpu_partial(c);
> +}
> +
> +#else /* CONFIG_SLUB_TINY */
> +static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> +static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
> +static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
> +#endif /* CONFIG_SLUB_TINY */
> +
> +static bool has_pcs_used(int cpu, struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +
> +       if (!s->cpu_sheaves)
> +               return false;
> +
> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +       return (pcs->spare || pcs->main->size);
> +}
> +
> +static void pcs_flush_all(struct kmem_cache *s);
>
>  /*
>   * Flush cpu slab.
> @@ -3326,30 +3755,18 @@ struct slub_flush_work {
>  static void flush_cpu_slab(struct work_struct *w)
>  {
>         struct kmem_cache *s;
> -       struct kmem_cache_cpu *c;
>         struct slub_flush_work *sfw;
>
>         sfw = container_of(w, struct slub_flush_work, work);
>
>         s = sfw->s;
> -       c = this_cpu_ptr(s->cpu_slab);
>
> -       if (c->slab)
> -               flush_slab(s, c);
> +       if (s->cpu_sheaves)
> +               pcs_flush_all(s);
>
> -       put_partials(s);
> +       flush_this_cpu_slab(s);
>  }
>
> -static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> -{
> -       struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> -
> -       return c->slab || slub_percpu_partial(c);
> -}
> -
> -static DEFINE_MUTEX(flush_lock);
> -static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> -
>  static void flush_all_cpus_locked(struct kmem_cache *s)
>  {
>         struct slub_flush_work *sfw;
> @@ -3360,7 +3777,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
>
>         for_each_online_cpu(cpu) {
>                 sfw = &per_cpu(slub_flush, cpu);
> -               if (!has_cpu_slab(cpu, s)) {
> +               if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
>                         sfw->skip = true;
>                         continue;
>                 }
> @@ -3396,19 +3813,15 @@ static int slub_cpu_dead(unsigned int cpu)
>         struct kmem_cache *s;
>
>         mutex_lock(&slab_mutex);
> -       list_for_each_entry(s, &slab_caches, list)
> +       list_for_each_entry(s, &slab_caches, list) {
>                 __flush_cpu_slab(s, cpu);
> +               if (s->cpu_sheaves)
> +                       __pcs_flush_all_cpu(s, cpu);
> +       }
>         mutex_unlock(&slab_mutex);
>         return 0;
>  }
>
> -#else /* CONFIG_SLUB_TINY */
> -static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
> -static inline void flush_all(struct kmem_cache *s) { }
> -static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> -static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
> -#endif /* CONFIG_SLUB_TINY */
> -
>  /*
>   * Check if the objects in a per cpu structure fit numa
>   * locality expectations.
> @@ -4158,6 +4571,199 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>         return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
>  }
>
> +static struct slub_percpu_sheaves *
> +__pcs_handle_empty(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)

The naming is a bit ambiguous IMO. Maybe __pcs_replace_empty_main() ?

> +{
> +       struct slab_sheaf *empty = NULL;
> +       struct slab_sheaf *full;
> +       bool can_alloc;
> +

Can we add lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock))?
More for documentation purposes than anything else.

> +       if (pcs->spare && pcs->spare->size > 0) {
> +               swap(pcs->main, pcs->spare);
> +               return pcs;
> +       }
> +
> +       full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
> +
> +       if (full) {
> +               stat(s, BARN_GET);
> +               pcs->main = full;
> +               return pcs;
> +       }
> +
> +       stat(s, BARN_GET_FAIL);
> +
> +       can_alloc = gfpflags_allow_blocking(gfp);
> +
> +       if (can_alloc) {
> +               if (pcs->spare) {
> +                       empty = pcs->spare;
> +                       pcs->spare = NULL;
> +               } else {
> +                       empty = barn_get_empty_sheaf(pcs->barn);
> +               }
> +       }
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       if (!can_alloc)
> +               return NULL;
> +
> +       if (empty) {
> +               if (!refill_sheaf(s, empty, gfp)) {
> +                       full = empty;
> +               } else {
> +                       /*
> +                        * we must be very low on memory so don't bother
> +                        * with the barn
> +                        */
> +                       free_empty_sheaf(s, empty);
> +               }
> +       } else {
> +               full = alloc_full_sheaf(s, gfp);
> +       }
> +
> +       if (!full)
> +               return NULL;
> +
> +       /*
> +        * we can reach here only when gfpflags_allow_blocking
> +        * so this must not be an irq
> +        */
> +       local_lock(&s->cpu_sheaves->lock);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       /*
> +        * If we are returning empty sheaf, we either got it from the
> +        * barn or had to allocate one. If we are returning a full
> +        * sheaf, it's due to racing or being migrated to a different
> +        * cpu. Breaching the barn's sheaf limits should be thus rare
> +        * enough so just ignore them to simplify the recovery.
> +        */
> +
> +       if (pcs->main->size == 0) {
> +               barn_put_empty_sheaf(pcs->barn, pcs->main);
> +               pcs->main = full;
> +               return pcs;
> +       }
> +
> +       if (!pcs->spare) {
> +               pcs->spare = full;
> +               return pcs;
> +       }
> +
> +       if (pcs->spare->size == 0) {
> +               barn_put_empty_sheaf(pcs->barn, pcs->spare);
> +               pcs->spare = full;
> +               return pcs;
> +       }
> +
> +       barn_put_full_sheaf(pcs->barn, full);
> +       stat(s, BARN_PUT);
> +
> +       return pcs;
> +}
> +
> +static __fastpath_inline
> +void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       void *object;
> +
> +#ifdef CONFIG_NUMA
> +       if (static_branch_unlikely(&strict_numa)) {
> +               if (current->mempolicy)
> +                       return NULL;
> +       }
> +#endif
> +
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return NULL;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == 0)) {
> +               pcs = __pcs_handle_empty(s, pcs, gfp);
> +               if (unlikely(!pcs))
> +                       return NULL;
> +       }
> +
> +       object = pcs->main->objects[--pcs->main->size];
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat(s, ALLOC_PCS);
> +
> +       return object;
> +}
> +
> +static __fastpath_inline
> +unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *main;
> +       unsigned int allocated = 0;
> +       unsigned int batch;
> +
> +next_batch:
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return allocated;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == 0)) {
> +
> +               struct slab_sheaf *full;
> +
> +               if (pcs->spare && pcs->spare->size > 0) {
> +                       swap(pcs->main, pcs->spare);
> +                       goto do_alloc;
> +               }
> +
> +               full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
> +
> +               if (full) {
> +                       stat(s, BARN_GET);
> +                       pcs->main = full;
> +                       goto do_alloc;
> +               }
> +
> +               stat(s, BARN_GET_FAIL);
> +
> +               local_unlock(&s->cpu_sheaves->lock);
> +
> +               /*
> +                * Once full sheaves in barn are depleted, let the bulk
> +                * allocation continue from slab pages, otherwise we would just
> +                * be copying arrays of pointers twice.
> +                */
> +               return allocated;
> +       }
> +
> +do_alloc:
> +
> +       main = pcs->main;
> +       batch = min(size, main->size);
> +
> +       main->size -= batch;
> +       memcpy(p, main->objects + main->size, batch * sizeof(void *));
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat_add(s, ALLOC_PCS, batch);
> +
> +       allocated += batch;
> +
> +       if (batch < size) {
> +               p += batch;
> +               size -= batch;
> +               goto next_batch;
> +       }
> +
> +       return allocated;
> +}
> +
> +
>  /*
>   * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
>   * have the fastpath folded into their functions. So no function call
> @@ -4182,7 +4788,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>         if (unlikely(object))
>                 goto out;
>
> -       object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> +       if (s->cpu_sheaves && node == NUMA_NO_NODE)
> +               object = alloc_from_pcs(s, gfpflags);
> +
> +       if (!object)
> +               object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
>
>         maybe_wipe_obj_freeptr(s, object);
>         init = slab_want_init_on_alloc(gfpflags, s);
> @@ -4554,6 +5164,274 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>         discard_slab(s, slab);
>  }
>
> +/*
> + * pcs is locked. We should have get rid of the spare sheaf and obtained an
> + * empty sheaf, while the main sheaf is full. We want to install the empty sheaf
> + * as a main sheaf, and make the current main sheaf a spare sheaf.
> + *
> + * However due to having relinquished the cpu_sheaves lock when obtaining
> + * the empty sheaf, we need to handle some unlikely but possible cases.
> + *
> + * If we put any sheaf to barn here, it's because we were interrupted or have
> + * been migrated to a different cpu, which should be rare enough so just ignore
> + * the barn's limits to simplify the handling.
> + *
> + * An alternative scenario that gets us here is when we fail
> + * barn_replace_full_sheaf(), because there's no empty sheaf available in the
> + * barn, so we had to allocate it by alloc_empty_sheaf(). But because we saw the
> + * limit on full sheaves was not exceeded, we assume it didn't change and just
> + * put the full sheaf there.
> + */
> +static void __pcs_install_empty_sheaf(struct kmem_cache *s,
> +               struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
> +{

lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock))

> +       /* This is what we expect to find if nobody interrupted us. */
> +       if (likely(!pcs->spare)) {
> +               pcs->spare = pcs->main;
> +               pcs->main = empty;
> +               return;
> +       }
> +
> +       /*
> +        * Unlikely because if the main sheaf had space, we would have just
> +        * freed to it. Get rid of our empty sheaf.
> +        */
> +       if (pcs->main->size < s->sheaf_capacity) {
> +               barn_put_empty_sheaf(pcs->barn, empty);
> +               return;
> +       }
> +
> +       /* Also unlikely for the same reason/ */
> +       if (pcs->spare->size < s->sheaf_capacity) {
> +               swap(pcs->main, pcs->spare);
> +               barn_put_empty_sheaf(pcs->barn, empty);
> +               return;
> +       }
> +
> +       /*
> +        * We probably failed barn_replace_full_sheaf() due to no empty sheaf
> +        * available there, but we allocated one, so finish the job.
> +        */
> +       barn_put_full_sheaf(pcs->barn, pcs->main);
> +       stat(s, BARN_PUT);
> +       pcs->main = empty;
> +}
> +

IIUC s->cpu_sheaves->lock is locked when we enter __pcs_handle_full(),
it is still locked if this function returns a valid object and it's
unlocked if NULL gets returned. A comment clarifying these locking
rules would be great.

> +static struct slub_percpu_sheaves *
> +__pcs_handle_full(struct kmem_cache *s, struct slub_percpu_sheaves *pcs)

__pcs_replace_full_main() ?

> +{
> +       struct slab_sheaf *empty;
> +       bool put_fail;

lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock))

> +
> +restart:
> +       put_fail = false;
> +
> +       if (!pcs->spare) {
> +               empty = barn_get_empty_sheaf(pcs->barn);
> +               if (empty) {
> +                       pcs->spare = pcs->main;
> +                       pcs->main = empty;
> +                       return pcs;
> +               }
> +               goto alloc_empty;
> +       }
> +
> +       if (pcs->spare->size < s->sheaf_capacity) {
> +               swap(pcs->main, pcs->spare);
> +               return pcs;
> +       }
> +
> +       empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +       if (!IS_ERR(empty)) {
> +               stat(s, BARN_PUT);
> +               pcs->main = empty;
> +               return pcs;
> +       }
> +
> +       if (PTR_ERR(empty) == -E2BIG) {
> +               /* Since we got here, spare exists and is full */
> +               struct slab_sheaf *to_flush = pcs->spare;
> +
> +               stat(s, BARN_PUT_FAIL);
> +
> +               pcs->spare = NULL;
> +               local_unlock(&s->cpu_sheaves->lock);
> +
> +               sheaf_flush_unused(s, to_flush);
> +               empty = to_flush;
> +               goto got_empty;
> +       }
> +
> +       /*
> +        * We could not replace full sheaf because barn had no empty
> +        * sheaves. We can still allocate it and put the full sheaf in
> +        * __pcs_install_empty_sheaf(), but if we fail to allocate it,
> +        * make sure to count the fail.
> +        */
> +       put_fail = true;
> +
> +alloc_empty:
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +       if (empty)
> +               goto got_empty;
> +
> +       if (put_fail)
> +                stat(s, BARN_PUT_FAIL);
> +
> +       if (!sheaf_flush_main(s))
> +               return NULL;
> +
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return NULL;
> +
> +       /*
> +        * we flushed the main sheaf so it should be empty now,
> +        * but in case we got preempted or migrated, we need to
> +        * check again
> +        */
> +       if (pcs->main->size == s->sheaf_capacity)
> +               goto restart;
> +
> +       return pcs;
> +
> +got_empty:
> +       if (!local_trylock(&s->cpu_sheaves->lock)) {
> +               barn_put_empty_sheaf(pcs->barn, empty);
> +               return NULL;
> +       }
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +       __pcs_install_empty_sheaf(s, pcs, empty);
> +
> +       return pcs;
> +}
> +
> +/*
> + * Free an object to the percpu sheaves.
> + * The object is expected to have passed slab_free_hook() already.
> + */
> +static __fastpath_inline
> +bool free_to_pcs(struct kmem_cache *s, void *object)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return false;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +               pcs = __pcs_handle_full(s, pcs);
> +               if (unlikely(!pcs))
> +                       return false;
> +       }
> +
> +       pcs->main->objects[pcs->main->size++] = object;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat(s, FREE_PCS);
> +
> +       return true;
> +}
> +
> +/*
> + * Bulk free objects to the percpu sheaves.
> + * Unlike free_to_pcs() this includes the calls to all necessary hooks
> + * and the fallback to freeing to slab pages.
> + */
> +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *main, *empty;
> +       unsigned int batch, i = 0;
> +       bool init;
> +
> +       init = slab_want_init_on_free(s);
> +
> +       while (i < size) {
> +               struct slab *slab = virt_to_slab(p[i]);
> +
> +               memcg_slab_free_hook(s, slab, p + i, 1);
> +               alloc_tagging_slab_free_hook(s, slab, p + i, 1);
> +
> +               if (unlikely(!slab_free_hook(s, p[i], init, false))) {
> +                       p[i] = p[--size];
> +                       if (!size)
> +                               return;
> +                       continue;
> +               }
> +
> +               i++;
> +       }
> +
> +next_batch:
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               goto fallback;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (likely(pcs->main->size < s->sheaf_capacity))
> +               goto do_free;
> +
> +       if (!pcs->spare) {
> +               empty = barn_get_empty_sheaf(pcs->barn);
> +               if (!empty)
> +                       goto no_empty;
> +
> +               pcs->spare = pcs->main;
> +               pcs->main = empty;
> +               goto do_free;
> +       }
> +
> +       if (pcs->spare->size < s->sheaf_capacity) {
> +               swap(pcs->main, pcs->spare);
> +               goto do_free;
> +       }
> +
> +       empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +       if (IS_ERR(empty)) {
> +               stat(s, BARN_PUT_FAIL);
> +               goto no_empty;
> +       }
> +
> +       stat(s, BARN_PUT);
> +       pcs->main = empty;
> +
> +do_free:
> +       main = pcs->main;
> +       batch = min(size, s->sheaf_capacity - main->size);
> +
> +       memcpy(main->objects + main->size, p, batch * sizeof(void *));
> +       main->size += batch;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat_add(s, FREE_PCS, batch);
> +
> +       if (batch < size) {
> +               p += batch;
> +               size -= batch;
> +               goto next_batch;
> +       }
> +
> +       return;
> +
> +no_empty:
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       /*
> +        * if we depleted all empty sheaves in the barn or there are too
> +        * many full sheaves, free the rest to slab pages
> +        */
> +fallback:
> +       __kmem_cache_free_bulk(s, size, p);
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -4640,7 +5518,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>         memcg_slab_free_hook(s, slab, &object, 1);
>         alloc_tagging_slab_free_hook(s, slab, &object, 1);
>
> -       if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> +       if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> +               return;
> +
> +       if (!s->cpu_sheaves || !free_to_pcs(s, object))
>                 do_slab_free(s, slab, object, object, 1, addr);
>  }
>
> @@ -5236,6 +6117,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>         if (!size)
>                 return;
>
> +       /*
> +        * freeing to sheaves is so incompatible with the detached freelist so
> +        * once we go that way, we have to do everything differently
> +        */
> +       if (s && s->cpu_sheaves) {
> +               free_to_pcs_bulk(s, size, p);
> +               return;
> +       }
> +
>         do {
>                 struct detached_freelist df;
>
> @@ -5354,7 +6244,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>  int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                                  void **p)
>  {
> -       int i;
> +       unsigned int i = 0;
>
>         if (!size)
>                 return 0;
> @@ -5363,9 +6253,20 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>         if (unlikely(!s))
>                 return 0;
>
> -       i = __kmem_cache_alloc_bulk(s, flags, size, p);
> -       if (unlikely(i == 0))
> -               return 0;
> +       if (s->cpu_sheaves)
> +               i = alloc_from_pcs_bulk(s, size, p);
> +
> +       if (i < size) {
> +               /*
> +                * If we ran out of memory, don't bother with freeing back to
> +                * the percpu sheaves, we have bigger problems.
> +                */
> +               if (unlikely(__kmem_cache_alloc_bulk(s, flags, size - i, p + i) == 0)) {
> +                       if (i > 0)
> +                               __kmem_cache_free_bulk(s, i, p);
> +                       return 0;
> +               }
> +       }
>
>         /*
>          * memcg and kmem_cache debug support and memory initialization.
> @@ -5375,11 +6276,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                     slab_want_init_on_alloc(flags, s), s->object_size))) {
>                 return 0;
>         }
> -       return i;
> +
> +       return size;
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
>
> -
>  /*
>   * Object placement in a slab is made very easy because we always start at
>   * offset 0. If we tune the size of the object to the alignment then we can
> @@ -5513,7 +6414,7 @@ static inline int calculate_order(unsigned int size)
>  }
>
>  static void
> -init_kmem_cache_node(struct kmem_cache_node *n)
> +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
>  {
>         n->nr_partial = 0;
>         spin_lock_init(&n->list_lock);
> @@ -5523,6 +6424,9 @@ init_kmem_cache_node(struct kmem_cache_node *n)
>         atomic_long_set(&n->total_objects, 0);
>         INIT_LIST_HEAD(&n->full);
>  #endif
> +       n->barn = barn;
> +       if (barn)
> +               barn_init(barn);
>  }
>
>  #ifndef CONFIG_SLUB_TINY
> @@ -5553,6 +6457,30 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
>  }
>  #endif /* CONFIG_SLUB_TINY */
>
> +static int init_percpu_sheaves(struct kmem_cache *s)
> +{
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct slub_percpu_sheaves *pcs;
> +               int nid;
> +
> +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +               local_trylock_init(&pcs->lock);
> +
> +               nid = cpu_to_mem(cpu);
> +
> +               pcs->barn = get_node(s, nid)->barn;
> +               pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
> +
> +               if (!pcs->main)
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
>  static struct kmem_cache *kmem_cache_node;
>
>  /*
> @@ -5588,7 +6516,7 @@ static void early_kmem_cache_node_alloc(int node)
>         slab->freelist = get_freepointer(kmem_cache_node, n);
>         slab->inuse = 1;
>         kmem_cache_node->node[node] = n;
> -       init_kmem_cache_node(n);
> +       init_kmem_cache_node(n, NULL);
>         inc_slabs_node(kmem_cache_node, node, slab->objects);
>
>         /*
> @@ -5604,6 +6532,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
>         struct kmem_cache_node *n;
>
>         for_each_kmem_cache_node(s, node, n) {
> +               if (n->barn) {
> +                       WARN_ON(n->barn->nr_full);
> +                       WARN_ON(n->barn->nr_empty);
> +                       kfree(n->barn);
> +                       n->barn = NULL;
> +               }
> +
>                 s->node[node] = NULL;
>                 kmem_cache_free(kmem_cache_node, n);
>         }
> @@ -5612,6 +6547,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
>  void __kmem_cache_release(struct kmem_cache *s)
>  {
>         cache_random_seq_destroy(s);
> +       if (s->cpu_sheaves)
> +               pcs_destroy(s);
>  #ifndef CONFIG_SLUB_TINY
>         free_percpu(s->cpu_slab);
>  #endif
> @@ -5624,20 +6561,29 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>
>         for_each_node_mask(node, slab_nodes) {
>                 struct kmem_cache_node *n;
> +               struct node_barn *barn = NULL;
>
>                 if (slab_state == DOWN) {
>                         early_kmem_cache_node_alloc(node);
>                         continue;
>                 }
> +
> +               if (s->cpu_sheaves) {
> +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> +
> +                       if (!barn)
> +                               return 0;
> +               }
> +
>                 n = kmem_cache_alloc_node(kmem_cache_node,
>                                                 GFP_KERNEL, node);
> -
>                 if (!n) {
> -                       free_kmem_cache_nodes(s);

Why do you skip free_kmem_cache_nodes() here?


> +                       kfree(barn);
>                         return 0;
>                 }
>
> -               init_kmem_cache_node(n);
> +               init_kmem_cache_node(n, barn);
> +
>                 s->node[node] = n;
>         }
>         return 1;
> @@ -5894,6 +6840,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>         flush_all_cpus_locked(s);
>         /* Attempt to free all objects */
>         for_each_kmem_cache_node(s, node, n) {
> +               if (n->barn)
> +                       barn_shrink(s, n->barn);
>                 free_partial(s, n);
>                 if (n->nr_partial || node_nr_slabs(n))
>                         return 1;
> @@ -6097,6 +7045,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
>                 for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
>                         INIT_LIST_HEAD(promote + i);
>
> +               if (n->barn)
> +                       barn_shrink(s, n->barn);
> +
>                 spin_lock_irqsave(&n->list_lock, flags);
>
>                 /*
> @@ -6209,12 +7160,24 @@ static int slab_mem_going_online_callback(void *arg)
>          */
>         mutex_lock(&slab_mutex);
>         list_for_each_entry(s, &slab_caches, list) {
> +               struct node_barn *barn = NULL;
> +
>                 /*
>                  * The structure may already exist if the node was previously
>                  * onlined and offlined.
>                  */
>                 if (get_node(s, nid))
>                         continue;
> +
> +               if (s->cpu_sheaves) {
> +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
> +
> +                       if (!barn) {
> +                               ret = -ENOMEM;
> +                               goto out;
> +                       }
> +               }
> +
>                 /*
>                  * XXX: kmem_cache_alloc_node will fallback to other nodes
>                  *      since memory is not yet available from the node that
> @@ -6222,10 +7185,13 @@ static int slab_mem_going_online_callback(void *arg)
>                  */
>                 n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
>                 if (!n) {
> +                       kfree(barn);
>                         ret = -ENOMEM;
>                         goto out;
>                 }
> -               init_kmem_cache_node(n);
> +
> +               init_kmem_cache_node(n, barn);
> +
>                 s->node[nid] = n;
>         }
>         /*
> @@ -6444,6 +7410,17 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>
>         set_cpu_partial(s);
>
> +       if (args->sheaf_capacity && !IS_ENABLED(CONFIG_SLUB_TINY)
> +                                       && !(s->flags & SLAB_DEBUG_FLAGS)) {
> +               s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
> +               if (!s->cpu_sheaves) {
> +                       err = -ENOMEM;
> +                       goto out;
> +               }
> +               // TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
> +               s->sheaf_capacity = args->sheaf_capacity;
> +       }
> +
>  #ifdef CONFIG_NUMA
>         s->remote_node_defrag_ratio = 1000;
>  #endif
> @@ -6460,6 +7437,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>         if (!alloc_kmem_cache_cpus(s))
>                 goto out;
>
> +       if (s->cpu_sheaves) {
> +               err = init_percpu_sheaves(s);
> +               if (err)
> +                       goto out;
> +       }
> +
>         err = 0;
>
>         /* Mutex is not taken during early boot */
> @@ -6481,7 +7464,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>                 __kmem_cache_release(s);
>         return err;
>  }
> -
>  #ifdef SLAB_SUPPORTS_SYSFS
>  static int count_inuse(struct slab *slab)
>  {
> @@ -6912,6 +7894,12 @@ static ssize_t order_show(struct kmem_cache *s, char *buf)
>  }
>  SLAB_ATTR_RO(order);
>
> +static ssize_t sheaf_capacity_show(struct kmem_cache *s, char *buf)
> +{
> +       return sysfs_emit(buf, "%u\n", s->sheaf_capacity);
> +}
> +SLAB_ATTR_RO(sheaf_capacity);
> +
>  static ssize_t min_partial_show(struct kmem_cache *s, char *buf)
>  {
>         return sysfs_emit(buf, "%lu\n", s->min_partial);
> @@ -7259,8 +8247,10 @@ static ssize_t text##_store(struct kmem_cache *s,                \
>  }                                                              \
>  SLAB_ATTR(text);                                               \
>
> +STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
> +STAT_ATTR(FREE_PCS, free_cpu_sheaf);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_FROZEN, free_frozen);
> @@ -7285,6 +8275,14 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
>  STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
>  STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
>  STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
> +STAT_ATTR(SHEAF_FLUSH, sheaf_flush);
> +STAT_ATTR(SHEAF_REFILL, sheaf_refill);
> +STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
> +STAT_ATTR(SHEAF_FREE, sheaf_free);
> +STAT_ATTR(BARN_GET, barn_get);
> +STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
> +STAT_ATTR(BARN_PUT, barn_put);
> +STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
>  #endif /* CONFIG_SLUB_STATS */
>
>  #ifdef CONFIG_KFENCE
> @@ -7315,6 +8313,7 @@ static struct attribute *slab_attrs[] = {
>         &object_size_attr.attr,
>         &objs_per_slab_attr.attr,
>         &order_attr.attr,
> +       &sheaf_capacity_attr.attr,
>         &min_partial_attr.attr,
>         &cpu_partial_attr.attr,
>         &objects_partial_attr.attr,
> @@ -7346,8 +8345,10 @@ static struct attribute *slab_attrs[] = {
>         &remote_node_defrag_ratio_attr.attr,
>  #endif
>  #ifdef CONFIG_SLUB_STATS
> +       &alloc_cpu_sheaf_attr.attr,
>         &alloc_fastpath_attr.attr,
>         &alloc_slowpath_attr.attr,
> +       &free_cpu_sheaf_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
>         &free_frozen_attr.attr,
> @@ -7372,6 +8373,14 @@ static struct attribute *slab_attrs[] = {
>         &cpu_partial_free_attr.attr,
>         &cpu_partial_node_attr.attr,
>         &cpu_partial_drain_attr.attr,
> +       &sheaf_flush_attr.attr,
> +       &sheaf_refill_attr.attr,
> +       &sheaf_alloc_attr.attr,
> +       &sheaf_free_attr.attr,
> +       &barn_get_attr.attr,
> +       &barn_get_fail_attr.attr,
> +       &barn_put_attr.attr,
> +       &barn_put_fail_attr.attr,
>  #endif
>  #ifdef CONFIG_FAILSLAB
>         &failslab_attr.attr,
>
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 05/14] tools: Add testing support for changes to rcu and slab for sheaves
  2025-07-23 13:34 ` [PATCH v5 05/14] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
@ 2025-08-22 16:28   ` Suren Baghdasaryan
  2025-08-26  9:32     ` Vlastimil Babka
  0 siblings, 1 reply; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-22 16:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Make testing work for the slab and rcu changes that have come in with
> the sheaves work.
>
> This only works with one kmem_cache, and only the first one used.
> Subsequent setting of kmem_cache will not update the active kmem_cache
> and will be silently dropped because there are other tests which happen
> after the kmem_cache of interest is set.
>
> The saved active kmem_cache is used in the rcu callback, which passes
> the object to be freed.
>
> The rcu call takes the rcu_head, which is passed in as the field in the
> struct (in this case rcu in the maple tree node), which is calculated by
> pointer math.  The offset of which is saved (in a global variable) for
> restoring the node pointer on the callback after the rcu grace period
> expires.
>
> Don't use any of this outside of testing, please.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Couple nits but otherwise LGTM.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  tools/include/linux/slab.h            | 41 ++++++++++++++++++++++++++++++++---
>  tools/testing/shared/linux.c          | 24 ++++++++++++++++----
>  tools/testing/shared/linux/rcupdate.h | 22 +++++++++++++++++++
>  3 files changed, 80 insertions(+), 7 deletions(-)
>
> diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
> index c87051e2b26f5a7fee0362697fae067076b8e84d..d1444e79f2685edb828adbce8b3fbb500c0f8844 100644
> --- a/tools/include/linux/slab.h
> +++ b/tools/include/linux/slab.h
> @@ -23,6 +23,12 @@ enum slab_state {
>         FULL
>  };
>
> +struct kmem_cache_args {
> +       unsigned int align;
> +       unsigned int sheaf_capacity;
> +       void (*ctor)(void *);
> +};
> +
>  static inline void *kzalloc(size_t size, gfp_t gfp)
>  {
>         return kmalloc(size, gfp | __GFP_ZERO);
> @@ -37,9 +43,38 @@ static inline void *kmem_cache_alloc(struct kmem_cache *cachep, int flags)
>  }
>  void kmem_cache_free(struct kmem_cache *cachep, void *objp);
>
> -struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
> -                       unsigned int align, unsigned int flags,
> -                       void (*ctor)(void *));
> +
> +struct kmem_cache *
> +__kmem_cache_create_args(const char *name, unsigned int size,
> +               struct kmem_cache_args *args, unsigned int flags);
> +
> +/* If NULL is passed for @args, use this variant with default arguments. */
> +static inline struct kmem_cache *
> +__kmem_cache_default_args(const char *name, unsigned int size,
> +               struct kmem_cache_args *args, unsigned int flags)
> +{
> +       struct kmem_cache_args kmem_default_args = {};
> +
> +       return __kmem_cache_create_args(name, size, &kmem_default_args, flags);
> +}
> +
> +static inline struct kmem_cache *
> +__kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> +               unsigned int flags, void (*ctor)(void *))
> +{
> +       struct kmem_cache_args kmem_args = {
> +               .align  = align,
> +               .ctor   = ctor,
> +       };
> +
> +       return __kmem_cache_create_args(name, size, &kmem_args, flags);
> +}
> +
> +#define kmem_cache_create(__name, __object_size, __args, ...)           \
> +       _Generic((__args),                                              \
> +               struct kmem_cache_args *: __kmem_cache_create_args,     \
> +               void *: __kmem_cache_default_args,                      \
> +               default: __kmem_cache_create)(__name, __object_size, __args, __VA_ARGS__)
>
>  void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
>  int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
> diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
> index 0f97fb0d19e19c327aa4843a35b45cc086f4f366..f998555a1b2af4a899a468a652b04622df459ed3 100644
> --- a/tools/testing/shared/linux.c
> +++ b/tools/testing/shared/linux.c
> @@ -20,6 +20,7 @@ struct kmem_cache {
>         pthread_mutex_t lock;
>         unsigned int size;
>         unsigned int align;
> +       unsigned int sheaf_capacity;
>         int nr_objs;
>         void *objs;
>         void (*ctor)(void *);
> @@ -31,6 +32,8 @@ struct kmem_cache {
>         void *private;
>  };
>
> +static struct kmem_cache *kmem_active = NULL;
> +
>  void kmem_cache_set_callback(struct kmem_cache *cachep, void (*callback)(void *))
>  {
>         cachep->callback = callback;
> @@ -147,6 +150,14 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
>         pthread_mutex_unlock(&cachep->lock);
>  }
>
> +void kmem_cache_free_active(void *objp)
> +{
> +       if (!kmem_active)
> +               printf("WARNING: No active kmem_cache\n");
> +
> +       kmem_cache_free(kmem_active, objp);
> +}
> +
>  void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
>  {
>         if (kmalloc_verbose)
> @@ -234,23 +245,28 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
>  }
>
>  struct kmem_cache *
> -kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> -               unsigned int flags, void (*ctor)(void *))
> +__kmem_cache_create_args(const char *name, unsigned int size,
> +                         struct kmem_cache_args *args,
> +                         unsigned int flags)
>  {
>         struct kmem_cache *ret = malloc(sizeof(*ret));
>
>         pthread_mutex_init(&ret->lock, NULL);
>         ret->size = size;
> -       ret->align = align;
> +       ret->align = args->align;
> +       ret->sheaf_capacity = args->sheaf_capacity;
>         ret->nr_objs = 0;
>         ret->nr_allocated = 0;
>         ret->nr_tallocated = 0;
>         ret->objs = NULL;
> -       ret->ctor = ctor;
> +       ret->ctor = args->ctor;
>         ret->non_kernel = 0;
>         ret->exec_callback = false;
>         ret->callback = NULL;
>         ret->private = NULL;
> +       if (!kmem_active)
> +               kmem_active = ret;
> +
>         return ret;
>  }
>
> diff --git a/tools/testing/shared/linux/rcupdate.h b/tools/testing/shared/linux/rcupdate.h
> index fed468fb0c78db6f33fb1900c7110ab5f3c19c65..c95e2f0bbd93798e544d7d34e0823ed68414f924 100644
> --- a/tools/testing/shared/linux/rcupdate.h
> +++ b/tools/testing/shared/linux/rcupdate.h
> @@ -9,4 +9,26 @@
>  #define rcu_dereference_check(p, cond) rcu_dereference(p)
>  #define RCU_INIT_POINTER(p, v) do { (p) = (v); } while (0)
>
> +void kmem_cache_free_active(void *objp);
> +static unsigned long kfree_cb_offset = 0;
> +
> +static inline void kfree_rcu_cb(struct rcu_head *head)
> +{
> +       void *objp = (void *) ((unsigned long)head - kfree_cb_offset);
> +
> +       kmem_cache_free_active(objp);
> +}
> +
> +#ifndef offsetof
> +#define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER)
> +#endif
> +

We need a comment here that concurrent kfree_rcu() calls are not
supported because they would override each other's kfree_cb_offset.
Kinda obvious but I think unusual limitations should be explicitly
called out.

> +#define kfree_rcu(ptr, rhv)                                            \
> +do {                                                                   \
> +       if (!kfree_cb_offset)                                           \
> +               kfree_cb_offset = offsetof(typeof(*(ptr)), rhv);        \
> +                                                                       \
> +       call_rcu(&ptr->rhv, kfree_rcu_cb);                              \
> +} while (0)

Any specific reason kfree_rcu() is a macro and not a static inline function?

> +
>  #endif
>
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 06/14] tools: Add sheaves support to testing infrastructure
  2025-07-23 13:34 ` [PATCH v5 06/14] tools: Add sheaves support to testing infrastructure Vlastimil Babka
@ 2025-08-22 16:56   ` Suren Baghdasaryan
  2025-08-26  9:59     ` Vlastimil Babka
  0 siblings, 1 reply; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-22 16:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Allocate a sheaf and fill it to the count amount.  Does not fill to the
> sheaf limit to detect incorrect allocation requests.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  tools/include/linux/slab.h   | 24 +++++++++++++
>  tools/testing/shared/linux.c | 84 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 108 insertions(+)
>
> diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
> index d1444e79f2685edb828adbce8b3fbb500c0f8844..1962d7f1abee154e1cda5dba28aef213088dd198 100644
> --- a/tools/include/linux/slab.h
> +++ b/tools/include/linux/slab.h
> @@ -23,6 +23,13 @@ enum slab_state {
>         FULL
>  };
>
> +struct slab_sheaf {
> +       struct kmem_cache *cache;
> +       unsigned int size;
> +       unsigned int capacity;
> +       void *objects[];
> +};
> +
>  struct kmem_cache_args {
>         unsigned int align;
>         unsigned int sheaf_capacity;
> @@ -80,4 +87,21 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
>  int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
>                           void **list);
>
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
> +
> +void *
> +kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf *sheaf);
> +
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf *sheaf);
> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf **sheafp, unsigned int size);
> +
> +static inline unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
> +{
> +       return sheaf->size;
> +}
> +
>  #endif         /* _TOOLS_SLAB_H */
> diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
> index f998555a1b2af4a899a468a652b04622df459ed3..e0255f53159bd3a1325d49192283dd6790a5e3b8 100644
> --- a/tools/testing/shared/linux.c
> +++ b/tools/testing/shared/linux.c
> @@ -181,6 +181,12 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
>         if (kmalloc_verbose)
>                 pr_debug("Bulk alloc %zu\n", size);
>
> +       if (cachep->exec_callback) {
> +               if (cachep->callback)
> +                       cachep->callback(cachep->private);
> +               cachep->exec_callback = false;
> +       }
> +
>         pthread_mutex_lock(&cachep->lock);
>         if (cachep->nr_objs >= size) {
>                 struct radix_tree_node *node;
> @@ -270,6 +276,84 @@ __kmem_cache_create_args(const char *name, unsigned int size,
>         return ret;
>  }
>
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
> +{
> +       struct slab_sheaf *sheaf;
> +       unsigned int capacity;
> +
> +       if (size > s->sheaf_capacity)
> +               capacity = size;
> +       else
> +               capacity = s->sheaf_capacity;

nit:
capacity = max(size, s->sheaf_capacity);

> +
> +       sheaf = malloc(sizeof(*sheaf) + sizeof(void *) * s->sheaf_capacity * capacity);

Should this really be `sizeof(void *) * s->sheaf_capacity * capacity`
or just `sizeof(void *) * capacity` ?


> +       if (!sheaf) {
> +               return NULL;
> +       }
> +
> +       memset(sheaf, 0, size);
> +       sheaf->cache = s;
> +       sheaf->capacity = capacity;
> +       sheaf->size = kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects);
> +       if (!sheaf->size) {
> +               free(sheaf);
> +               return NULL;
> +       }
> +
> +       return sheaf;
> +}
> +
> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                struct slab_sheaf **sheafp, unsigned int size)
> +{
> +       struct slab_sheaf *sheaf = *sheafp;
> +       int refill;
> +
> +       if (sheaf->size >= size)
> +               return 0;
> +
> +       if (size > sheaf->capacity) {
> +               sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
> +               if (!sheaf)
> +                       return -ENOMEM;
> +
> +               kmem_cache_return_sheaf(s, gfp, *sheafp);
> +               *sheafp = sheaf;
> +               return 0;
> +       }
> +
> +       refill = kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
> +                                      &sheaf->objects[sheaf->size]);
> +       if (!refill)
> +               return -ENOMEM;
> +
> +       sheaf->size += refill;
> +       return 0;
> +}
> +
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                struct slab_sheaf *sheaf)
> +{
> +       if (sheaf->size) {
> +               //s->non_kernel += sheaf->size;

Above comment seems obsolete.

> +               kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
> +       }
> +       free(sheaf);
> +}
> +
> +void *
> +kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf *sheaf)
> +{
> +       if (sheaf->size == 0) {
> +               printf("Nothing left in sheaf!\n");
> +               return NULL;
> +       }
> +

Should we clear sheaf->objects[sheaf->size] for additional safety?

> +       return sheaf->objects[--sheaf->size];
> +}
> +
>  /*
>   * Test the test infrastructure for kem_cache_alloc/free and bulk counterparts.
>   */
>
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves
  2025-07-23 13:34 ` [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
@ 2025-08-22 19:58   ` Suren Baghdasaryan
  2025-08-25  6:52   ` Harry Yoo
  1 sibling, 0 replies; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-22 19:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Currently allocations asking for a specific node explicitly or via
> mempolicy in strict_numa node bypass percpu sheaves. Since sheaves
> contain mostly local objects, we can try allocating from them if the
> local node happens to be the requested node or allowed by the mempolicy.
> If we find the object from percpu sheaves is not from the expected node,
> we skip the sheaves - this should be rare.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 45 insertions(+), 7 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 50fc35b8fc9b3101821c338e9469c134677ded51..b98983b8d2e3e04ea256d91efcf0215ff0ae7e38 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4765,18 +4765,42 @@ __pcs_handle_empty(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t
>  }
>
>  static __fastpath_inline
> -void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
> +void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>  {
>         struct slub_percpu_sheaves *pcs;
>         void *object;
>
>  #ifdef CONFIG_NUMA
> -       if (static_branch_unlikely(&strict_numa)) {
> -               if (current->mempolicy)
> -                       return NULL;
> +       if (static_branch_unlikely(&strict_numa) &&
> +                        node == NUMA_NO_NODE) {
> +
> +               struct mempolicy *mpol = current->mempolicy;
> +
> +               if (mpol) {
> +                       /*
> +                        * Special BIND rule support. If the local node
> +                        * is in permitted set then do not redirect
> +                        * to a particular node.
> +                        * Otherwise we apply the memory policy to get
> +                        * the node we need to allocate on.
> +                        */
> +                       if (mpol->mode != MPOL_BIND ||
> +                                       !node_isset(numa_mem_id(), mpol->nodes))
> +
> +                               node = mempolicy_slab_node();
> +               }
>         }
>  #endif
>
> +       if (unlikely(node != NUMA_NO_NODE)) {

Should this and later (node != NUMA_NO_NODE) checks be still under
#ifdef CONFIG_NUMA ?

> +               /*
> +                * We assume the percpu sheaves contain only local objects
> +                * although it's not completely guaranteed, so we verify later.
> +                */
> +               if (node != numa_mem_id())
> +                       return NULL;
> +       }
> +
>         if (!local_trylock(&s->cpu_sheaves->lock))
>                 return NULL;
>
> @@ -4788,7 +4812,21 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>                         return NULL;
>         }
>
> -       object = pcs->main->objects[--pcs->main->size];
> +       object = pcs->main->objects[pcs->main->size - 1];
> +
> +       if (unlikely(node != NUMA_NO_NODE)) {
> +               /*
> +                * Verify that the object was from the node we want. This could
> +                * be false because of cpu migration during an unlocked part of
> +                * the current allocation or previous freeing process.
> +                */
> +               if (folio_nid(virt_to_folio(object)) != node) {
> +                       local_unlock(&s->cpu_sheaves->lock);
> +                       return NULL;
> +               }
> +       }
> +
> +       pcs->main->size--;
>
>         local_unlock(&s->cpu_sheaves->lock);
>
> @@ -4888,8 +4926,8 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>         if (unlikely(object))
>                 goto out;
>
> -       if (s->cpu_sheaves && node == NUMA_NO_NODE)
> -               object = alloc_from_pcs(s, gfpflags);
> +       if (s->cpu_sheaves)
> +               object = alloc_from_pcs(s, gfpflags, node);
>
>         if (!object)
>                 object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
>
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 12/14] maple_tree: Sheaf conversion
  2025-07-23 13:34 ` [PATCH v5 12/14] maple_tree: Sheaf conversion Vlastimil Babka
@ 2025-08-22 20:18   ` Suren Baghdasaryan
  2025-08-26 14:22     ` Liam R. Howlett
  0 siblings, 1 reply; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-22 20:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>
> Use sheaves instead of bulk allocations.  This should speed up the
> allocations and the return path of unused allocations.

Nice cleanup!

>
> Remove push/pop of nodes from maple state.
> Remove unnecessary testing
> ifdef out other testing that probably will be deleted

Should we simply remove them if they are unused?

> Fix testcase for testing race
> Move some testing around in the same commit.

Would it be possible to separate test changes from kernel changes into
another patch? Kernel part looks good to me but I don't know enough
about these tests to vote on that.

>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/maple_tree.h       |   6 +-
>  lib/maple_tree.c                 | 331 ++++----------------
>  lib/test_maple_tree.c            |   8 +
>  tools/testing/radix-tree/maple.c | 632 +++++++--------------------------------
>  tools/testing/shared/linux.c     |   8 +-
>  5 files changed, 185 insertions(+), 800 deletions(-)
>
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index 9ef1290382249462d73ae72435dada7ce4b0622c..3cf1ae9dde7ce43fa20ae400c01fefad048c302e 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -442,7 +442,8 @@ struct ma_state {
>         struct maple_enode *node;       /* The node containing this entry */
>         unsigned long min;              /* The minimum index of this node - implied pivot min */
>         unsigned long max;              /* The maximum index of this node - implied pivot max */
> -       struct maple_alloc *alloc;      /* Allocated nodes for this operation */
> +       struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
> +       unsigned long node_request;
>         enum maple_status status;       /* The status of the state (active, start, none, etc) */
>         unsigned char depth;            /* depth of tree descent during write */
>         unsigned char offset;
> @@ -490,7 +491,8 @@ struct ma_wr_state {
>                 .status = ma_start,                                     \
>                 .min = 0,                                               \
>                 .max = ULONG_MAX,                                       \
> -               .alloc = NULL,                                          \
> +               .node_request= 0,                                       \
> +               .sheaf = NULL,                                          \
>                 .mas_flags = 0,                                         \
>                 .store_type = wr_invalid,                               \
>         }
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 82f39fe29a462aa3c779789a28efdd6cdef64c79..3c3c14a76d98ded3b619c178d64099b464a2ca23 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -198,6 +198,22 @@ static void mt_free_rcu(struct rcu_head *head)
>         kmem_cache_free(maple_node_cache, node);
>  }
>
> +static void mt_return_sheaf(struct slab_sheaf *sheaf)
> +{
> +       kmem_cache_return_sheaf(maple_node_cache, GFP_KERNEL, sheaf);
> +}
> +
> +static struct slab_sheaf *mt_get_sheaf(gfp_t gfp, int count)
> +{
> +       return kmem_cache_prefill_sheaf(maple_node_cache, gfp, count);
> +}
> +
> +static int mt_refill_sheaf(gfp_t gfp, struct slab_sheaf **sheaf,
> +               unsigned int size)
> +{
> +       return kmem_cache_refill_sheaf(maple_node_cache, gfp, sheaf, size);
> +}
> +
>  /*
>   * ma_free_rcu() - Use rcu callback to free a maple node
>   * @node: The node to free
> @@ -590,67 +606,6 @@ static __always_inline bool mte_dead_node(const struct maple_enode *enode)
>         return ma_dead_node(node);
>  }
>
> -/*
> - * mas_allocated() - Get the number of nodes allocated in a maple state.
> - * @mas: The maple state
> - *
> - * The ma_state alloc member is overloaded to hold a pointer to the first
> - * allocated node or to the number of requested nodes to allocate.  If bit 0 is
> - * set, then the alloc contains the number of requested nodes.  If there is an
> - * allocated node, then the total allocated nodes is in that node.
> - *
> - * Return: The total number of nodes allocated
> - */
> -static inline unsigned long mas_allocated(const struct ma_state *mas)
> -{
> -       if (!mas->alloc || ((unsigned long)mas->alloc & 0x1))
> -               return 0;
> -
> -       return mas->alloc->total;
> -}
> -
> -/*
> - * mas_set_alloc_req() - Set the requested number of allocations.
> - * @mas: the maple state
> - * @count: the number of allocations.
> - *
> - * The requested number of allocations is either in the first allocated node,
> - * located in @mas->alloc->request_count, or directly in @mas->alloc if there is
> - * no allocated node.  Set the request either in the node or do the necessary
> - * encoding to store in @mas->alloc directly.
> - */
> -static inline void mas_set_alloc_req(struct ma_state *mas, unsigned long count)
> -{
> -       if (!mas->alloc || ((unsigned long)mas->alloc & 0x1)) {
> -               if (!count)
> -                       mas->alloc = NULL;
> -               else
> -                       mas->alloc = (struct maple_alloc *)(((count) << 1U) | 1U);
> -               return;
> -       }
> -
> -       mas->alloc->request_count = count;
> -}
> -
> -/*
> - * mas_alloc_req() - get the requested number of allocations.
> - * @mas: The maple state
> - *
> - * The alloc count is either stored directly in @mas, or in
> - * @mas->alloc->request_count if there is at least one node allocated.  Decode
> - * the request count if it's stored directly in @mas->alloc.
> - *
> - * Return: The allocation request count.
> - */
> -static inline unsigned int mas_alloc_req(const struct ma_state *mas)
> -{
> -       if ((unsigned long)mas->alloc & 0x1)
> -               return (unsigned long)(mas->alloc) >> 1;
> -       else if (mas->alloc)
> -               return mas->alloc->request_count;
> -       return 0;
> -}
> -
>  /*
>   * ma_pivots() - Get a pointer to the maple node pivots.
>   * @node: the maple node
> @@ -1148,77 +1103,15 @@ static int mas_ascend(struct ma_state *mas)
>   */
>  static inline struct maple_node *mas_pop_node(struct ma_state *mas)
>  {
> -       struct maple_alloc *ret, *node = mas->alloc;
> -       unsigned long total = mas_allocated(mas);
> -       unsigned int req = mas_alloc_req(mas);
> +       struct maple_node *ret;
>
> -       /* nothing or a request pending. */
> -       if (WARN_ON(!total))
> +       if (WARN_ON_ONCE(!mas->sheaf))
>                 return NULL;
>
> -       if (total == 1) {
> -               /* single allocation in this ma_state */
> -               mas->alloc = NULL;
> -               ret = node;
> -               goto single_node;
> -       }
> -
> -       if (node->node_count == 1) {
> -               /* Single allocation in this node. */
> -               mas->alloc = node->slot[0];
> -               mas->alloc->total = node->total - 1;
> -               ret = node;
> -               goto new_head;
> -       }
> -       node->total--;
> -       ret = node->slot[--node->node_count];
> -       node->slot[node->node_count] = NULL;
> -
> -single_node:
> -new_head:
> -       if (req) {
> -               req++;
> -               mas_set_alloc_req(mas, req);
> -       }
> -
> +       ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
>         memset(ret, 0, sizeof(*ret));
> -       return (struct maple_node *)ret;
> -}
> -
> -/*
> - * mas_push_node() - Push a node back on the maple state allocation.
> - * @mas: The maple state
> - * @used: The used maple node
> - *
> - * Stores the maple node back into @mas->alloc for reuse.  Updates allocated and
> - * requested node count as necessary.
> - */
> -static inline void mas_push_node(struct ma_state *mas, struct maple_node *used)
> -{
> -       struct maple_alloc *reuse = (struct maple_alloc *)used;
> -       struct maple_alloc *head = mas->alloc;
> -       unsigned long count;
> -       unsigned int requested = mas_alloc_req(mas);
> -
> -       count = mas_allocated(mas);
>
> -       reuse->request_count = 0;
> -       reuse->node_count = 0;
> -       if (count) {
> -               if (head->node_count < MAPLE_ALLOC_SLOTS) {
> -                       head->slot[head->node_count++] = reuse;
> -                       head->total++;
> -                       goto done;
> -               }
> -               reuse->slot[0] = head;
> -               reuse->node_count = 1;
> -       }
> -
> -       reuse->total = count + 1;
> -       mas->alloc = reuse;
> -done:
> -       if (requested > 1)
> -               mas_set_alloc_req(mas, requested - 1);
> +       return ret;
>  }
>
>  /*
> @@ -1228,75 +1121,32 @@ static inline void mas_push_node(struct ma_state *mas, struct maple_node *used)
>   */
>  static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
>  {
> -       struct maple_alloc *node;
> -       unsigned long allocated = mas_allocated(mas);
> -       unsigned int requested = mas_alloc_req(mas);
> -       unsigned int count;
> -       void **slots = NULL;
> -       unsigned int max_req = 0;
> -
> -       if (!requested)
> -               return;
> +       if (unlikely(mas->sheaf)) {
> +               unsigned long refill = mas->node_request;
>
> -       mas_set_alloc_req(mas, 0);
> -       if (mas->mas_flags & MA_STATE_PREALLOC) {
> -               if (allocated)
> +               if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
> +                       mas->node_request = 0;
>                         return;
> -               WARN_ON(!allocated);
> -       }
> -
> -       if (!allocated || mas->alloc->node_count == MAPLE_ALLOC_SLOTS) {
> -               node = (struct maple_alloc *)mt_alloc_one(gfp);
> -               if (!node)
> -                       goto nomem_one;
> -
> -               if (allocated) {
> -                       node->slot[0] = mas->alloc;
> -                       node->node_count = 1;
> -               } else {
> -                       node->node_count = 0;
>                 }
>
> -               mas->alloc = node;
> -               node->total = ++allocated;
> -               node->request_count = 0;
> -               requested--;
> -       }
> +               if (mt_refill_sheaf(gfp, &mas->sheaf, refill))
> +                       goto error;
>
> -       node = mas->alloc;
> -       while (requested) {
> -               max_req = MAPLE_ALLOC_SLOTS - node->node_count;
> -               slots = (void **)&node->slot[node->node_count];
> -               max_req = min(requested, max_req);
> -               count = mt_alloc_bulk(gfp, max_req, slots);
> -               if (!count)
> -                       goto nomem_bulk;
> -
> -               if (node->node_count == 0) {
> -                       node->slot[0]->node_count = 0;
> -                       node->slot[0]->request_count = 0;
> -               }
> +               mas->node_request = 0;
> +               return;
> +       }
>
> -               node->node_count += count;
> -               allocated += count;
> -               /* find a non-full node*/
> -               do {
> -                       node = node->slot[0];
> -               } while (unlikely(node->node_count == MAPLE_ALLOC_SLOTS));
> -               requested -= count;
> +       mas->sheaf = mt_get_sheaf(gfp, mas->node_request);
> +       if (likely(mas->sheaf)) {
> +               mas->node_request = 0;
> +               return;
>         }
> -       mas->alloc->total = allocated;
> -       return;
>
> -nomem_bulk:
> -       /* Clean up potential freed allocations on bulk failure */
> -       memset(slots, 0, max_req * sizeof(unsigned long));
> -       mas->alloc->total = allocated;
> -nomem_one:
> -       mas_set_alloc_req(mas, requested);
> +error:
>         mas_set_err(mas, -ENOMEM);
>  }
>
> +
>  /*
>   * mas_free() - Free an encoded maple node
>   * @mas: The maple state
> @@ -1307,42 +1157,7 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
>   */
>  static inline void mas_free(struct ma_state *mas, struct maple_enode *used)
>  {
> -       struct maple_node *tmp = mte_to_node(used);
> -
> -       if (mt_in_rcu(mas->tree))
> -               ma_free_rcu(tmp);
> -       else
> -               mas_push_node(mas, tmp);
> -}
> -
> -/*
> - * mas_node_count_gfp() - Check if enough nodes are allocated and request more
> - * if there is not enough nodes.
> - * @mas: The maple state
> - * @count: The number of nodes needed
> - * @gfp: the gfp flags
> - */
> -static void mas_node_count_gfp(struct ma_state *mas, int count, gfp_t gfp)
> -{
> -       unsigned long allocated = mas_allocated(mas);
> -
> -       if (allocated < count) {
> -               mas_set_alloc_req(mas, count - allocated);
> -               mas_alloc_nodes(mas, gfp);
> -       }
> -}
> -
> -/*
> - * mas_node_count() - Check if enough nodes are allocated and request more if
> - * there is not enough nodes.
> - * @mas: The maple state
> - * @count: The number of nodes needed
> - *
> - * Note: Uses GFP_NOWAIT | __GFP_NOWARN for gfp flags.
> - */
> -static void mas_node_count(struct ma_state *mas, int count)
> -{
> -       return mas_node_count_gfp(mas, count, GFP_NOWAIT | __GFP_NOWARN);
> +       ma_free_rcu(mte_to_node(used));
>  }
>
>  /*
> @@ -2517,10 +2332,7 @@ static inline void mas_topiary_node(struct ma_state *mas,
>         enode = tmp_mas->node;
>         tmp = mte_to_node(enode);
>         mte_set_node_dead(enode);
> -       if (in_rcu)
> -               ma_free_rcu(tmp);
> -       else
> -               mas_push_node(mas, tmp);
> +       ma_free_rcu(tmp);
>  }
>
>  /*
> @@ -4168,7 +3980,7 @@ static inline void mas_wr_prealloc_setup(struct ma_wr_state *wr_mas)
>   *
>   * Return: Number of nodes required for preallocation.
>   */
> -static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
> +static inline void mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
>  {
>         struct ma_state *mas = wr_mas->mas;
>         unsigned char height = mas_mt_height(mas);
> @@ -4214,7 +4026,7 @@ static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
>                 WARN_ON_ONCE(1);
>         }
>
> -       return ret;
> +       mas->node_request = ret;
>  }
>
>  /*
> @@ -4275,15 +4087,15 @@ static inline enum store_type mas_wr_store_type(struct ma_wr_state *wr_mas)
>   */
>  static inline void mas_wr_preallocate(struct ma_wr_state *wr_mas, void *entry)
>  {
> -       int request;
> +       struct ma_state *mas = wr_mas->mas;
>
>         mas_wr_prealloc_setup(wr_mas);
> -       wr_mas->mas->store_type = mas_wr_store_type(wr_mas);
> -       request = mas_prealloc_calc(wr_mas, entry);
> -       if (!request)
> +       mas->store_type = mas_wr_store_type(wr_mas);
> +       mas_prealloc_calc(wr_mas, entry);
> +       if (!mas->node_request)
>                 return;
>
> -       mas_node_count(wr_mas->mas, request);
> +       mas_alloc_nodes(mas, GFP_NOWAIT | __GFP_NOWARN);
>  }
>
>  /**
> @@ -5398,7 +5210,6 @@ static inline void mte_destroy_walk(struct maple_enode *enode,
>   */
>  void *mas_store(struct ma_state *mas, void *entry)
>  {
> -       int request;
>         MA_WR_STATE(wr_mas, mas, entry);
>
>         trace_ma_write(__func__, mas, 0, entry);
> @@ -5428,11 +5239,11 @@ void *mas_store(struct ma_state *mas, void *entry)
>                 return wr_mas.content;
>         }
>
> -       request = mas_prealloc_calc(&wr_mas, entry);
> -       if (!request)
> +       mas_prealloc_calc(&wr_mas, entry);
> +       if (!mas->node_request)
>                 goto store;
>
> -       mas_node_count(mas, request);
> +       mas_alloc_nodes(mas, GFP_NOWAIT | __GFP_NOWARN);
>         if (mas_is_err(mas))
>                 return NULL;
>
> @@ -5520,26 +5331,25 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>  int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>  {
>         MA_WR_STATE(wr_mas, mas, entry);
> -       int ret = 0;
> -       int request;
>
>         mas_wr_prealloc_setup(&wr_mas);
>         mas->store_type = mas_wr_store_type(&wr_mas);
> -       request = mas_prealloc_calc(&wr_mas, entry);
> -       if (!request)
> -               return ret;
> +       mas_prealloc_calc(&wr_mas, entry);
> +       if (!mas->node_request)
> +               return 0;
>
> -       mas_node_count_gfp(mas, request, gfp);
> +       mas_alloc_nodes(mas, gfp);
>         if (mas_is_err(mas)) {
> -               mas_set_alloc_req(mas, 0);
> -               ret = xa_err(mas->node);
> +               int ret = xa_err(mas->node);
> +
> +               mas->node_request = 0;
>                 mas_destroy(mas);
>                 mas_reset(mas);
>                 return ret;
>         }
>
>         mas->mas_flags |= MA_STATE_PREALLOC;
> -       return ret;
> +       return 0;
>  }
>  EXPORT_SYMBOL_GPL(mas_preallocate);
>
> @@ -5553,9 +5363,6 @@ EXPORT_SYMBOL_GPL(mas_preallocate);
>   */
>  void mas_destroy(struct ma_state *mas)
>  {
> -       struct maple_alloc *node;
> -       unsigned long total;
> -
>         /*
>          * When using mas_for_each() to insert an expected number of elements,
>          * it is possible that the number inserted is less than the expected
> @@ -5576,21 +5383,11 @@ void mas_destroy(struct ma_state *mas)
>         }
>         mas->mas_flags &= ~(MA_STATE_BULK|MA_STATE_PREALLOC);
>
> -       total = mas_allocated(mas);
> -       while (total) {
> -               node = mas->alloc;
> -               mas->alloc = node->slot[0];
> -               if (node->node_count > 1) {
> -                       size_t count = node->node_count - 1;
> -
> -                       mt_free_bulk(count, (void __rcu **)&node->slot[1]);
> -                       total -= count;
> -               }
> -               mt_free_one(ma_mnode_ptr(node));
> -               total--;
> -       }
> +       mas->node_request = 0;
> +       if (mas->sheaf)
> +               mt_return_sheaf(mas->sheaf);
>
> -       mas->alloc = NULL;
> +       mas->sheaf = NULL;
>  }
>  EXPORT_SYMBOL_GPL(mas_destroy);
>
> @@ -5640,7 +5437,8 @@ int mas_expected_entries(struct ma_state *mas, unsigned long nr_entries)
>         /* Internal nodes */
>         nr_nodes += DIV_ROUND_UP(nr_nodes, nonleaf_cap);
>         /* Add working room for split (2 nodes) + new parents */
> -       mas_node_count_gfp(mas, nr_nodes + 3, GFP_KERNEL);
> +       mas->node_request = nr_nodes + 3;
> +       mas_alloc_nodes(mas, GFP_KERNEL);
>
>         /* Detect if allocations run out */
>         mas->mas_flags |= MA_STATE_PREALLOC;
> @@ -6276,7 +6074,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>                 mas_alloc_nodes(mas, gfp);
>         }
>
> -       if (!mas_allocated(mas))
> +       if (!mas->sheaf)
>                 return false;
>
>         mas->status = ma_start;
> @@ -7671,8 +7469,9 @@ void mas_dump(const struct ma_state *mas)
>
>         pr_err("[%u/%u] index=%lx last=%lx\n", mas->offset, mas->end,
>                mas->index, mas->last);
> -       pr_err("     min=%lx max=%lx alloc=" PTR_FMT ", depth=%u, flags=%x\n",
> -              mas->min, mas->max, mas->alloc, mas->depth, mas->mas_flags);
> +       pr_err("     min=%lx max=%lx sheaf=" PTR_FMT ", request %lu depth=%u, flags=%x\n",
> +              mas->min, mas->max, mas->sheaf, mas->node_request, mas->depth,
> +              mas->mas_flags);
>         if (mas->index > mas->last)
>                 pr_err("Check index & last\n");
>  }
> diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
> index 13e2a10d7554d6b1de5ffbda59f3a5bc4039a8c8..5549eb4200c7974e3bb457e0fd054c434e4b85da 100644
> --- a/lib/test_maple_tree.c
> +++ b/lib/test_maple_tree.c
> @@ -2746,6 +2746,7 @@ static noinline void __init check_fuzzer(struct maple_tree *mt)
>         mtree_test_erase(mt, ULONG_MAX - 10);
>  }
>
> +#if 0
>  /* duplicate the tree with a specific gap */
>  static noinline void __init check_dup_gaps(struct maple_tree *mt,
>                                     unsigned long nr_entries, bool zero_start,
> @@ -2770,6 +2771,7 @@ static noinline void __init check_dup_gaps(struct maple_tree *mt,
>                 mtree_store_range(mt, i*10, (i+1)*10 - gap,
>                                   xa_mk_value(i), GFP_KERNEL);
>
> +       mt_dump(mt, mt_dump_dec);
>         mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
>         mt_set_non_kernel(99999);
>         down_write(&newmt_lock);
> @@ -2779,9 +2781,12 @@ static noinline void __init check_dup_gaps(struct maple_tree *mt,
>
>         rcu_read_lock();
>         mas_for_each(&mas, tmp, ULONG_MAX) {
> +               printk("%lu nodes %lu\n", mas.index,
> +                      kmem_cache_sheaf_count(newmas.sheaf));
>                 newmas.index = mas.index;
>                 newmas.last = mas.last;
>                 mas_store(&newmas, tmp);
> +               mt_dump(&newmt, mt_dump_dec);
>         }
>         rcu_read_unlock();
>         mas_destroy(&newmas);
> @@ -2878,6 +2883,7 @@ static noinline void __init check_dup(struct maple_tree *mt)
>                 cond_resched();
>         }
>  }
> +#endif
>
>  static noinline void __init check_bnode_min_spanning(struct maple_tree *mt)
>  {
> @@ -4045,9 +4051,11 @@ static int __init maple_tree_seed(void)
>         check_fuzzer(&tree);
>         mtree_destroy(&tree);
>
> +#if 0
>         mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>         check_dup(&tree);
>         mtree_destroy(&tree);
> +#endif
>
>         mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>         check_bnode_min_spanning(&tree);
> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> index f6f923c9dc1039997953a94ec184c560b225c2d4..1bd789191f232385d69f2dd3e900bac99d8919ff 100644
> --- a/tools/testing/radix-tree/maple.c
> +++ b/tools/testing/radix-tree/maple.c
> @@ -63,430 +63,6 @@ struct rcu_reader_struct {
>         struct rcu_test_struct2 *test;
>  };
>
> -static int get_alloc_node_count(struct ma_state *mas)
> -{
> -       int count = 1;
> -       struct maple_alloc *node = mas->alloc;
> -
> -       if (!node || ((unsigned long)node & 0x1))
> -               return 0;
> -       while (node->node_count) {
> -               count += node->node_count;
> -               node = node->slot[0];
> -       }
> -       return count;
> -}
> -
> -static void check_mas_alloc_node_count(struct ma_state *mas)
> -{
> -       mas_node_count_gfp(mas, MAPLE_ALLOC_SLOTS + 1, GFP_KERNEL);
> -       mas_node_count_gfp(mas, MAPLE_ALLOC_SLOTS + 3, GFP_KERNEL);
> -       MT_BUG_ON(mas->tree, get_alloc_node_count(mas) != mas->alloc->total);
> -       mas_destroy(mas);
> -}
> -
> -/*
> - * check_new_node() - Check the creation of new nodes and error path
> - * verification.
> - */
> -static noinline void __init check_new_node(struct maple_tree *mt)
> -{
> -
> -       struct maple_node *mn, *mn2, *mn3;
> -       struct maple_alloc *smn;
> -       struct maple_node *nodes[100];
> -       int i, j, total;
> -
> -       MA_STATE(mas, mt, 0, 0);
> -
> -       check_mas_alloc_node_count(&mas);
> -
> -       /* Try allocating 3 nodes */
> -       mtree_lock(mt);
> -       mt_set_non_kernel(0);
> -       /* request 3 nodes to be allocated. */
> -       mas_node_count(&mas, 3);
> -       /* Allocation request of 3. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 3);
> -       /* Allocate failed. */
> -       MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 3);
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mn == NULL);
> -       MT_BUG_ON(mt, mas.alloc == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
> -       mas_push_node(&mas, mn);
> -       mas_reset(&mas);
> -       mas_destroy(&mas);
> -       mtree_unlock(mt);
> -
> -
> -       /* Try allocating 1 node, then 2 more */
> -       mtree_lock(mt);
> -       /* Set allocation request to 1. */
> -       mas_set_alloc_req(&mas, 1);
> -       /* Check Allocation request of 1. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
> -       mas_set_err(&mas, -ENOMEM);
> -       /* Validate allocation request. */
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       /* Eat the requested node. */
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mn == NULL);
> -       MT_BUG_ON(mt, mn->slot[0] != NULL);
> -       MT_BUG_ON(mt, mn->slot[1] != NULL);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -
> -       mn->parent = ma_parent_ptr(mn);
> -       ma_free_rcu(mn);
> -       mas.status = ma_start;
> -       mas_destroy(&mas);
> -       /* Allocate 3 nodes, will fail. */
> -       mas_node_count(&mas, 3);
> -       /* Drop the lock and allocate 3 nodes. */
> -       mas_nomem(&mas, GFP_KERNEL);
> -       /* Ensure 3 are allocated. */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 3);
> -       /* Allocation request of 0. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 0);
> -
> -       MT_BUG_ON(mt, mas.alloc == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[1] == NULL);
> -       /* Ensure we counted 3. */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 3);
> -       /* Free. */
> -       mas_reset(&mas);
> -       mas_destroy(&mas);
> -
> -       /* Set allocation request to 1. */
> -       mas_set_alloc_req(&mas, 1);
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
> -       mas_set_err(&mas, -ENOMEM);
> -       /* Validate allocation request. */
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 1);
> -       /* Check the node is only one node. */
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       MT_BUG_ON(mt, mn == NULL);
> -       MT_BUG_ON(mt, mn->slot[0] != NULL);
> -       MT_BUG_ON(mt, mn->slot[1] != NULL);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       mas_push_node(&mas, mn);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 1);
> -       MT_BUG_ON(mt, mas.alloc->node_count);
> -
> -       mas_set_alloc_req(&mas, 2); /* request 2 more. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 2);
> -       mas_set_err(&mas, -ENOMEM);
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 3);
> -       MT_BUG_ON(mt, mas.alloc == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[0] == NULL);
> -       MT_BUG_ON(mt, mas.alloc->slot[1] == NULL);
> -       for (i = 2; i >= 0; i--) {
> -               mn = mas_pop_node(&mas);
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i);
> -               MT_BUG_ON(mt, !mn);
> -               MT_BUG_ON(mt, not_empty(mn));
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -       }
> -
> -       total = 64;
> -       mas_set_alloc_req(&mas, total); /* request 2 more. */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != total);
> -       mas_set_err(&mas, -ENOMEM);
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       for (i = total; i > 0; i--) {
> -               unsigned int e = 0; /* expected node_count */
> -
> -               if (!MAPLE_32BIT) {
> -                       if (i >= 35)
> -                               e = i - 34;
> -                       else if (i >= 5)
> -                               e = i - 4;
> -                       else if (i >= 2)
> -                               e = i - 1;
> -               } else {
> -                       if (i >= 4)
> -                               e = i - 3;
> -                       else if (i >= 1)
> -                               e = i - 1;
> -                       else
> -                               e = 0;
> -               }
> -
> -               MT_BUG_ON(mt, mas.alloc->node_count != e);
> -               mn = mas_pop_node(&mas);
> -               MT_BUG_ON(mt, not_empty(mn));
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i - 1);
> -               MT_BUG_ON(mt, !mn);
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -       }
> -
> -       total = 100;
> -       for (i = 1; i < total; i++) {
> -               mas_set_alloc_req(&mas, i);
> -               mas_set_err(&mas, -ENOMEM);
> -               MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -               for (j = i; j > 0; j--) {
> -                       mn = mas_pop_node(&mas);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
> -                       MT_BUG_ON(mt, !mn);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mas_push_node(&mas, mn);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != j);
> -                       mn = mas_pop_node(&mas);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != j - 1);
> -                       mn->parent = ma_parent_ptr(mn);
> -                       ma_free_rcu(mn);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -
> -               mas_set_alloc_req(&mas, i);
> -               mas_set_err(&mas, -ENOMEM);
> -               MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -               for (j = 0; j <= i/2; j++) {
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
> -                       nodes[j] = mas_pop_node(&mas);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
> -               }
> -
> -               while (j) {
> -                       j--;
> -                       mas_push_node(&mas, nodes[j]);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i);
> -               for (j = 0; j <= i/2; j++) {
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j);
> -                       mn = mas_pop_node(&mas);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mn->parent = ma_parent_ptr(mn);
> -                       ma_free_rcu(mn);
> -                       MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1);
> -               }
> -               mas_reset(&mas);
> -               MT_BUG_ON(mt, mas_nomem(&mas, GFP_KERNEL));
> -               mas_destroy(&mas);
> -
> -       }
> -
> -       /* Set allocation request. */
> -       total = 500;
> -       mas_node_count(&mas, total);
> -       /* Drop the lock and allocate the nodes. */
> -       mas_nomem(&mas, GFP_KERNEL);
> -       MT_BUG_ON(mt, !mas.alloc);
> -       i = 1;
> -       smn = mas.alloc;
> -       while (i < total) {
> -               for (j = 0; j < MAPLE_ALLOC_SLOTS; j++) {
> -                       i++;
> -                       MT_BUG_ON(mt, !smn->slot[j]);
> -                       if (i == total)
> -                               break;
> -               }
> -               smn = smn->slot[0]; /* next. */
> -       }
> -       MT_BUG_ON(mt, mas_allocated(&mas) != total);
> -       mas_reset(&mas);
> -       mas_destroy(&mas); /* Free. */
> -
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       for (i = 1; i < 128; i++) {
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i); /* check request filled */
> -               for (j = i; j > 0; j--) { /*Free the requests */
> -                       mn = mas_pop_node(&mas); /* get the next node. */
> -                       MT_BUG_ON(mt, mn == NULL);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mn->parent = ma_parent_ptr(mn);
> -                       ma_free_rcu(mn);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       }
> -
> -       for (i = 1; i < MAPLE_NODE_MASK + 1; i++) {
> -               MA_STATE(mas2, mt, 0, 0);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               MT_BUG_ON(mt, mas_allocated(&mas) != i); /* check request filled */
> -               for (j = 1; j <= i; j++) { /* Move the allocations to mas2 */
> -                       mn = mas_pop_node(&mas); /* get the next node. */
> -                       MT_BUG_ON(mt, mn == NULL);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mas_push_node(&mas2, mn);
> -                       MT_BUG_ON(mt, mas_allocated(&mas2) != j);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -               MT_BUG_ON(mt, mas_allocated(&mas2) != i);
> -
> -               for (j = i; j > 0; j--) { /*Free the requests */
> -                       MT_BUG_ON(mt, mas_allocated(&mas2) != j);
> -                       mn = mas_pop_node(&mas2); /* get the next node. */
> -                       MT_BUG_ON(mt, mn == NULL);
> -                       MT_BUG_ON(mt, not_empty(mn));
> -                       mn->parent = ma_parent_ptr(mn);
> -                       ma_free_rcu(mn);
> -               }
> -               MT_BUG_ON(mt, mas_allocated(&mas2) != 0);
> -       }
> -
> -
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 1); /* Request */
> -       MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
> -       MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS);
> -
> -       mn = mas_pop_node(&mas); /* get the next node. */
> -       MT_BUG_ON(mt, mn == NULL);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS);
> -       MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS - 1);
> -
> -       mas_push_node(&mas, mn);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
> -       MT_BUG_ON(mt, mas.alloc->node_count != MAPLE_ALLOC_SLOTS);
> -
> -       /* Check the limit of pop/push/pop */
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 2); /* Request */
> -       MT_BUG_ON(mt, mas_alloc_req(&mas) != 1);
> -       MT_BUG_ON(mt, mas.node != MA_ERROR(-ENOMEM));
> -       MT_BUG_ON(mt, !mas_nomem(&mas, GFP_KERNEL));
> -       MT_BUG_ON(mt, mas_alloc_req(&mas));
> -       MT_BUG_ON(mt, mas.alloc->node_count != 1);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
> -       MT_BUG_ON(mt, mas.alloc->node_count  != MAPLE_ALLOC_SLOTS);
> -       mas_push_node(&mas, mn);
> -       MT_BUG_ON(mt, mas.alloc->node_count != 1);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 2);
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, not_empty(mn));
> -       mn->parent = ma_parent_ptr(mn);
> -       ma_free_rcu(mn);
> -       for (i = 1; i <= MAPLE_ALLOC_SLOTS + 1; i++) {
> -               mn = mas_pop_node(&mas);
> -               MT_BUG_ON(mt, not_empty(mn));
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -       }
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 0);
> -
> -
> -       for (i = 3; i < MAPLE_NODE_MASK * 3; i++) {
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mas_push_node(&mas, mn); /* put it back */
> -               mas_destroy(&mas);
> -
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn2 = mas_pop_node(&mas); /* get the next node. */
> -               mas_push_node(&mas, mn); /* put them back */
> -               mas_push_node(&mas, mn2);
> -               mas_destroy(&mas);
> -
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn2 = mas_pop_node(&mas); /* get the next node. */
> -               mn3 = mas_pop_node(&mas); /* get the next node. */
> -               mas_push_node(&mas, mn); /* put them back */
> -               mas_push_node(&mas, mn2);
> -               mas_push_node(&mas, mn3);
> -               mas_destroy(&mas);
> -
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -               mas_destroy(&mas);
> -
> -               mas.node = MA_ERROR(-ENOMEM);
> -               mas_node_count(&mas, i); /* Request */
> -               mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -               mn = mas_pop_node(&mas); /* get the next node. */
> -               mn->parent = ma_parent_ptr(mn);
> -               ma_free_rcu(mn);
> -               mas_destroy(&mas);
> -       }
> -
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, 5); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 5);
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, 10); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       mas.status = ma_start;
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 10);
> -       mas_destroy(&mas);
> -
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS - 1); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS - 1);
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, 10 + MAPLE_ALLOC_SLOTS - 1); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       mas.status = ma_start;
> -       MT_BUG_ON(mt, mas_allocated(&mas) != 10 + MAPLE_ALLOC_SLOTS - 1);
> -       mas_destroy(&mas);
> -
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS + 1); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS + 1);
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 2 + 2); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       mas.status = ma_start;
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 2 + 2);
> -       mas_destroy(&mas);
> -
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 2 + 1); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 2 + 1);
> -       mas.node = MA_ERROR(-ENOMEM);
> -       mas_node_count(&mas, MAPLE_ALLOC_SLOTS * 3 + 2); /* Request */
> -       mas_nomem(&mas, GFP_KERNEL); /* Fill request */
> -       mas.status = ma_start;
> -       MT_BUG_ON(mt, mas_allocated(&mas) != MAPLE_ALLOC_SLOTS * 3 + 2);
> -       mas_destroy(&mas);
> -
> -       mtree_unlock(mt);
> -}
> -
>  /*
>   * Check erasing including RCU.
>   */
> @@ -35458,8 +35034,7 @@ static void check_dfs_preorder(struct maple_tree *mt)
>         mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
>         mas_reset(&mas);
>         mt_zero_nr_tallocated();
> -       mt_set_non_kernel(200);
> -       mas_expected_entries(&mas, max);
> +       mt_set_non_kernel(1000);
>         for (count = 0; count <= max; count++) {
>                 mas.index = mas.last = count;
>                 mas_store(&mas, xa_mk_value(count));
> @@ -35524,6 +35099,13 @@ static unsigned char get_vacant_height(struct ma_wr_state *wr_mas, void *entry)
>         return vacant_height;
>  }
>
> +static int mas_allocated(struct ma_state *mas)
> +{
> +       if (mas->sheaf)
> +               return kmem_cache_sheaf_size(mas->sheaf);
> +
> +       return 0;
> +}
>  /* Preallocation testing */
>  static noinline void __init check_prealloc(struct maple_tree *mt)
>  {
> @@ -35533,8 +35115,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
>         unsigned char vacant_height;
>         struct maple_node *mn;
>         void *ptr = check_prealloc;
> +       struct ma_wr_state wr_mas;
>         MA_STATE(mas, mt, 10, 20);
> -       MA_WR_STATE(wr_mas, &mas, ptr);
>
>         mt_set_non_kernel(1000);
>         for (i = 0; i <= max; i++)
> @@ -35542,7 +35124,11 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
>
>         /* Spanning store */
>         mas_set_range(&mas, 470, 500);
> -       MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
> +       wr_mas.mas = &mas;
> +
> +       mas_wr_preallocate(&wr_mas, ptr);
> +       MT_BUG_ON(mt, mas.store_type != wr_spanning_store);
> +       MT_BUG_ON(mt, mas_is_err(&mas));
>         allocated = mas_allocated(&mas);
>         height = mas_mt_height(&mas);
>         vacant_height = get_vacant_height(&wr_mas, ptr);
> @@ -35552,6 +35138,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
>         allocated = mas_allocated(&mas);
>         MT_BUG_ON(mt, allocated != 0);
>
> +       mas_wr_preallocate(&wr_mas, ptr);
>         MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
>         allocated = mas_allocated(&mas);
>         height = mas_mt_height(&mas);
> @@ -35592,20 +35179,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
>         mn->parent = ma_parent_ptr(mn);
>         ma_free_rcu(mn);
>
> -       MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
> -       allocated = mas_allocated(&mas);
> -       height = mas_mt_height(&mas);
> -       vacant_height = get_vacant_height(&wr_mas, ptr);
> -       MT_BUG_ON(mt, allocated != 1 + (height - vacant_height) * 3);
> -       mn = mas_pop_node(&mas);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
> -       mas_push_node(&mas, mn);
> -       MT_BUG_ON(mt, mas_allocated(&mas) != allocated);
> -       MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
> -       mas_destroy(&mas);
> -       allocated = mas_allocated(&mas);
> -       MT_BUG_ON(mt, allocated != 0);
> -
>         MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
>         allocated = mas_allocated(&mas);
>         height = mas_mt_height(&mas);
> @@ -36394,11 +35967,17 @@ static void check_nomem_writer_race(struct maple_tree *mt)
>         check_load(mt, 6, xa_mk_value(0xC));
>         mtree_unlock(mt);
>
> +       mt_set_non_kernel(0);
>         /* test for the same race but with mas_store_gfp() */
>         mtree_store_range(mt, 0, 5, xa_mk_value(0xA), GFP_KERNEL);
>         mtree_store_range(mt, 6, 10, NULL, GFP_KERNEL);
>
>         mas_set_range(&mas, 0, 5);
> +
> +       /* setup writer 2 that will trigger the race condition */
> +       mt_set_private(mt);
> +       mt_set_callback(writer2);
> +
>         mtree_lock(mt);
>         mas_store_gfp(&mas, NULL, GFP_KERNEL);
>
> @@ -36435,7 +36014,6 @@ static inline int check_vma_modification(struct maple_tree *mt)
>         __mas_set_range(&mas, 0x7ffde4ca2000, 0x7ffffffff000 - 1);
>         mas_preallocate(&mas, NULL, GFP_KERNEL);
>         mas_store_prealloc(&mas, NULL);
> -       mt_dump(mt, mt_dump_hex);
>
>         mas_destroy(&mas);
>         mtree_unlock(mt);
> @@ -36453,6 +36031,8 @@ static inline void check_bulk_rebalance(struct maple_tree *mt)
>
>         build_full_tree(mt, 0, 2);
>
> +
> +       mtree_lock(mt);
>         /* erase every entry in the tree */
>         do {
>                 /* set up bulk store mode */
> @@ -36462,6 +36042,85 @@ static inline void check_bulk_rebalance(struct maple_tree *mt)
>         } while (mas_prev(&mas, 0) != NULL);
>
>         mas_destroy(&mas);
> +       mtree_unlock(mt);
> +}
> +
> +static unsigned long get_last_index(struct ma_state *mas)
> +{
> +       struct maple_node *node = mas_mn(mas);
> +       enum maple_type mt = mte_node_type(mas->node);
> +       unsigned long *pivots = ma_pivots(node, mt);
> +       unsigned long last_index = mas_data_end(mas);
> +
> +       BUG_ON(last_index == 0);
> +
> +       return pivots[last_index - 1] + 1;
> +}
> +
> +/*
> + * Assert that we handle spanning stores that consume the entirety of the right
> + * leaf node correctly.
> + */
> +static void test_spanning_store_regression(void)
> +{
> +       unsigned long from = 0, to = 0;
> +       DEFINE_MTREE(tree);
> +       MA_STATE(mas, &tree, 0, 0);
> +
> +       /*
> +        * Build a 3-level tree. We require a parent node below the root node
> +        * and 2 leaf nodes under it, so we can span the entirety of the right
> +        * hand node.
> +        */
> +       build_full_tree(&tree, 0, 3);
> +
> +       /* Descend into position at depth 2. */
> +       mas_reset(&mas);
> +       mas_start(&mas);
> +       mas_descend(&mas);
> +       mas_descend(&mas);
> +
> +       /*
> +        * We need to establish a tree like the below.
> +        *
> +        * Then we can try a store in [from, to] which results in a spanned
> +        * store across nodes B and C, with the maple state at the time of the
> +        * write being such that only the subtree at A and below is considered.
> +        *
> +        * Height
> +        *  0                              Root Node
> +        *                                  /      \
> +        *                    pivot = to   /        \ pivot = ULONG_MAX
> +        *                                /          \
> +        *   1                       A [-----]       ...
> +        *                              /   \
> +        *                pivot = from /     \ pivot = to
> +        *                            /       \
> +        *   2 (LEAVES)          B [-----]  [-----] C
> +        *                                       ^--- Last pivot to.
> +        */
> +       while (true) {
> +               unsigned long tmp = get_last_index(&mas);
> +
> +               if (mas_next_sibling(&mas)) {
> +                       from = tmp;
> +                       to = mas.max;
> +               } else {
> +                       break;
> +               }
> +       }
> +
> +       BUG_ON(from == 0 && to == 0);
> +
> +       /* Perform the store. */
> +       mas_set_range(&mas, from, to);
> +       mas_store_gfp(&mas, xa_mk_value(0xdead), GFP_KERNEL);
> +
> +       /* If the regression occurs, the validation will fail. */
> +       mt_validate(&tree);
> +
> +       /* Cleanup. */
> +       __mt_destroy(&tree);
>  }
>
>  void farmer_tests(void)
> @@ -36525,6 +36184,7 @@ void farmer_tests(void)
>         check_collapsing_rebalance(&tree);
>         mtree_destroy(&tree);
>
> +
>         mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>         check_null_expand(&tree);
>         mtree_destroy(&tree);
> @@ -36538,10 +36198,6 @@ void farmer_tests(void)
>         check_erase_testset(&tree);
>         mtree_destroy(&tree);
>
> -       mt_init_flags(&tree, 0);
> -       check_new_node(&tree);
> -       mtree_destroy(&tree);
> -
>         if (!MAPLE_32BIT) {
>                 mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>                 check_rcu_simulated(&tree);
> @@ -36563,95 +36219,13 @@ void farmer_tests(void)
>
>         /* No memory handling */
>         check_nomem(&tree);
> -}
> -
> -static unsigned long get_last_index(struct ma_state *mas)
> -{
> -       struct maple_node *node = mas_mn(mas);
> -       enum maple_type mt = mte_node_type(mas->node);
> -       unsigned long *pivots = ma_pivots(node, mt);
> -       unsigned long last_index = mas_data_end(mas);
> -
> -       BUG_ON(last_index == 0);
>
> -       return pivots[last_index - 1] + 1;
> -}
> -
> -/*
> - * Assert that we handle spanning stores that consume the entirety of the right
> - * leaf node correctly.
> - */
> -static void test_spanning_store_regression(void)
> -{
> -       unsigned long from = 0, to = 0;
> -       DEFINE_MTREE(tree);
> -       MA_STATE(mas, &tree, 0, 0);
> -
> -       /*
> -        * Build a 3-level tree. We require a parent node below the root node
> -        * and 2 leaf nodes under it, so we can span the entirety of the right
> -        * hand node.
> -        */
> -       build_full_tree(&tree, 0, 3);
> -
> -       /* Descend into position at depth 2. */
> -       mas_reset(&mas);
> -       mas_start(&mas);
> -       mas_descend(&mas);
> -       mas_descend(&mas);
> -
> -       /*
> -        * We need to establish a tree like the below.
> -        *
> -        * Then we can try a store in [from, to] which results in a spanned
> -        * store across nodes B and C, with the maple state at the time of the
> -        * write being such that only the subtree at A and below is considered.
> -        *
> -        * Height
> -        *  0                              Root Node
> -        *                                  /      \
> -        *                    pivot = to   /        \ pivot = ULONG_MAX
> -        *                                /          \
> -        *   1                       A [-----]       ...
> -        *                              /   \
> -        *                pivot = from /     \ pivot = to
> -        *                            /       \
> -        *   2 (LEAVES)          B [-----]  [-----] C
> -        *                                       ^--- Last pivot to.
> -        */
> -       while (true) {
> -               unsigned long tmp = get_last_index(&mas);
> -
> -               if (mas_next_sibling(&mas)) {
> -                       from = tmp;
> -                       to = mas.max;
> -               } else {
> -                       break;
> -               }
> -       }
> -
> -       BUG_ON(from == 0 && to == 0);
> -
> -       /* Perform the store. */
> -       mas_set_range(&mas, from, to);
> -       mas_store_gfp(&mas, xa_mk_value(0xdead), GFP_KERNEL);
> -
> -       /* If the regression occurs, the validation will fail. */
> -       mt_validate(&tree);
> -
> -       /* Cleanup. */
> -       __mt_destroy(&tree);
> -}
> -
> -static void regression_tests(void)
> -{
>         test_spanning_store_regression();
>  }
>
>  void maple_tree_tests(void)
>  {
>  #if !defined(BENCH)
> -       regression_tests();
>         farmer_tests();
>  #endif
>         maple_tree_seed();
> diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
> index e0255f53159bd3a1325d49192283dd6790a5e3b8..6a15665fc8315168c718e6810c7deaeed13a3a6a 100644
> --- a/tools/testing/shared/linux.c
> +++ b/tools/testing/shared/linux.c
> @@ -82,7 +82,8 @@ void *kmem_cache_alloc_lru(struct kmem_cache *cachep, struct list_lru *lru,
>
>         if (!(gfp & __GFP_DIRECT_RECLAIM)) {
>                 if (!cachep->non_kernel) {
> -                       cachep->exec_callback = true;
> +                       if (cachep->callback)
> +                               cachep->exec_callback = true;
>                         return NULL;
>                 }
>
> @@ -236,6 +237,8 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
>                 for (i = 0; i < size; i++)
>                         __kmem_cache_free_locked(cachep, p[i]);
>                 pthread_mutex_unlock(&cachep->lock);
> +               if (cachep->callback)
> +                       cachep->exec_callback = true;
>                 return 0;
>         }
>
> @@ -288,9 +291,8 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
>                 capacity = s->sheaf_capacity;
>
>         sheaf = malloc(sizeof(*sheaf) + sizeof(void *) * s->sheaf_capacity * capacity);
> -       if (!sheaf) {
> +       if (!sheaf)
>                 return NULL;
> -       }
>
>         memset(sheaf, 0, size);
>         sheaf->cache = s;
>
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 13/14] maple_tree: Add single node allocation support to maple state
  2025-07-23 13:34 ` [PATCH v5 13/14] maple_tree: Add single node allocation support to maple state Vlastimil Babka
@ 2025-08-22 20:25   ` Suren Baghdasaryan
  2025-08-26 15:10     ` Liam R. Howlett
  0 siblings, 1 reply; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-22 20:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The fast path through a write will require replacing a single node in
> the tree.  Using a sheaf (32 nodes) is too heavy for the fast path, so
> special case the node store operation by just allocating one node in the
> maple state.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/maple_tree.h |  4 +++-
>  lib/maple_tree.c           | 47 ++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 44 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index 3cf1ae9dde7ce43fa20ae400c01fefad048c302e..61eb5e7d09ad0133978e3ac4b2af66710421e769 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -443,6 +443,7 @@ struct ma_state {
>         unsigned long min;              /* The minimum index of this node - implied pivot min */
>         unsigned long max;              /* The maximum index of this node - implied pivot max */
>         struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
> +       struct maple_node *alloc;       /* allocated nodes */
>         unsigned long node_request;
>         enum maple_status status;       /* The status of the state (active, start, none, etc) */
>         unsigned char depth;            /* depth of tree descent during write */
> @@ -491,8 +492,9 @@ struct ma_wr_state {
>                 .status = ma_start,                                     \
>                 .min = 0,                                               \
>                 .max = ULONG_MAX,                                       \
> -               .node_request= 0,                                       \
>                 .sheaf = NULL,                                          \
> +               .alloc = NULL,                                          \
> +               .node_request= 0,                                       \
>                 .mas_flags = 0,                                         \
>                 .store_type = wr_invalid,                               \
>         }
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 3c3c14a76d98ded3b619c178d64099b464a2ca23..9aa782b1497f224e7366ebbd65f997523ee0c8ab 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -1101,16 +1101,23 @@ static int mas_ascend(struct ma_state *mas)
>   *
>   * Return: A pointer to a maple node.
>   */
> -static inline struct maple_node *mas_pop_node(struct ma_state *mas)
> +static __always_inline struct maple_node *mas_pop_node(struct ma_state *mas)
>  {
>         struct maple_node *ret;
>
> +       if (mas->alloc) {
> +               ret = mas->alloc;
> +               mas->alloc = NULL;
> +               goto out;
> +       }
> +
>         if (WARN_ON_ONCE(!mas->sheaf))
>                 return NULL;
>
>         ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
> -       memset(ret, 0, sizeof(*ret));
>
> +out:
> +       memset(ret, 0, sizeof(*ret));
>         return ret;
>  }
>
> @@ -1121,9 +1128,34 @@ static inline struct maple_node *mas_pop_node(struct ma_state *mas)
>   */
>  static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
>  {
> -       if (unlikely(mas->sheaf)) {
> -               unsigned long refill = mas->node_request;
> +       if (!mas->node_request)
> +               return;
> +
> +       if (mas->node_request == 1) {
> +               if (mas->sheaf)
> +                       goto use_sheaf;
> +
> +               if (mas->alloc)
> +                       return;
>
> +               mas->alloc = mt_alloc_one(gfp);
> +               if (!mas->alloc)
> +                       goto error;
> +
> +               mas->node_request = 0;
> +               return;
> +       }
> +
> +use_sheaf:
> +       if (unlikely(mas->alloc)) {

When would this condition happen? Do we really need to free mas->alloc
here or it can be reused for the next 1-node allocation?

> +               mt_free_one(mas->alloc);
> +               mas->alloc = NULL;
> +       }
> +
> +       if (mas->sheaf) {
> +               unsigned long refill;
> +
> +               refill = mas->node_request;
>                 if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
>                         mas->node_request = 0;
>                         return;
> @@ -5386,8 +5418,11 @@ void mas_destroy(struct ma_state *mas)
>         mas->node_request = 0;
>         if (mas->sheaf)
>                 mt_return_sheaf(mas->sheaf);
> -
>         mas->sheaf = NULL;
> +
> +       if (mas->alloc)
> +               mt_free_one(mas->alloc);
> +       mas->alloc = NULL;
>  }
>  EXPORT_SYMBOL_GPL(mas_destroy);
>
> @@ -6074,7 +6109,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
>                 mas_alloc_nodes(mas, gfp);
>         }
>
> -       if (!mas->sheaf)
> +       if (!mas->sheaf && !mas->alloc)
>                 return false;
>
>         mas->status = ma_start;
>
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 14/14] maple_tree: Convert forking to use the sheaf interface
  2025-07-23 13:34 ` [PATCH v5 14/14] maple_tree: Convert forking to use the sheaf interface Vlastimil Babka
@ 2025-08-22 20:29   ` Suren Baghdasaryan
  0 siblings, 0 replies; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-22 20:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Use the generic interface which should result in less bulk allocations
> during a forking.
>
> A part of this is to abstract the freeing of the sheaf or maple state
> allocations into its own function so mas_destroy() and the tree
> duplication code can use the same functionality to return any unused
> resources.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  lib/maple_tree.c | 42 +++++++++++++++++++++++-------------------
>  1 file changed, 23 insertions(+), 19 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 9aa782b1497f224e7366ebbd65f997523ee0c8ab..180d5e2ea49440248aaae04a066276406b2537ed 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -1178,6 +1178,19 @@ static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
>         mas_set_err(mas, -ENOMEM);
>  }
>
> +static inline void mas_empty_nodes(struct ma_state *mas)
> +{
> +       mas->node_request = 0;
> +       if (mas->sheaf) {
> +               mt_return_sheaf(mas->sheaf);
> +               mas->sheaf = NULL;
> +       }
> +
> +       if (mas->alloc) {
> +               mt_free_one(mas->alloc);
> +               mas->alloc = NULL;
> +       }
> +}
>
>  /*
>   * mas_free() - Free an encoded maple node
> @@ -5414,15 +5427,7 @@ void mas_destroy(struct ma_state *mas)
>                 mas->mas_flags &= ~MA_STATE_REBALANCE;
>         }
>         mas->mas_flags &= ~(MA_STATE_BULK|MA_STATE_PREALLOC);
> -
> -       mas->node_request = 0;
> -       if (mas->sheaf)
> -               mt_return_sheaf(mas->sheaf);
> -       mas->sheaf = NULL;
> -
> -       if (mas->alloc)
> -               mt_free_one(mas->alloc);
> -       mas->alloc = NULL;
> +       mas_empty_nodes(mas);
>  }
>  EXPORT_SYMBOL_GPL(mas_destroy);
>
> @@ -6499,7 +6504,7 @@ static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
>         struct maple_node *node = mte_to_node(mas->node);
>         struct maple_node *new_node = mte_to_node(new_mas->node);
>         enum maple_type type;
> -       unsigned char request, count, i;
> +       unsigned char count, i;
>         void __rcu **slots;
>         void __rcu **new_slots;
>         unsigned long val;
> @@ -6507,20 +6512,17 @@ static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
>         /* Allocate memory for child nodes. */
>         type = mte_node_type(mas->node);
>         new_slots = ma_slots(new_node, type);
> -       request = mas_data_end(mas) + 1;
> -       count = mt_alloc_bulk(gfp, request, (void **)new_slots);
> -       if (unlikely(count < request)) {
> -               memset(new_slots, 0, request * sizeof(void *));
> -               mas_set_err(mas, -ENOMEM);
> +       count = mas->node_request = mas_data_end(mas) + 1;
> +       mas_alloc_nodes(mas, gfp);
> +       if (unlikely(mas_is_err(mas)))
>                 return;
> -       }
>
> -       /* Restore node type information in slots. */
>         slots = ma_slots(node, type);
>         for (i = 0; i < count; i++) {
>                 val = (unsigned long)mt_slot_locked(mas->tree, slots, i);
>                 val &= MAPLE_NODE_MASK;
> -               ((unsigned long *)new_slots)[i] |= val;
> +               new_slots[i] = ma_mnode_ptr((unsigned long)mas_pop_node(mas) |
> +                                           val);
>         }
>  }
>
> @@ -6574,7 +6576,7 @@ static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
>                         /* Only allocate child nodes for non-leaf nodes. */
>                         mas_dup_alloc(mas, new_mas, gfp);
>                         if (unlikely(mas_is_err(mas)))
> -                               return;
> +                               goto empty_mas;
>                 } else {
>                         /*
>                          * This is the last leaf node and duplication is
> @@ -6607,6 +6609,8 @@ static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
>         /* Make them the same height */
>         new_mas->tree->ma_flags = mas->tree->ma_flags;
>         rcu_assign_pointer(new_mas->tree->ma_root, root);
> +empty_mas:
> +       mas_empty_nodes(mas);
>  }
>
>  /**
>
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 09/14] mm, slub: skip percpu sheaves for remote object freeing
  2025-07-23 13:34 ` [PATCH v5 09/14] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
@ 2025-08-25  5:22   ` Harry Yoo
  2025-08-26 10:11     ` Vlastimil Babka
  0 siblings, 1 reply; 45+ messages in thread
From: Harry Yoo @ 2025-08-25  5:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 03:34:42PM +0200, Vlastimil Babka wrote:
> Since we don't control the NUMA locality of objects in percpu sheaves,
> allocations with node restrictions bypass them. Allocations without
> restrictions may however still expect to get local objects with high
> probability, and the introduction of sheaves can decrease it due to
> freed object from a remote node ending up in percpu sheaves.
> 
> The fraction of such remote frees seems low (5% on an 8-node machine)
> but it can be expected that some cache or workload specific corner cases
> exist. We can either conclude that this is not a problem due to the low
> fraction, or we can make remote frees bypass percpu sheaves and go
> directly to their slabs. This will make the remote frees more expensive,
> but if if's only a small fraction, most frees will still benefit from
> the lower overhead of percpu sheaves.
> 
> This patch thus makes remote object freeing bypass percpu sheaves,
> including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
> it's not intended to be 100% guarantee that percpu sheaves will only
> contain local objects. The refill from slabs does not provide that
> guarantee in the first place, and there might be cpu migrations
> happening when we need to unlock the local_lock. Avoiding all that could
> be possible but complicated so we can leave it for later investigation
> whether it would be worth it. It can be expected that the more selective
> freeing will itself prevent accumulation of remote objects in percpu
> sheaves so any such violations would have only short-term effects.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab_common.c |  7 +++++--
>  mm/slub.c        | 42 ++++++++++++++++++++++++++++++++++++------
>  2 files changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 2d806e02568532a1000fd3912db6978e945dcfa8..f466f68a5bd82030a987baf849a98154cd48ef23 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1623,8 +1623,11 @@ static bool kfree_rcu_sheaf(void *obj)
>  
>  	slab = folio_slab(folio);
>  	s = slab->slab_cache;
> -	if (s->cpu_sheaves)
> -		return __kfree_rcu_sheaf(s, obj);
> +	if (s->cpu_sheaves) {
> +		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
> +			   slab_nid(slab) == numa_node_id()))
> +			return __kfree_rcu_sheaf(s, obj);
> +	}

This should be numa_mem_id() to handle memory-less NUMA nodes as
Christoph mentioned [1]?

I saw you addressed this in most of places but not this one.

With that addressed, please feel free to add:
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

[1] https://lore.kernel.org/linux-mm/c60ae681-6027-0626-8d4e-5833982bf1f0@gentwo.org

>  
>  	return false;
>  }

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves
  2025-07-23 13:34 ` [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
  2025-08-22 19:58   ` Suren Baghdasaryan
@ 2025-08-25  6:52   ` Harry Yoo
  2025-08-26 10:49     ` Vlastimil Babka
  1 sibling, 1 reply; 45+ messages in thread
From: Harry Yoo @ 2025-08-25  6:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, Jul 23, 2025 at 03:34:43PM +0200, Vlastimil Babka wrote:
> Currently allocations asking for a specific node explicitly or via
> mempolicy in strict_numa node bypass percpu sheaves. Since sheaves
> contain mostly local objects, we can try allocating from them if the
> local node happens to be the requested node or allowed by the mempolicy.
> If we find the object from percpu sheaves is not from the expected node,
> we skip the sheaves - this should be rare.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

With or without ifdeffery suggested by Suren
(or probably IS_ENABLED(CONFIG_NUMA) && node != NUMA_NO_NODE?),

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

>  mm/slub.c | 52 +++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 50fc35b8fc9b3101821c338e9469c134677ded51..b98983b8d2e3e04ea256d91efcf0215ff0ae7e38 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4765,18 +4765,42 @@ __pcs_handle_empty(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t
>  }
>  
>  static __fastpath_inline
> -void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
> +void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>  {
>  	struct slub_percpu_sheaves *pcs;
>  	void *object;
>  
>  #ifdef CONFIG_NUMA
> -	if (static_branch_unlikely(&strict_numa)) {
> -		if (current->mempolicy)
> -			return NULL;
> +	if (static_branch_unlikely(&strict_numa) &&
> +			 node == NUMA_NO_NODE) {
> +
> +		struct mempolicy *mpol = current->mempolicy;
> +
> +		if (mpol) {
> +			/*
> +			 * Special BIND rule support. If the local node
> +			 * is in permitted set then do not redirect
> +			 * to a particular node.
> +			 * Otherwise we apply the memory policy to get
> +			 * the node we need to allocate on.
> +			 */
> +			if (mpol->mode != MPOL_BIND ||
> +					!node_isset(numa_mem_id(), mpol->nodes))
> +
> +				node = mempolicy_slab_node();
> +		}
>  	}
>  #endif
>  
> +	if (unlikely(node != NUMA_NO_NODE)) {
> +		/*
> +		 * We assume the percpu sheaves contain only local objects
> +		 * although it's not completely guaranteed, so we verify later.
> +		 */
> +		if (node != numa_mem_id())
> +			return NULL;
> +	}
> +
>  	if (!local_trylock(&s->cpu_sheaves->lock))
>  		return NULL;
>  
> @@ -4788,7 +4812,21 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
>  			return NULL;
>  	}
>  
> -	object = pcs->main->objects[--pcs->main->size];
> +	object = pcs->main->objects[pcs->main->size - 1];
> +
> +	if (unlikely(node != NUMA_NO_NODE)) {
> +		/*
> +		 * Verify that the object was from the node we want. This could
> +		 * be false because of cpu migration during an unlocked part of
> +		 * the current allocation or previous freeing process.
> +		 */
> +		if (folio_nid(virt_to_folio(object)) != node) {
> +			local_unlock(&s->cpu_sheaves->lock);
> +			return NULL;
> +		}
> +	}
> +
> +	pcs->main->size--;
>  
>  	local_unlock(&s->cpu_sheaves->lock);
>  
> @@ -4888,8 +4926,8 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>  	if (unlikely(object))
>  		goto out;
>  
> -	if (s->cpu_sheaves && node == NUMA_NO_NODE)
> -		object = alloc_from_pcs(s, gfpflags);
> +	if (s->cpu_sheaves)
> +		object = alloc_from_pcs(s, gfpflags, node);
>  
>  	if (!object)
>  		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> 
> -- 
> 2.50.1
> 

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 01/14] slab: add opt-in caching layer of percpu sheaves
  2025-08-18 10:09   ` Harry Yoo
@ 2025-08-26  8:03     ` Vlastimil Babka
  0 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-08-26  8:03 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 8/18/25 12:09, Harry Yoo wrote:
>> +alloc_empty:
>> +	local_unlock(&s->cpu_sheaves->lock);
>> +
>> +	empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>> +	if (empty)
>> +		goto got_empty;
>> +
>> +	if (put_fail)
>> +		 stat(s, BARN_PUT_FAIL);
>> +
>> +	if (!sheaf_flush_main(s))
>> +		return NULL;
>> +
>> +	if (!local_trylock(&s->cpu_sheaves->lock))
>> +		return NULL;
>> +
>> +	/*
>> +	 * we flushed the main sheaf so it should be empty now,
>> +	 * but in case we got preempted or migrated, we need to
>> +	 * check again
>> +	 */
>> +	if (pcs->main->size == s->sheaf_capacity)
>> +		goto restart;
> 
> I think it's missing:
> 
> pcs = this_cpu_ptr(&s->cpu_sheaves);
> 
> between local_trylock() and reading pcs->main->size().

Oops, yes, thanks!
Also fixed up the other things you pointed out.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 01/14] slab: add opt-in caching layer of percpu sheaves
  2025-08-19  4:19   ` Suren Baghdasaryan
@ 2025-08-26  8:51     ` Vlastimil Babka
  0 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-08-26  8:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 8/19/25 06:19, Suren Baghdasaryan wrote:
>> @@ -5624,20 +6561,29 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>>
>>         for_each_node_mask(node, slab_nodes) {
>>                 struct kmem_cache_node *n;
>> +               struct node_barn *barn = NULL;
>>
>>                 if (slab_state == DOWN) {
>>                         early_kmem_cache_node_alloc(node);
>>                         continue;
>>                 }
>> +
>> +               if (s->cpu_sheaves) {
>> +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
>> +
>> +                       if (!barn)
>> +                               return 0;
>> +               }
>> +
>>                 n = kmem_cache_alloc_node(kmem_cache_node,
>>                                                 GFP_KERNEL, node);
>> -
>>                 if (!n) {
>> -                       free_kmem_cache_nodes(s);
> 
> Why do you skip free_kmem_cache_nodes() here?

It's not necessary as the caller will perform __kmem_cache_release() which
calls free_kmem_cache_nodes()

I have incorporated your other suggestions, thanks!

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 05/14] tools: Add testing support for changes to rcu and slab for sheaves
  2025-08-22 16:28   ` Suren Baghdasaryan
@ 2025-08-26  9:32     ` Vlastimil Babka
  2025-08-27  0:19       ` Suren Baghdasaryan
  0 siblings, 1 reply; 45+ messages in thread
From: Vlastimil Babka @ 2025-08-26  9:32 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 8/22/25 18:28, Suren Baghdasaryan wrote:
>> diff --git a/tools/testing/shared/linux/rcupdate.h b/tools/testing/shared/linux/rcupdate.h
>> index fed468fb0c78db6f33fb1900c7110ab5f3c19c65..c95e2f0bbd93798e544d7d34e0823ed68414f924 100644
>> --- a/tools/testing/shared/linux/rcupdate.h
>> +++ b/tools/testing/shared/linux/rcupdate.h
>> @@ -9,4 +9,26 @@
>>  #define rcu_dereference_check(p, cond) rcu_dereference(p)
>>  #define RCU_INIT_POINTER(p, v) do { (p) = (v); } while (0)
>>
>> +void kmem_cache_free_active(void *objp);
>> +static unsigned long kfree_cb_offset = 0;
>> +
>> +static inline void kfree_rcu_cb(struct rcu_head *head)
>> +{
>> +       void *objp = (void *) ((unsigned long)head - kfree_cb_offset);
>> +
>> +       kmem_cache_free_active(objp);
>> +}
>> +
>> +#ifndef offsetof
>> +#define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER)
>> +#endif
>> +
> 
> We need a comment here that concurrent kfree_rcu() calls are not
> supported because they would override each other's kfree_cb_offset.

I think it's a bit more complex and related to the commit log sentence "This
only works with one kmem_cache, and only the first one used.". The first
call to kfree_rcu sets kfree_cb_offset (but what if the rhv offset is
actually 0?) so the others won't update it. So concurrent calls will work as
far as from the same cache thus same offset. But I'd like Liam's
confirmation and the comment text, if possible :)

> Kinda obvious but I think unusual limitations should be explicitly
> called out.
> 
>> +#define kfree_rcu(ptr, rhv)                                            \
>> +do {                                                                   \
>> +       if (!kfree_cb_offset)                                           \
>> +               kfree_cb_offset = offsetof(typeof(*(ptr)), rhv);        \
>> +                                                                       \
>> +       call_rcu(&ptr->rhv, kfree_rcu_cb);                              \
>> +} while (0)
> 
> Any specific reason kfree_rcu() is a macro and not a static inline function?

Think it's needed for the typeof() to work. The kernel's kfree_rcu() is
similar in this aspect.

>> +
>>  #endif
>>
>> --
>> 2.50.1
>>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 06/14] tools: Add sheaves support to testing infrastructure
  2025-08-22 16:56   ` Suren Baghdasaryan
@ 2025-08-26  9:59     ` Vlastimil Babka
  0 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-08-26  9:59 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 8/22/25 18:56, Suren Baghdasaryan wrote:
>> @@ -270,6 +276,84 @@ __kmem_cache_create_args(const char *name, unsigned int size,
>>         return ret;
>>  }
>>
>> +struct slab_sheaf *
>> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
>> +{
>> +       struct slab_sheaf *sheaf;
>> +       unsigned int capacity;
>> +
>> +       if (size > s->sheaf_capacity)
>> +               capacity = size;
>> +       else
>> +               capacity = s->sheaf_capacity;
> 
> nit:
> capacity = max(size, s->sheaf_capacity);

OK

>> +
>> +       sheaf = malloc(sizeof(*sheaf) + sizeof(void *) * s->sheaf_capacity * capacity);
> 
> Should this really be `sizeof(void *) * s->sheaf_capacity * capacity`
> or just `sizeof(void *) * capacity` ?

Right, so the whole thing should be:
sizeof(*sheaf) + sizeof(void *) * capacity

> 
>> +       if (!sheaf) {
>> +               return NULL;
>> +       }
>> +
>> +       memset(sheaf, 0, size);

This is also wrong, so I'm changing it to calloc(1, ...) to get the zeroing
there.

>> +       sheaf->cache = s;
>> +       sheaf->capacity = capacity;
>> +       sheaf->size = kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects);
>> +       if (!sheaf->size) {
>> +               free(sheaf);
>> +               return NULL;
>> +       }
>> +
>> +       return sheaf;
>> +}
>> +
>> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
>> +                struct slab_sheaf **sheafp, unsigned int size)
>> +{
>> +       struct slab_sheaf *sheaf = *sheafp;
>> +       int refill;
>> +
>> +       if (sheaf->size >= size)
>> +               return 0;
>> +
>> +       if (size > sheaf->capacity) {
>> +               sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
>> +               if (!sheaf)
>> +                       return -ENOMEM;
>> +
>> +               kmem_cache_return_sheaf(s, gfp, *sheafp);
>> +               *sheafp = sheaf;
>> +               return 0;
>> +       }
>> +
>> +       refill = kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
>> +                                      &sheaf->objects[sheaf->size]);
>> +       if (!refill)
>> +               return -ENOMEM;
>> +
>> +       sheaf->size += refill;
>> +       return 0;
>> +}
>> +
>> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
>> +                struct slab_sheaf *sheaf)
>> +{
>> +       if (sheaf->size) {
>> +               //s->non_kernel += sheaf->size;
> 
> Above comment seems obsolete.

Ack.

> 
>> +               kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
>> +       }
>> +       free(sheaf);
>> +}
>> +
>> +void *
>> +kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
>> +               struct slab_sheaf *sheaf)
>> +{
>> +       if (sheaf->size == 0) {
>> +               printf("Nothing left in sheaf!\n");
>> +               return NULL;
>> +       }
>> +
> 
> Should we clear sheaf->objects[sheaf->size] for additional safety?

OK.
>> +       return sheaf->objects[--sheaf->size];
>> +}
>> +
>>  /*
>>   * Test the test infrastructure for kem_cache_alloc/free and bulk counterparts.
>>   */
>>
>> --
>> 2.50.1
>>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 09/14] mm, slub: skip percpu sheaves for remote object freeing
  2025-08-25  5:22   ` Harry Yoo
@ 2025-08-26 10:11     ` Vlastimil Babka
  0 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-08-26 10:11 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 8/25/25 07:22, Harry Yoo wrote:
> On Wed, Jul 23, 2025 at 03:34:42PM +0200, Vlastimil Babka wrote:
>> Since we don't control the NUMA locality of objects in percpu sheaves,
>> allocations with node restrictions bypass them. Allocations without
>> restrictions may however still expect to get local objects with high
>> probability, and the introduction of sheaves can decrease it due to
>> freed object from a remote node ending up in percpu sheaves.
>> 
>> The fraction of such remote frees seems low (5% on an 8-node machine)
>> but it can be expected that some cache or workload specific corner cases
>> exist. We can either conclude that this is not a problem due to the low
>> fraction, or we can make remote frees bypass percpu sheaves and go
>> directly to their slabs. This will make the remote frees more expensive,
>> but if if's only a small fraction, most frees will still benefit from
>> the lower overhead of percpu sheaves.
>> 
>> This patch thus makes remote object freeing bypass percpu sheaves,
>> including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
>> it's not intended to be 100% guarantee that percpu sheaves will only
>> contain local objects. The refill from slabs does not provide that
>> guarantee in the first place, and there might be cpu migrations
>> happening when we need to unlock the local_lock. Avoiding all that could
>> be possible but complicated so we can leave it for later investigation
>> whether it would be worth it. It can be expected that the more selective
>> freeing will itself prevent accumulation of remote objects in percpu
>> sheaves so any such violations would have only short-term effects.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/slab_common.c |  7 +++++--
>>  mm/slub.c        | 42 ++++++++++++++++++++++++++++++++++++------
>>  2 files changed, 41 insertions(+), 8 deletions(-)
>> 
>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>> index 2d806e02568532a1000fd3912db6978e945dcfa8..f466f68a5bd82030a987baf849a98154cd48ef23 100644
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> @@ -1623,8 +1623,11 @@ static bool kfree_rcu_sheaf(void *obj)
>>  
>>  	slab = folio_slab(folio);
>>  	s = slab->slab_cache;
>> -	if (s->cpu_sheaves)
>> -		return __kfree_rcu_sheaf(s, obj);
>> +	if (s->cpu_sheaves) {
>> +		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
>> +			   slab_nid(slab) == numa_node_id()))
>> +			return __kfree_rcu_sheaf(s, obj);
>> +	}
> 
> This should be numa_mem_id() to handle memory-less NUMA nodes as
> Christoph mentioned [1]?
> 
> I saw you addressed this in most of places but not this one.

Oops, right.
> With that addressed, please feel free to add:
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks!

> [1] https://lore.kernel.org/linux-mm/c60ae681-6027-0626-8d4e-5833982bf1f0@gentwo.org
> 
>>  
>>  	return false;
>>  }
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves
  2025-08-25  6:52   ` Harry Yoo
@ 2025-08-26 10:49     ` Vlastimil Babka
  0 siblings, 0 replies; 45+ messages in thread
From: Vlastimil Babka @ 2025-08-26 10:49 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 8/25/25 08:52, Harry Yoo wrote:
> On Wed, Jul 23, 2025 at 03:34:43PM +0200, Vlastimil Babka wrote:
>> Currently allocations asking for a specific node explicitly or via
>> mempolicy in strict_numa node bypass percpu sheaves. Since sheaves
>> contain mostly local objects, we can try allocating from them if the
>> local node happens to be the requested node or allowed by the mempolicy.
>> If we find the object from percpu sheaves is not from the expected node,
>> we skip the sheaves - this should be rare.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
> 
> With or without ifdeffery suggested by Suren
> (or probably IS_ENABLED(CONFIG_NUMA) && node != NUMA_NO_NODE?),
> 
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks both, I've extracted IS_ENABLED(CONFIG_NUMA) && node != NUMA_NO_NODE)
to a local bool variable.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 12/14] maple_tree: Sheaf conversion
  2025-08-22 20:18   ` Suren Baghdasaryan
@ 2025-08-26 14:22     ` Liam R. Howlett
  2025-08-27  2:07       ` Suren Baghdasaryan
  0 siblings, 1 reply; 45+ messages in thread
From: Liam R. Howlett @ 2025-08-26 14:22 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

* Suren Baghdasaryan <surenb@google.com> [250822 16:18]:
> On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> >
> > Use sheaves instead of bulk allocations.  This should speed up the
> > allocations and the return path of unused allocations.
> 
> Nice cleanup!
> 
> >
> > Remove push/pop of nodes from maple state.
> > Remove unnecessary testing
> > ifdef out other testing that probably will be deleted
> 
> Should we simply remove them if they are unused?

Yes, I think it's time to drop them.

> 
> > Fix testcase for testing race
> > Move some testing around in the same commit.
> 
> Would it be possible to separate test changes from kernel changes into
> another patch? Kernel part looks good to me but I don't know enough
> about these tests to vote on that.

Yes.  I'll do that.

I'll drop testing first then the feature so that testing will continue
to pass on bisection.

I will also stop moving tests around in this change.

> 
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  include/linux/maple_tree.h       |   6 +-
> >  lib/maple_tree.c                 | 331 ++++----------------
> >  lib/test_maple_tree.c            |   8 +
> >  tools/testing/radix-tree/maple.c | 632 +++++++--------------------------------
> >  tools/testing/shared/linux.c     |   8 +-
> >  5 files changed, 185 insertions(+), 800 deletions(-)

...

I didn't see any changes in the code block, but please let me know if I
missed them.


Thanks,
Liam

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 13/14] maple_tree: Add single node allocation support to maple state
  2025-08-22 20:25   ` Suren Baghdasaryan
@ 2025-08-26 15:10     ` Liam R. Howlett
  2025-08-27  2:03       ` Suren Baghdasaryan
  0 siblings, 1 reply; 45+ messages in thread
From: Liam R. Howlett @ 2025-08-26 15:10 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

* Suren Baghdasaryan <surenb@google.com> [250822 16:25]:
> On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > The fast path through a write will require replacing a single node in
> > the tree.  Using a sheaf (32 nodes) is too heavy for the fast path, so
> > special case the node store operation by just allocating one node in the
> > maple state.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > ---
> >  include/linux/maple_tree.h |  4 +++-
> >  lib/maple_tree.c           | 47 ++++++++++++++++++++++++++++++++++++++++------
> >  2 files changed, 44 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > index 3cf1ae9dde7ce43fa20ae400c01fefad048c302e..61eb5e7d09ad0133978e3ac4b2af66710421e769 100644
> > --- a/include/linux/maple_tree.h
> > +++ b/include/linux/maple_tree.h
> > @@ -443,6 +443,7 @@ struct ma_state {
> >         unsigned long min;              /* The minimum index of this node - implied pivot min */
> >         unsigned long max;              /* The maximum index of this node - implied pivot max */
> >         struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
> > +       struct maple_node *alloc;       /* allocated nodes */
> >         unsigned long node_request;
> >         enum maple_status status;       /* The status of the state (active, start, none, etc) */
> >         unsigned char depth;            /* depth of tree descent during write */
> > @@ -491,8 +492,9 @@ struct ma_wr_state {
> >                 .status = ma_start,                                     \
> >                 .min = 0,                                               \
> >                 .max = ULONG_MAX,                                       \
> > -               .node_request= 0,                                       \
> >                 .sheaf = NULL,                                          \
> > +               .alloc = NULL,                                          \
> > +               .node_request= 0,                                       \
> >                 .mas_flags = 0,                                         \
> >                 .store_type = wr_invalid,                               \
> >         }
> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > index 3c3c14a76d98ded3b619c178d64099b464a2ca23..9aa782b1497f224e7366ebbd65f997523ee0c8ab 100644
> > --- a/lib/maple_tree.c
> > +++ b/lib/maple_tree.c
> > @@ -1101,16 +1101,23 @@ static int mas_ascend(struct ma_state *mas)
> >   *
> >   * Return: A pointer to a maple node.
> >   */
> > -static inline struct maple_node *mas_pop_node(struct ma_state *mas)
> > +static __always_inline struct maple_node *mas_pop_node(struct ma_state *mas)
> >  {
> >         struct maple_node *ret;
> >
> > +       if (mas->alloc) {
> > +               ret = mas->alloc;
> > +               mas->alloc = NULL;
> > +               goto out;
> > +       }
> > +
> >         if (WARN_ON_ONCE(!mas->sheaf))
> >                 return NULL;
> >
> >         ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
> > -       memset(ret, 0, sizeof(*ret));
> >
> > +out:
> > +       memset(ret, 0, sizeof(*ret));
> >         return ret;
> >  }
> >
> > @@ -1121,9 +1128,34 @@ static inline struct maple_node *mas_pop_node(struct ma_state *mas)
> >   */
> >  static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
> >  {
> > -       if (unlikely(mas->sheaf)) {
> > -               unsigned long refill = mas->node_request;
> > +       if (!mas->node_request)
> > +               return;
> > +
> > +       if (mas->node_request == 1) {
> > +               if (mas->sheaf)
> > +                       goto use_sheaf;
> > +
> > +               if (mas->alloc)
> > +                       return;
> >
> > +               mas->alloc = mt_alloc_one(gfp);
> > +               if (!mas->alloc)
> > +                       goto error;
> > +
> > +               mas->node_request = 0;
> > +               return;
> > +       }
> > +
> > +use_sheaf:
> > +       if (unlikely(mas->alloc)) {
> 
> When would this condition happen?


This would be the case if we have one node allocated and requested more
than one node.  That is, a chained request for nodes that ends up having
the alloc set and requesting a sheaf.

> Do we really need to free mas->alloc
> here or it can be reused for the next 1-node allocation?

Most calls end in mas_destroy() so that won't happen today.

We could reduce the number of allocations requested to the sheaf and let
the code find the mas->alloc first and use that.

But remember, we are getting into this situation where code did a
mas_preallocate() then figured they needed to do something else (error
recovery, or changed the vma flags and now it can merge..) and will now
need additional nodes.  So this is a rare case, so I figured just free
it was the safest thing.


> > +               mt_free_one(mas->alloc);
> > +               mas->alloc = NULL;
> > +       }
> > +
> > +       if (mas->sheaf) {
> > +               unsigned long refill;
> > +
> > +               refill = mas->node_request;
> >                 if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
> >                         mas->node_request = 0;
> >                         return;
> > @@ -5386,8 +5418,11 @@ void mas_destroy(struct ma_state *mas)
> >         mas->node_request = 0;
> >         if (mas->sheaf)
> >                 mt_return_sheaf(mas->sheaf);
> > -
> >         mas->sheaf = NULL;
> > +
> > +       if (mas->alloc)
> > +               mt_free_one(mas->alloc);
> > +       mas->alloc = NULL;
> >  }
> >  EXPORT_SYMBOL_GPL(mas_destroy);
> >
> > @@ -6074,7 +6109,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
> >                 mas_alloc_nodes(mas, gfp);
> >         }
> >
> > -       if (!mas->sheaf)
> > +       if (!mas->sheaf && !mas->alloc)
> >                 return false;
> >
> >         mas->status = ma_start;
> >
> > --
> > 2.50.1
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 05/14] tools: Add testing support for changes to rcu and slab for sheaves
  2025-08-26  9:32     ` Vlastimil Babka
@ 2025-08-27  0:19       ` Suren Baghdasaryan
  0 siblings, 0 replies; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-27  0:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Tue, Aug 26, 2025 at 2:32 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/22/25 18:28, Suren Baghdasaryan wrote:
> >> diff --git a/tools/testing/shared/linux/rcupdate.h b/tools/testing/shared/linux/rcupdate.h
> >> index fed468fb0c78db6f33fb1900c7110ab5f3c19c65..c95e2f0bbd93798e544d7d34e0823ed68414f924 100644
> >> --- a/tools/testing/shared/linux/rcupdate.h
> >> +++ b/tools/testing/shared/linux/rcupdate.h
> >> @@ -9,4 +9,26 @@
> >>  #define rcu_dereference_check(p, cond) rcu_dereference(p)
> >>  #define RCU_INIT_POINTER(p, v) do { (p) = (v); } while (0)
> >>
> >> +void kmem_cache_free_active(void *objp);
> >> +static unsigned long kfree_cb_offset = 0;
> >> +
> >> +static inline void kfree_rcu_cb(struct rcu_head *head)
> >> +{
> >> +       void *objp = (void *) ((unsigned long)head - kfree_cb_offset);
> >> +
> >> +       kmem_cache_free_active(objp);
> >> +}
> >> +
> >> +#ifndef offsetof
> >> +#define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER)
> >> +#endif
> >> +
> >
> > We need a comment here that concurrent kfree_rcu() calls are not
> > supported because they would override each other's kfree_cb_offset.
>
> I think it's a bit more complex and related to the commit log sentence "This
> only works with one kmem_cache, and only the first one used.". The first
> call to kfree_rcu sets kfree_cb_offset (but what if the rhv offset is
> actually 0?) so the others won't update it. So concurrent calls will work as
> far as from the same cache thus same offset. But I'd like Liam's
> confirmation and the comment text, if possible :)
>
> > Kinda obvious but I think unusual limitations should be explicitly
> > called out.
> >
> >> +#define kfree_rcu(ptr, rhv)                                            \
> >> +do {                                                                   \
> >> +       if (!kfree_cb_offset)                                           \
> >> +               kfree_cb_offset = offsetof(typeof(*(ptr)), rhv);        \
> >> +                                                                       \
> >> +       call_rcu(&ptr->rhv, kfree_rcu_cb);                              \
> >> +} while (0)
> >
> > Any specific reason kfree_rcu() is a macro and not a static inline function?
>
> Think it's needed for the typeof() to work. The kernel's kfree_rcu() is
> similar in this aspect.

Ah, got it. Thanks!

>
> >> +
> >>  #endif
> >>
> >> --
> >> 2.50.1
> >>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 13/14] maple_tree: Add single node allocation support to maple state
  2025-08-26 15:10     ` Liam R. Howlett
@ 2025-08-27  2:03       ` Suren Baghdasaryan
  0 siblings, 0 replies; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-27  2:03 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Vlastimil Babka,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Uladzislau Rezki, linux-mm, linux-kernel, rcu, maple-tree

On Tue, Aug 26, 2025 at 8:11 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250822 16:25]:
> > On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > The fast path through a write will require replacing a single node in
> > > the tree.  Using a sheaf (32 nodes) is too heavy for the fast path, so
> > > special case the node store operation by just allocating one node in the
> > > maple state.
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  include/linux/maple_tree.h |  4 +++-
> > >  lib/maple_tree.c           | 47 ++++++++++++++++++++++++++++++++++++++++------
> > >  2 files changed, 44 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index 3cf1ae9dde7ce43fa20ae400c01fefad048c302e..61eb5e7d09ad0133978e3ac4b2af66710421e769 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -443,6 +443,7 @@ struct ma_state {
> > >         unsigned long min;              /* The minimum index of this node - implied pivot min */
> > >         unsigned long max;              /* The maximum index of this node - implied pivot max */
> > >         struct slab_sheaf *sheaf;       /* Allocated nodes for this operation */
> > > +       struct maple_node *alloc;       /* allocated nodes */
> > >         unsigned long node_request;
> > >         enum maple_status status;       /* The status of the state (active, start, none, etc) */
> > >         unsigned char depth;            /* depth of tree descent during write */
> > > @@ -491,8 +492,9 @@ struct ma_wr_state {
> > >                 .status = ma_start,                                     \
> > >                 .min = 0,                                               \
> > >                 .max = ULONG_MAX,                                       \
> > > -               .node_request= 0,                                       \
> > >                 .sheaf = NULL,                                          \
> > > +               .alloc = NULL,                                          \
> > > +               .node_request= 0,                                       \
> > >                 .mas_flags = 0,                                         \
> > >                 .store_type = wr_invalid,                               \
> > >         }
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index 3c3c14a76d98ded3b619c178d64099b464a2ca23..9aa782b1497f224e7366ebbd65f997523ee0c8ab 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -1101,16 +1101,23 @@ static int mas_ascend(struct ma_state *mas)
> > >   *
> > >   * Return: A pointer to a maple node.
> > >   */
> > > -static inline struct maple_node *mas_pop_node(struct ma_state *mas)
> > > +static __always_inline struct maple_node *mas_pop_node(struct ma_state *mas)
> > >  {
> > >         struct maple_node *ret;
> > >
> > > +       if (mas->alloc) {
> > > +               ret = mas->alloc;
> > > +               mas->alloc = NULL;
> > > +               goto out;
> > > +       }
> > > +
> > >         if (WARN_ON_ONCE(!mas->sheaf))
> > >                 return NULL;
> > >
> > >         ret = kmem_cache_alloc_from_sheaf(maple_node_cache, GFP_NOWAIT, mas->sheaf);
> > > -       memset(ret, 0, sizeof(*ret));
> > >
> > > +out:
> > > +       memset(ret, 0, sizeof(*ret));
> > >         return ret;
> > >  }
> > >
> > > @@ -1121,9 +1128,34 @@ static inline struct maple_node *mas_pop_node(struct ma_state *mas)
> > >   */
> > >  static inline void mas_alloc_nodes(struct ma_state *mas, gfp_t gfp)
> > >  {
> > > -       if (unlikely(mas->sheaf)) {
> > > -               unsigned long refill = mas->node_request;
> > > +       if (!mas->node_request)
> > > +               return;
> > > +
> > > +       if (mas->node_request == 1) {
> > > +               if (mas->sheaf)
> > > +                       goto use_sheaf;
> > > +
> > > +               if (mas->alloc)
> > > +                       return;
> > >
> > > +               mas->alloc = mt_alloc_one(gfp);
> > > +               if (!mas->alloc)
> > > +                       goto error;
> > > +
> > > +               mas->node_request = 0;
> > > +               return;
> > > +       }
> > > +
> > > +use_sheaf:
> > > +       if (unlikely(mas->alloc)) {
> >
> > When would this condition happen?
>
>
> This would be the case if we have one node allocated and requested more
> than one node.  That is, a chained request for nodes that ends up having
> the alloc set and requesting a sheaf.

Ah, ok. So this is also a recovery case when we thought we need only
one node and then the situation changed and we need more than one?

>
> > Do we really need to free mas->alloc
> > here or it can be reused for the next 1-node allocation?
>
> Most calls end in mas_destroy() so that won't happen today.
>
> We could reduce the number of allocations requested to the sheaf and let
> the code find the mas->alloc first and use that.
>
> But remember, we are getting into this situation where code did a
> mas_preallocate() then figured they needed to do something else (error
> recovery, or changed the vma flags and now it can merge..) and will now
> need additional nodes.  So this is a rare case, so I figured just free
> it was the safest thing.

Ok, got it. Both situations would be part of the unusual recovery
case. Makes sense then. Thanks!


>
>
> > > +               mt_free_one(mas->alloc);
> > > +               mas->alloc = NULL;
> > > +       }
> > > +
> > > +       if (mas->sheaf) {
> > > +               unsigned long refill;
> > > +
> > > +               refill = mas->node_request;
> > >                 if(kmem_cache_sheaf_size(mas->sheaf) >= refill) {
> > >                         mas->node_request = 0;
> > >                         return;
> > > @@ -5386,8 +5418,11 @@ void mas_destroy(struct ma_state *mas)
> > >         mas->node_request = 0;
> > >         if (mas->sheaf)
> > >                 mt_return_sheaf(mas->sheaf);
> > > -
> > >         mas->sheaf = NULL;
> > > +
> > > +       if (mas->alloc)
> > > +               mt_free_one(mas->alloc);
> > > +       mas->alloc = NULL;
> > >  }
> > >  EXPORT_SYMBOL_GPL(mas_destroy);
> > >
> > > @@ -6074,7 +6109,7 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
> > >                 mas_alloc_nodes(mas, gfp);
> > >         }
> > >
> > > -       if (!mas->sheaf)
> > > +       if (!mas->sheaf && !mas->alloc)
> > >                 return false;
> > >
> > >         mas->status = ma_start;
> > >
> > > --
> > > 2.50.1
> > >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 12/14] maple_tree: Sheaf conversion
  2025-08-26 14:22     ` Liam R. Howlett
@ 2025-08-27  2:07       ` Suren Baghdasaryan
  2025-08-28 14:27         ` Liam R. Howlett
  0 siblings, 1 reply; 45+ messages in thread
From: Suren Baghdasaryan @ 2025-08-27  2:07 UTC (permalink / raw)
  To: Liam R. Howlett, Suren Baghdasaryan, Vlastimil Babka,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Uladzislau Rezki, linux-mm, linux-kernel, rcu, maple-tree

On Tue, Aug 26, 2025 at 7:22 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Suren Baghdasaryan <surenb@google.com> [250822 16:18]:
> > On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > >
> > > Use sheaves instead of bulk allocations.  This should speed up the
> > > allocations and the return path of unused allocations.
> >
> > Nice cleanup!
> >
> > >
> > > Remove push/pop of nodes from maple state.
> > > Remove unnecessary testing
> > > ifdef out other testing that probably will be deleted
> >
> > Should we simply remove them if they are unused?
>
> Yes, I think it's time to drop them.
>
> >
> > > Fix testcase for testing race
> > > Move some testing around in the same commit.
> >
> > Would it be possible to separate test changes from kernel changes into
> > another patch? Kernel part looks good to me but I don't know enough
> > about these tests to vote on that.
>
> Yes.  I'll do that.
>
> I'll drop testing first then the feature so that testing will continue
> to pass on bisection.
>
> I will also stop moving tests around in this change.
>
> >
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  include/linux/maple_tree.h       |   6 +-
> > >  lib/maple_tree.c                 | 331 ++++----------------
> > >  lib/test_maple_tree.c            |   8 +
> > >  tools/testing/radix-tree/maple.c | 632 +++++++--------------------------------
> > >  tools/testing/shared/linux.c     |   8 +-
> > >  5 files changed, 185 insertions(+), 800 deletions(-)
>
> ...
>
> I didn't see any changes in the code block, but please let me know if I
> missed them.

I was referring to the changes in include/linux/maple_tree.h and
lib/maple_tree.c as kernel changes and the rest as test changes.

>
>
> Thanks,
> Liam

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 12/14] maple_tree: Sheaf conversion
  2025-08-27  2:07       ` Suren Baghdasaryan
@ 2025-08-28 14:27         ` Liam R. Howlett
  0 siblings, 0 replies; 45+ messages in thread
From: Liam R. Howlett @ 2025-08-28 14:27 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Vlastimil Babka, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

* Suren Baghdasaryan <surenb@google.com> [250826 22:07]:
> On Tue, Aug 26, 2025 at 7:22 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Suren Baghdasaryan <surenb@google.com> [250822 16:18]:
> > > On Wed, Jul 23, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >
> > > > From: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > >
> > > > Use sheaves instead of bulk allocations.  This should speed up the
> > > > allocations and the return path of unused allocations.
> > >
> > > Nice cleanup!
> > >
> > > >
> > > > Remove push/pop of nodes from maple state.
> > > > Remove unnecessary testing
> > > > ifdef out other testing that probably will be deleted
> > >
> > > Should we simply remove them if they are unused?
> >
> > Yes, I think it's time to drop them.
> >
> > >
> > > > Fix testcase for testing race
> > > > Move some testing around in the same commit.
> > >
> > > Would it be possible to separate test changes from kernel changes into
> > > another patch? Kernel part looks good to me but I don't know enough
> > > about these tests to vote on that.
> >
> > Yes.  I'll do that.
> >
> > I'll drop testing first then the feature so that testing will continue
> > to pass on bisection.
> >
> > I will also stop moving tests around in this change.

I cannot easily change the tests without the corresponding code and keep
the tests passing.

I can limit the changes to what is necessary though.  It looks like I've
moved some things around and I'll do that another time.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2025-08-28 14:28 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 13:34 [PATCH v5 00/14] SLUB percpu sheaves Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 01/14] slab: add opt-in caching layer of " Vlastimil Babka
2025-08-18 10:09   ` Harry Yoo
2025-08-26  8:03     ` Vlastimil Babka
2025-08-19  4:19   ` Suren Baghdasaryan
2025-08-26  8:51     ` Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 02/14] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2025-07-23 16:39   ` Uladzislau Rezki
2025-07-24 14:30     ` Vlastimil Babka
2025-07-24 17:36       ` Uladzislau Rezki
2025-07-23 13:34 ` [PATCH v5 03/14] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 04/14] slab: determine barn status racily outside of lock Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 05/14] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
2025-08-22 16:28   ` Suren Baghdasaryan
2025-08-26  9:32     ` Vlastimil Babka
2025-08-27  0:19       ` Suren Baghdasaryan
2025-07-23 13:34 ` [PATCH v5 06/14] tools: Add sheaves support to testing infrastructure Vlastimil Babka
2025-08-22 16:56   ` Suren Baghdasaryan
2025-08-26  9:59     ` Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 07/14] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 08/14] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 09/14] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
2025-08-25  5:22   ` Harry Yoo
2025-08-26 10:11     ` Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 10/14] mm, slab: allow NUMA restricted allocations to use percpu sheaves Vlastimil Babka
2025-08-22 19:58   ` Suren Baghdasaryan
2025-08-25  6:52   ` Harry Yoo
2025-08-26 10:49     ` Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 11/14] testing/radix-tree/maple: Increase readers and reduce delay for faster machines Vlastimil Babka
2025-07-23 13:34 ` [PATCH v5 12/14] maple_tree: Sheaf conversion Vlastimil Babka
2025-08-22 20:18   ` Suren Baghdasaryan
2025-08-26 14:22     ` Liam R. Howlett
2025-08-27  2:07       ` Suren Baghdasaryan
2025-08-28 14:27         ` Liam R. Howlett
2025-07-23 13:34 ` [PATCH v5 13/14] maple_tree: Add single node allocation support to maple state Vlastimil Babka
2025-08-22 20:25   ` Suren Baghdasaryan
2025-08-26 15:10     ` Liam R. Howlett
2025-08-27  2:03       ` Suren Baghdasaryan
2025-07-23 13:34 ` [PATCH v5 14/14] maple_tree: Convert forking to use the sheaf interface Vlastimil Babka
2025-08-22 20:29   ` Suren Baghdasaryan
2025-08-15 22:53 ` [PATCH v5 00/14] SLUB percpu sheaves Sudarsan Mahendran
2025-08-16  8:05   ` Harry Yoo
     [not found]     ` <CAA9mObAiQbAYvzhW---VoqDA6Zsb152p5ePMvbco0xgwyvaB2Q@mail.gmail.com>
2025-08-16 18:31       ` Vlastimil Babka
2025-08-16 18:33         ` Vlastimil Babka
2025-08-17  4:28           ` Sudarsan Mahendran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).