[PATCH v4 0/9] SLUB percpu sheaves

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/9] SLUB percpu sheaves
@ 2025-04-25  8:27 Vlastimil Babka
  2025-04-25  8:27 ` [PATCH v4 1/9] slab: add opt-in caching layer of " Vlastimil Babka
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

Hi,

This is the v4 and first non-RFC series to add an opt-in percpu
array-based caching layer to SLUB, following the LSF/MM discussions.
Since v3 I've also made changes to achieve full compatibility with
slub_debug, and IRC discussions led to the last patch intended to
improve NUMA locality (the patch remains separate for evaluation
purposes).

Harry's RFC [1] also prompted me to reconsider the stat counters and as
a result I removed some that seemed unnecessary and added others that
were missing to evaluate how effective the barns and sheaf preffiling
are.

I have also addressed the RFC v3 feedback by Suren and Harry - thanks!

Note the name "sheaf" was invented by Matthew so we don't call the
arrays magazine like the original Bonwick paper. The per-NUMA-node cache
of sheaves is thus called "barn".

This caching may seem similar to the arrays in SLAB, but there are some
important differences:

- opt-in, not used for every cache
- does not distinguish NUMA locality, thus there are no per-node
  "shared" arrays (with possible lock contention) and no "alien" arrays
  that would need periodical flushing
  - NUMA restricted allocations and strict_numa mode is still honoured,
    the percpu sheaves are bypassed for those allocations
  - a later patch (for separate evaluation) makes freeing remote objects
    bypass sheaves so sheaves contain mostly (not strictly) local objects
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
  for guaranteed and efficient allocations in a restricted context, when
  the upper bound for needed objects is known but rarely reached

The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset, but for maple tree
it's not a full conversion that would benefit from the improved
preallocation API.

Some performance benefits were measured by Suren and Liam in previous
versions. Suren's results in [2] looked quite promising.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  thanks to local_trylock() it becomes a preempt_disable() and no atomic
  operations. Same for freeing, which is normally a local double cmpxchg
  only for short term allocations (so the same slab is still active on the
  same cpu when freeing the object) and a more costly locked double
  cmpxchg otherwise.

  There is a possible downside with a larger fraction of
  non-NUMA-restricted allocations to get remote objects. The last patch
  changes it by making remote frees bypass sheaves. Some very preliminary
  measurements suggest only 5% frees are remote, but whether this is a net
  improvement has to be evaluated.

- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
  separate percpu sheaf and only submit the whole sheaf to call_rcu()
  when full. After the grace period, the sheaf can be used for
  allocations, which is more efficient than freeing and reallocating
  individual slab objects (even with the batching done by kfree_rcu()
  implementation itself). In case only some cpus are allowed to handle rcu
  callbacks, the sheaf can still be made available to other cpus on the
  same node via the shared barn. The maple_node cache uses kfree_rcu() and
  thus can benefit from this.

- Preallocation support. A prefilled sheaf can be privately borrowed to
  perform a short term operation that is not allowed to block in the
  middle and may need to allocate some objects. If an upper bound (worst
  case) for the number of allocations is known, but only much fewer
  allocations actually needed on average, borrowing and returning a sheaf
  is much more efficient then a bulk allocation for the worst case
  followed by a bulk free of the many unused objects. Maple tree write
  operations should benefit from this.

- Compatibility with slub_debug. When slub_debug is enabled for a cache,
  we simply don't create the percpu sheaves so that the debugging hooks
  (at the node partial list slowpaths) are reached as before. Sheaf
  preallocation still works by reusing the (ineffective) paths for
  requests exceeding the cache's sheaf_capacity. This is in line with the
  existing approach where debugging bypasses the fast paths.

GIT TREES:

this series: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v4r2

It is based on post-6.15-rc3 commit 82efd569a890 ("locking/local_lock:
fix _Generic() matching of local_trylock_t") as it definitely needs
local_trylock_t to work properly.

I have tried to rebase the full maple tree conversion, but there were
conflicts due to 6.15 changes and I don't know the code well enought to
resolve them confidently.

Vlastimil

[1] https://lore.kernel.org/all/20250407041810.13861-1-harry.yoo@oracle.com/
[2] https://lore.kernel.org/all/CAJuCfpFVopL%2BsMdU4bLRxs%2BHS_WPCmFZBdCmwE8qV2Dpa5WZnA@mail.gmail.com/

---
Changes in v4:
- slub_debug disables sheaves for the cache in order to work properly
- strict_numa mode works as intended
- added a separate patch to make freeing remote objects skip sheaves
- various code refactoring suggested by Suren and Harry
- removed less useful stat counters and added missing ones for barn
  and prefilled sheaf events
- Link to v3: https://lore.kernel.org/r/20250317-slub-percpu-caches-v3-0-9d9884d8b643@suse.cz

Changes in v3:
- Squash localtry_lock conversion so it's used immediately.
- Incorporate feedback and add tags from Suren and Harry - thanks!
  - Mostly adding comments and some refactoring.
  - Fixes for kfree_rcu_sheaf() vmalloc handling, cpu hotremove
    flushing.
  - Fix wrong condition in kmem_cache_return_sheaf() that may have
    affected performance negatively.
  - Refactoring of free_to_pcs()
- Link to v2: https://lore.kernel.org/r/20250214-slub-percpu-caches-v2-0-88592ee0966a@suse.cz

Changes in v2:
- Removed kfree_rcu() destructors support as VMAs will not need it
  anymore after [3] is merged.
- Changed to localtry_lock_t borrowed from [2] instead of an own
  implementation of the same idea.
- Many fixes and improvements thanks to Liam's adoption for maple tree
  nodes.
- Userspace Testing stubs by Liam.
- Reduced limitations/todos - hooking to kfree_rcu() is complete,
  prefilled sheaves can exceed cache's sheaf_capacity.
- Link to v1: https://lore.kernel.org/r/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz

---
Liam R. Howlett (2):
      tools: Add testing support for changes to rcu and slab for sheaves
      tools: Add sheaves support to testing infrastructure

Vlastimil Babka (7):
      slab: add opt-in caching layer of percpu sheaves
      slab: add sheaf support for batching kfree_rcu() operations
      slab: sheaf prefilling for guaranteed allocations
      slab: determine barn status racily outside of lock
      maple_tree: use percpu sheaves for maple_node_cache
      mm, vma: use percpu sheaves for vm_area_struct cache
      mm, slub: skip percpu sheaves for remote object freeing

 include/linux/slab.h                  |   47 +
 kernel/fork.c                         |    1 +
 lib/maple_tree.c                      |   11 +-
 mm/slab.h                             |    5 +
 mm/slab_common.c                      |   32 +-
 mm/slub.c                             | 1573 +++++++++++++++++++++++++++++++--
 tools/include/linux/slab.h            |   65 +-
 tools/testing/shared/linux.c          |  108 ++-
 tools/testing/shared/linux/rcupdate.h |   22 +
 9 files changed, 1791 insertions(+), 73 deletions(-)
---
base-commit: 82efd569a8909f2b13140c1b3de88535aea0b051
change-id: 20231128-slub-percpu-caches-9441892011d7

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-04-25 17:31   ` Christoph Lameter (Ampere)
                     ` (2 more replies)
  2025-04-25  8:27 ` [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Specifying a non-zero value for a new struct kmem_cache_args field
sheaf_capacity will setup a caching layer of percpu arrays called
sheaves of given capacity for the created cache.

Allocations from the cache will allocate via the percpu sheaves (main or
spare) as long as they have no NUMA node preference. Frees will also
put the object back into one of the sheaves.

When both percpu sheaves are found empty during an allocation, an empty
sheaf may be replaced with a full one from the per-node barn. If none
are available and the allocation is allowed to block, an empty sheaf is
refilled from slab(s) by an internal bulk alloc operation. When both
percpu sheaves are full during freeing, the barn can replace a full one
with an empty one, unless over a full sheaves limit. In that case a
sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
sheaves and barns is also wired to the existing cpu flushing and cache
shrinking operations.

The sheaves do not distinguish NUMA locality of the cached objects. If
an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
the sheaves are bypassed.

The bulk operations exposed to slab users also try to utilize the
sheaves as long as the necessary (full or empty) sheaves are available
on the cpu or in the barn. Once depleted, they will fallback to bulk
alloc/free to slabs directly to avoid double copying.

The sheaf_capacity value is exported in sysfs for observability.

Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
count objects allocated or freed using the sheaves (and thus not
counting towards the other alloc/free path counters). Counters
sheaf_refill and sheaf_flush count objects filled or flushed from or to
slab pages, and can be used to assess how effective the caching is. The
refill and flush operations will also count towards the usual
alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
the backing slabs.  For barn operations, barn_get and barn_put count how
many full sheaves were get from or put to the barn, the _fail variants
count how many such requests could not be satisfied mainly  because the
barn was either empty or full. While the barn also holds empty sheaves
to make some operations easier, these are not as critical to mandate own
counters.  Finally, there are sheaf_alloc/sheaf_free counters.

Access to the percpu sheaves is protected by local_trylock() when
potential callers include irq context, and local_lock() otherwise (such
as when we already know the gfp flags allow blocking). The trylock
failures should be rare and we can easily fallback. Each per-NUMA-node
barn has a spin_lock.

When slub_debug is enabled for a cache with sheaf_capacity also
specified, the latter is ignored so that allocations and frees reach the
slow path where debugging hooks are processed.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/slab.h |   31 ++
 mm/slab.h            |    2 +
 mm/slab_common.c     |    5 +-
 mm/slub.c            | 1053 +++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 1044 insertions(+), 47 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..4cb495d55fc58c70a992ee4782d7990ce1c55dc6 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -335,6 +335,37 @@ struct kmem_cache_args {
 	 * %NULL means no constructor.
 	 */
 	void (*ctor)(void *);
+	/**
+	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
+	 *
+	 * With a non-zero value, allocations from the cache go through caching
+	 * arrays called sheaves. Each cpu has a main sheaf that's always
+	 * present, and a spare sheaf thay may be not present. When both become
+	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
+	 * from the per-node barn.
+	 *
+	 * When no full sheaf is available, and gfp flags allow blocking, a
+	 * sheaf is allocated and filled from slab(s) using bulk allocation.
+	 * Otherwise the allocation falls back to the normal operation
+	 * allocating a single object from a slab.
+	 *
+	 * Analogically when freeing and both percpu sheaves are full, the barn
+	 * may replace it with an empty sheaf, unless it's over capacity. In
+	 * that case a sheaf is bulk freed to slab pages.
+	 *
+	 * The sheaves do not enforce NUMA placement of objects, so allocations
+	 * via kmem_cache_alloc_node() with a node specified other than
+	 * NUMA_NO_NODE will bypass them.
+	 *
+	 * Bulk allocation and free operations also try to use the cpu sheaves
+	 * and barn, but fallback to using slab pages directly.
+	 *
+	 * When slub_debug is enabled for the cache, the sheaf_capacity argument
+	 * is ignored.
+	 *
+	 * %0 means no sheaves will be created
+	 */
+	unsigned int sheaf_capacity;
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
diff --git a/mm/slab.h b/mm/slab.h
index 05a21dc796e095e8db934564d559494cd81746ec..1980330c2fcb4a4613a7e4f7efc78b349993fd89 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -259,6 +259,7 @@ struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
 #endif
+	struct slub_percpu_sheaves __percpu *cpu_sheaves;
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
 	unsigned long min_partial;
@@ -272,6 +273,7 @@ struct kmem_cache {
 	/* Number of per cpu partial slabs to keep around */
 	unsigned int cpu_partial_slabs;
 #endif
+	unsigned int sheaf_capacity;
 	struct kmem_cache_order_objects oo;
 
 	/* Allocation and freeing of slabs */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 5be257e03c7c930b5ca16dd92f790604cc5767ac..4f295bdd2d42355af6311a799955301005f8a532 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
 		return 1;
 #endif
 
+	if (s->cpu_sheaves)
+		return 1;
+
 	/*
 	 * We may have set a slab to be unmergeable during bootstrap.
 	 */
@@ -321,7 +324,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 		    object_size - args->usersize < args->useroffset))
 		args->usersize = args->useroffset = 0;
 
-	if (!args->usersize)
+	if (!args->usersize && !args->sheaf_capacity)
 		s = __kmem_cache_alias(name, object_size, args->align, flags,
 				       args->ctor);
 	if (s)
diff --git a/mm/slub.c b/mm/slub.c
index dc9e729e1d269b5d362cb5bc44f824640ffd00f3..ae3e80ad9926ca15601eef2f2aa016ca059498f8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -346,8 +346,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
 #endif
 
 enum stat_item {
+	ALLOC_PCS,		/* Allocation from percpu sheaf */
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
+	FREE_PCS,		/* Free to percpu sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -372,6 +374,14 @@ enum stat_item {
 	CPU_PARTIAL_FREE,	/* Refill cpu partial on free */
 	CPU_PARTIAL_NODE,	/* Refill cpu partial from node partial */
 	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
+	SHEAF_FLUSH,		/* Objects flushed from a sheaf */
+	SHEAF_REFILL,		/* Objects refilled to a sheaf */
+	SHEAF_ALLOC,		/* Allocation of an empty sheaf */
+	SHEAF_FREE,		/* Freeing of an empty sheaf */
+	BARN_GET,		/* Got full sheaf from barn */
+	BARN_GET_FAIL,		/* Failed to get full sheaf from barn */
+	BARN_PUT,		/* Put full sheaf to barn */
+	BARN_PUT_FAIL,		/* Failed to put full sheaf to barn */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -418,6 +428,33 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
 #endif
 }
 
+#define MAX_FULL_SHEAVES	10
+#define MAX_EMPTY_SHEAVES	10
+
+struct node_barn {
+	spinlock_t lock;
+	struct list_head sheaves_full;
+	struct list_head sheaves_empty;
+	unsigned int nr_full;
+	unsigned int nr_empty;
+};
+
+struct slab_sheaf {
+	union {
+		struct rcu_head rcu_head;
+		struct list_head barn_list;
+	};
+	unsigned int size;
+	void *objects[];
+};
+
+struct slub_percpu_sheaves {
+	local_trylock_t lock;
+	struct slab_sheaf *main; /* never NULL when unlocked */
+	struct slab_sheaf *spare; /* empty or full, may be NULL */
+	struct node_barn *barn;
+};
+
 /*
  * The slab lists for all objects.
  */
@@ -430,6 +467,7 @@ struct kmem_cache_node {
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
+	struct node_barn *barn;
 };
 
 static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
@@ -453,12 +491,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
  */
 static nodemask_t slab_nodes;
 
-#ifndef CONFIG_SLUB_TINY
 /*
  * Workqueue used for flush_cpu_slab().
  */
 static struct workqueue_struct *flushwq;
-#endif
+
+struct slub_flush_work {
+	struct work_struct work;
+	struct kmem_cache *s;
+	bool skip;
+};
+
+static DEFINE_MUTEX(flush_lock);
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
 
 /********************************************************************
  * 			Core slab cache functions
@@ -2454,6 +2499,359 @@ static void *setup_object(struct kmem_cache *s, void *object)
 	return object;
 }
 
+static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
+					s->sheaf_capacity), gfp);
+
+	if (unlikely(!sheaf))
+		return NULL;
+
+	stat(s, SHEAF_ALLOC);
+
+	return sheaf;
+}
+
+static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	kfree(sheaf);
+
+	stat(s, SHEAF_FREE);
+}
+
+static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+				   size_t size, void **p);
+
+
+static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
+			 gfp_t gfp)
+{
+	int to_fill = s->sheaf_capacity - sheaf->size;
+	int filled;
+
+	if (!to_fill)
+		return 0;
+
+	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
+					 &sheaf->objects[sheaf->size]);
+
+	sheaf->size += filled;
+
+	stat_add(s, SHEAF_REFILL, filled);
+
+	if (filled < to_fill)
+		return -ENOMEM;
+
+	return 0;
+}
+
+
+static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (!sheaf)
+		return NULL;
+
+	if (refill_sheaf(s, sheaf, gfp)) {
+		free_empty_sheaf(s, sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+/*
+ * Maximum number of objects freed during a single flush of main pcs sheaf.
+ * Translates directly to an on-stack array size.
+ */
+#define PCS_BATCH_MAX	32U
+
+static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
+
+/*
+ * Free all objects from the main sheaf. In order to perform
+ * __kmem_cache_free_bulk() outside of cpu_sheaves->lock, work in batches where
+ * object pointers are moved to a on-stack array under the lock. To bound the
+ * stack usage, limit each batch to PCS_BATCH_MAX.
+ *
+ * returns true if at least partially flushed
+ */
+static bool sheaf_flush_main(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	unsigned int batch, remaining;
+	void *objects[PCS_BATCH_MAX];
+	struct slab_sheaf *sheaf;
+	bool ret = false;
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return ret;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+	sheaf = pcs->main;
+
+	batch = min(PCS_BATCH_MAX, sheaf->size);
+
+	sheaf->size -= batch;
+	memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
+
+	remaining = sheaf->size;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	__kmem_cache_free_bulk(s, batch, &objects[0]);
+
+	stat_add(s, SHEAF_FLUSH, batch);
+
+	ret = true;
+
+	if (remaining)
+		goto next_batch;
+
+	return ret;
+}
+
+/*
+ * Free all objects from a sheaf that's unused, i.e. not linked to any
+ * cpu_sheaves, so we need no locking and batching. The locking is also not
+ * necessary when flushing cpu's sheaves (both spare and main) during cpu
+ * hotremove as the cpu is not executing anymore.
+ */
+static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	if (!sheaf->size)
+		return;
+
+	stat_add(s, SHEAF_FLUSH, sheaf->size);
+
+	__kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+
+	sheaf->size = 0;
+}
+
+/*
+ * Caller needs to make sure migration is disabled in order to fully flush
+ * single cpu's sheaves
+ *
+ * must not be called from an irq
+ *
+ * flushing operations are rare so let's keep it simple and flush to slabs
+ * directly, skipping the barn
+ */
+static void pcs_flush_all(struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *spare;
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	spare = pcs->spare;
+	pcs->spare = NULL;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (spare) {
+		sheaf_flush_unused(s, spare);
+		free_empty_sheaf(s, spare);
+	}
+
+	sheaf_flush_main(s);
+}
+
+static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	/* The cpu is not executing anymore so we don't need pcs->lock */
+	sheaf_flush_unused(s, pcs->main);
+	if (pcs->spare) {
+		sheaf_flush_unused(s, pcs->spare);
+		free_empty_sheaf(s, pcs->spare);
+		pcs->spare = NULL;
+	}
+}
+
+static void pcs_destroy(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		/* can happen when unwinding failed create */
+		if (!pcs->main)
+			continue;
+
+		/*
+		 * We have already passed __kmem_cache_shutdown() so everything
+		 * was flushed and there should be no objects allocated from
+		 * slabs, otherwise kmem_cache_destroy() would have aborted.
+		 * Therefore something would have to be really wrong if the
+		 * warnings here trigger, and we should rather leave bojects and
+		 * sheaves to leak in that case.
+		 */
+
+		WARN_ON(pcs->spare);
+
+		if (!WARN_ON(pcs->main->size)) {
+			free_empty_sheaf(s, pcs->main);
+			pcs->main = NULL;
+		}
+	}
+
+	free_percpu(s->cpu_sheaves);
+	s->cpu_sheaves = NULL;
+}
+
+static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *empty = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_empty) {
+		empty = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&empty->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+/*
+ * The following two functions are used mainly in cases where we have to undo an
+ * intended action due to a race or cpu migration. Thus they do not check the
+ * empty or full sheaf limits for simplicity.
+ */
+
+static void barn_put_empty_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_add(&sheaf->barn_list, &barn->sheaves_empty);
+	barn->nr_empty++;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+}
+
+static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_add(&sheaf->barn_list, &barn->sheaves_full);
+	barn->nr_full++;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+}
+
+/*
+ * If a full sheaf is available, return it and put the supplied empty one to
+ * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
+ * change.
+ */
+static struct slab_sheaf *
+barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
+{
+	struct slab_sheaf *full = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&full->barn_list);
+		list_add(&empty->barn_list, &barn->sheaves_empty);
+		barn->nr_full--;
+		barn->nr_empty++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return full;
+}
+/*
+ * If a empty sheaf is available, return it and put the supplied full one to
+ * barn. But if there are too many full sheaves, reject this with -E2BIG.
+ */
+static struct slab_sheaf *
+barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
+{
+	struct slab_sheaf *empty;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full >= MAX_FULL_SHEAVES) {
+		empty = ERR_PTR(-E2BIG);
+	} else if (!barn->nr_empty) {
+		empty = ERR_PTR(-ENOMEM);
+	} else {
+		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
+					 barn_list);
+		list_del(&empty->barn_list);
+		list_add(&full->barn_list, &barn->sheaves_full);
+		barn->nr_empty--;
+		barn->nr_full++;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return empty;
+}
+
+static void barn_init(struct node_barn *barn)
+{
+	spin_lock_init(&barn->lock);
+	INIT_LIST_HEAD(&barn->sheaves_full);
+	INIT_LIST_HEAD(&barn->sheaves_empty);
+	barn->nr_full = 0;
+	barn->nr_empty = 0;
+}
+
+static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
+{
+	struct list_head empty_list;
+	struct list_head full_list;
+	struct slab_sheaf *sheaf, *sheaf2;
+	unsigned long flags;
+
+	INIT_LIST_HEAD(&empty_list);
+	INIT_LIST_HEAD(&full_list);
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	list_splice_init(&barn->sheaves_full, &full_list);
+	barn->nr_full = 0;
+	list_splice_init(&barn->sheaves_empty, &empty_list);
+	barn->nr_empty = 0;
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
+		sheaf_flush_unused(s, sheaf);
+		free_empty_sheaf(s, sheaf);
+	}
+
+	list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
+		free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Slab allocation and freeing
  */
@@ -3325,11 +3723,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 	put_partials_cpu(s, c);
 }
 
-struct slub_flush_work {
-	struct work_struct work;
-	struct kmem_cache *s;
-	bool skip;
-};
+static inline void flush_this_cpu_slab(struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+
+	if (c->slab)
+		flush_slab(s, c);
+
+	put_partials(s);
+}
+
+static bool has_cpu_slab(int cpu, struct kmem_cache *s)
+{
+	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+
+	return c->slab || slub_percpu_partial(c);
+}
+
+#else /* CONFIG_SLUB_TINY */
+static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
+static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
+static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
+#endif /* CONFIG_SLUB_TINY */
+
+static bool has_pcs_used(int cpu, struct kmem_cache *s)
+{
+	struct slub_percpu_sheaves *pcs;
+
+	if (!s->cpu_sheaves)
+		return false;
+
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+	return (pcs->spare || pcs->main->size);
+}
+
+static void pcs_flush_all(struct kmem_cache *s);
 
 /*
  * Flush cpu slab.
@@ -3339,30 +3768,18 @@ struct slub_flush_work {
 static void flush_cpu_slab(struct work_struct *w)
 {
 	struct kmem_cache *s;
-	struct kmem_cache_cpu *c;
 	struct slub_flush_work *sfw;
 
 	sfw = container_of(w, struct slub_flush_work, work);
 
 	s = sfw->s;
-	c = this_cpu_ptr(s->cpu_slab);
 
-	if (c->slab)
-		flush_slab(s, c);
+	if (s->cpu_sheaves)
+		pcs_flush_all(s);
 
-	put_partials(s);
-}
-
-static bool has_cpu_slab(int cpu, struct kmem_cache *s)
-{
-	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
-
-	return c->slab || slub_percpu_partial(c);
+	flush_this_cpu_slab(s);
 }
 
-static DEFINE_MUTEX(flush_lock);
-static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
-
 static void flush_all_cpus_locked(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
@@ -3373,7 +3790,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		if (!has_cpu_slab(cpu, s)) {
+		if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
 			sfw->skip = true;
 			continue;
 		}
@@ -3409,19 +3826,15 @@ static int slub_cpu_dead(unsigned int cpu)
 	struct kmem_cache *s;
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list)
+	list_for_each_entry(s, &slab_caches, list) {
 		__flush_cpu_slab(s, cpu);
+		if (s->cpu_sheaves)
+			__pcs_flush_all_cpu(s, cpu);
+	}
 	mutex_unlock(&slab_mutex);
 	return 0;
 }
 
-#else /* CONFIG_SLUB_TINY */
-static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
-static inline void flush_all(struct kmem_cache *s) { }
-static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
-static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
-#endif /* CONFIG_SLUB_TINY */
-
 /*
  * Check if the objects in a per cpu structure fit numa
  * locality expectations.
@@ -4171,6 +4584,191 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
 }
 
+static __fastpath_inline
+void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
+{
+	struct slub_percpu_sheaves *pcs;
+	void *object;
+
+#ifdef CONFIG_NUMA
+	if (static_branch_unlikely(&strict_numa)) {
+		if (current->mempolicy)
+			return NULL;
+	}
+#endif
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return NULL;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+
+		struct slab_sheaf *empty = NULL;
+		struct slab_sheaf *full;
+		bool can_alloc;
+
+		if (pcs->spare && pcs->spare->size > 0) {
+			swap(pcs->main, pcs->spare);
+			goto do_alloc;
+		}
+
+		full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
+
+		if (full) {
+			stat(s, BARN_GET);
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		stat(s, BARN_GET_FAIL);
+
+		can_alloc = gfpflags_allow_blocking(gfp);
+
+		if (can_alloc) {
+			if (pcs->spare) {
+				empty = pcs->spare;
+				pcs->spare = NULL;
+			} else {
+				empty = barn_get_empty_sheaf(pcs->barn);
+			}
+		}
+
+		local_unlock(&s->cpu_sheaves->lock);
+
+		if (!can_alloc)
+			return NULL;
+
+		if (empty) {
+			if (!refill_sheaf(s, empty, gfp)) {
+				full = empty;
+			} else {
+				/*
+				 * we must be very low on memory so don't bother
+				 * with the barn
+				 */
+				free_empty_sheaf(s, empty);
+			}
+		} else {
+			full = alloc_full_sheaf(s, gfp);
+		}
+
+		if (!full)
+			return NULL;
+
+		/*
+		 * we can reach here only when gfpflags_allow_blocking
+		 * so this must not be an irq
+		 */
+		local_lock(&s->cpu_sheaves->lock);
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		/*
+		 * If we are returning empty sheaf, we either got it from the
+		 * barn or had to allocate one. If we are returning a full
+		 * sheaf, it's due to racing or being migrated to a different
+		 * cpu. Breaching the barn's sheaf limits should be thus rare
+		 * enough so just ignore them to simplify the recovery.
+		 */
+
+		if (pcs->main->size == 0) {
+			barn_put_empty_sheaf(pcs->barn, pcs->main);
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		if (!pcs->spare) {
+			pcs->spare = full;
+			goto do_alloc;
+		}
+
+		if (pcs->spare->size == 0) {
+			barn_put_empty_sheaf(pcs->barn, pcs->spare);
+			pcs->spare = full;
+			goto do_alloc;
+		}
+
+		barn_put_full_sheaf(pcs->barn, full);
+		stat(s, BARN_PUT);
+	}
+
+do_alloc:
+	object = pcs->main->objects[--pcs->main->size];
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat(s, ALLOC_PCS);
+
+	return object;
+}
+
+static __fastpath_inline
+unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main;
+	unsigned int allocated = 0;
+	unsigned int batch;
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return allocated;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == 0)) {
+
+		struct slab_sheaf *full;
+
+		if (pcs->spare && pcs->spare->size > 0) {
+			swap(pcs->main, pcs->spare);
+			goto do_alloc;
+		}
+
+		full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
+
+		if (full) {
+			stat(s, BARN_GET);
+			pcs->main = full;
+			goto do_alloc;
+		}
+
+		stat(s, BARN_GET_FAIL);
+
+		local_unlock(&s->cpu_sheaves->lock);
+
+		/*
+		 * Once full sheaves in barn are depleted, let the bulk
+		 * allocation continue from slab pages, otherwise we would just
+		 * be copying arrays of pointers twice.
+		 */
+		return allocated;
+	}
+
+do_alloc:
+
+	main = pcs->main;
+	batch = min(size, main->size);
+
+	main->size -= batch;
+	memcpy(p, main->objects + main->size, batch * sizeof(void *));
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat_add(s, ALLOC_PCS, batch);
+
+	allocated += batch;
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+
+	return allocated;
+}
+
+
 /*
  * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
  * have the fastpath folded into their functions. So no function call
@@ -4195,7 +4793,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
 	if (unlikely(object))
 		goto out;
 
-	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
+	if (s->cpu_sheaves && node == NUMA_NO_NODE)
+		object = alloc_from_pcs(s, gfpflags);
+
+	if (!object)
+		object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
 
 	maybe_wipe_obj_freeptr(s, object);
 	init = slab_want_init_on_alloc(gfpflags, s);
@@ -4567,6 +5169,234 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+/*
+ * pcs is locked. We should have get rid of the spare sheaf and obtained an
+ * empty sheaf, while the main sheaf is full. We want to install the empty sheaf
+ * as a main sheaf, and make the current main sheaf a spare sheaf.
+ *
+ * However due to having relinquished the cpu_sheaves lock when obtaining
+ * the empty sheaf, we need to handle some unlikely but possible cases.
+ *
+ * If we put any sheaf to barn here, it's because we were interrupted or have
+ * been migrated to a different cpu, which should be rare enough so just ignore
+ * the barn's limits to simplify the handling.
+ */
+static void __pcs_install_empty_sheaf(struct kmem_cache *s,
+		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
+{
+	/* this is what we expect to find if nobody interrupted us */
+	if (likely(!pcs->spare)) {
+		pcs->spare = pcs->main;
+		pcs->main = empty;
+		return;
+	}
+
+	/*
+	 * Unlikely because if the main sheaf had space, we would have just
+	 * freed to it. Get rid of our empty sheaf.
+	 */
+	if (pcs->main->size < s->sheaf_capacity) {
+		barn_put_empty_sheaf(pcs->barn, empty);
+		return;
+	}
+
+	/* Also unlikely for the same reason */
+	if (pcs->spare->size < s->sheaf_capacity) {
+		swap(pcs->main, pcs->spare);
+		barn_put_empty_sheaf(pcs->barn, empty);
+		return;
+	}
+
+	barn_put_full_sheaf(pcs->barn, pcs->main);
+	stat(s, BARN_PUT);
+	pcs->main = empty;
+}
+
+/*
+ * Free an object to the percpu sheaves.
+ * The object is expected to have passed slab_free_hook() already.
+ */
+static __fastpath_inline
+bool free_to_pcs(struct kmem_cache *s, void *object)
+{
+	struct slub_percpu_sheaves *pcs;
+
+restart:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		return false;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
+
+		struct slab_sheaf *empty;
+
+		if (!pcs->spare) {
+			empty = barn_get_empty_sheaf(pcs->barn);
+			if (empty) {
+				pcs->spare = pcs->main;
+				pcs->main = empty;
+				goto do_free;
+			}
+			goto alloc_empty;
+		}
+
+		if (pcs->spare->size < s->sheaf_capacity) {
+			swap(pcs->main, pcs->spare);
+			goto do_free;
+		}
+
+		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
+
+		if (!IS_ERR(empty)) {
+			stat(s, BARN_PUT);
+			pcs->main = empty;
+			goto do_free;
+		}
+
+		if (PTR_ERR(empty) == -E2BIG) {
+			/* Since we got here, spare exists and is full */
+			struct slab_sheaf *to_flush = pcs->spare;
+
+			stat(s, BARN_PUT_FAIL);
+
+			pcs->spare = NULL;
+			local_unlock(&s->cpu_sheaves->lock);
+
+			sheaf_flush_unused(s, to_flush);
+			empty = to_flush;
+			goto got_empty;
+		}
+
+alloc_empty:
+		local_unlock(&s->cpu_sheaves->lock);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty) {
+			if (sheaf_flush_main(s))
+				goto restart;
+			else
+				return false;
+		}
+
+got_empty:
+		if (!local_trylock(&s->cpu_sheaves->lock)) {
+			struct node_barn *barn;
+
+			barn = get_node(s, numa_mem_id())->barn;
+
+			barn_put_empty_sheaf(barn, empty);
+			return false;
+		}
+
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+		__pcs_install_empty_sheaf(s, pcs, empty);
+	}
+
+do_free:
+	pcs->main->objects[pcs->main->size++] = object;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat(s, FREE_PCS);
+
+	return true;
+}
+
+/*
+ * Bulk free objects to the percpu sheaves.
+ * Unlike free_to_pcs() this includes the calls to all necessary hooks
+ * and the fallback to freeing to slab pages.
+ */
+static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *main, *empty;
+	unsigned int batch, i = 0;
+	bool init;
+
+	init = slab_want_init_on_free(s);
+
+	while (i < size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
+			p[i] = p[--size];
+			if (!size)
+				return;
+			continue;
+		}
+
+		i++;
+	}
+
+next_batch:
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto fallback;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (likely(pcs->main->size < s->sheaf_capacity))
+		goto do_free;
+
+	if (!pcs->spare) {
+		empty = barn_get_empty_sheaf(pcs->barn);
+		if (!empty)
+			goto no_empty;
+
+		pcs->spare = pcs->main;
+		pcs->main = empty;
+		goto do_free;
+	}
+
+	if (pcs->spare->size < s->sheaf_capacity) {
+		swap(pcs->main, pcs->spare);
+		goto do_free;
+	}
+
+	empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
+	if (IS_ERR(empty)) {
+		stat(s, BARN_PUT_FAIL);
+		goto no_empty;
+	}
+
+	stat(s, BARN_PUT);
+	pcs->main = empty;
+
+do_free:
+	main = pcs->main;
+	batch = min(size, s->sheaf_capacity - main->size);
+
+	memcpy(main->objects + main->size, p, batch * sizeof(void *));
+	main->size += batch;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	stat_add(s, FREE_PCS, batch);
+
+	if (batch < size) {
+		p += batch;
+		size -= batch;
+		goto next_batch;
+	}
+
+	return;
+
+no_empty:
+	local_unlock(&s->cpu_sheaves->lock);
+
+	/*
+	 * if we depleted all empty sheaves in the barn or there are too
+	 * many full sheaves, free the rest to slab pages
+	 */
+fallback:
+	__kmem_cache_free_bulk(s, size, p);
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4653,7 +5483,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	memcg_slab_free_hook(s, slab, &object, 1);
 	alloc_tagging_slab_free_hook(s, slab, &object, 1);
 
-	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
+		return;
+
+	if (!s->cpu_sheaves || !free_to_pcs(s, object))
 		do_slab_free(s, slab, object, object, 1, addr);
 }
 
@@ -5247,6 +6080,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 	if (!size)
 		return;
 
+	/*
+	 * freeing to sheaves is so incompatible with the detached freelist so
+	 * once we go that way, we have to do everything differently
+	 */
+	if (s && s->cpu_sheaves) {
+		free_to_pcs_bulk(s, size, p);
+		return;
+	}
+
 	do {
 		struct detached_freelist df;
 
@@ -5365,7 +6207,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
 int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 				 void **p)
 {
-	int i;
+	unsigned int i = 0;
 
 	if (!size)
 		return 0;
@@ -5374,9 +6216,21 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 	if (unlikely(!s))
 		return 0;
 
-	i = __kmem_cache_alloc_bulk(s, flags, size, p);
-	if (unlikely(i == 0))
-		return 0;
+	if (s->cpu_sheaves)
+		i = alloc_from_pcs_bulk(s, size, p);
+
+	if (i < size) {
+		unsigned int j = __kmem_cache_alloc_bulk(s, flags, size - i, p + i);
+		/*
+		 * If we ran out of memory, don't bother with freeing back to
+		 * the percpu sheaves, we have bigger problems.
+		 */
+		if (unlikely(j == 0)) {
+			if (i > 0)
+				__kmem_cache_free_bulk(s, i, p);
+			return 0;
+		}
+	}
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.
@@ -5386,11 +6240,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
 		    slab_want_init_on_alloc(flags, s), s->object_size))) {
 		return 0;
 	}
-	return i;
+
+	return size;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
 
-
 /*
  * Object placement in a slab is made very easy because we always start at
  * offset 0. If we tune the size of the object to the alignment then we can
@@ -5524,7 +6378,7 @@ static inline int calculate_order(unsigned int size)
 }
 
 static void
-init_kmem_cache_node(struct kmem_cache_node *n)
+init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
@@ -5534,6 +6388,9 @@ init_kmem_cache_node(struct kmem_cache_node *n)
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
+	n->barn = barn;
+	if (barn)
+		barn_init(barn);
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -5564,6 +6421,30 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 }
 #endif /* CONFIG_SLUB_TINY */
 
+static int init_percpu_sheaves(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct slub_percpu_sheaves *pcs;
+		int nid;
+
+		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
+
+		local_trylock_init(&pcs->lock);
+
+		nid = cpu_to_mem(cpu);
+
+		pcs->barn = get_node(s, nid)->barn;
+		pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+
+		if (!pcs->main)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static struct kmem_cache *kmem_cache_node;
 
 /*
@@ -5599,7 +6480,7 @@ static void early_kmem_cache_node_alloc(int node)
 	slab->freelist = get_freepointer(kmem_cache_node, n);
 	slab->inuse = 1;
 	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n);
+	init_kmem_cache_node(n, NULL);
 	inc_slabs_node(kmem_cache_node, node, slab->objects);
 
 	/*
@@ -5615,6 +6496,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn) {
+			WARN_ON(n->barn->nr_full);
+			WARN_ON(n->barn->nr_empty);
+			kfree(n->barn);
+			n->barn = NULL;
+		}
+
 		s->node[node] = NULL;
 		kmem_cache_free(kmem_cache_node, n);
 	}
@@ -5623,6 +6511,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
+	if (s->cpu_sheaves)
+		pcs_destroy(s);
 #ifndef CONFIG_SLUB_TINY
 	free_percpu(s->cpu_slab);
 #endif
@@ -5635,20 +6525,29 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_node_mask(node, slab_nodes) {
 		struct kmem_cache_node *n;
+		struct node_barn *barn = NULL;
 
 		if (slab_state == DOWN) {
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+			if (!barn)
+				return 0;
+		}
+
 		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
-
 		if (!n) {
-			free_kmem_cache_nodes(s);
+			kfree(barn);
 			return 0;
 		}
 
-		init_kmem_cache_node(n);
+		init_kmem_cache_node(n, barn);
+
 		s->node[node] = n;
 	}
 	return 1;
@@ -5905,6 +6804,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	flush_all_cpus_locked(s);
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
+		if (n->barn)
+			barn_shrink(s, n->barn);
 		free_partial(s, n);
 		if (n->nr_partial || node_nr_slabs(n))
 			return 1;
@@ -6108,6 +7009,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 			INIT_LIST_HEAD(promote + i);
 
+		if (n->barn)
+			barn_shrink(s, n->barn);
+
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -6220,12 +7124,24 @@ static int slab_mem_going_online_callback(void *arg)
 	 */
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
+		struct node_barn *barn = NULL;
+
 		/*
 		 * The structure may already exist if the node was previously
 		 * onlined and offlined.
 		 */
 		if (get_node(s, nid))
 			continue;
+
+		if (s->cpu_sheaves) {
+			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
+
+			if (!barn) {
+				ret = -ENOMEM;
+				goto out;
+			}
+		}
+
 		/*
 		 * XXX: kmem_cache_alloc_node will fallback to other nodes
 		 *      since memory is not yet available from the node that
@@ -6233,10 +7149,13 @@ static int slab_mem_going_online_callback(void *arg)
 		 */
 		n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
 		if (!n) {
+			kfree(barn);
 			ret = -ENOMEM;
 			goto out;
 		}
-		init_kmem_cache_node(n);
+
+		init_kmem_cache_node(n, barn);
+
 		s->node[nid] = n;
 	}
 	/*
@@ -6455,6 +7374,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 
 	set_cpu_partial(s);
 
+	if (args->sheaf_capacity && !(s->flags & SLAB_DEBUG_FLAGS)) {
+		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
+		if (!s->cpu_sheaves) {
+			err = -ENOMEM;
+			goto out;
+		}
+		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
+		s->sheaf_capacity = args->sheaf_capacity;
+	}
+
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
 #endif
@@ -6471,6 +7400,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	if (!alloc_kmem_cache_cpus(s))
 		goto out;
 
+	if (s->cpu_sheaves) {
+		err = init_percpu_sheaves(s);
+		if (err)
+			goto out;
+	}
+
 	err = 0;
 
 	/* Mutex is not taken during early boot */
@@ -6492,7 +7427,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 		__kmem_cache_release(s);
 	return err;
 }
-
 #ifdef SLAB_SUPPORTS_SYSFS
 static int count_inuse(struct slab *slab)
 {
@@ -6923,6 +7857,12 @@ static ssize_t order_show(struct kmem_cache *s, char *buf)
 }
 SLAB_ATTR_RO(order);
 
+static ssize_t sheaf_capacity_show(struct kmem_cache *s, char *buf)
+{
+	return sysfs_emit(buf, "%u\n", s->sheaf_capacity);
+}
+SLAB_ATTR_RO(sheaf_capacity);
+
 static ssize_t min_partial_show(struct kmem_cache *s, char *buf)
 {
 	return sysfs_emit(buf, "%lu\n", s->min_partial);
@@ -7270,8 +8210,10 @@ static ssize_t text##_store(struct kmem_cache *s,		\
 }								\
 SLAB_ATTR(text);						\
 
+STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
+STAT_ATTR(FREE_PCS, free_cpu_sheaf);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -7296,6 +8238,14 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
 STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
 STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
 STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
+STAT_ATTR(SHEAF_FLUSH, sheaf_flush);
+STAT_ATTR(SHEAF_REFILL, sheaf_refill);
+STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
+STAT_ATTR(SHEAF_FREE, sheaf_free);
+STAT_ATTR(BARN_GET, barn_get);
+STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
+STAT_ATTR(BARN_PUT, barn_put);
+STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -7326,6 +8276,7 @@ static struct attribute *slab_attrs[] = {
 	&object_size_attr.attr,
 	&objs_per_slab_attr.attr,
 	&order_attr.attr,
+	&sheaf_capacity_attr.attr,
 	&min_partial_attr.attr,
 	&cpu_partial_attr.attr,
 	&objects_partial_attr.attr,
@@ -7357,8 +8308,10 @@ static struct attribute *slab_attrs[] = {
 	&remote_node_defrag_ratio_attr.attr,
 #endif
 #ifdef CONFIG_SLUB_STATS
+	&alloc_cpu_sheaf_attr.attr,
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
+	&free_cpu_sheaf_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,
@@ -7383,6 +8336,14 @@ static struct attribute *slab_attrs[] = {
 	&cpu_partial_free_attr.attr,
 	&cpu_partial_node_attr.attr,
 	&cpu_partial_drain_attr.attr,
+	&sheaf_flush_attr.attr,
+	&sheaf_refill_attr.attr,
+	&sheaf_alloc_attr.attr,
+	&sheaf_free_attr.attr,
+	&barn_get_attr.attr,
+	&barn_get_fail_attr.attr,
+	&barn_put_attr.attr,
+	&barn_put_fail_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-04-25  8:27 ` [PATCH v4 1/9] slab: add opt-in caching layer of " Vlastimil Babka
@ 2025-04-25 17:31   ` Christoph Lameter (Ampere)
  2025-04-28  7:01     ` Vlastimil Babka
  2025-04-29  1:08   ` Harry Yoo
  2025-05-06 23:14   ` Suren Baghdasaryan
  2 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-04-25 17:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, 25 Apr 2025, Vlastimil Babka wrote:

> @@ -4195,7 +4793,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>  	if (unlikely(object))
>  		goto out;
>
> -	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> +	if (s->cpu_sheaves && node == NUMA_NO_NODE)
> +		object = alloc_from_pcs(s, gfpflags);

The node to use is determined in __slab_alloc_node() only based on the
memory policy etc. NUMA_NO_NODE allocations can be redirected by memory
policies and this check disables it.


> @@ -4653,7 +5483,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>  	memcg_slab_free_hook(s, slab, &object, 1);
>  	alloc_tagging_slab_free_hook(s, slab, &object, 1);
>
> -	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> +	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> +		return;
> +
> +	if (!s->cpu_sheaves || !free_to_pcs(s, object))
>  		do_slab_free(s, slab, object, object, 1, addr);
>  }

We free to pcs even if the object is remote?



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-04-25 17:31   ` Christoph Lameter (Ampere)
@ 2025-04-28  7:01     ` Vlastimil Babka
  2025-05-06 17:32       ` Suren Baghdasaryan
  0 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-28  7:01 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Suren Baghdasaryan, Liam R. Howlett, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 4/25/25 19:31, Christoph Lameter (Ampere) wrote:
> On Fri, 25 Apr 2025, Vlastimil Babka wrote:
> 
>> @@ -4195,7 +4793,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>>  	if (unlikely(object))
>>  		goto out;
>>
>> -	object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
>> +	if (s->cpu_sheaves && node == NUMA_NO_NODE)
>> +		object = alloc_from_pcs(s, gfpflags);
> 
> The node to use is determined in __slab_alloc_node() only based on the
> memory policy etc. NUMA_NO_NODE allocations can be redirected by memory
> policies and this check disables it.

To handle that, alloc_from_pcs() contains this:

#ifdef CONFIG_NUMA
        if (static_branch_unlikely(&strict_numa)) {
                if (current->mempolicy)
                        return NULL;
        }
#endif

And so there will be a fallback. It doesn't (currently) try to evaluate if
the local node is compatible as this is before taking the local lock (and
thus preventing migration).


>> @@ -4653,7 +5483,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>>  	memcg_slab_free_hook(s, slab, &object, 1);
>>  	alloc_tagging_slab_free_hook(s, slab, &object, 1);
>>
>> -	if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>> +	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>> +		return;
>> +
>> +	if (!s->cpu_sheaves || !free_to_pcs(s, object))
>>  		do_slab_free(s, slab, object, object, 1, addr);
>>  }
> 
> We free to pcs even if the object is remote?
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-04-28  7:01     ` Vlastimil Babka
@ 2025-05-06 17:32       ` Suren Baghdasaryan
  2025-05-06 23:11         ` Suren Baghdasaryan
  0 siblings, 1 reply; 35+ messages in thread
From: Suren Baghdasaryan @ 2025-05-06 17:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter (Ampere), Liam R. Howlett, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Mon, Apr 28, 2025 at 12:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 4/25/25 19:31, Christoph Lameter (Ampere) wrote:
> > On Fri, 25 Apr 2025, Vlastimil Babka wrote:
> >
> >> @@ -4195,7 +4793,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
> >>      if (unlikely(object))
> >>              goto out;
> >>
> >> -    object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> >> +    if (s->cpu_sheaves && node == NUMA_NO_NODE)
> >> +            object = alloc_from_pcs(s, gfpflags);
> >
> > The node to use is determined in __slab_alloc_node() only based on the
> > memory policy etc. NUMA_NO_NODE allocations can be redirected by memory
> > policies and this check disables it.
>
> To handle that, alloc_from_pcs() contains this:
>
> #ifdef CONFIG_NUMA
>         if (static_branch_unlikely(&strict_numa)) {
>                 if (current->mempolicy)
>                         return NULL;
>         }
> #endif
>
> And so there will be a fallback. It doesn't (currently) try to evaluate if
> the local node is compatible as this is before taking the local lock (and
> thus preventing migration).
>
>
> >> @@ -4653,7 +5483,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> >>      memcg_slab_free_hook(s, slab, &object, 1);
> >>      alloc_tagging_slab_free_hook(s, slab, &object, 1);
> >>
> >> -    if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> >> +    if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> >> +            return;
> >> +
> >> +    if (!s->cpu_sheaves || !free_to_pcs(s, object))
> >>              do_slab_free(s, slab, object, object, 1, addr);
> >>  }
> >
> > We free to pcs even if the object is remote?

Overall the patch LGTM but I would like to hear the answer to this
question too, please.

> >
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-05-06 17:32       ` Suren Baghdasaryan
@ 2025-05-06 23:11         ` Suren Baghdasaryan
  0 siblings, 0 replies; 35+ messages in thread
From: Suren Baghdasaryan @ 2025-05-06 23:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Christoph Lameter (Ampere), Liam R. Howlett, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Tue, May 6, 2025 at 10:32 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Apr 28, 2025 at 12:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 4/25/25 19:31, Christoph Lameter (Ampere) wrote:
> > > On Fri, 25 Apr 2025, Vlastimil Babka wrote:
> > >
> > >> @@ -4195,7 +4793,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
> > >>      if (unlikely(object))
> > >>              goto out;
> > >>
> > >> -    object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> > >> +    if (s->cpu_sheaves && node == NUMA_NO_NODE)
> > >> +            object = alloc_from_pcs(s, gfpflags);
> > >
> > > The node to use is determined in __slab_alloc_node() only based on the
> > > memory policy etc. NUMA_NO_NODE allocations can be redirected by memory
> > > policies and this check disables it.
> >
> > To handle that, alloc_from_pcs() contains this:
> >
> > #ifdef CONFIG_NUMA
> >         if (static_branch_unlikely(&strict_numa)) {
> >                 if (current->mempolicy)
> >                         return NULL;
> >         }
> > #endif
> >
> > And so there will be a fallback. It doesn't (currently) try to evaluate if
> > the local node is compatible as this is before taking the local lock (and
> > thus preventing migration).
> >
> >
> > >> @@ -4653,7 +5483,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
> > >>      memcg_slab_free_hook(s, slab, &object, 1);
> > >>      alloc_tagging_slab_free_hook(s, slab, &object, 1);
> > >>
> > >> -    if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > >> +    if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> > >> +            return;
> > >> +
> > >> +    if (!s->cpu_sheaves || !free_to_pcs(s, object))
> > >>              do_slab_free(s, slab, object, object, 1, addr);
> > >>  }
> > >
> > > We free to pcs even if the object is remote?
>
> Overall the patch LGTM but I would like to hear the answer to this
> question too, please.

Ah, I reached the last patch and found the answer there:
https://lore.kernel.org/all/c60ae681-6027-0626-8d4e-5833982bf1f0@gentwo.org/

>
> > >
> >


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-04-25  8:27 ` [PATCH v4 1/9] slab: add opt-in caching layer of " Vlastimil Babka
  2025-04-25 17:31   ` Christoph Lameter (Ampere)
@ 2025-04-29  1:08   ` Harry Yoo
  2025-05-13 16:08     ` Vlastimil Babka
  2025-05-06 23:14   ` Suren Baghdasaryan
  2 siblings, 1 reply; 35+ messages in thread
From: Harry Yoo @ 2025-04-29  1:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Apr 25, 2025 at 10:27:21AM +0200, Vlastimil Babka wrote:
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
> 
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
> 
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
> 
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
> 
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
> 
> The sheaf_capacity value is exported in sysfs for observability.
> 
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full.

> While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.

I initially thought we need counters for empty sheaves to see how many times
it grabs empty sheaves from the barn, but looks like barn_put
("put full sheaves to the barn") is effectively a proxy for that, right?

> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
> 
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

LGTM, with a few nits:

>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1053 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1044 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..4cb495d55fc58c70a992ee4782d7990ce1c55dc6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -335,6 +335,37 @@ struct kmem_cache_args {
> 	 * %NULL means no constructor.
> 	 */
> 	void (*ctor)(void *);
>	/**
>	 * @sheaf_capacity: Enable sheaves of given capacity for the cache.
>	 *
>	 * With a non-zero value, allocations from the cache go through caching
>	 * arrays called sheaves. Each cpu has a main sheaf that's always
>	 * present, and a spare sheaf thay may be not present. When both become
>	 * empty, there's an attempt to replace an empty sheaf with a full sheaf
>	 * from the per-node barn.
>	 *
>	 * When no full sheaf is available, and gfp flags allow blocking, a
>	 * sheaf is allocated and filled from slab(s) using bulk allocation.
>	 * Otherwise the allocation falls back to the normal operation
>	 * allocating a single object from a slab.
>	 *
>	 * Analogically when freeing and both percpu sheaves are full, the barn
>	 * may replace it with an empty sheaf, unless it's over capacity. In
>	 * that case a sheaf is bulk freed to slab pages.
>	 *
>	 * The sheaves do not enforce NUMA placement of objects, so allocations
>	 * via kmem_cache_alloc_node() with a node specified other than
>	 * NUMA_NO_NODE will bypass them.
>	 *
>	 * Bulk allocation and free operations also try to use the cpu sheaves
>	 * and barn, but fallback to using slab pages directly.
>	 *
>	 * When slub_debug is enabled for the cache, the sheaf_capacity argument
>	 * is ignored.
>	 *
>	 * %0 means no sheaves will be created

nit: created -> created. (with a full stop)

>	 */
>	unsigned int sheaf_capacity;

> diff --git a/mm/slub.c b/mm/slub.c
> index dc9e729e1d269b5d362cb5bc44f824640ffd00f3..ae3e80ad9926ca15601eef2f2aa016ca059498f8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> +static void pcs_destroy(struct kmem_cache *s)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct slub_percpu_sheaves *pcs;
> +
> +		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +		/* can happen when unwinding failed create */
> +		if (!pcs->main)
> +			continue;
> +
> +		/*
> +		 * We have already passed __kmem_cache_shutdown() so everything
> +		 * was flushed and there should be no objects allocated from
> +		 * slabs, otherwise kmem_cache_destroy() would have aborted.
> +		 * Therefore something would have to be really wrong if the
> +		 * warnings here trigger, and we should rather leave bojects and

nit: bojects -> objects

> +		 * sheaves to leak in that case.
> +		 */
> +
> +		WARN_ON(pcs->spare);
> +
> +		if (!WARN_ON(pcs->main->size)) {
> +			free_empty_sheaf(s, pcs->main);
> +			pcs->main = NULL;
> +		}
> +	}
> +
> +	free_percpu(s->cpu_sheaves);
> +	s->cpu_sheaves = NULL;
> +}
> +
> +/*
> + * If a empty sheaf is available, return it and put the supplied full one to

nit: a empty sheaf -> an empty sheaf

> + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> + */
>
> +static struct slab_sheaf *
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> @@ -4567,6 +5169,234 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +/*
> + * Free an object to the percpu sheaves.
> + * The object is expected to have passed slab_free_hook() already.
> + */
> +static __fastpath_inline
> +bool free_to_pcs(struct kmem_cache *s, void *object)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +
> +restart:
> +	if (!local_trylock(&s->cpu_sheaves->lock))
> +		return false;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +		struct slab_sheaf *empty;
> +
> +		if (!pcs->spare) {
> +			empty = barn_get_empty_sheaf(pcs->barn);
> +			if (empty) {
> +				pcs->spare = pcs->main;
> +				pcs->main = empty;
> +				goto do_free;
> +			}
> +			goto alloc_empty;
> +		}
> +
> +		if (pcs->spare->size < s->sheaf_capacity) {
> +			swap(pcs->main, pcs->spare);
> +			goto do_free;
> +		}
> +
> +		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +		if (!IS_ERR(empty)) {
> +			stat(s, BARN_PUT);
> +			pcs->main = empty;
> +			goto do_free;
> +		}

nit: stat(s, BARN_PUT_FAIL); should probably be here instead?

> +
> +		if (PTR_ERR(empty) == -E2BIG) {
> +			/* Since we got here, spare exists and is full */
> +			struct slab_sheaf *to_flush = pcs->spare;
> +
> +			stat(s, BARN_PUT_FAIL);
> +
> +			pcs->spare = NULL;
> +			local_unlock(&s->cpu_sheaves->lock);
> +
> +			sheaf_flush_unused(s, to_flush);
> +			empty = to_flush;
> +			goto got_empty;
> +		}

> @@ -6455,6 +7374,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>  
>  	set_cpu_partial(s);
>  
> +	if (args->sheaf_capacity && !(s->flags & SLAB_DEBUG_FLAGS)) {
> +		s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);

nit: Probably you want to disable sheaves on CONFIG_SLUB_TINY=y too?

> +		if (!s->cpu_sheaves) {
> +			err = -ENOMEM;
> +			goto out;
> +		}
> +		// TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
> +		s->sheaf_capacity = args->sheaf_capacity;
> +	}
> +

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-04-29  1:08   ` Harry Yoo
@ 2025-05-13 16:08     ` Vlastimil Babka
  0 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-05-13 16:08 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 4/29/25 03:08, Harry Yoo wrote:
> On Fri, Apr 25, 2025 at 10:27:21AM +0200, Vlastimil Babka wrote:
>> Specifying a non-zero value for a new struct kmem_cache_args field
>> sheaf_capacity will setup a caching layer of percpu arrays called
>> sheaves of given capacity for the created cache.
>> 
>> Allocations from the cache will allocate via the percpu sheaves (main or
>> spare) as long as they have no NUMA node preference. Frees will also
>> put the object back into one of the sheaves.
>> 
>> When both percpu sheaves are found empty during an allocation, an empty
>> sheaf may be replaced with a full one from the per-node barn. If none
>> are available and the allocation is allowed to block, an empty sheaf is
>> refilled from slab(s) by an internal bulk alloc operation. When both
>> percpu sheaves are full during freeing, the barn can replace a full one
>> with an empty one, unless over a full sheaves limit. In that case a
>> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
>> sheaves and barns is also wired to the existing cpu flushing and cache
>> shrinking operations.
>> 
>> The sheaves do not distinguish NUMA locality of the cached objects. If
>> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
>> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
>> the sheaves are bypassed.
>> 
>> The bulk operations exposed to slab users also try to utilize the
>> sheaves as long as the necessary (full or empty) sheaves are available
>> on the cpu or in the barn. Once depleted, they will fallback to bulk
>> alloc/free to slabs directly to avoid double copying.
>> 
>> The sheaf_capacity value is exported in sysfs for observability.
>> 
>> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
>> count objects allocated or freed using the sheaves (and thus not
>> counting towards the other alloc/free path counters). Counters
>> sheaf_refill and sheaf_flush count objects filled or flushed from or to
>> slab pages, and can be used to assess how effective the caching is. The
>> refill and flush operations will also count towards the usual
>> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
>> the backing slabs.  For barn operations, barn_get and barn_put count how
>> many full sheaves were get from or put to the barn, the _fail variants
>> count how many such requests could not be satisfied mainly  because the
>> barn was either empty or full.
> 
>> While the barn also holds empty sheaves
>> to make some operations easier, these are not as critical to mandate own
>> counters.  Finally, there are sheaf_alloc/sheaf_free counters.
> 
> I initially thought we need counters for empty sheaves to see how many times
> it grabs empty sheaves from the barn, but looks like barn_put
> ("put full sheaves to the barn") is effectively a proxy for that, right?

Mostly yes, the free sheaves in barn is mainly to make the "replace full
with empty" easy, but if that fails because there's no empty sheaves, the
fallback with allocating an empty sheaf should still be successful enough
that tracking it in detail doesn't seem that useful.

>> Access to the percpu sheaves is protected by local_trylock() when
>> potential callers include irq context, and local_lock() otherwise (such
>> as when we already know the gfp flags allow blocking). The trylock
>> failures should be rare and we can easily fallback. Each per-NUMA-node
>> barn has a spin_lock.
>> 
>> When slub_debug is enabled for a cache with sheaf_capacity also
>> specified, the latter is ignored so that allocations and frees reach the
>> slow path where debugging hooks are processed.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
> 
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks!

> LGTM, with a few nits:

I've applied them, thanks. Responding only to one that needs it:

>> +static __fastpath_inline
>> +bool free_to_pcs(struct kmem_cache *s, void *object)
>> +{
>> +	struct slub_percpu_sheaves *pcs;
>> +
>> +restart:
>> +	if (!local_trylock(&s->cpu_sheaves->lock))
>> +		return false;
>> +
>> +	pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +	if (unlikely(pcs->main->size == s->sheaf_capacity)) {
>> +
>> +		struct slab_sheaf *empty;
>> +
>> +		if (!pcs->spare) {
>> +			empty = barn_get_empty_sheaf(pcs->barn);
>> +			if (empty) {
>> +				pcs->spare = pcs->main;
>> +				pcs->main = empty;
>> +				goto do_free;
>> +			}
>> +			goto alloc_empty;
>> +		}
>> +
>> +		if (pcs->spare->size < s->sheaf_capacity) {
>> +			swap(pcs->main, pcs->spare);
>> +			goto do_free;
>> +		}
>> +
>> +		empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
>> +
>> +		if (!IS_ERR(empty)) {
>> +			stat(s, BARN_PUT);
>> +			pcs->main = empty;
>> +			goto do_free;
>> +		}
> 
> nit: stat(s, BARN_PUT_FAIL); should probably be here instead?

Hm, the intention was that no, because when PTR_ERR(empty) == -ENOMEM, we
try alloc_empty_sheaf(), and that will likely succeed, and then
__pcs_install_empty_sheaf() will just force the put full sheaf (and record a
BARN_PUT), because we already saw that we're not over capacity. But now I
see I didn't describe it as a scenario for the function's comment, so I will
add that.

But technically we should also record stat(s, BARN_PUT_FAIL) when that
alloc_empty_sheaf() fails, but not when we "goto alloc_empty" from the "no
spare" above. Bit icky but I'll add that too.

>> +
>> +		if (PTR_ERR(empty) == -E2BIG) {
>> +			/* Since we got here, spare exists and is full */
>> +			struct slab_sheaf *to_flush = pcs->spare;
>> +
>> +			stat(s, BARN_PUT_FAIL);
>> +
>> +			pcs->spare = NULL;
>> +			local_unlock(&s->cpu_sheaves->lock);
>> +
>> +			sheaf_flush_unused(s, to_flush);
>> +			empty = to_flush;
>> +			goto got_empty;
>> +		}
> 




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-04-25  8:27 ` [PATCH v4 1/9] slab: add opt-in caching layer of " Vlastimil Babka
  2025-04-25 17:31   ` Christoph Lameter (Ampere)
  2025-04-29  1:08   ` Harry Yoo
@ 2025-05-06 23:14   ` Suren Baghdasaryan
  2025-05-14 13:06     ` Vlastimil Babka
  2 siblings, 1 reply; 35+ messages in thread
From: Suren Baghdasaryan @ 2025-05-06 23:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Apr 25, 2025 at 1:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Specifying a non-zero value for a new struct kmem_cache_args field
> sheaf_capacity will setup a caching layer of percpu arrays called
> sheaves of given capacity for the created cache.
>
> Allocations from the cache will allocate via the percpu sheaves (main or
> spare) as long as they have no NUMA node preference. Frees will also
> put the object back into one of the sheaves.
>
> When both percpu sheaves are found empty during an allocation, an empty
> sheaf may be replaced with a full one from the per-node barn. If none
> are available and the allocation is allowed to block, an empty sheaf is
> refilled from slab(s) by an internal bulk alloc operation. When both
> percpu sheaves are full during freeing, the barn can replace a full one
> with an empty one, unless over a full sheaves limit. In that case a
> sheaf is flushed to slab(s) by an internal bulk free operation. Flushing
> sheaves and barns is also wired to the existing cpu flushing and cache
> shrinking operations.
>
> The sheaves do not distinguish NUMA locality of the cached objects. If
> an allocation is requested with kmem_cache_alloc_node() (or a mempolicy
> with strict_numa mode enabled) with a specific node (not NUMA_NO_NODE),
> the sheaves are bypassed.
>
> The bulk operations exposed to slab users also try to utilize the
> sheaves as long as the necessary (full or empty) sheaves are available
> on the cpu or in the barn. Once depleted, they will fallback to bulk
> alloc/free to slabs directly to avoid double copying.
>
> The sheaf_capacity value is exported in sysfs for observability.
>
> Sysfs CONFIG_SLUB_STATS counters alloc_cpu_sheaf and free_cpu_sheaf
> count objects allocated or freed using the sheaves (and thus not
> counting towards the other alloc/free path counters). Counters
> sheaf_refill and sheaf_flush count objects filled or flushed from or to
> slab pages, and can be used to assess how effective the caching is. The
> refill and flush operations will also count towards the usual
> alloc_fastpath/slowpath, free_fastpath/slowpath and other counters for
> the backing slabs.  For barn operations, barn_get and barn_put count how
> many full sheaves were get from or put to the barn, the _fail variants
> count how many such requests could not be satisfied mainly  because the
> barn was either empty or full. While the barn also holds empty sheaves
> to make some operations easier, these are not as critical to mandate own
> counters.  Finally, there are sheaf_alloc/sheaf_free counters.
>
> Access to the percpu sheaves is protected by local_trylock() when
> potential callers include irq context, and local_lock() otherwise (such
> as when we already know the gfp flags allow blocking). The trylock
> failures should be rare and we can easily fallback. Each per-NUMA-node
> barn has a spin_lock.
>
> When slub_debug is enabled for a cache with sheaf_capacity also
> specified, the latter is ignored so that allocations and frees reach the
> slow path where debugging hooks are processed.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

One nit which is barely worth mentioning.

> ---
>  include/linux/slab.h |   31 ++
>  mm/slab.h            |    2 +
>  mm/slab_common.c     |    5 +-
>  mm/slub.c            | 1053 +++++++++++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 1044 insertions(+), 47 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d5a8ab98035cf3e3d9043e3b038e1bebeff05b52..4cb495d55fc58c70a992ee4782d7990ce1c55dc6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -335,6 +335,37 @@ struct kmem_cache_args {
>          * %NULL means no constructor.
>          */
>         void (*ctor)(void *);
> +       /**
> +        * @sheaf_capacity: Enable sheaves of given capacity for the cache.
> +        *
> +        * With a non-zero value, allocations from the cache go through caching
> +        * arrays called sheaves. Each cpu has a main sheaf that's always
> +        * present, and a spare sheaf thay may be not present. When both become
> +        * empty, there's an attempt to replace an empty sheaf with a full sheaf
> +        * from the per-node barn.
> +        *
> +        * When no full sheaf is available, and gfp flags allow blocking, a
> +        * sheaf is allocated and filled from slab(s) using bulk allocation.
> +        * Otherwise the allocation falls back to the normal operation
> +        * allocating a single object from a slab.
> +        *
> +        * Analogically when freeing and both percpu sheaves are full, the barn
> +        * may replace it with an empty sheaf, unless it's over capacity. In
> +        * that case a sheaf is bulk freed to slab pages.
> +        *
> +        * The sheaves do not enforce NUMA placement of objects, so allocations
> +        * via kmem_cache_alloc_node() with a node specified other than
> +        * NUMA_NO_NODE will bypass them.
> +        *
> +        * Bulk allocation and free operations also try to use the cpu sheaves
> +        * and barn, but fallback to using slab pages directly.
> +        *
> +        * When slub_debug is enabled for the cache, the sheaf_capacity argument
> +        * is ignored.
> +        *
> +        * %0 means no sheaves will be created
> +        */
> +       unsigned int sheaf_capacity;
>  };
>
>  struct kmem_cache *__kmem_cache_create_args(const char *name,
> diff --git a/mm/slab.h b/mm/slab.h
> index 05a21dc796e095e8db934564d559494cd81746ec..1980330c2fcb4a4613a7e4f7efc78b349993fd89 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -259,6 +259,7 @@ struct kmem_cache {
>  #ifndef CONFIG_SLUB_TINY
>         struct kmem_cache_cpu __percpu *cpu_slab;
>  #endif
> +       struct slub_percpu_sheaves __percpu *cpu_sheaves;
>         /* Used for retrieving partial slabs, etc. */
>         slab_flags_t flags;
>         unsigned long min_partial;
> @@ -272,6 +273,7 @@ struct kmem_cache {
>         /* Number of per cpu partial slabs to keep around */
>         unsigned int cpu_partial_slabs;
>  #endif
> +       unsigned int sheaf_capacity;
>         struct kmem_cache_order_objects oo;
>
>         /* Allocation and freeing of slabs */
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 5be257e03c7c930b5ca16dd92f790604cc5767ac..4f295bdd2d42355af6311a799955301005f8a532 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -163,6 +163,9 @@ int slab_unmergeable(struct kmem_cache *s)
>                 return 1;
>  #endif
>
> +       if (s->cpu_sheaves)
> +               return 1;
> +
>         /*
>          * We may have set a slab to be unmergeable during bootstrap.
>          */
> @@ -321,7 +324,7 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
>                     object_size - args->usersize < args->useroffset))
>                 args->usersize = args->useroffset = 0;
>
> -       if (!args->usersize)
> +       if (!args->usersize && !args->sheaf_capacity)
>                 s = __kmem_cache_alias(name, object_size, args->align, flags,
>                                        args->ctor);
>         if (s)
> diff --git a/mm/slub.c b/mm/slub.c
> index dc9e729e1d269b5d362cb5bc44f824640ffd00f3..ae3e80ad9926ca15601eef2f2aa016ca059498f8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -346,8 +346,10 @@ static inline void debugfs_slab_add(struct kmem_cache *s) { }
>  #endif
>
>  enum stat_item {
> +       ALLOC_PCS,              /* Allocation from percpu sheaf */
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
> +       FREE_PCS,               /* Free to percpu sheaf */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -372,6 +374,14 @@ enum stat_item {
>         CPU_PARTIAL_FREE,       /* Refill cpu partial on free */
>         CPU_PARTIAL_NODE,       /* Refill cpu partial from node partial */
>         CPU_PARTIAL_DRAIN,      /* Drain cpu partial to node partial */
> +       SHEAF_FLUSH,            /* Objects flushed from a sheaf */
> +       SHEAF_REFILL,           /* Objects refilled to a sheaf */
> +       SHEAF_ALLOC,            /* Allocation of an empty sheaf */
> +       SHEAF_FREE,             /* Freeing of an empty sheaf */
> +       BARN_GET,               /* Got full sheaf from barn */
> +       BARN_GET_FAIL,          /* Failed to get full sheaf from barn */
> +       BARN_PUT,               /* Put full sheaf to barn */
> +       BARN_PUT_FAIL,          /* Failed to put full sheaf to barn */
>         NR_SLUB_STAT_ITEMS
>  };
>
> @@ -418,6 +428,33 @@ void stat_add(const struct kmem_cache *s, enum stat_item si, int v)
>  #endif
>  }
>
> +#define MAX_FULL_SHEAVES       10
> +#define MAX_EMPTY_SHEAVES      10
> +
> +struct node_barn {
> +       spinlock_t lock;
> +       struct list_head sheaves_full;
> +       struct list_head sheaves_empty;
> +       unsigned int nr_full;
> +       unsigned int nr_empty;
> +};
> +
> +struct slab_sheaf {
> +       union {
> +               struct rcu_head rcu_head;
> +               struct list_head barn_list;
> +       };
> +       unsigned int size;
> +       void *objects[];
> +};
> +
> +struct slub_percpu_sheaves {
> +       local_trylock_t lock;
> +       struct slab_sheaf *main; /* never NULL when unlocked */
> +       struct slab_sheaf *spare; /* empty or full, may be NULL */
> +       struct node_barn *barn;
> +};
> +
>  /*
>   * The slab lists for all objects.
>   */
> @@ -430,6 +467,7 @@ struct kmem_cache_node {
>         atomic_long_t total_objects;
>         struct list_head full;
>  #endif
> +       struct node_barn *barn;
>  };
>
>  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> @@ -453,12 +491,19 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
>   */
>  static nodemask_t slab_nodes;
>
> -#ifndef CONFIG_SLUB_TINY
>  /*
>   * Workqueue used for flush_cpu_slab().
>   */
>  static struct workqueue_struct *flushwq;
> -#endif
> +
> +struct slub_flush_work {
> +       struct work_struct work;
> +       struct kmem_cache *s;
> +       bool skip;
> +};
> +
> +static DEFINE_MUTEX(flush_lock);
> +static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
>
>  /********************************************************************
>   *                     Core slab cache functions
> @@ -2454,6 +2499,359 @@ static void *setup_object(struct kmem_cache *s, void *object)
>         return object;
>  }
>
> +static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slab_sheaf *sheaf = kzalloc(struct_size(sheaf, objects,
> +                                       s->sheaf_capacity), gfp);
> +
> +       if (unlikely(!sheaf))
> +               return NULL;
> +
> +       stat(s, SHEAF_ALLOC);
> +
> +       return sheaf;
> +}
> +
> +static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
> +{
> +       kfree(sheaf);
> +
> +       stat(s, SHEAF_FREE);
> +}
> +
> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> +                                  size_t size, void **p);
> +
> +
> +static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
> +                        gfp_t gfp)
> +{
> +       int to_fill = s->sheaf_capacity - sheaf->size;
> +       int filled;
> +
> +       if (!to_fill)
> +               return 0;
> +
> +       filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
> +                                        &sheaf->objects[sheaf->size]);
> +
> +       sheaf->size += filled;
> +
> +       stat_add(s, SHEAF_REFILL, filled);
> +
> +       if (filled < to_fill)
> +               return -ENOMEM;
> +
> +       return 0;
> +}
> +
> +
> +static struct slab_sheaf *alloc_full_sheaf(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slab_sheaf *sheaf = alloc_empty_sheaf(s, gfp);
> +
> +       if (!sheaf)
> +               return NULL;
> +
> +       if (refill_sheaf(s, sheaf, gfp)) {
> +               free_empty_sheaf(s, sheaf);
> +               return NULL;
> +       }
> +
> +       return sheaf;
> +}
> +
> +/*
> + * Maximum number of objects freed during a single flush of main pcs sheaf.
> + * Translates directly to an on-stack array size.
> + */
> +#define PCS_BATCH_MAX  32U
> +
> +static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
> +
> +/*
> + * Free all objects from the main sheaf. In order to perform
> + * __kmem_cache_free_bulk() outside of cpu_sheaves->lock, work in batches where
> + * object pointers are moved to a on-stack array under the lock. To bound the
> + * stack usage, limit each batch to PCS_BATCH_MAX.
> + *
> + * returns true if at least partially flushed
> + */
> +static bool sheaf_flush_main(struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       unsigned int batch, remaining;
> +       void *objects[PCS_BATCH_MAX];
> +       struct slab_sheaf *sheaf;
> +       bool ret = false;
> +
> +next_batch:
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return ret;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +       sheaf = pcs->main;
> +
> +       batch = min(PCS_BATCH_MAX, sheaf->size);
> +
> +       sheaf->size -= batch;
> +       memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
> +
> +       remaining = sheaf->size;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       __kmem_cache_free_bulk(s, batch, &objects[0]);
> +
> +       stat_add(s, SHEAF_FLUSH, batch);
> +
> +       ret = true;
> +
> +       if (remaining)
> +               goto next_batch;
> +
> +       return ret;
> +}
> +
> +/*
> + * Free all objects from a sheaf that's unused, i.e. not linked to any
> + * cpu_sheaves, so we need no locking and batching. The locking is also not
> + * necessary when flushing cpu's sheaves (both spare and main) during cpu
> + * hotremove as the cpu is not executing anymore.
> + */
> +static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
> +{
> +       if (!sheaf->size)
> +               return;
> +
> +       stat_add(s, SHEAF_FLUSH, sheaf->size);
> +
> +       __kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
> +
> +       sheaf->size = 0;
> +}
> +
> +/*
> + * Caller needs to make sure migration is disabled in order to fully flush
> + * single cpu's sheaves
> + *
> + * must not be called from an irq
> + *
> + * flushing operations are rare so let's keep it simple and flush to slabs
> + * directly, skipping the barn
> + */
> +static void pcs_flush_all(struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *spare;
> +
> +       local_lock(&s->cpu_sheaves->lock);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       spare = pcs->spare;
> +       pcs->spare = NULL;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       if (spare) {
> +               sheaf_flush_unused(s, spare);
> +               free_empty_sheaf(s, spare);
> +       }
> +
> +       sheaf_flush_main(s);
> +}
> +
> +static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +
> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +       /* The cpu is not executing anymore so we don't need pcs->lock */
> +       sheaf_flush_unused(s, pcs->main);
> +       if (pcs->spare) {
> +               sheaf_flush_unused(s, pcs->spare);
> +               free_empty_sheaf(s, pcs->spare);
> +               pcs->spare = NULL;
> +       }
> +}
> +
> +static void pcs_destroy(struct kmem_cache *s)
> +{
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct slub_percpu_sheaves *pcs;
> +
> +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +               /* can happen when unwinding failed create */
> +               if (!pcs->main)
> +                       continue;
> +
> +               /*
> +                * We have already passed __kmem_cache_shutdown() so everything
> +                * was flushed and there should be no objects allocated from
> +                * slabs, otherwise kmem_cache_destroy() would have aborted.
> +                * Therefore something would have to be really wrong if the
> +                * warnings here trigger, and we should rather leave bojects and
> +                * sheaves to leak in that case.
> +                */
> +
> +               WARN_ON(pcs->spare);
> +
> +               if (!WARN_ON(pcs->main->size)) {
> +                       free_empty_sheaf(s, pcs->main);
> +                       pcs->main = NULL;
> +               }
> +       }
> +
> +       free_percpu(s->cpu_sheaves);
> +       s->cpu_sheaves = NULL;
> +}
> +
> +static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
> +{
> +       struct slab_sheaf *empty = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_empty) {
> +               empty = list_first_entry(&barn->sheaves_empty,
> +                                        struct slab_sheaf, barn_list);
> +               list_del(&empty->barn_list);
> +               barn->nr_empty--;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return empty;
> +}
> +
> +/*
> + * The following two functions are used mainly in cases where we have to undo an
> + * intended action due to a race or cpu migration. Thus they do not check the
> + * empty or full sheaf limits for simplicity.
> + */
> +
> +static void barn_put_empty_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
> +{
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       list_add(&sheaf->barn_list, &barn->sheaves_empty);
> +       barn->nr_empty++;
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +}
> +
> +static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
> +{
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       list_add(&sheaf->barn_list, &barn->sheaves_full);
> +       barn->nr_full++;
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +}
> +
> +/*
> + * If a full sheaf is available, return it and put the supplied empty one to
> + * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
> + * change.
> + */
> +static struct slab_sheaf *
> +barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
> +{
> +       struct slab_sheaf *full = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_full) {
> +               full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
> +                                       barn_list);
> +               list_del(&full->barn_list);
> +               list_add(&empty->barn_list, &barn->sheaves_empty);
> +               barn->nr_full--;
> +               barn->nr_empty++;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return full;
> +}
> +/*
> + * If a empty sheaf is available, return it and put the supplied full one to
> + * barn. But if there are too many full sheaves, reject this with -E2BIG.
> + */
> +static struct slab_sheaf *
> +barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
> +{
> +       struct slab_sheaf *empty;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_full >= MAX_FULL_SHEAVES) {
> +               empty = ERR_PTR(-E2BIG);
> +       } else if (!barn->nr_empty) {
> +               empty = ERR_PTR(-ENOMEM);
> +       } else {
> +               empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
> +                                        barn_list);
> +               list_del(&empty->barn_list);
> +               list_add(&full->barn_list, &barn->sheaves_full);
> +               barn->nr_empty--;
> +               barn->nr_full++;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return empty;
> +}
> +
> +static void barn_init(struct node_barn *barn)
> +{
> +       spin_lock_init(&barn->lock);
> +       INIT_LIST_HEAD(&barn->sheaves_full);
> +       INIT_LIST_HEAD(&barn->sheaves_empty);
> +       barn->nr_full = 0;
> +       barn->nr_empty = 0;
> +}
> +
> +static void barn_shrink(struct kmem_cache *s, struct node_barn *barn)
> +{
> +       struct list_head empty_list;
> +       struct list_head full_list;
> +       struct slab_sheaf *sheaf, *sheaf2;
> +       unsigned long flags;
> +
> +       INIT_LIST_HEAD(&empty_list);
> +       INIT_LIST_HEAD(&full_list);
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       list_splice_init(&barn->sheaves_full, &full_list);
> +       barn->nr_full = 0;
> +       list_splice_init(&barn->sheaves_empty, &empty_list);
> +       barn->nr_empty = 0;
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       list_for_each_entry_safe(sheaf, sheaf2, &full_list, barn_list) {
> +               sheaf_flush_unused(s, sheaf);
> +               free_empty_sheaf(s, sheaf);
> +       }
> +
> +       list_for_each_entry_safe(sheaf, sheaf2, &empty_list, barn_list)
> +               free_empty_sheaf(s, sheaf);
> +}
> +
>  /*
>   * Slab allocation and freeing
>   */
> @@ -3325,11 +3723,42 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>         put_partials_cpu(s, c);
>  }
>
> -struct slub_flush_work {
> -       struct work_struct work;
> -       struct kmem_cache *s;
> -       bool skip;
> -};
> +static inline void flush_this_cpu_slab(struct kmem_cache *s)
> +{
> +       struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
> +
> +       if (c->slab)
> +               flush_slab(s, c);
> +
> +       put_partials(s);
> +}
> +
> +static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> +{
> +       struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> +
> +       return c->slab || slub_percpu_partial(c);
> +}
> +
> +#else /* CONFIG_SLUB_TINY */
> +static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> +static inline bool has_cpu_slab(int cpu, struct kmem_cache *s) { return false; }
> +static inline void flush_this_cpu_slab(struct kmem_cache *s) { }
> +#endif /* CONFIG_SLUB_TINY */
> +
> +static bool has_pcs_used(int cpu, struct kmem_cache *s)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +
> +       if (!s->cpu_sheaves)
> +               return false;
> +
> +       pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +       return (pcs->spare || pcs->main->size);
> +}
> +
> +static void pcs_flush_all(struct kmem_cache *s);
>
>  /*
>   * Flush cpu slab.
> @@ -3339,30 +3768,18 @@ struct slub_flush_work {
>  static void flush_cpu_slab(struct work_struct *w)
>  {
>         struct kmem_cache *s;
> -       struct kmem_cache_cpu *c;
>         struct slub_flush_work *sfw;
>
>         sfw = container_of(w, struct slub_flush_work, work);
>
>         s = sfw->s;
> -       c = this_cpu_ptr(s->cpu_slab);
>
> -       if (c->slab)
> -               flush_slab(s, c);
> +       if (s->cpu_sheaves)
> +               pcs_flush_all(s);
>
> -       put_partials(s);
> -}
> -
> -static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> -{
> -       struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
> -
> -       return c->slab || slub_percpu_partial(c);
> +       flush_this_cpu_slab(s);
>  }
>
> -static DEFINE_MUTEX(flush_lock);
> -static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
> -
>  static void flush_all_cpus_locked(struct kmem_cache *s)
>  {
>         struct slub_flush_work *sfw;
> @@ -3373,7 +3790,7 @@ static void flush_all_cpus_locked(struct kmem_cache *s)
>
>         for_each_online_cpu(cpu) {
>                 sfw = &per_cpu(slub_flush, cpu);
> -               if (!has_cpu_slab(cpu, s)) {
> +               if (!has_cpu_slab(cpu, s) && !has_pcs_used(cpu, s)) {
>                         sfw->skip = true;
>                         continue;
>                 }
> @@ -3409,19 +3826,15 @@ static int slub_cpu_dead(unsigned int cpu)
>         struct kmem_cache *s;
>
>         mutex_lock(&slab_mutex);
> -       list_for_each_entry(s, &slab_caches, list)
> +       list_for_each_entry(s, &slab_caches, list) {
>                 __flush_cpu_slab(s, cpu);
> +               if (s->cpu_sheaves)
> +                       __pcs_flush_all_cpu(s, cpu);
> +       }
>         mutex_unlock(&slab_mutex);
>         return 0;
>  }
>
> -#else /* CONFIG_SLUB_TINY */
> -static inline void flush_all_cpus_locked(struct kmem_cache *s) { }
> -static inline void flush_all(struct kmem_cache *s) { }
> -static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { }
> -static inline int slub_cpu_dead(unsigned int cpu) { return 0; }
> -#endif /* CONFIG_SLUB_TINY */
> -
>  /*
>   * Check if the objects in a per cpu structure fit numa
>   * locality expectations.
> @@ -4171,6 +4584,191 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>         return memcg_slab_post_alloc_hook(s, lru, flags, size, p);
>  }
>
> +static __fastpath_inline
> +void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       void *object;
> +
> +#ifdef CONFIG_NUMA
> +       if (static_branch_unlikely(&strict_numa)) {
> +               if (current->mempolicy)
> +                       return NULL;
> +       }
> +#endif
> +
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return NULL;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == 0)) {
> +
> +               struct slab_sheaf *empty = NULL;
> +               struct slab_sheaf *full;
> +               bool can_alloc;
> +
> +               if (pcs->spare && pcs->spare->size > 0) {
> +                       swap(pcs->main, pcs->spare);
> +                       goto do_alloc;
> +               }
> +
> +               full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
> +
> +               if (full) {
> +                       stat(s, BARN_GET);
> +                       pcs->main = full;
> +                       goto do_alloc;
> +               }
> +
> +               stat(s, BARN_GET_FAIL);
> +
> +               can_alloc = gfpflags_allow_blocking(gfp);
> +
> +               if (can_alloc) {
> +                       if (pcs->spare) {
> +                               empty = pcs->spare;
> +                               pcs->spare = NULL;
> +                       } else {
> +                               empty = barn_get_empty_sheaf(pcs->barn);
> +                       }
> +               }
> +
> +               local_unlock(&s->cpu_sheaves->lock);
> +
> +               if (!can_alloc)
> +                       return NULL;
> +
> +               if (empty) {
> +                       if (!refill_sheaf(s, empty, gfp)) {
> +                               full = empty;
> +                       } else {
> +                               /*
> +                                * we must be very low on memory so don't bother
> +                                * with the barn
> +                                */
> +                               free_empty_sheaf(s, empty);
> +                       }
> +               } else {
> +                       full = alloc_full_sheaf(s, gfp);
> +               }
> +
> +               if (!full)
> +                       return NULL;
> +
> +               /*
> +                * we can reach here only when gfpflags_allow_blocking
> +                * so this must not be an irq
> +                */
> +               local_lock(&s->cpu_sheaves->lock);
> +               pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +               /*
> +                * If we are returning empty sheaf, we either got it from the
> +                * barn or had to allocate one. If we are returning a full
> +                * sheaf, it's due to racing or being migrated to a different
> +                * cpu. Breaching the barn's sheaf limits should be thus rare
> +                * enough so just ignore them to simplify the recovery.
> +                */
> +
> +               if (pcs->main->size == 0) {
> +                       barn_put_empty_sheaf(pcs->barn, pcs->main);
> +                       pcs->main = full;
> +                       goto do_alloc;
> +               }
> +
> +               if (!pcs->spare) {
> +                       pcs->spare = full;
> +                       goto do_alloc;
> +               }
> +
> +               if (pcs->spare->size == 0) {
> +                       barn_put_empty_sheaf(pcs->barn, pcs->spare);
> +                       pcs->spare = full;
> +                       goto do_alloc;
> +               }
> +
> +               barn_put_full_sheaf(pcs->barn, full);
> +               stat(s, BARN_PUT);
> +       }
> +
> +do_alloc:
> +       object = pcs->main->objects[--pcs->main->size];
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat(s, ALLOC_PCS);
> +
> +       return object;
> +}
> +
> +static __fastpath_inline
> +unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *main;
> +       unsigned int allocated = 0;
> +       unsigned int batch;
> +
> +next_batch:
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return allocated;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == 0)) {
> +
> +               struct slab_sheaf *full;
> +
> +               if (pcs->spare && pcs->spare->size > 0) {
> +                       swap(pcs->main, pcs->spare);
> +                       goto do_alloc;
> +               }
> +
> +               full = barn_replace_empty_sheaf(pcs->barn, pcs->main);
> +
> +               if (full) {
> +                       stat(s, BARN_GET);
> +                       pcs->main = full;
> +                       goto do_alloc;
> +               }
> +
> +               stat(s, BARN_GET_FAIL);
> +
> +               local_unlock(&s->cpu_sheaves->lock);
> +
> +               /*
> +                * Once full sheaves in barn are depleted, let the bulk
> +                * allocation continue from slab pages, otherwise we would just
> +                * be copying arrays of pointers twice.
> +                */
> +               return allocated;
> +       }
> +
> +do_alloc:
> +
> +       main = pcs->main;
> +       batch = min(size, main->size);
> +
> +       main->size -= batch;
> +       memcpy(p, main->objects + main->size, batch * sizeof(void *));
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat_add(s, ALLOC_PCS, batch);
> +
> +       allocated += batch;
> +
> +       if (batch < size) {
> +               p += batch;
> +               size -= batch;
> +               goto next_batch;
> +       }
> +
> +       return allocated;
> +}
> +
> +
>  /*
>   * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
>   * have the fastpath folded into their functions. So no function call
> @@ -4195,7 +4793,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list
>         if (unlikely(object))
>                 goto out;
>
> -       object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
> +       if (s->cpu_sheaves && node == NUMA_NO_NODE)
> +               object = alloc_from_pcs(s, gfpflags);
> +
> +       if (!object)
> +               object = __slab_alloc_node(s, gfpflags, node, addr, orig_size);
>
>         maybe_wipe_obj_freeptr(s, object);
>         init = slab_want_init_on_alloc(gfpflags, s);
> @@ -4567,6 +5169,234 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>         discard_slab(s, slab);
>  }
>
> +/*
> + * pcs is locked. We should have get rid of the spare sheaf and obtained an
> + * empty sheaf, while the main sheaf is full. We want to install the empty sheaf
> + * as a main sheaf, and make the current main sheaf a spare sheaf.
> + *
> + * However due to having relinquished the cpu_sheaves lock when obtaining
> + * the empty sheaf, we need to handle some unlikely but possible cases.
> + *
> + * If we put any sheaf to barn here, it's because we were interrupted or have
> + * been migrated to a different cpu, which should be rare enough so just ignore
> + * the barn's limits to simplify the handling.
> + */
> +static void __pcs_install_empty_sheaf(struct kmem_cache *s,
> +               struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty)
> +{
> +       /* this is what we expect to find if nobody interrupted us */
> +       if (likely(!pcs->spare)) {
> +               pcs->spare = pcs->main;
> +               pcs->main = empty;
> +               return;
> +       }
> +
> +       /*
> +        * Unlikely because if the main sheaf had space, we would have just
> +        * freed to it. Get rid of our empty sheaf.
> +        */
> +       if (pcs->main->size < s->sheaf_capacity) {
> +               barn_put_empty_sheaf(pcs->barn, empty);
> +               return;
> +       }
> +
> +       /* Also unlikely for the same reason */
> +       if (pcs->spare->size < s->sheaf_capacity) {
> +               swap(pcs->main, pcs->spare);
> +               barn_put_empty_sheaf(pcs->barn, empty);
> +               return;
> +       }
> +
> +       barn_put_full_sheaf(pcs->barn, pcs->main);
> +       stat(s, BARN_PUT);
> +       pcs->main = empty;
> +}
> +
> +/*
> + * Free an object to the percpu sheaves.
> + * The object is expected to have passed slab_free_hook() already.
> + */
> +static __fastpath_inline
> +bool free_to_pcs(struct kmem_cache *s, void *object)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +
> +restart:
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               return false;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(pcs->main->size == s->sheaf_capacity)) {
> +
> +               struct slab_sheaf *empty;
> +
> +               if (!pcs->spare) {
> +                       empty = barn_get_empty_sheaf(pcs->barn);
> +                       if (empty) {
> +                               pcs->spare = pcs->main;
> +                               pcs->main = empty;
> +                               goto do_free;
> +                       }
> +                       goto alloc_empty;
> +               }
> +
> +               if (pcs->spare->size < s->sheaf_capacity) {
> +                       swap(pcs->main, pcs->spare);
> +                       goto do_free;
> +               }
> +
> +               empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +
> +               if (!IS_ERR(empty)) {
> +                       stat(s, BARN_PUT);
> +                       pcs->main = empty;
> +                       goto do_free;
> +               }
> +
> +               if (PTR_ERR(empty) == -E2BIG) {
> +                       /* Since we got here, spare exists and is full */
> +                       struct slab_sheaf *to_flush = pcs->spare;
> +
> +                       stat(s, BARN_PUT_FAIL);
> +
> +                       pcs->spare = NULL;
> +                       local_unlock(&s->cpu_sheaves->lock);
> +
> +                       sheaf_flush_unused(s, to_flush);
> +                       empty = to_flush;
> +                       goto got_empty;
> +               }
> +
> +alloc_empty:
> +               local_unlock(&s->cpu_sheaves->lock);
> +
> +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +
> +               if (!empty) {
> +                       if (sheaf_flush_main(s))
> +                               goto restart;
> +                       else
> +                               return false;
> +               }
> +
> +got_empty:
> +               if (!local_trylock(&s->cpu_sheaves->lock)) {
> +                       struct node_barn *barn;
> +
> +                       barn = get_node(s, numa_mem_id())->barn;
> +
> +                       barn_put_empty_sheaf(barn, empty);
> +                       return false;
> +               }
> +
> +               pcs = this_cpu_ptr(s->cpu_sheaves);
> +               __pcs_install_empty_sheaf(s, pcs, empty);
> +       }
> +
> +do_free:
> +       pcs->main->objects[pcs->main->size++] = object;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat(s, FREE_PCS);
> +
> +       return true;
> +}
> +
> +/*
> + * Bulk free objects to the percpu sheaves.
> + * Unlike free_to_pcs() this includes the calls to all necessary hooks
> + * and the fallback to freeing to slab pages.
> + */
> +static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *main, *empty;
> +       unsigned int batch, i = 0;
> +       bool init;
> +
> +       init = slab_want_init_on_free(s);
> +
> +       while (i < size) {
> +               struct slab *slab = virt_to_slab(p[i]);
> +
> +               memcg_slab_free_hook(s, slab, p + i, 1);
> +               alloc_tagging_slab_free_hook(s, slab, p + i, 1);
> +
> +               if (unlikely(!slab_free_hook(s, p[i], init, false))) {
> +                       p[i] = p[--size];
> +                       if (!size)
> +                               return;
> +                       continue;
> +               }
> +
> +               i++;
> +       }
> +
> +next_batch:
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               goto fallback;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (likely(pcs->main->size < s->sheaf_capacity))
> +               goto do_free;
> +
> +       if (!pcs->spare) {
> +               empty = barn_get_empty_sheaf(pcs->barn);
> +               if (!empty)
> +                       goto no_empty;
> +
> +               pcs->spare = pcs->main;
> +               pcs->main = empty;
> +               goto do_free;
> +       }
> +
> +       if (pcs->spare->size < s->sheaf_capacity) {
> +               swap(pcs->main, pcs->spare);
> +               goto do_free;
> +       }
> +
> +       empty = barn_replace_full_sheaf(pcs->barn, pcs->main);
> +       if (IS_ERR(empty)) {
> +               stat(s, BARN_PUT_FAIL);
> +               goto no_empty;
> +       }
> +
> +       stat(s, BARN_PUT);
> +       pcs->main = empty;
> +
> +do_free:
> +       main = pcs->main;
> +       batch = min(size, s->sheaf_capacity - main->size);
> +
> +       memcpy(main->objects + main->size, p, batch * sizeof(void *));
> +       main->size += batch;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       stat_add(s, FREE_PCS, batch);
> +
> +       if (batch < size) {
> +               p += batch;
> +               size -= batch;
> +               goto next_batch;
> +       }
> +
> +       return;
> +
> +no_empty:
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       /*
> +        * if we depleted all empty sheaves in the barn or there are too
> +        * many full sheaves, free the rest to slab pages
> +        */
> +fallback:
> +       __kmem_cache_free_bulk(s, size, p);
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -4653,7 +5483,10 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>         memcg_slab_free_hook(s, slab, &object, 1);
>         alloc_tagging_slab_free_hook(s, slab, &object, 1);
>
> -       if (likely(slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> +       if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
> +               return;
> +
> +       if (!s->cpu_sheaves || !free_to_pcs(s, object))
>                 do_slab_free(s, slab, object, object, 1, addr);
>  }
>
> @@ -5247,6 +6080,15 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>         if (!size)
>                 return;
>
> +       /*
> +        * freeing to sheaves is so incompatible with the detached freelist so
> +        * once we go that way, we have to do everything differently
> +        */
> +       if (s && s->cpu_sheaves) {
> +               free_to_pcs_bulk(s, size, p);
> +               return;
> +       }
> +
>         do {
>                 struct detached_freelist df;
>
> @@ -5365,7 +6207,7 @@ static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>  int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                                  void **p)
>  {
> -       int i;
> +       unsigned int i = 0;
>
>         if (!size)
>                 return 0;
> @@ -5374,9 +6216,21 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>         if (unlikely(!s))
>                 return 0;
>
> -       i = __kmem_cache_alloc_bulk(s, flags, size, p);
> -       if (unlikely(i == 0))
> -               return 0;
> +       if (s->cpu_sheaves)
> +               i = alloc_from_pcs_bulk(s, size, p);
> +
> +       if (i < size) {
> +               unsigned int j = __kmem_cache_alloc_bulk(s, flags, size - i, p + i);

nit: this nondescript `j` variable can be eliminated:

if (unlikely(__kmem_cache_alloc_bulk(s, flags, size - i, p + i) == 0))


> +               /*
> +                * If we ran out of memory, don't bother with freeing back to
> +                * the percpu sheaves, we have bigger problems.
> +                */
> +               if (unlikely(j == 0)) {
> +                       if (i > 0)
> +                               __kmem_cache_free_bulk(s, i, p);
> +                       return 0;
> +               }
> +       }
>
>         /*
>          * memcg and kmem_cache debug support and memory initialization.
> @@ -5386,11 +6240,11 @@ int kmem_cache_alloc_bulk_noprof(struct kmem_cache *s, gfp_t flags, size_t size,
>                     slab_want_init_on_alloc(flags, s), s->object_size))) {
>                 return 0;
>         }
> -       return i;
> +
> +       return size;
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_bulk_noprof);
>
> -
>  /*
>   * Object placement in a slab is made very easy because we always start at
>   * offset 0. If we tune the size of the object to the alignment then we can
> @@ -5524,7 +6378,7 @@ static inline int calculate_order(unsigned int size)
>  }
>
>  static void
> -init_kmem_cache_node(struct kmem_cache_node *n)
> +init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
>  {
>         n->nr_partial = 0;
>         spin_lock_init(&n->list_lock);
> @@ -5534,6 +6388,9 @@ init_kmem_cache_node(struct kmem_cache_node *n)
>         atomic_long_set(&n->total_objects, 0);
>         INIT_LIST_HEAD(&n->full);
>  #endif
> +       n->barn = barn;
> +       if (barn)
> +               barn_init(barn);
>  }
>
>  #ifndef CONFIG_SLUB_TINY
> @@ -5564,6 +6421,30 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
>  }
>  #endif /* CONFIG_SLUB_TINY */
>
> +static int init_percpu_sheaves(struct kmem_cache *s)
> +{
> +       int cpu;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct slub_percpu_sheaves *pcs;
> +               int nid;
> +
> +               pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
> +
> +               local_trylock_init(&pcs->lock);
> +
> +               nid = cpu_to_mem(cpu);
> +
> +               pcs->barn = get_node(s, nid)->barn;
> +               pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
> +
> +               if (!pcs->main)
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
>  static struct kmem_cache *kmem_cache_node;
>
>  /*
> @@ -5599,7 +6480,7 @@ static void early_kmem_cache_node_alloc(int node)
>         slab->freelist = get_freepointer(kmem_cache_node, n);
>         slab->inuse = 1;
>         kmem_cache_node->node[node] = n;
> -       init_kmem_cache_node(n);
> +       init_kmem_cache_node(n, NULL);
>         inc_slabs_node(kmem_cache_node, node, slab->objects);
>
>         /*
> @@ -5615,6 +6496,13 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
>         struct kmem_cache_node *n;
>
>         for_each_kmem_cache_node(s, node, n) {
> +               if (n->barn) {
> +                       WARN_ON(n->barn->nr_full);
> +                       WARN_ON(n->barn->nr_empty);
> +                       kfree(n->barn);
> +                       n->barn = NULL;
> +               }
> +
>                 s->node[node] = NULL;
>                 kmem_cache_free(kmem_cache_node, n);
>         }
> @@ -5623,6 +6511,8 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
>  void __kmem_cache_release(struct kmem_cache *s)
>  {
>         cache_random_seq_destroy(s);
> +       if (s->cpu_sheaves)
> +               pcs_destroy(s);
>  #ifndef CONFIG_SLUB_TINY
>         free_percpu(s->cpu_slab);
>  #endif
> @@ -5635,20 +6525,29 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>
>         for_each_node_mask(node, slab_nodes) {
>                 struct kmem_cache_node *n;
> +               struct node_barn *barn = NULL;
>
>                 if (slab_state == DOWN) {
>                         early_kmem_cache_node_alloc(node);
>                         continue;
>                 }
> +
> +               if (s->cpu_sheaves) {
> +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> +
> +                       if (!barn)
> +                               return 0;
> +               }
> +
>                 n = kmem_cache_alloc_node(kmem_cache_node,
>                                                 GFP_KERNEL, node);
> -
>                 if (!n) {
> -                       free_kmem_cache_nodes(s);
> +                       kfree(barn);
>                         return 0;
>                 }
>
> -               init_kmem_cache_node(n);
> +               init_kmem_cache_node(n, barn);
> +
>                 s->node[node] = n;
>         }
>         return 1;
> @@ -5905,6 +6804,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>         flush_all_cpus_locked(s);
>         /* Attempt to free all objects */
>         for_each_kmem_cache_node(s, node, n) {
> +               if (n->barn)
> +                       barn_shrink(s, n->barn);
>                 free_partial(s, n);
>                 if (n->nr_partial || node_nr_slabs(n))
>                         return 1;
> @@ -6108,6 +7009,9 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
>                 for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
>                         INIT_LIST_HEAD(promote + i);
>
> +               if (n->barn)
> +                       barn_shrink(s, n->barn);
> +
>                 spin_lock_irqsave(&n->list_lock, flags);
>
>                 /*
> @@ -6220,12 +7124,24 @@ static int slab_mem_going_online_callback(void *arg)
>          */
>         mutex_lock(&slab_mutex);
>         list_for_each_entry(s, &slab_caches, list) {
> +               struct node_barn *barn = NULL;
> +
>                 /*
>                  * The structure may already exist if the node was previously
>                  * onlined and offlined.
>                  */
>                 if (get_node(s, nid))
>                         continue;
> +
> +               if (s->cpu_sheaves) {
> +                       barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
> +
> +                       if (!barn) {
> +                               ret = -ENOMEM;
> +                               goto out;
> +                       }
> +               }
> +
>                 /*
>                  * XXX: kmem_cache_alloc_node will fallback to other nodes
>                  *      since memory is not yet available from the node that
> @@ -6233,10 +7149,13 @@ static int slab_mem_going_online_callback(void *arg)
>                  */
>                 n = kmem_cache_alloc(kmem_cache_node, GFP_KERNEL);
>                 if (!n) {
> +                       kfree(barn);
>                         ret = -ENOMEM;
>                         goto out;
>                 }
> -               init_kmem_cache_node(n);
> +
> +               init_kmem_cache_node(n, barn);
> +
>                 s->node[nid] = n;
>         }
>         /*
> @@ -6455,6 +7374,16 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>
>         set_cpu_partial(s);
>
> +       if (args->sheaf_capacity && !(s->flags & SLAB_DEBUG_FLAGS)) {
> +               s->cpu_sheaves = alloc_percpu(struct slub_percpu_sheaves);
> +               if (!s->cpu_sheaves) {
> +                       err = -ENOMEM;
> +                       goto out;
> +               }
> +               // TODO: increase capacity to grow slab_sheaf up to next kmalloc size?
> +               s->sheaf_capacity = args->sheaf_capacity;
> +       }
> +
>  #ifdef CONFIG_NUMA
>         s->remote_node_defrag_ratio = 1000;
>  #endif
> @@ -6471,6 +7400,12 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>         if (!alloc_kmem_cache_cpus(s))
>                 goto out;
>
> +       if (s->cpu_sheaves) {
> +               err = init_percpu_sheaves(s);
> +               if (err)
> +                       goto out;
> +       }
> +
>         err = 0;
>
>         /* Mutex is not taken during early boot */
> @@ -6492,7 +7427,6 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
>                 __kmem_cache_release(s);
>         return err;
>  }
> -
>  #ifdef SLAB_SUPPORTS_SYSFS
>  static int count_inuse(struct slab *slab)
>  {
> @@ -6923,6 +7857,12 @@ static ssize_t order_show(struct kmem_cache *s, char *buf)
>  }
>  SLAB_ATTR_RO(order);
>
> +static ssize_t sheaf_capacity_show(struct kmem_cache *s, char *buf)
> +{
> +       return sysfs_emit(buf, "%u\n", s->sheaf_capacity);
> +}
> +SLAB_ATTR_RO(sheaf_capacity);
> +
>  static ssize_t min_partial_show(struct kmem_cache *s, char *buf)
>  {
>         return sysfs_emit(buf, "%lu\n", s->min_partial);
> @@ -7270,8 +8210,10 @@ static ssize_t text##_store(struct kmem_cache *s,                \
>  }                                                              \
>  SLAB_ATTR(text);                                               \
>
> +STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
> +STAT_ATTR(FREE_PCS, free_cpu_sheaf);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_FROZEN, free_frozen);
> @@ -7296,6 +8238,14 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
>  STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
>  STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
>  STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
> +STAT_ATTR(SHEAF_FLUSH, sheaf_flush);
> +STAT_ATTR(SHEAF_REFILL, sheaf_refill);
> +STAT_ATTR(SHEAF_ALLOC, sheaf_alloc);
> +STAT_ATTR(SHEAF_FREE, sheaf_free);
> +STAT_ATTR(BARN_GET, barn_get);
> +STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
> +STAT_ATTR(BARN_PUT, barn_put);
> +STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
>  #endif /* CONFIG_SLUB_STATS */
>
>  #ifdef CONFIG_KFENCE
> @@ -7326,6 +8276,7 @@ static struct attribute *slab_attrs[] = {
>         &object_size_attr.attr,
>         &objs_per_slab_attr.attr,
>         &order_attr.attr,
> +       &sheaf_capacity_attr.attr,
>         &min_partial_attr.attr,
>         &cpu_partial_attr.attr,
>         &objects_partial_attr.attr,
> @@ -7357,8 +8308,10 @@ static struct attribute *slab_attrs[] = {
>         &remote_node_defrag_ratio_attr.attr,
>  #endif
>  #ifdef CONFIG_SLUB_STATS
> +       &alloc_cpu_sheaf_attr.attr,
>         &alloc_fastpath_attr.attr,
>         &alloc_slowpath_attr.attr,
> +       &free_cpu_sheaf_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
>         &free_frozen_attr.attr,
> @@ -7383,6 +8336,14 @@ static struct attribute *slab_attrs[] = {
>         &cpu_partial_free_attr.attr,
>         &cpu_partial_node_attr.attr,
>         &cpu_partial_drain_attr.attr,
> +       &sheaf_flush_attr.attr,
> +       &sheaf_refill_attr.attr,
> +       &sheaf_alloc_attr.attr,
> +       &sheaf_free_attr.attr,
> +       &barn_get_attr.attr,
> +       &barn_get_fail_attr.attr,
> +       &barn_put_attr.attr,
> +       &barn_put_fail_attr.attr,
>  #endif
>  #ifdef CONFIG_FAILSLAB
>         &failslab_attr.attr,
>
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] slab: add opt-in caching layer of percpu sheaves
  2025-05-06 23:14   ` Suren Baghdasaryan
@ 2025-05-14 13:06     ` Vlastimil Babka
  0 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-05-14 13:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 5/7/25 01:14, Suren Baghdasaryan wrote:
> On Fri, Apr 25, 2025 at 1:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>

Thanks!

> One nit which is barely worth mentioning.

OK, made the change.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
  2025-04-25  8:27 ` [PATCH v4 1/9] slab: add opt-in caching layer of " Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-04-29  7:36   ` Harry Yoo
  2025-05-06 21:34   ` Suren Baghdasaryan
  2025-04-25  8:27 ` [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
For caches with sheaves, on each cpu maintain a rcu_free sheaf in
addition to main and spare sheaves.

kfree_rcu() operations will try to put objects on this sheaf. Once full,
the sheaf is detached and submitted to call_rcu() with a handler that
will try to put it in the barn, or flush to slab pages using bulk free,
when the barn is full. Then a new empty sheaf must be obtained to put
more objects there.

It's possible that no free sheaves are available to use for a new
rcu_free sheaf, and the allocation in kfree_rcu() context can only use
GFP_NOWAIT and thus may fail. In that case, fall back to the existing
kfree_rcu() implementation.

Expected advantages:
- batching the kfree_rcu() operations, that could eventually replace the
  existing batching
- sheaves can be reused for allocations via barn instead of being
  flushed to slabs, which is more efficient
  - this includes cases where only some cpus are allowed to process rcu
    callbacks (Android)

Possible disadvantage:
- objects might be waiting for more than their grace period (it is
  determined by the last object freed into the sheaf), increasing memory
  usage - but the existing batching does that too.

Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
implementation favors smaller memory footprint over performance.

Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
count how many kfree_rcu() used the rcu_free sheaf successfully and how
many had to fall back to the existing implementation.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h        |   3 +
 mm/slab_common.c |  24 ++++++++
 mm/slub.c        | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 208 insertions(+), 2 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 1980330c2fcb4a4613a7e4f7efc78b349993fd89..ddf1e4bcba734dccbf67e83bdbab3ca7272f540e 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -459,6 +459,9 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
 }
 
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+
+/* Legal flag mask for kmem_cache_create(), for various configurations */
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
 			 SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 4f295bdd2d42355af6311a799955301005f8a532..6c3b90f03cb79b57f426824450f576a977d85c53 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1608,6 +1608,27 @@ static void kfree_rcu_work(struct work_struct *work)
 		kvfree_rcu_list(head);
 }
 
+static bool kfree_rcu_sheaf(void *obj)
+{
+	struct kmem_cache *s;
+	struct folio *folio;
+	struct slab *slab;
+
+	if (is_vmalloc_addr(obj))
+		return false;
+
+	folio = virt_to_folio(obj);
+	if (unlikely(!folio_test_slab(folio)))
+		return false;
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+	if (s->cpu_sheaves)
+		return __kfree_rcu_sheaf(s, obj);
+
+	return false;
+}
+
 static bool
 need_offload_krc(struct kfree_rcu_cpu *krcp)
 {
@@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	if (!head)
 		might_sleep();
 
+	if (kfree_rcu_sheaf(ptr))
+		return;
+
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
diff --git a/mm/slub.c b/mm/slub.c
index ae3e80ad9926ca15601eef2f2aa016ca059498f8..6f31a27b5d47fa6621fa8af6d6842564077d4b60 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -350,6 +350,8 @@ enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
 	ALLOC_SLOWPATH,		/* Allocation by getting a new cpu slab */
 	FREE_PCS,		/* Free to percpu sheaf */
+	FREE_RCU_SHEAF,		/* Free to rcu_free sheaf */
+	FREE_RCU_SHEAF_FAIL,	/* Failed to free to a rcu_free sheaf */
 	FREE_FASTPATH,		/* Free to cpu slab */
 	FREE_SLOWPATH,		/* Freeing not to cpu slab */
 	FREE_FROZEN,		/* Freeing to frozen slab */
@@ -444,6 +446,7 @@ struct slab_sheaf {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
 	};
+	struct kmem_cache *cache;
 	unsigned int size;
 	void *objects[];
 };
@@ -452,6 +455,7 @@ struct slub_percpu_sheaves {
 	local_trylock_t lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
+	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
 	struct node_barn *barn;
 };
 
@@ -2507,6 +2511,8 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
 	if (unlikely(!sheaf))
 		return NULL;
 
+	sheaf->cache = s;
+
 	stat(s, SHEAF_ALLOC);
 
 	return sheaf;
@@ -2631,6 +2637,24 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
 	sheaf->size = 0;
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf);
+
+static void rcu_free_sheaf_nobarn(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+	s = sheaf->cache;
+
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	sheaf_flush_unused(s, sheaf);
+
+	free_empty_sheaf(s, sheaf);
+}
+
 /*
  * Caller needs to make sure migration is disabled in order to fully flush
  * single cpu's sheaves
@@ -2643,7 +2667,7 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
 static void pcs_flush_all(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
-	struct slab_sheaf *spare;
+	struct slab_sheaf *spare, *rcu_free;
 
 	local_lock(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -2651,6 +2675,9 @@ static void pcs_flush_all(struct kmem_cache *s)
 	spare = pcs->spare;
 	pcs->spare = NULL;
 
+	rcu_free = pcs->rcu_free;
+	pcs->rcu_free = NULL;
+
 	local_unlock(&s->cpu_sheaves->lock);
 
 	if (spare) {
@@ -2658,6 +2685,9 @@ static void pcs_flush_all(struct kmem_cache *s)
 		free_empty_sheaf(s, spare);
 	}
 
+	if (rcu_free)
+		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+
 	sheaf_flush_main(s);
 }
 
@@ -2674,6 +2704,11 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
 		free_empty_sheaf(s, pcs->spare);
 		pcs->spare = NULL;
 	}
+
+	if (pcs->rcu_free) {
+		call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+		pcs->rcu_free = NULL;
+	}
 }
 
 static void pcs_destroy(struct kmem_cache *s)
@@ -2699,6 +2734,7 @@ static void pcs_destroy(struct kmem_cache *s)
 		 */
 
 		WARN_ON(pcs->spare);
+		WARN_ON(pcs->rcu_free);
 
 		if (!WARN_ON(pcs->main->size)) {
 			free_empty_sheaf(s, pcs->main);
@@ -3755,7 +3791,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
 
 	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-	return (pcs->spare || pcs->main->size);
+	return (pcs->spare || pcs->rcu_free || pcs->main->size);
 }
 
 static void pcs_flush_all(struct kmem_cache *s);
@@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
 	return true;
 }
 
+static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
+				     struct slab_sheaf *sheaf)
+{
+	bool init = slab_want_init_on_free(s);
+	void **p = &sheaf->objects[0];
+	unsigned int i = 0;
+
+	while (i < sheaf->size) {
+		struct slab *slab = virt_to_slab(p[i]);
+
+		memcg_slab_free_hook(s, slab, p + i, 1);
+		alloc_tagging_slab_free_hook(s, slab, p + i, 1);
+
+		if (unlikely(!slab_free_hook(s, p[i], init, true))) {
+			p[i] = p[--sheaf->size];
+			continue;
+		}
+
+		i++;
+	}
+}
+
+static void rcu_free_sheaf(struct rcu_head *head)
+{
+	struct slab_sheaf *sheaf;
+	struct node_barn *barn;
+	struct kmem_cache *s;
+
+	sheaf = container_of(head, struct slab_sheaf, rcu_head);
+
+	s = sheaf->cache;
+
+	/*
+	 * This may reduce the number of objects that the sheaf is no longer
+	 * technically full, but it's easier to treat it that way (unless it's
+	 * competely empty), as the code handles it fine, there's just slightly
+	 * worse batching benefit. It only happens due to debugging, which
+	 * is a performance hit anyway.
+	 */
+	__rcu_free_sheaf_prepare(s, sheaf);
+
+	barn = get_node(s, numa_mem_id())->barn;
+
+	/* due to slab_free_hook() */
+	if (unlikely(sheaf->size == 0))
+		goto empty;
+
+	/*
+	 * Checking nr_full/nr_empty outside lock avoids contention in case the
+	 * barn is at the respective limit. Due to the race we might go over the
+	 * limit but that should be rare and harmless.
+	 */
+
+	if (data_race(barn->nr_full) < MAX_FULL_SHEAVES) {
+		stat(s, BARN_PUT);
+		barn_put_full_sheaf(barn, sheaf);
+		return;
+	}
+
+	stat(s, BARN_PUT_FAIL);
+	sheaf_flush_unused(s, sheaf);
+
+empty:
+	if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
+		barn_put_empty_sheaf(barn, sheaf);
+		return;
+	}
+
+	free_empty_sheaf(s, sheaf);
+}
+
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *rcu_sheaf;
+
+	if (!local_trylock(&s->cpu_sheaves->lock))
+		goto fail;
+
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (unlikely(!pcs->rcu_free)) {
+
+		struct slab_sheaf *empty;
+
+		empty = barn_get_empty_sheaf(pcs->barn);
+
+		if (empty) {
+			pcs->rcu_free = empty;
+			goto do_free;
+		}
+
+		local_unlock(&s->cpu_sheaves->lock);
+
+		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
+
+		if (!empty)
+			goto fail;
+
+		if (!local_trylock(&s->cpu_sheaves->lock))
+			goto fail;
+
+		pcs = this_cpu_ptr(s->cpu_sheaves);
+
+		if (unlikely(pcs->rcu_free))
+			barn_put_empty_sheaf(pcs->barn, empty);
+		else
+			pcs->rcu_free = empty;
+	}
+
+do_free:
+
+	rcu_sheaf = pcs->rcu_free;
+
+	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
+
+	if (likely(rcu_sheaf->size < s->sheaf_capacity))
+		rcu_sheaf = NULL;
+	else
+		pcs->rcu_free = NULL;
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (rcu_sheaf)
+		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+
+	stat(s, FREE_RCU_SHEAF);
+	return true;
+
+fail:
+	stat(s, FREE_RCU_SHEAF_FAIL);
+	return false;
+}
+
 /*
  * Bulk free objects to the percpu sheaves.
  * Unlike free_to_pcs() this includes the calls to all necessary hooks
@@ -6802,6 +6972,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 
 	flush_all_cpus_locked(s);
+
+	/* we might have rcu sheaves in flight */
+	if (s->cpu_sheaves)
+		rcu_barrier();
+
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
 		if (n->barn)
@@ -8214,6 +8389,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
 STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
 STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
 STAT_ATTR(FREE_PCS, free_cpu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
+STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
 STAT_ATTR(FREE_FASTPATH, free_fastpath);
 STAT_ATTR(FREE_SLOWPATH, free_slowpath);
 STAT_ATTR(FREE_FROZEN, free_frozen);
@@ -8312,6 +8489,8 @@ static struct attribute *slab_attrs[] = {
 	&alloc_fastpath_attr.attr,
 	&alloc_slowpath_attr.attr,
 	&free_cpu_sheaf_attr.attr,
+	&free_rcu_sheaf_attr.attr,
+	&free_rcu_sheaf_fail_attr.attr,
 	&free_fastpath_attr.attr,
 	&free_slowpath_attr.attr,
 	&free_frozen_attr.attr,

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations
  2025-04-25  8:27 ` [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2025-04-29  7:36   ` Harry Yoo
  2025-05-14 13:07     ` Vlastimil Babka
  2025-05-06 21:34   ` Suren Baghdasaryan
  1 sibling, 1 reply; 35+ messages in thread
From: Harry Yoo @ 2025-04-29  7:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Apr 25, 2025 at 10:27:22AM +0200, Vlastimil Babka wrote:
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
> 
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put it in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
> 
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() implementation.
> 
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
> 
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too.
> 
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
> 
> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> many had to fall back to the existing implementation.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

with a few nits:

>  mm/slab.h        |   3 +
>  mm/slab_common.c |  24 ++++++++
>  mm/slub.c        | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 208 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 1980330c2fcb4a4613a7e4f7efc78b349993fd89..ddf1e4bcba734dccbf67e83bdbab3ca7272f540e 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -459,6 +459,9 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>  	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
>  }
>  
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);

> +/* Legal flag mask for kmem_cache_create(), for various configurations */

nit: I think now this line should be removed?

>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>  			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
>  			 SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 4f295bdd2d42355af6311a799955301005f8a532..6c3b90f03cb79b57f426824450f576a977d85c53 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> diff --git a/mm/slub.c b/mm/slub.c
> index ae3e80ad9926ca15601eef2f2aa016ca059498f8..6f31a27b5d47fa6621fa8af6d6842564077d4b60 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>  	return true;
>  }
>  
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +	struct slab_sheaf *rcu_sheaf;
> +
> +	if (!local_trylock(&s->cpu_sheaves->lock))
> +		goto fail;
> +
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (unlikely(!pcs->rcu_free)) {
> +
> +		struct slab_sheaf *empty;

nit: should we grab the spare sheaf here if it's empty?

> +
> +		empty = barn_get_empty_sheaf(pcs->barn);
> +
> +		if (empty) {
> +			pcs->rcu_free = empty;
> +			goto do_free;
> +		}
> +
> +		local_unlock(&s->cpu_sheaves->lock);
> +
> +		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +
> +		if (!empty)
> +			goto fail;
> +
>  /*
>   * Bulk free objects to the percpu sheaves.
>   * Unlike free_to_pcs() this includes the calls to all necessary hooks

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations
  2025-04-29  7:36   ` Harry Yoo
@ 2025-05-14 13:07     ` Vlastimil Babka
  0 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-05-14 13:07 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 4/29/25 09:36, Harry Yoo wrote:
> On Fri, Apr 25, 2025 at 10:27:22AM +0200, Vlastimil Babka wrote:
>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>> addition to main and spare sheaves.
>> 
>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>> the sheaf is detached and submitted to call_rcu() with a handler that
>> will try to put it in the barn, or flush to slab pages using bulk free,
>> when the barn is full. Then a new empty sheaf must be obtained to put
>> more objects there.
>> 
>> It's possible that no free sheaves are available to use for a new
>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>> kfree_rcu() implementation.
>> 
>> Expected advantages:
>> - batching the kfree_rcu() operations, that could eventually replace the
>>   existing batching
>> - sheaves can be reused for allocations via barn instead of being
>>   flushed to slabs, which is more efficient
>>   - this includes cases where only some cpus are allowed to process rcu
>>     callbacks (Android)
>> 
>> Possible disadvantage:
>> - objects might be waiting for more than their grace period (it is
>>   determined by the last object freed into the sheaf), increasing memory
>>   usage - but the existing batching does that too.
>> 
>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>> implementation favors smaller memory footprint over performance.
>> 
>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>> many had to fall back to the existing implementation.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
> 
> Looks good to me,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks!

> 
>> +/* Legal flag mask for kmem_cache_create(), for various configurations */
> 
> nit: I think now this line should be removed?

Yeah looks like rebasing mistake. Removed.

>>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>>  			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
>>  			 SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>> index 4f295bdd2d42355af6311a799955301005f8a532..6c3b90f03cb79b57f426824450f576a977d85c53 100644
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> diff --git a/mm/slub.c b/mm/slub.c
>> index ae3e80ad9926ca15601eef2f2aa016ca059498f8..6f31a27b5d47fa6621fa8af6d6842564077d4b60 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>>  	return true;
>>  }
>>  
>> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>> +{
>> +	struct slub_percpu_sheaves *pcs;
>> +	struct slab_sheaf *rcu_sheaf;
>> +
>> +	if (!local_trylock(&s->cpu_sheaves->lock))
>> +		goto fail;
>> +
>> +	pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +	if (unlikely(!pcs->rcu_free)) {
>> +
>> +		struct slab_sheaf *empty;
> 
> nit: should we grab the spare sheaf here if it's empty?

Hmm yeah why not. But only completely empty. Done, thanks!

>> +
>> +		empty = barn_get_empty_sheaf(pcs->barn);
>> +
>> +		if (empty) {
>> +			pcs->rcu_free = empty;
>> +			goto do_free;
>> +		}
>> +
>> +		local_unlock(&s->cpu_sheaves->lock);
>> +
>> +		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
>> +
>> +		if (!empty)
>> +			goto fail;
>> +
>>  /*
>>   * Bulk free objects to the percpu sheaves.
>>   * Unlike free_to_pcs() this includes the calls to all necessary hooks
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations
  2025-04-25  8:27 ` [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
  2025-04-29  7:36   ` Harry Yoo
@ 2025-05-06 21:34   ` Suren Baghdasaryan
  2025-05-14 14:01     ` Vlastimil Babka
  1 sibling, 1 reply; 35+ messages in thread
From: Suren Baghdasaryan @ 2025-05-06 21:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Apr 25, 2025 at 1:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
>
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put it in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
>
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() implementation.
>
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
>
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too.
>
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
>
> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> many had to fall back to the existing implementation.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab.h        |   3 +
>  mm/slab_common.c |  24 ++++++++
>  mm/slub.c        | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 208 insertions(+), 2 deletions(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 1980330c2fcb4a4613a7e4f7efc78b349993fd89..ddf1e4bcba734dccbf67e83bdbab3ca7272f540e 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -459,6 +459,9 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>         return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
>  }
>
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
> +
> +/* Legal flag mask for kmem_cache_create(), for various configurations */
>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>                          SLAB_CACHE_DMA32 | SLAB_PANIC | \
>                          SLAB_TYPESAFE_BY_RCU | SLAB_DEBUG_OBJECTS | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 4f295bdd2d42355af6311a799955301005f8a532..6c3b90f03cb79b57f426824450f576a977d85c53 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1608,6 +1608,27 @@ static void kfree_rcu_work(struct work_struct *work)
>                 kvfree_rcu_list(head);
>  }
>
> +static bool kfree_rcu_sheaf(void *obj)
> +{
> +       struct kmem_cache *s;
> +       struct folio *folio;
> +       struct slab *slab;
> +
> +       if (is_vmalloc_addr(obj))
> +               return false;
> +
> +       folio = virt_to_folio(obj);
> +       if (unlikely(!folio_test_slab(folio)))
> +               return false;
> +
> +       slab = folio_slab(folio);
> +       s = slab->slab_cache;
> +       if (s->cpu_sheaves)
> +               return __kfree_rcu_sheaf(s, obj);
> +
> +       return false;
> +}
> +
>  static bool
>  need_offload_krc(struct kfree_rcu_cpu *krcp)
>  {
> @@ -1952,6 +1973,9 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>         if (!head)
>                 might_sleep();
>
> +       if (kfree_rcu_sheaf(ptr))
> +               return;
> +
>         // Queue the object but don't yet schedule the batch.
>         if (debug_rcu_head_queue(ptr)) {
>                 // Probable double kfree_rcu(), just leak.
> diff --git a/mm/slub.c b/mm/slub.c
> index ae3e80ad9926ca15601eef2f2aa016ca059498f8..6f31a27b5d47fa6621fa8af6d6842564077d4b60 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -350,6 +350,8 @@ enum stat_item {
>         ALLOC_FASTPATH,         /* Allocation from cpu slab */
>         ALLOC_SLOWPATH,         /* Allocation by getting a new cpu slab */
>         FREE_PCS,               /* Free to percpu sheaf */
> +       FREE_RCU_SHEAF,         /* Free to rcu_free sheaf */
> +       FREE_RCU_SHEAF_FAIL,    /* Failed to free to a rcu_free sheaf */
>         FREE_FASTPATH,          /* Free to cpu slab */
>         FREE_SLOWPATH,          /* Freeing not to cpu slab */
>         FREE_FROZEN,            /* Freeing to frozen slab */
> @@ -444,6 +446,7 @@ struct slab_sheaf {
>                 struct rcu_head rcu_head;
>                 struct list_head barn_list;
>         };
> +       struct kmem_cache *cache;
>         unsigned int size;
>         void *objects[];
>  };
> @@ -452,6 +455,7 @@ struct slub_percpu_sheaves {
>         local_trylock_t lock;
>         struct slab_sheaf *main; /* never NULL when unlocked */
>         struct slab_sheaf *spare; /* empty or full, may be NULL */
> +       struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
>         struct node_barn *barn;
>  };
>
> @@ -2507,6 +2511,8 @@ static struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp)
>         if (unlikely(!sheaf))
>                 return NULL;
>
> +       sheaf->cache = s;
> +
>         stat(s, SHEAF_ALLOC);
>
>         return sheaf;
> @@ -2631,6 +2637,24 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
>         sheaf->size = 0;
>  }
>
> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> +                                    struct slab_sheaf *sheaf);

I think you could safely move __rcu_free_sheaf_prepare() here and
avoid the above forward declaration.

> +
> +static void rcu_free_sheaf_nobarn(struct rcu_head *head)
> +{
> +       struct slab_sheaf *sheaf;
> +       struct kmem_cache *s;
> +
> +       sheaf = container_of(head, struct slab_sheaf, rcu_head);
> +       s = sheaf->cache;
> +
> +       __rcu_free_sheaf_prepare(s, sheaf);
> +
> +       sheaf_flush_unused(s, sheaf);
> +
> +       free_empty_sheaf(s, sheaf);
> +}
> +
>  /*
>   * Caller needs to make sure migration is disabled in order to fully flush
>   * single cpu's sheaves
> @@ -2643,7 +2667,7 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
>  static void pcs_flush_all(struct kmem_cache *s)
>  {
>         struct slub_percpu_sheaves *pcs;
> -       struct slab_sheaf *spare;
> +       struct slab_sheaf *spare, *rcu_free;
>
>         local_lock(&s->cpu_sheaves->lock);
>         pcs = this_cpu_ptr(s->cpu_sheaves);
> @@ -2651,6 +2675,9 @@ static void pcs_flush_all(struct kmem_cache *s)
>         spare = pcs->spare;
>         pcs->spare = NULL;
>
> +       rcu_free = pcs->rcu_free;
> +       pcs->rcu_free = NULL;
> +
>         local_unlock(&s->cpu_sheaves->lock);
>
>         if (spare) {
> @@ -2658,6 +2685,9 @@ static void pcs_flush_all(struct kmem_cache *s)
>                 free_empty_sheaf(s, spare);
>         }
>
> +       if (rcu_free)
> +               call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> +
>         sheaf_flush_main(s);
>  }
>
> @@ -2674,6 +2704,11 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
>                 free_empty_sheaf(s, pcs->spare);
>                 pcs->spare = NULL;
>         }
> +
> +       if (pcs->rcu_free) {
> +               call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
> +               pcs->rcu_free = NULL;
> +       }
>  }
>
>  static void pcs_destroy(struct kmem_cache *s)
> @@ -2699,6 +2734,7 @@ static void pcs_destroy(struct kmem_cache *s)
>                  */
>
>                 WARN_ON(pcs->spare);
> +               WARN_ON(pcs->rcu_free);
>
>                 if (!WARN_ON(pcs->main->size)) {
>                         free_empty_sheaf(s, pcs->main);
> @@ -3755,7 +3791,7 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
>
>         pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
>
> -       return (pcs->spare || pcs->main->size);
> +       return (pcs->spare || pcs->rcu_free || pcs->main->size);
>  }
>
>  static void pcs_flush_all(struct kmem_cache *s);
> @@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>         return true;
>  }
>
> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> +                                    struct slab_sheaf *sheaf)

This function seems to be an almost exact copy of free_to_pcs_bulk()
from your previous patch. Maybe they can be consolidated?

> +{
> +       bool init = slab_want_init_on_free(s);
> +       void **p = &sheaf->objects[0];
> +       unsigned int i = 0;
> +
> +       while (i < sheaf->size) {
> +               struct slab *slab = virt_to_slab(p[i]);
> +
> +               memcg_slab_free_hook(s, slab, p + i, 1);
> +               alloc_tagging_slab_free_hook(s, slab, p + i, 1);
> +
> +               if (unlikely(!slab_free_hook(s, p[i], init, true))) {
> +                       p[i] = p[--sheaf->size];
> +                       continue;
> +               }
> +
> +               i++;
> +       }
> +}
> +
> +static void rcu_free_sheaf(struct rcu_head *head)
> +{
> +       struct slab_sheaf *sheaf;
> +       struct node_barn *barn;
> +       struct kmem_cache *s;
> +
> +       sheaf = container_of(head, struct slab_sheaf, rcu_head);
> +
> +       s = sheaf->cache;
> +
> +       /*
> +        * This may reduce the number of objects that the sheaf is no longer
> +        * technically full, but it's easier to treat it that way (unless it's

I don't understand the sentence above. Could you please clarify and
maybe reword it?

> +        * competely empty), as the code handles it fine, there's just slightly

s/competely/completely

> +        * worse batching benefit. It only happens due to debugging, which
> +        * is a performance hit anyway.
> +        */
> +       __rcu_free_sheaf_prepare(s, sheaf);
> +
> +       barn = get_node(s, numa_mem_id())->barn;
> +
> +       /* due to slab_free_hook() */
> +       if (unlikely(sheaf->size == 0))
> +               goto empty;
> +
> +       /*
> +        * Checking nr_full/nr_empty outside lock avoids contention in case the
> +        * barn is at the respective limit. Due to the race we might go over the
> +        * limit but that should be rare and harmless.
> +        */
> +
> +       if (data_race(barn->nr_full) < MAX_FULL_SHEAVES) {
> +               stat(s, BARN_PUT);
> +               barn_put_full_sheaf(barn, sheaf);
> +               return;
> +       }
> +
> +       stat(s, BARN_PUT_FAIL);
> +       sheaf_flush_unused(s, sheaf);
> +
> +empty:
> +       if (data_race(barn->nr_empty) < MAX_EMPTY_SHEAVES) {
> +               barn_put_empty_sheaf(barn, sheaf);
> +               return;
> +       }
> +
> +       free_empty_sheaf(s, sheaf);
> +}
> +
> +bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *rcu_sheaf;
> +
> +       if (!local_trylock(&s->cpu_sheaves->lock))
> +               goto fail;
> +
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (unlikely(!pcs->rcu_free)) {
> +
> +               struct slab_sheaf *empty;
> +
> +               empty = barn_get_empty_sheaf(pcs->barn);
> +
> +               if (empty) {
> +                       pcs->rcu_free = empty;
> +                       goto do_free;
> +               }
> +
> +               local_unlock(&s->cpu_sheaves->lock);
> +
> +               empty = alloc_empty_sheaf(s, GFP_NOWAIT);
> +
> +               if (!empty)
> +                       goto fail;
> +
> +               if (!local_trylock(&s->cpu_sheaves->lock))

Aren't you leaking `empty` sheaf on this failure?

> +                       goto fail;
> +
> +               pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +               if (unlikely(pcs->rcu_free))
> +                       barn_put_empty_sheaf(pcs->barn, empty);
> +               else
> +                       pcs->rcu_free = empty;
> +       }
> +
> +do_free:
> +
> +       rcu_sheaf = pcs->rcu_free;
> +
> +       rcu_sheaf->objects[rcu_sheaf->size++] = obj;
> +
> +       if (likely(rcu_sheaf->size < s->sheaf_capacity))
> +               rcu_sheaf = NULL;
> +       else
> +               pcs->rcu_free = NULL;
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       if (rcu_sheaf)
> +               call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
> +
> +       stat(s, FREE_RCU_SHEAF);
> +       return true;
> +
> +fail:
> +       stat(s, FREE_RCU_SHEAF_FAIL);
> +       return false;
> +}
> +
>  /*
>   * Bulk free objects to the percpu sheaves.
>   * Unlike free_to_pcs() this includes the calls to all necessary hooks
> @@ -6802,6 +6972,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>         struct kmem_cache_node *n;
>
>         flush_all_cpus_locked(s);
> +
> +       /* we might have rcu sheaves in flight */
> +       if (s->cpu_sheaves)
> +               rcu_barrier();
> +
>         /* Attempt to free all objects */
>         for_each_kmem_cache_node(s, node, n) {
>                 if (n->barn)
> @@ -8214,6 +8389,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
>  STAT_ATTR(FREE_PCS, free_cpu_sheaf);
> +STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
> +STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>  STAT_ATTR(FREE_FROZEN, free_frozen);
> @@ -8312,6 +8489,8 @@ static struct attribute *slab_attrs[] = {
>         &alloc_fastpath_attr.attr,
>         &alloc_slowpath_attr.attr,
>         &free_cpu_sheaf_attr.attr,
> +       &free_rcu_sheaf_attr.attr,
> +       &free_rcu_sheaf_fail_attr.attr,
>         &free_fastpath_attr.attr,
>         &free_slowpath_attr.attr,
>         &free_frozen_attr.attr,
>
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations
  2025-05-06 21:34   ` Suren Baghdasaryan
@ 2025-05-14 14:01     ` Vlastimil Babka
  2025-05-15  8:45       ` Vlastimil Babka
  0 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2025-05-14 14:01 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 5/6/25 23:34, Suren Baghdasaryan wrote:
> On Fri, Apr 25, 2025 at 1:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> @@ -2631,6 +2637,24 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
>>         sheaf->size = 0;
>>  }
>>
>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
>> +                                    struct slab_sheaf *sheaf);
> 
> I think you could safely move __rcu_free_sheaf_prepare() here and
> avoid the above forward declaration.

Right, done.

>> @@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>>         return true;
>>  }
>>
>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
>> +                                    struct slab_sheaf *sheaf)
> 
> This function seems to be an almost exact copy of free_to_pcs_bulk()
> from your previous patch. Maybe they can be consolidated?

True, I've extracted it to __kmem_cache_free_bulk_prepare().

>> +{
>> +       bool init = slab_want_init_on_free(s);
>> +       void **p = &sheaf->objects[0];
>> +       unsigned int i = 0;
>> +
>> +       while (i < sheaf->size) {
>> +               struct slab *slab = virt_to_slab(p[i]);
>> +
>> +               memcg_slab_free_hook(s, slab, p + i, 1);
>> +               alloc_tagging_slab_free_hook(s, slab, p + i, 1);
>> +
>> +               if (unlikely(!slab_free_hook(s, p[i], init, true))) {
>> +                       p[i] = p[--sheaf->size];
>> +                       continue;
>> +               }
>> +
>> +               i++;
>> +       }
>> +}
>> +
>> +static void rcu_free_sheaf(struct rcu_head *head)
>> +{
>> +       struct slab_sheaf *sheaf;
>> +       struct node_barn *barn;
>> +       struct kmem_cache *s;
>> +
>> +       sheaf = container_of(head, struct slab_sheaf, rcu_head);
>> +
>> +       s = sheaf->cache;
>> +
>> +       /*
>> +        * This may reduce the number of objects that the sheaf is no longer
>> +        * technically full, but it's easier to treat it that way (unless it's
> 
> I don't understand the sentence above. Could you please clarify and
> maybe reword it?

Is this more clear?

/*
 * This may remove some objects due to slab_free_hook() returning false,
 * so that the sheaf might no longer be completely full. But it's easier
 * to handle it as full (unless it became completely empty), as the code
 * handles it fine. The only downside is that sheaf will serve fewer
 * allocations when reused. It only happens due to debugging, which is a
 * performance hit anyway.
 */

>> +
>> +               if (!local_trylock(&s->cpu_sheaves->lock))
> 
> Aren't you leaking `empty` sheaf on this failure?

Right! Fixed, thanks.

>> +                       goto fail;
>> +
>> +               pcs = this_cpu_ptr(s->cpu_sheaves);
>> +
>> +               if (unlikely(pcs->rcu_free))
>> +                       barn_put_empty_sheaf(pcs->barn, empty);
>> +               else
>> +                       pcs->rcu_free = empty;
>> +       }
>> +
>> +do_free:
>> +
>> +       rcu_sheaf = pcs->rcu_free;
>> +
>> +       rcu_sheaf->objects[rcu_sheaf->size++] = obj;
>> +
>> +       if (likely(rcu_sheaf->size < s->sheaf_capacity))
>> +               rcu_sheaf = NULL;
>> +       else
>> +               pcs->rcu_free = NULL;
>> +
>> +       local_unlock(&s->cpu_sheaves->lock);
>> +
>> +       if (rcu_sheaf)
>> +               call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
>> +
>> +       stat(s, FREE_RCU_SHEAF);
>> +       return true;
>> +
>> +fail:
>> +       stat(s, FREE_RCU_SHEAF_FAIL);
>> +       return false;
>> +}
>> +
>>  /*
>>   * Bulk free objects to the percpu sheaves.
>>   * Unlike free_to_pcs() this includes the calls to all necessary hooks
>> @@ -6802,6 +6972,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
>>         struct kmem_cache_node *n;
>>
>>         flush_all_cpus_locked(s);
>> +
>> +       /* we might have rcu sheaves in flight */
>> +       if (s->cpu_sheaves)
>> +               rcu_barrier();
>> +
>>         /* Attempt to free all objects */
>>         for_each_kmem_cache_node(s, node, n) {
>>                 if (n->barn)
>> @@ -8214,6 +8389,8 @@ STAT_ATTR(ALLOC_PCS, alloc_cpu_sheaf);
>>  STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath);
>>  STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath);
>>  STAT_ATTR(FREE_PCS, free_cpu_sheaf);
>> +STAT_ATTR(FREE_RCU_SHEAF, free_rcu_sheaf);
>> +STAT_ATTR(FREE_RCU_SHEAF_FAIL, free_rcu_sheaf_fail);
>>  STAT_ATTR(FREE_FASTPATH, free_fastpath);
>>  STAT_ATTR(FREE_SLOWPATH, free_slowpath);
>>  STAT_ATTR(FREE_FROZEN, free_frozen);
>> @@ -8312,6 +8489,8 @@ static struct attribute *slab_attrs[] = {
>>         &alloc_fastpath_attr.attr,
>>         &alloc_slowpath_attr.attr,
>>         &free_cpu_sheaf_attr.attr,
>> +       &free_rcu_sheaf_attr.attr,
>> +       &free_rcu_sheaf_fail_attr.attr,
>>         &free_fastpath_attr.attr,
>>         &free_slowpath_attr.attr,
>>         &free_frozen_attr.attr,
>>
>> --
>> 2.49.0
>>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations
  2025-05-14 14:01     ` Vlastimil Babka
@ 2025-05-15  8:45       ` Vlastimil Babka
  2025-05-15 15:03         ` Suren Baghdasaryan
  0 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2025-05-15  8:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 5/14/25 16:01, Vlastimil Babka wrote:
> On 5/6/25 23:34, Suren Baghdasaryan wrote:
>> On Fri, Apr 25, 2025 at 1:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>> @@ -2631,6 +2637,24 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
>>>         sheaf->size = 0;
>>>  }
>>>
>>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
>>> +                                    struct slab_sheaf *sheaf);
>> 
>> I think you could safely move __rcu_free_sheaf_prepare() here and
>> avoid the above forward declaration.
> 
> Right, done.
> 
>>> @@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
>>>         return true;
>>>  }
>>>
>>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
>>> +                                    struct slab_sheaf *sheaf)
>> 
>> This function seems to be an almost exact copy of free_to_pcs_bulk()
>> from your previous patch. Maybe they can be consolidated?
> 
> True, I've extracted it to __kmem_cache_free_bulk_prepare().

... and that was a mistake as free_to_pcs_bulk() diverges in patch 9/9 in a
way that this makes it too infeasible


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations
  2025-05-15  8:45       ` Vlastimil Babka
@ 2025-05-15 15:03         ` Suren Baghdasaryan
  0 siblings, 0 replies; 35+ messages in thread
From: Suren Baghdasaryan @ 2025-05-15 15:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Thu, May 15, 2025 at 1:45 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 5/14/25 16:01, Vlastimil Babka wrote:
> > On 5/6/25 23:34, Suren Baghdasaryan wrote:
> >> On Fri, Apr 25, 2025 at 1:27 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>> @@ -2631,6 +2637,24 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
> >>>         sheaf->size = 0;
> >>>  }
> >>>
> >>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> >>> +                                    struct slab_sheaf *sheaf);
> >>
> >> I think you could safely move __rcu_free_sheaf_prepare() here and
> >> avoid the above forward declaration.
> >
> > Right, done.
> >
> >>> @@ -5304,6 +5340,140 @@ bool free_to_pcs(struct kmem_cache *s, void *object)
> >>>         return true;
> >>>  }
> >>>
> >>> +static void __rcu_free_sheaf_prepare(struct kmem_cache *s,
> >>> +                                    struct slab_sheaf *sheaf)
> >>
> >> This function seems to be an almost exact copy of free_to_pcs_bulk()
> >> from your previous patch. Maybe they can be consolidated?
> >
> > True, I've extracted it to __kmem_cache_free_bulk_prepare().
>
> ... and that was a mistake as free_to_pcs_bulk() diverges in patch 9/9 in a
> way that this makes it too infeasible

Ah, I see. Makes sense. Sorry for the misleading suggestion.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
  2025-04-25  8:27 ` [PATCH v4 1/9] slab: add opt-in caching layer of " Vlastimil Babka
  2025-04-25  8:27 ` [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-05-06 22:54   ` Suren Baghdasaryan
  2025-05-07  9:15   ` Harry Yoo
  2025-04-25  8:27 ` [PATCH v4 4/9] slab: determine barn status racily outside of lock Vlastimil Babka
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Add functions for efficient guaranteed allocations e.g. in a critical
section that cannot sleep, when the exact number of allocations is not
known beforehand, but an upper limit can be calculated.

kmem_cache_prefill_sheaf() returns a sheaf containing at least given
number of objects.

kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
and is guaranteed not to fail until depleted.

kmem_cache_return_sheaf() is for giving the sheaf back to the slab
allocator after the critical section. This will also attempt to refill
it to cache's sheaf capacity for better efficiency of sheaves handling,
but it's not stricly necessary to succeed.

kmem_cache_refill_sheaf() can be used to refill a previously obtained
sheaf to requested size. If the current size is sufficient, it does
nothing. If the requested size exceeds cache's sheaf_capacity and the
sheaf's current capacity, the sheaf will be replaced with a new one,
hence the indirect pointer parameter.

kmem_cache_sheaf_size() can be used to query the current size.

The implementation supports requesting sizes that exceed cache's
sheaf_capacity, but it is not efficient - such "oversize" sheaves are
allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
might be especially ineffective when replacing a sheaf with a new one of
a larger capacity. It is therefore better to size cache's
sheaf_capacity accordingly to make oversize sheaves exceptional.

CONFIG_SLUB_STATS counters are added for sheaf prefill and return
operations. A prefill or return is considered _fast when it is able to
grab or return a percpu spare sheaf (even if the sheaf needs a refill to
satisfy the request, as those should amortize over time), and _slow
otherwise (when the barn or even sheaf allocation/freeing has to be
involved). sheaf_prefill_oversize is provided to determine how many
prefills were oversize (counter for oversize returns is not necessary as
all oversize refills result in oversize returns).

When slub_debug is enabled for a cache with sheaves, no percpu sheaves
exist for it, but the prefill functionality is still provided simply by
all prefilled sheaves becoming oversize. If percpu sheaves are not
created for a cache due to not passing the sheaf_capacity argument on
cache creation, the prefills also work through oversize sheaves, but
there's a WARN_ON_ONCE() to indicate the omission.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/slab.h |  16 ++++
 mm/slub.c            | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 281 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 4cb495d55fc58c70a992ee4782d7990ce1c55dc6..b0a9ba33abae22bf38cbf1689e3c08bb0b05002f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -829,6 +829,22 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
 				   int node) __assume_slab_alignment __malloc;
 #define kmem_cache_alloc_node(...)	alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
+
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf **sheafp, unsigned int size);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+				       struct slab_sheaf *sheaf);
+
+void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
+			struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
+#define kmem_cache_alloc_from_sheaf(...)	\
+			alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))
+
+unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf);
+
 /*
  * These macros allow declaring a kmem_buckets * parameter alongside size, which
  * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
diff --git a/mm/slub.c b/mm/slub.c
index 6f31a27b5d47fa6621fa8af6d6842564077d4b60..724266fdd996c091f1f0b34012c5179f17dfa422 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -384,6 +384,11 @@ enum stat_item {
 	BARN_GET_FAIL,		/* Failed to get full sheaf from barn */
 	BARN_PUT,		/* Put full sheaf to barn */
 	BARN_PUT_FAIL,		/* Failed to put full sheaf to barn */
+	SHEAF_PREFILL_FAST,	/* Sheaf prefill grabbed the spare sheaf */
+	SHEAF_PREFILL_SLOW,	/* Sheaf prefill found no spare sheaf */
+	SHEAF_PREFILL_OVERSIZE,	/* Allocation of oversize sheaf for prefill */
+	SHEAF_RETURN_FAST,	/* Sheaf return reattached spare sheaf */
+	SHEAF_RETURN_SLOW,	/* Sheaf return could not reattach spare */
 	NR_SLUB_STAT_ITEMS
 };
 
@@ -445,6 +450,8 @@ struct slab_sheaf {
 	union {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
+		/* only used for prefilled sheafs */
+		unsigned int capacity;
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
@@ -2795,6 +2802,30 @@ static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf
 	spin_unlock_irqrestore(&barn->lock, flags);
 }
 
+static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
+{
+	struct slab_sheaf *sheaf = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&barn->lock, flags);
+
+	if (barn->nr_full) {
+		sheaf = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
+					barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_full--;
+	} else if (barn->nr_empty) {
+		sheaf = list_first_entry(&barn->sheaves_empty,
+					 struct slab_sheaf, barn_list);
+		list_del(&sheaf->barn_list);
+		barn->nr_empty--;
+	}
+
+	spin_unlock_irqrestore(&barn->lock, flags);
+
+	return sheaf;
+}
+
 /*
  * If a full sheaf is available, return it and put the supplied empty one to
  * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
@@ -4905,6 +4936,230 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
 
+/*
+ * returns a sheaf that has least the requested size
+ * when prefilling is needed, do so with given gfp flags
+ *
+ * return NULL if sheaf allocation or prefilling failed
+ */
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
+{
+	struct slub_percpu_sheaves *pcs;
+	struct slab_sheaf *sheaf = NULL;
+
+	if (unlikely(size > s->sheaf_capacity)) {
+
+		/*
+		 * slab_debug disables cpu sheaves intentionally so all
+		 * prefilled sheaves become "oversize" and we give up on
+		 * performance for the debugging.
+		 * Creating a cache without sheaves and then requesting a
+		 * prefilled sheaf is however not expected, so warn.
+		 */
+		WARN_ON_ONCE(s->sheaf_capacity == 0 &&
+			     !(s->flags & SLAB_DEBUG_FLAGS));
+
+		sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
+		if (!sheaf)
+			return NULL;
+
+		stat(s, SHEAF_PREFILL_OVERSIZE);
+		sheaf->cache = s;
+		sheaf->capacity = size;
+
+		if (!__kmem_cache_alloc_bulk(s, gfp, size,
+					     &sheaf->objects[0])) {
+			kfree(sheaf);
+			return NULL;
+		}
+
+		sheaf->size = size;
+
+		return sheaf;
+	}
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (pcs->spare) {
+		sheaf = pcs->spare;
+		pcs->spare = NULL;
+		stat(s, SHEAF_PREFILL_FAST);
+	} else {
+		stat(s, SHEAF_PREFILL_SLOW);
+		sheaf = barn_get_full_or_empty_sheaf(pcs->barn);
+		if (sheaf && sheaf->size)
+			stat(s, BARN_GET);
+		else
+			stat(s, BARN_GET_FAIL);
+	}
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+
+	if (!sheaf)
+		sheaf = alloc_empty_sheaf(s, gfp);
+
+	if (sheaf && sheaf->size < size) {
+		if (refill_sheaf(s, sheaf, gfp)) {
+			sheaf_flush_unused(s, sheaf);
+			free_empty_sheaf(s, sheaf);
+			sheaf = NULL;
+		}
+	}
+
+	if (sheaf)
+		sheaf->capacity = s->sheaf_capacity;
+
+	return sheaf;
+}
+
+/*
+ * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
+ *
+ * If the sheaf cannot simply become the percpu spare sheaf, but there's space
+ * for a full sheaf in the barn, we try to refill the sheaf back to the cache's
+ * sheaf_capacity to avoid handling partially full sheaves.
+ *
+ * If the refill fails because gfp is e.g. GFP_NOWAIT, or the barn is full, the
+ * sheaf is instead flushed and freed.
+ */
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+			     struct slab_sheaf *sheaf)
+{
+	struct slub_percpu_sheaves *pcs;
+	bool refill = false;
+	struct node_barn *barn;
+
+	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
+		sheaf_flush_unused(s, sheaf);
+		kfree(sheaf);
+		return;
+	}
+
+	local_lock(&s->cpu_sheaves->lock);
+	pcs = this_cpu_ptr(s->cpu_sheaves);
+
+	if (!pcs->spare) {
+		pcs->spare = sheaf;
+		sheaf = NULL;
+		stat(s, SHEAF_RETURN_FAST);
+	} else if (data_race(pcs->barn->nr_full) < MAX_FULL_SHEAVES) {
+		barn = pcs->barn;
+		refill = true;
+	}
+
+	local_unlock(&s->cpu_sheaves->lock);
+
+	if (!sheaf)
+		return;
+
+	stat(s, SHEAF_RETURN_SLOW);
+
+	/*
+	 * if the barn is full of full sheaves or we fail to refill the sheaf,
+	 * simply flush and free it
+	 */
+	if (!refill || refill_sheaf(s, sheaf, gfp)) {
+		sheaf_flush_unused(s, sheaf);
+		free_empty_sheaf(s, sheaf);
+		return;
+	}
+
+	/* we racily determined the sheaf would fit, so now force it */
+	barn_put_full_sheaf(barn, sheaf);
+	stat(s, BARN_PUT);
+}
+
+/*
+ * refill a sheaf previously returned by kmem_cache_prefill_sheaf to at least
+ * the given size
+ *
+ * the sheaf might be replaced by a new one when requesting more than
+ * s->sheaf_capacity objects if such replacement is necessary, but the refill
+ * fails (returning -ENOMEM), the existing sheaf is left intact
+ *
+ * In practice we always refill to full sheaf's capacity.
+ */
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+			    struct slab_sheaf **sheafp, unsigned int size)
+{
+	struct slab_sheaf *sheaf;
+
+	/*
+	 * TODO: do we want to support *sheaf == NULL to be equivalent of
+	 * kmem_cache_prefill_sheaf() ?
+	 */
+	if (!sheafp || !(*sheafp))
+		return -EINVAL;
+
+	sheaf = *sheafp;
+	if (sheaf->size >= size)
+		return 0;
+
+	if (likely(sheaf->capacity >= size)) {
+		if (likely(sheaf->capacity == s->sheaf_capacity))
+			return refill_sheaf(s, sheaf, gfp);
+
+		if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
+					     &sheaf->objects[sheaf->size])) {
+			return -ENOMEM;
+		}
+		sheaf->size = sheaf->capacity;
+
+		return 0;
+	}
+
+	/*
+	 * We had a regular sized sheaf and need an oversize one, or we had an
+	 * oversize one already but need a larger one now.
+	 * This should be a very rare path so let's not complicate it.
+	 */
+	sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
+	if (!sheaf)
+		return -ENOMEM;
+
+	kmem_cache_return_sheaf(s, gfp, *sheafp);
+	*sheafp = sheaf;
+	return 0;
+}
+
+/*
+ * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
+ *
+ * Guaranteed not to fail as many allocations as was the requested size.
+ * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
+ *
+ * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
+ * memcg charging is forced over limit if necessary, to avoid failure.
+ */
+void *
+kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
+				   struct slab_sheaf *sheaf)
+{
+	void *ret = NULL;
+	bool init;
+
+	if (sheaf->size == 0)
+		goto out;
+
+	ret = sheaf->objects[--sheaf->size];
+
+	init = slab_want_init_on_alloc(gfp, s);
+
+	/* add __GFP_NOFAIL to force successful memcg charging */
+	slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
+out:
+	trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
+
+	return ret;
+}
+
+unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
+{
+	return sheaf->size;
+}
 /*
  * To avoid unnecessary overhead, we pass through large allocation requests
  * directly to the page allocator. We use __GFP_COMP, because we will need to
@@ -8423,6 +8678,11 @@ STAT_ATTR(BARN_GET, barn_get);
 STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
 STAT_ATTR(BARN_PUT, barn_put);
 STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
+STAT_ATTR(SHEAF_PREFILL_FAST, sheaf_prefill_fast);
+STAT_ATTR(SHEAF_PREFILL_SLOW, sheaf_prefill_slow);
+STAT_ATTR(SHEAF_PREFILL_OVERSIZE, sheaf_prefill_oversize);
+STAT_ATTR(SHEAF_RETURN_FAST, sheaf_return_fast);
+STAT_ATTR(SHEAF_RETURN_SLOW, sheaf_return_slow);
 #endif	/* CONFIG_SLUB_STATS */
 
 #ifdef CONFIG_KFENCE
@@ -8523,6 +8783,11 @@ static struct attribute *slab_attrs[] = {
 	&barn_get_fail_attr.attr,
 	&barn_put_attr.attr,
 	&barn_put_fail_attr.attr,
+	&sheaf_prefill_fast_attr.attr,
+	&sheaf_prefill_slow_attr.attr,
+	&sheaf_prefill_oversize_attr.attr,
+	&sheaf_return_fast_attr.attr,
+	&sheaf_return_slow_attr.attr,
 #endif
 #ifdef CONFIG_FAILSLAB
 	&failslab_attr.attr,

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations
  2025-04-25  8:27 ` [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
@ 2025-05-06 22:54   ` Suren Baghdasaryan
  2025-05-07  9:15   ` Harry Yoo
  1 sibling, 0 replies; 35+ messages in thread
From: Suren Baghdasaryan @ 2025-05-06 22:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Apr 25, 2025 at 1:28 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Add functions for efficient guaranteed allocations e.g. in a critical
> section that cannot sleep, when the exact number of allocations is not
> known beforehand, but an upper limit can be calculated.
>
> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
> number of objects.
>
> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
> and is guaranteed not to fail until depleted.
>
> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
> allocator after the critical section. This will also attempt to refill
> it to cache's sheaf capacity for better efficiency of sheaves handling,
> but it's not stricly necessary to succeed.
>
> kmem_cache_refill_sheaf() can be used to refill a previously obtained
> sheaf to requested size. If the current size is sufficient, it does
> nothing. If the requested size exceeds cache's sheaf_capacity and the
> sheaf's current capacity, the sheaf will be replaced with a new one,
> hence the indirect pointer parameter.
>
> kmem_cache_sheaf_size() can be used to query the current size.
>
> The implementation supports requesting sizes that exceed cache's
> sheaf_capacity, but it is not efficient - such "oversize" sheaves are
> allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
> immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
> might be especially ineffective when replacing a sheaf with a new one of
> a larger capacity. It is therefore better to size cache's
> sheaf_capacity accordingly to make oversize sheaves exceptional.
>
> CONFIG_SLUB_STATS counters are added for sheaf prefill and return
> operations. A prefill or return is considered _fast when it is able to
> grab or return a percpu spare sheaf (even if the sheaf needs a refill to
> satisfy the request, as those should amortize over time), and _slow
> otherwise (when the barn or even sheaf allocation/freeing has to be
> involved). sheaf_prefill_oversize is provided to determine how many
> prefills were oversize (counter for oversize returns is not necessary as
> all oversize refills result in oversize returns).
>
> When slub_debug is enabled for a cache with sheaves, no percpu sheaves
> exist for it, but the prefill functionality is still provided simply by
> all prefilled sheaves becoming oversize. If percpu sheaves are not
> created for a cache due to not passing the sheaf_capacity argument on
> cache creation, the prefills also work through oversize sheaves, but
> there's a WARN_ON_ONCE() to indicate the omission.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/slab.h |  16 ++++
>  mm/slub.c            | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 281 insertions(+)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 4cb495d55fc58c70a992ee4782d7990ce1c55dc6..b0a9ba33abae22bf38cbf1689e3c08bb0b05002f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -829,6 +829,22 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t flags,
>                                    int node) __assume_slab_alignment __malloc;
>  #define kmem_cache_alloc_node(...)     alloc_hooks(kmem_cache_alloc_node_noprof(__VA_ARGS__))
>
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
> +
> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
> +               struct slab_sheaf **sheafp, unsigned int size);
> +
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                                      struct slab_sheaf *sheaf);
> +
> +void *kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *cachep, gfp_t gfp,
> +                       struct slab_sheaf *sheaf) __assume_slab_alignment __malloc;
> +#define kmem_cache_alloc_from_sheaf(...)       \
> +                       alloc_hooks(kmem_cache_alloc_from_sheaf_noprof(__VA_ARGS__))
> +
> +unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf);
> +
>  /*
>   * These macros allow declaring a kmem_buckets * parameter alongside size, which
>   * can be compiled out with CONFIG_SLAB_BUCKETS=n so that a large number of call
> diff --git a/mm/slub.c b/mm/slub.c
> index 6f31a27b5d47fa6621fa8af6d6842564077d4b60..724266fdd996c091f1f0b34012c5179f17dfa422 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -384,6 +384,11 @@ enum stat_item {
>         BARN_GET_FAIL,          /* Failed to get full sheaf from barn */
>         BARN_PUT,               /* Put full sheaf to barn */
>         BARN_PUT_FAIL,          /* Failed to put full sheaf to barn */
> +       SHEAF_PREFILL_FAST,     /* Sheaf prefill grabbed the spare sheaf */
> +       SHEAF_PREFILL_SLOW,     /* Sheaf prefill found no spare sheaf */
> +       SHEAF_PREFILL_OVERSIZE, /* Allocation of oversize sheaf for prefill */
> +       SHEAF_RETURN_FAST,      /* Sheaf return reattached spare sheaf */
> +       SHEAF_RETURN_SLOW,      /* Sheaf return could not reattach spare */
>         NR_SLUB_STAT_ITEMS
>  };
>
> @@ -445,6 +450,8 @@ struct slab_sheaf {
>         union {
>                 struct rcu_head rcu_head;
>                 struct list_head barn_list;
> +               /* only used for prefilled sheafs */
> +               unsigned int capacity;
>         };
>         struct kmem_cache *cache;
>         unsigned int size;
> @@ -2795,6 +2802,30 @@ static void barn_put_full_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf
>         spin_unlock_irqrestore(&barn->lock, flags);
>  }
>
> +static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
> +{
> +       struct slab_sheaf *sheaf = NULL;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&barn->lock, flags);
> +
> +       if (barn->nr_full) {
> +               sheaf = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
> +                                       barn_list);
> +               list_del(&sheaf->barn_list);
> +               barn->nr_full--;
> +       } else if (barn->nr_empty) {
> +               sheaf = list_first_entry(&barn->sheaves_empty,
> +                                        struct slab_sheaf, barn_list);
> +               list_del(&sheaf->barn_list);
> +               barn->nr_empty--;
> +       }
> +
> +       spin_unlock_irqrestore(&barn->lock, flags);
> +
> +       return sheaf;
> +}
> +
>  /*
>   * If a full sheaf is available, return it and put the supplied empty one to
>   * barn. We ignore the limit on empty sheaves as the number of sheaves doesn't
> @@ -4905,6 +4936,230 @@ void *kmem_cache_alloc_node_noprof(struct kmem_cache *s, gfp_t gfpflags, int nod
>  }
>  EXPORT_SYMBOL(kmem_cache_alloc_node_noprof);
>
> +/*
> + * returns a sheaf that has least the requested size

s/least/at least ?

> + * when prefilling is needed, do so with given gfp flags
> + *
> + * return NULL if sheaf allocation or prefilling failed
> + */
> +struct slab_sheaf *
> +kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       struct slab_sheaf *sheaf = NULL;
> +
> +       if (unlikely(size > s->sheaf_capacity)) {
> +
> +               /*
> +                * slab_debug disables cpu sheaves intentionally so all
> +                * prefilled sheaves become "oversize" and we give up on
> +                * performance for the debugging.
> +                * Creating a cache without sheaves and then requesting a
> +                * prefilled sheaf is however not expected, so warn.
> +                */
> +               WARN_ON_ONCE(s->sheaf_capacity == 0 &&
> +                            !(s->flags & SLAB_DEBUG_FLAGS));
> +
> +               sheaf = kzalloc(struct_size(sheaf, objects, size), gfp);
> +               if (!sheaf)
> +                       return NULL;
> +
> +               stat(s, SHEAF_PREFILL_OVERSIZE);
> +               sheaf->cache = s;
> +               sheaf->capacity = size;
> +
> +               if (!__kmem_cache_alloc_bulk(s, gfp, size,
> +                                            &sheaf->objects[0])) {
> +                       kfree(sheaf);

Not sure if we should have SHEAF_PREFILL_OVERSIZE_FAIL accounting as well here.


> +                       return NULL;
> +               }
> +
> +               sheaf->size = size;
> +
> +               return sheaf;
> +       }
> +
> +       local_lock(&s->cpu_sheaves->lock);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (pcs->spare) {
> +               sheaf = pcs->spare;
> +               pcs->spare = NULL;
> +               stat(s, SHEAF_PREFILL_FAST);
> +       } else {
> +               stat(s, SHEAF_PREFILL_SLOW);
> +               sheaf = barn_get_full_or_empty_sheaf(pcs->barn);
> +               if (sheaf && sheaf->size)
> +                       stat(s, BARN_GET);
> +               else
> +                       stat(s, BARN_GET_FAIL);
> +       }
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +
> +       if (!sheaf)
> +               sheaf = alloc_empty_sheaf(s, gfp);
> +
> +       if (sheaf && sheaf->size < size) {
> +               if (refill_sheaf(s, sheaf, gfp)) {
> +                       sheaf_flush_unused(s, sheaf);
> +                       free_empty_sheaf(s, sheaf);
> +                       sheaf = NULL;
> +               }
> +       }
> +
> +       if (sheaf)
> +               sheaf->capacity = s->sheaf_capacity;
> +
> +       return sheaf;
> +}
> +
> +/*
> + * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
> + *
> + * If the sheaf cannot simply become the percpu spare sheaf, but there's space
> + * for a full sheaf in the barn, we try to refill the sheaf back to the cache's
> + * sheaf_capacity to avoid handling partially full sheaves.
> + *
> + * If the refill fails because gfp is e.g. GFP_NOWAIT, or the barn is full, the
> + * sheaf is instead flushed and freed.
> + */
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                            struct slab_sheaf *sheaf)
> +{
> +       struct slub_percpu_sheaves *pcs;
> +       bool refill = false;
> +       struct node_barn *barn;
> +
> +       if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
> +               sheaf_flush_unused(s, sheaf);
> +               kfree(sheaf);
> +               return;
> +       }
> +
> +       local_lock(&s->cpu_sheaves->lock);
> +       pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +       if (!pcs->spare) {
> +               pcs->spare = sheaf;
> +               sheaf = NULL;
> +               stat(s, SHEAF_RETURN_FAST);
> +       } else if (data_race(pcs->barn->nr_full) < MAX_FULL_SHEAVES) {
> +               barn = pcs->barn;
> +               refill = true;
> +       }
> +
> +       local_unlock(&s->cpu_sheaves->lock);
> +
> +       if (!sheaf)
> +               return;
> +
> +       stat(s, SHEAF_RETURN_SLOW);
> +
> +       /*
> +        * if the barn is full of full sheaves or we fail to refill the sheaf,
> +        * simply flush and free it
> +        */
> +       if (!refill || refill_sheaf(s, sheaf, gfp)) {
> +               sheaf_flush_unused(s, sheaf);
> +               free_empty_sheaf(s, sheaf);
> +               return;
> +       }
> +
> +       /* we racily determined the sheaf would fit, so now force it */
> +       barn_put_full_sheaf(barn, sheaf);
> +       stat(s, BARN_PUT);
> +}
> +
> +/*
> + * refill a sheaf previously returned by kmem_cache_prefill_sheaf to at least
> + * the given size
> + *
> + * the sheaf might be replaced by a new one when requesting more than
> + * s->sheaf_capacity objects if such replacement is necessary, but the refill
> + * fails (returning -ENOMEM), the existing sheaf is left intact
> + *
> + * In practice we always refill to full sheaf's capacity.
> + */
> +int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
> +                           struct slab_sheaf **sheafp, unsigned int size)
> +{
> +       struct slab_sheaf *sheaf;
> +
> +       /*
> +        * TODO: do we want to support *sheaf == NULL to be equivalent of
> +        * kmem_cache_prefill_sheaf() ?
> +        */
> +       if (!sheafp || !(*sheafp))
> +               return -EINVAL;
> +
> +       sheaf = *sheafp;
> +       if (sheaf->size >= size)
> +               return 0;
> +
> +       if (likely(sheaf->capacity >= size)) {
> +               if (likely(sheaf->capacity == s->sheaf_capacity))
> +                       return refill_sheaf(s, sheaf, gfp);
> +
> +               if (!__kmem_cache_alloc_bulk(s, gfp, sheaf->capacity - sheaf->size,
> +                                            &sheaf->objects[sheaf->size])) {
> +                       return -ENOMEM;
> +               }
> +               sheaf->size = sheaf->capacity;
> +
> +               return 0;
> +       }
> +
> +       /*
> +        * We had a regular sized sheaf and need an oversize one, or we had an
> +        * oversize one already but need a larger one now.
> +        * This should be a very rare path so let's not complicate it.
> +        */
> +       sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
> +       if (!sheaf)
> +               return -ENOMEM;
> +
> +       kmem_cache_return_sheaf(s, gfp, *sheafp);
> +       *sheafp = sheaf;
> +       return 0;
> +}
> +
> +/*
> + * Allocate from a sheaf obtained by kmem_cache_prefill_sheaf()
> + *
> + * Guaranteed not to fail as many allocations as was the requested size.
> + * After the sheaf is emptied, it fails - no fallback to the slab cache itself.
> + *
> + * The gfp parameter is meant only to specify __GFP_ZERO or __GFP_ACCOUNT
> + * memcg charging is forced over limit if necessary, to avoid failure.
> + */
> +void *
> +kmem_cache_alloc_from_sheaf_noprof(struct kmem_cache *s, gfp_t gfp,
> +                                  struct slab_sheaf *sheaf)
> +{
> +       void *ret = NULL;
> +       bool init;
> +
> +       if (sheaf->size == 0)
> +               goto out;
> +
> +       ret = sheaf->objects[--sheaf->size];
> +
> +       init = slab_want_init_on_alloc(gfp, s);
> +
> +       /* add __GFP_NOFAIL to force successful memcg charging */
> +       slab_post_alloc_hook(s, NULL, gfp | __GFP_NOFAIL, 1, &ret, init, s->object_size);
> +out:
> +       trace_kmem_cache_alloc(_RET_IP_, ret, s, gfp, NUMA_NO_NODE);
> +
> +       return ret;
> +}
> +
> +unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
> +{
> +       return sheaf->size;
> +}
>  /*
>   * To avoid unnecessary overhead, we pass through large allocation requests
>   * directly to the page allocator. We use __GFP_COMP, because we will need to
> @@ -8423,6 +8678,11 @@ STAT_ATTR(BARN_GET, barn_get);
>  STAT_ATTR(BARN_GET_FAIL, barn_get_fail);
>  STAT_ATTR(BARN_PUT, barn_put);
>  STAT_ATTR(BARN_PUT_FAIL, barn_put_fail);
> +STAT_ATTR(SHEAF_PREFILL_FAST, sheaf_prefill_fast);
> +STAT_ATTR(SHEAF_PREFILL_SLOW, sheaf_prefill_slow);
> +STAT_ATTR(SHEAF_PREFILL_OVERSIZE, sheaf_prefill_oversize);
> +STAT_ATTR(SHEAF_RETURN_FAST, sheaf_return_fast);
> +STAT_ATTR(SHEAF_RETURN_SLOW, sheaf_return_slow);
>  #endif /* CONFIG_SLUB_STATS */
>
>  #ifdef CONFIG_KFENCE
> @@ -8523,6 +8783,11 @@ static struct attribute *slab_attrs[] = {
>         &barn_get_fail_attr.attr,
>         &barn_put_attr.attr,
>         &barn_put_fail_attr.attr,
> +       &sheaf_prefill_fast_attr.attr,
> +       &sheaf_prefill_slow_attr.attr,
> +       &sheaf_prefill_oversize_attr.attr,
> +       &sheaf_return_fast_attr.attr,
> +       &sheaf_return_slow_attr.attr,
>  #endif
>  #ifdef CONFIG_FAILSLAB
>         &failslab_attr.attr,
>
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations
  2025-04-25  8:27 ` [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
  2025-05-06 22:54   ` Suren Baghdasaryan
@ 2025-05-07  9:15   ` Harry Yoo
  2025-05-07  9:20     ` Harry Yoo
  2025-05-15  8:41     ` Vlastimil Babka
  1 sibling, 2 replies; 35+ messages in thread
From: Harry Yoo @ 2025-05-07  9:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Apr 25, 2025 at 10:27:23AM +0200, Vlastimil Babka wrote:
> Add functions for efficient guaranteed allocations e.g. in a critical
> section that cannot sleep, when the exact number of allocations is not
> known beforehand, but an upper limit can be calculated.
> 
> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
> number of objects.
> 
> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
> and is guaranteed not to fail until depleted.
> 
> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
> allocator after the critical section. This will also attempt to refill
> it to cache's sheaf capacity for better efficiency of sheaves handling,
> but it's not stricly necessary to succeed.
> 
> kmem_cache_refill_sheaf() can be used to refill a previously obtained
> sheaf to requested size. If the current size is sufficient, it does
> nothing. If the requested size exceeds cache's sheaf_capacity and the
> sheaf's current capacity, the sheaf will be replaced with a new one,
> hence the indirect pointer parameter.
> 
> kmem_cache_sheaf_size() can be used to query the current size.
> 
> The implementation supports requesting sizes that exceed cache's
> sheaf_capacity, but it is not efficient - such "oversize" sheaves are
> allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
> immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
> might be especially ineffective when replacing a sheaf with a new one of
> a larger capacity. It is therefore better to size cache's
> sheaf_capacity accordingly to make oversize sheaves exceptional.
> 
> CONFIG_SLUB_STATS counters are added for sheaf prefill and return
> operations. A prefill or return is considered _fast when it is able to
> grab or return a percpu spare sheaf (even if the sheaf needs a refill to
> satisfy the request, as those should amortize over time), and _slow
> otherwise (when the barn or even sheaf allocation/freeing has to be
> involved). sheaf_prefill_oversize is provided to determine how many
> prefills were oversize (counter for oversize returns is not necessary as
> all oversize refills result in oversize returns).
> 
> When slub_debug is enabled for a cache with sheaves, no percpu sheaves
> exist for it, but the prefill functionality is still provided simply by
> all prefilled sheaves becoming oversize. If percpu sheaves are not
> created for a cache due to not passing the sheaf_capacity argument on
> cache creation, the prefills also work through oversize sheaves, but
> there's a WARN_ON_ONCE() to indicate the omission.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

with a nit below.

> +/*
> + * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
> + *
> + * If the sheaf cannot simply become the percpu spare sheaf, but there's space
> + * for a full sheaf in the barn, we try to refill the sheaf back to the cache's
> + * sheaf_capacity to avoid handling partially full sheaves.
> + *
> + * If the refill fails because gfp is e.g. GFP_NOWAIT, or the barn is full, the
> + * sheaf is instead flushed and freed.
> + */
> +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> +			     struct slab_sheaf *sheaf)
> +{
> +	struct slub_percpu_sheaves *pcs;
> +	bool refill = false;
> +	struct node_barn *barn;
> +
> +	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
> +		sheaf_flush_unused(s, sheaf);
> +		kfree(sheaf);
> +		return;
> +	}
> +
> +	local_lock(&s->cpu_sheaves->lock);
> +	pcs = this_cpu_ptr(s->cpu_sheaves);
> +
> +	if (!pcs->spare) {
> +		pcs->spare = sheaf;
> +		sheaf = NULL;
> +		stat(s, SHEAF_RETURN_FAST);
> +	} else if (data_race(pcs->barn->nr_full) < MAX_FULL_SHEAVES) {
> +		barn = pcs->barn;
> +		refill = true;
> +	}
> +
> +	local_unlock(&s->cpu_sheaves->lock);
> +
> +	if (!sheaf)
> +		return;
> +
> +	stat(s, SHEAF_RETURN_SLOW);
> +
> +	/*
> +	 * if the barn is full of full sheaves or we fail to refill the sheaf,
> +	 * simply flush and free it
> +	 */
> +	if (!refill || refill_sheaf(s, sheaf, gfp)) {
> +		sheaf_flush_unused(s, sheaf);
> +		free_empty_sheaf(s, sheaf);
> +		return;
> +	}
> +
> +	/* we racily determined the sheaf would fit, so now force it */
> +	barn_put_full_sheaf(barn, sheaf);
> +	stat(s, BARN_PUT);
> +}

nit: as accessing pcs->barn outside local_lock is safe (it does not go
away until the cache is destroyed...), this could be simplified a little
bit:

diff --git a/mm/slub.c b/mm/slub.c
index 2bf83e2b85b2..4e1daba4d13e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5043,7 +5043,6 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
 			     struct slab_sheaf *sheaf)
 {
 	struct slub_percpu_sheaves *pcs;
-	bool refill = false;
 	struct node_barn *barn;

 	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
@@ -5059,9 +5058,6 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
 		pcs->spare = sheaf;
 		sheaf = NULL;
 		stat(s, SHEAF_RETURN_FAST);
-	} else if (data_race(pcs->barn->nr_full) < MAX_FULL_SHEAVES) {
-		barn = pcs->barn;
-		refill = true;
 	}

 	local_unlock(&s->cpu_sheaves->lock);
@@ -5071,17 +5067,19 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,

 	stat(s, SHEAF_RETURN_SLOW);

+	/* Accessing pcs->barn outside local_lock is safe */
+	barn = pcs->barn;
+
 	/*
 	 * if the barn is full of full sheaves or we fail to refill the sheaf,
 	 * simply flush and free it
 	 */
-	if (!refill || refill_sheaf(s, sheaf, gfp)) {
+	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
+			refill_sheaf(s, sheaf, gfp)) {
 		sheaf_flush_unused(s, sheaf);
 		free_empty_sheaf(s, sheaf);
-		return;
 	}

-	/* we racily determined the sheaf would fit, so now force it */
 	barn_put_full_sheaf(barn, sheaf);
 	stat(s, BARN_PUT);
 }

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations
  2025-05-07  9:15   ` Harry Yoo
@ 2025-05-07  9:20     ` Harry Yoo
  2025-05-15  8:41     ` Vlastimil Babka
  1 sibling, 0 replies; 35+ messages in thread
From: Harry Yoo @ 2025-05-07  9:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Wed, May 07, 2025 at 06:15:34PM +0900, Harry Yoo wrote:
> On Fri, Apr 25, 2025 at 10:27:23AM +0200, Vlastimil Babka wrote:
> > Add functions for efficient guaranteed allocations e.g. in a critical
> > section that cannot sleep, when the exact number of allocations is not
> > known beforehand, but an upper limit can be calculated.
> > 
> > kmem_cache_prefill_sheaf() returns a sheaf containing at least given
> > number of objects.
> > 
> > kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
> > and is guaranteed not to fail until depleted.
> > 
> > kmem_cache_return_sheaf() is for giving the sheaf back to the slab
> > allocator after the critical section. This will also attempt to refill
> > it to cache's sheaf capacity for better efficiency of sheaves handling,
> > but it's not stricly necessary to succeed.
> > 
> > kmem_cache_refill_sheaf() can be used to refill a previously obtained
> > sheaf to requested size. If the current size is sufficient, it does
> > nothing. If the requested size exceeds cache's sheaf_capacity and the
> > sheaf's current capacity, the sheaf will be replaced with a new one,
> > hence the indirect pointer parameter.
> > 
> > kmem_cache_sheaf_size() can be used to query the current size.
> > 
> > The implementation supports requesting sizes that exceed cache's
> > sheaf_capacity, but it is not efficient - such "oversize" sheaves are
> > allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
> > immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
> > might be especially ineffective when replacing a sheaf with a new one of
> > a larger capacity. It is therefore better to size cache's
> > sheaf_capacity accordingly to make oversize sheaves exceptional.
> > 
> > CONFIG_SLUB_STATS counters are added for sheaf prefill and return
> > operations. A prefill or return is considered _fast when it is able to
> > grab or return a percpu spare sheaf (even if the sheaf needs a refill to
> > satisfy the request, as those should amortize over time), and _slow
> > otherwise (when the barn or even sheaf allocation/freeing has to be
> > involved). sheaf_prefill_oversize is provided to determine how many
> > prefills were oversize (counter for oversize returns is not necessary as
> > all oversize refills result in oversize returns).
> > 
> > When slub_debug is enabled for a cache with sheaves, no percpu sheaves
> > exist for it, but the prefill functionality is still provided simply by
> > all prefilled sheaves becoming oversize. If percpu sheaves are not
> > created for a cache due to not passing the sheaf_capacity argument on
> > cache creation, the prefills also work through oversize sheaves, but
> > there's a WARN_ON_ONCE() to indicate the omission.
> > 
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> 
> Looks good to me,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> with a nit below.
> 
> > +/*
> > + * Use this to return a sheaf obtained by kmem_cache_prefill_sheaf()
> > + *
> > + * If the sheaf cannot simply become the percpu spare sheaf, but there's space
> > + * for a full sheaf in the barn, we try to refill the sheaf back to the cache's
> > + * sheaf_capacity to avoid handling partially full sheaves.
> > + *
> > + * If the refill fails because gfp is e.g. GFP_NOWAIT, or the barn is full, the
> > + * sheaf is instead flushed and freed.
> > + */
> > +void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> > +			     struct slab_sheaf *sheaf)
> > +{
> > +	struct slub_percpu_sheaves *pcs;
> > +	bool refill = false;
> > +	struct node_barn *barn;
> > +
> > +	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
> > +		sheaf_flush_unused(s, sheaf);
> > +		kfree(sheaf);
> > +		return;
> > +	}
> > +
> > +	local_lock(&s->cpu_sheaves->lock);
> > +	pcs = this_cpu_ptr(s->cpu_sheaves);
> > +
> > +	if (!pcs->spare) {
> > +		pcs->spare = sheaf;
> > +		sheaf = NULL;
> > +		stat(s, SHEAF_RETURN_FAST);
> > +	} else if (data_race(pcs->barn->nr_full) < MAX_FULL_SHEAVES) {
> > +		barn = pcs->barn;
> > +		refill = true;
> > +	}
> > +
> > +	local_unlock(&s->cpu_sheaves->lock);
> > +
> > +	if (!sheaf)
> > +		return;
> > +
> > +	stat(s, SHEAF_RETURN_SLOW);
> > +
> > +	/*
> > +	 * if the barn is full of full sheaves or we fail to refill the sheaf,
> > +	 * simply flush and free it
> > +	 */
> > +	if (!refill || refill_sheaf(s, sheaf, gfp)) {
> > +		sheaf_flush_unused(s, sheaf);
> > +		free_empty_sheaf(s, sheaf);
> > +		return;
> > +	}
> > +
> > +	/* we racily determined the sheaf would fit, so now force it */
> > +	barn_put_full_sheaf(barn, sheaf);
> > +	stat(s, BARN_PUT);
> > +}
> 
> nit: as accessing pcs->barn outside local_lock is safe (it does not go
> away until the cache is destroyed...), this could be simplified a little
> bit:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 2bf83e2b85b2..4e1daba4d13e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5043,7 +5043,6 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
>  			     struct slab_sheaf *sheaf)
>  {
>  	struct slub_percpu_sheaves *pcs;
> -	bool refill = false;
>  	struct node_barn *barn;
> 
>  	if (unlikely(sheaf->capacity != s->sheaf_capacity)) {
> @@ -5059,9 +5058,6 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
>  		pcs->spare = sheaf;
>  		sheaf = NULL;
>  		stat(s, SHEAF_RETURN_FAST);
> -	} else if (data_race(pcs->barn->nr_full) < MAX_FULL_SHEAVES) {
> -		barn = pcs->barn;
> -		refill = true;
>  	}
> 
>  	local_unlock(&s->cpu_sheaves->lock);
> @@ -5071,17 +5067,19 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
> 
>  	stat(s, SHEAF_RETURN_SLOW);
> 
> +	/* Accessing pcs->barn outside local_lock is safe */
> +	barn = pcs->barn;
> +
>  	/*
>  	 * if the barn is full of full sheaves or we fail to refill the sheaf,
>  	 * simply flush and free it
>  	 */
> -	if (!refill || refill_sheaf(s, sheaf, gfp)) {
> +	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES ||
> +			refill_sheaf(s, sheaf, gfp)) {
>  		sheaf_flush_unused(s, sheaf);
>  		free_empty_sheaf(s, sheaf);
> -		return;

Uh, I shouldn't have deleted this return statement :)

>  	}
> 
> -	/* we racily determined the sheaf would fit, so now force it */
>  	barn_put_full_sheaf(barn, sheaf);
>  	stat(s, BARN_PUT);
>  }
> 
> -- 
> Cheers,
> Harry / Hyeonggon

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations
  2025-05-07  9:15   ` Harry Yoo
  2025-05-07  9:20     ` Harry Yoo
@ 2025-05-15  8:41     ` Vlastimil Babka
  1 sibling, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-05-15  8:41 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 5/7/25 11:15, Harry Yoo wrote:
> On Fri, Apr 25, 2025 at 10:27:23AM +0200, Vlastimil Babka wrote:
>> Add functions for efficient guaranteed allocations e.g. in a critical
>> section that cannot sleep, when the exact number of allocations is not
>> known beforehand, but an upper limit can be calculated.
>> 
>> kmem_cache_prefill_sheaf() returns a sheaf containing at least given
>> number of objects.
>> 
>> kmem_cache_alloc_from_sheaf() will allocate an object from the sheaf
>> and is guaranteed not to fail until depleted.
>> 
>> kmem_cache_return_sheaf() is for giving the sheaf back to the slab
>> allocator after the critical section. This will also attempt to refill
>> it to cache's sheaf capacity for better efficiency of sheaves handling,
>> but it's not stricly necessary to succeed.
>> 
>> kmem_cache_refill_sheaf() can be used to refill a previously obtained
>> sheaf to requested size. If the current size is sufficient, it does
>> nothing. If the requested size exceeds cache's sheaf_capacity and the
>> sheaf's current capacity, the sheaf will be replaced with a new one,
>> hence the indirect pointer parameter.
>> 
>> kmem_cache_sheaf_size() can be used to query the current size.
>> 
>> The implementation supports requesting sizes that exceed cache's
>> sheaf_capacity, but it is not efficient - such "oversize" sheaves are
>> allocated fresh in kmem_cache_prefill_sheaf() and flushed and freed
>> immediately by kmem_cache_return_sheaf(). kmem_cache_refill_sheaf()
>> might be especially ineffective when replacing a sheaf with a new one of
>> a larger capacity. It is therefore better to size cache's
>> sheaf_capacity accordingly to make oversize sheaves exceptional.
>> 
>> CONFIG_SLUB_STATS counters are added for sheaf prefill and return
>> operations. A prefill or return is considered _fast when it is able to
>> grab or return a percpu spare sheaf (even if the sheaf needs a refill to
>> satisfy the request, as those should amortize over time), and _slow
>> otherwise (when the barn or even sheaf allocation/freeing has to be
>> involved). sheaf_prefill_oversize is provided to determine how many
>> prefills were oversize (counter for oversize returns is not necessary as
>> all oversize refills result in oversize returns).
>> 
>> When slub_debug is enabled for a cache with sheaves, no percpu sheaves
>> exist for it, but the prefill functionality is still provided simply by
>> all prefilled sheaves becoming oversize. If percpu sheaves are not
>> created for a cache due to not passing the sheaf_capacity argument on
>> cache creation, the prefills also work through oversize sheaves, but
>> there's a WARN_ON_ONCE() to indicate the omission.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>> ---
> 
> Looks good to me,
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
> 
> with a nit below.

Thanks, incorporated the suggestion!



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v4 4/9] slab: determine barn status racily outside of lock
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
                   ` (2 preceding siblings ...)
  2025-04-25  8:27 ` [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-04-25  8:27 ` [PATCH v4 5/9] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

The possibility of many barn operations is determined by the current
number of full or empty sheaves. Taking the barn->lock just to find out
that e.g. there are no empty sheaves results in unnecessary overhead and
lock contention. Thus perform these checks outside of the lock with a
data_race() annotated variable read and fail quickly without taking the
lock.

Checks for sheaf availability that racily succeed have to be obviously
repeated under the lock for correctness, but we can skip repeating
checks if there are too many sheaves on the given list as the limits
don't need to be strict.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
---
 mm/slub.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 724266fdd996c091f1f0b34012c5179f17dfa422..cc273cc45f632e16644355831132cdc391219cec 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2758,9 +2758,12 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn)
 	struct slab_sheaf *empty = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_empty))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_empty) {
+	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty,
 					 struct slab_sheaf, barn_list);
 		list_del(&empty->barn_list);
@@ -2807,6 +2810,9 @@ static struct slab_sheaf *barn_get_full_or_empty_sheaf(struct node_barn *barn)
 	struct slab_sheaf *sheaf = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_full) && !data_race(barn->nr_empty))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
 	if (barn->nr_full) {
@@ -2837,9 +2843,12 @@ barn_replace_empty_sheaf(struct node_barn *barn, struct slab_sheaf *empty)
 	struct slab_sheaf *full = NULL;
 	unsigned long flags;
 
+	if (!data_race(barn->nr_full))
+		return NULL;
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_full) {
+	if (likely(barn->nr_full)) {
 		full = list_first_entry(&barn->sheaves_full, struct slab_sheaf,
 					barn_list);
 		list_del(&full->barn_list);
@@ -2862,19 +2871,23 @@ barn_replace_full_sheaf(struct node_barn *barn, struct slab_sheaf *full)
 	struct slab_sheaf *empty;
 	unsigned long flags;
 
+	/* we don't repeat this check under barn->lock as it's not critical */
+	if (data_race(barn->nr_full) >= MAX_FULL_SHEAVES)
+		return ERR_PTR(-E2BIG);
+	if (!data_race(barn->nr_empty))
+		return ERR_PTR(-ENOMEM);
+
 	spin_lock_irqsave(&barn->lock, flags);
 
-	if (barn->nr_full >= MAX_FULL_SHEAVES) {
-		empty = ERR_PTR(-E2BIG);
-	} else if (!barn->nr_empty) {
-		empty = ERR_PTR(-ENOMEM);
-	} else {
+	if (likely(barn->nr_empty)) {
 		empty = list_first_entry(&barn->sheaves_empty, struct slab_sheaf,
 					 barn_list);
 		list_del(&empty->barn_list);
 		list_add(&full->barn_list, &barn->sheaves_full);
 		barn->nr_empty--;
 		barn->nr_full++;
+	} else {
+		empty = ERR_PTR(-ENOMEM);
 	}
 
 	spin_unlock_irqrestore(&barn->lock, flags);

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 5/9] tools: Add testing support for changes to rcu and slab for sheaves
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
                   ` (3 preceding siblings ...)
  2025-04-25  8:27 ` [PATCH v4 4/9] slab: determine barn status racily outside of lock Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-04-25  8:27 ` [PATCH v4 6/9] tools: Add sheaves support to testing infrastructure Vlastimil Babka
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Make testing work for the slab and rcu changes that have come in with
the sheaves work.

This only works with one kmem_cache, and only the first one used.
Subsequent setting of kmem_cache will not update the active kmem_cache
and will be silently dropped because there are other tests which happen
after the kmem_cache of interest is set.

The saved active kmem_cache is used in the rcu callback, which passes
the object to be freed.

The rcu call takes the rcu_head, which is passed in as the field in the
struct (in this case rcu in the maple tree node), which is calculated by
pointer math.  The offset of which is saved (in a global variable) for
restoring the node pointer on the callback after the rcu grace period
expires.

Don't use any of this outside of testing, please.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 tools/include/linux/slab.h            | 41 ++++++++++++++++++++++++++++++++---
 tools/testing/shared/linux.c          | 24 ++++++++++++++++----
 tools/testing/shared/linux/rcupdate.h | 22 +++++++++++++++++++
 3 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index c87051e2b26f5a7fee0362697fae067076b8e84d..d1444e79f2685edb828adbce8b3fbb500c0f8844 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -23,6 +23,12 @@ enum slab_state {
 	FULL
 };
 
+struct kmem_cache_args {
+	unsigned int align;
+	unsigned int sheaf_capacity;
+	void (*ctor)(void *);
+};
+
 static inline void *kzalloc(size_t size, gfp_t gfp)
 {
 	return kmalloc(size, gfp | __GFP_ZERO);
@@ -37,9 +43,38 @@ static inline void *kmem_cache_alloc(struct kmem_cache *cachep, int flags)
 }
 void kmem_cache_free(struct kmem_cache *cachep, void *objp);
 
-struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
-			unsigned int align, unsigned int flags,
-			void (*ctor)(void *));
+
+struct kmem_cache *
+__kmem_cache_create_args(const char *name, unsigned int size,
+		struct kmem_cache_args *args, unsigned int flags);
+
+/* If NULL is passed for @args, use this variant with default arguments. */
+static inline struct kmem_cache *
+__kmem_cache_default_args(const char *name, unsigned int size,
+		struct kmem_cache_args *args, unsigned int flags)
+{
+	struct kmem_cache_args kmem_default_args = {};
+
+	return __kmem_cache_create_args(name, size, &kmem_default_args, flags);
+}
+
+static inline struct kmem_cache *
+__kmem_cache_create(const char *name, unsigned int size, unsigned int align,
+		unsigned int flags, void (*ctor)(void *))
+{
+	struct kmem_cache_args kmem_args = {
+		.align	= align,
+		.ctor	= ctor,
+	};
+
+	return __kmem_cache_create_args(name, size, &kmem_args, flags);
+}
+
+#define kmem_cache_create(__name, __object_size, __args, ...)           \
+	_Generic((__args),                                              \
+		struct kmem_cache_args *: __kmem_cache_create_args,	\
+		void *: __kmem_cache_default_args,			\
+		default: __kmem_cache_create)(__name, __object_size, __args, __VA_ARGS__)
 
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index 0f97fb0d19e19c327aa4843a35b45cc086f4f366..f998555a1b2af4a899a468a652b04622df459ed3 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -20,6 +20,7 @@ struct kmem_cache {
 	pthread_mutex_t lock;
 	unsigned int size;
 	unsigned int align;
+	unsigned int sheaf_capacity;
 	int nr_objs;
 	void *objs;
 	void (*ctor)(void *);
@@ -31,6 +32,8 @@ struct kmem_cache {
 	void *private;
 };
 
+static struct kmem_cache *kmem_active = NULL;
+
 void kmem_cache_set_callback(struct kmem_cache *cachep, void (*callback)(void *))
 {
 	cachep->callback = callback;
@@ -147,6 +150,14 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
 	pthread_mutex_unlock(&cachep->lock);
 }
 
+void kmem_cache_free_active(void *objp)
+{
+	if (!kmem_active)
+		printf("WARNING: No active kmem_cache\n");
+
+	kmem_cache_free(kmem_active, objp);
+}
+
 void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list)
 {
 	if (kmalloc_verbose)
@@ -234,23 +245,28 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 }
 
 struct kmem_cache *
-kmem_cache_create(const char *name, unsigned int size, unsigned int align,
-		unsigned int flags, void (*ctor)(void *))
+__kmem_cache_create_args(const char *name, unsigned int size,
+			  struct kmem_cache_args *args,
+			  unsigned int flags)
 {
 	struct kmem_cache *ret = malloc(sizeof(*ret));
 
 	pthread_mutex_init(&ret->lock, NULL);
 	ret->size = size;
-	ret->align = align;
+	ret->align = args->align;
+	ret->sheaf_capacity = args->sheaf_capacity;
 	ret->nr_objs = 0;
 	ret->nr_allocated = 0;
 	ret->nr_tallocated = 0;
 	ret->objs = NULL;
-	ret->ctor = ctor;
+	ret->ctor = args->ctor;
 	ret->non_kernel = 0;
 	ret->exec_callback = false;
 	ret->callback = NULL;
 	ret->private = NULL;
+	if (!kmem_active)
+		kmem_active = ret;
+
 	return ret;
 }
 
diff --git a/tools/testing/shared/linux/rcupdate.h b/tools/testing/shared/linux/rcupdate.h
index fed468fb0c78db6f33fb1900c7110ab5f3c19c65..c95e2f0bbd93798e544d7d34e0823ed68414f924 100644
--- a/tools/testing/shared/linux/rcupdate.h
+++ b/tools/testing/shared/linux/rcupdate.h
@@ -9,4 +9,26 @@
 #define rcu_dereference_check(p, cond) rcu_dereference(p)
 #define RCU_INIT_POINTER(p, v)	do { (p) = (v); } while (0)
 
+void kmem_cache_free_active(void *objp);
+static unsigned long kfree_cb_offset = 0;
+
+static inline void kfree_rcu_cb(struct rcu_head *head)
+{
+	void *objp = (void *) ((unsigned long)head - kfree_cb_offset);
+
+	kmem_cache_free_active(objp);
+}
+
+#ifndef offsetof
+#define offsetof(TYPE, MEMBER)	__builtin_offsetof(TYPE, MEMBER)
+#endif
+
+#define kfree_rcu(ptr, rhv)						\
+do {									\
+	if (!kfree_cb_offset)						\
+		kfree_cb_offset = offsetof(typeof(*(ptr)), rhv);	\
+									\
+	call_rcu(&ptr->rhv, kfree_rcu_cb);				\
+} while (0)
+
 #endif

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 6/9] tools: Add sheaves support to testing infrastructure
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
                   ` (4 preceding siblings ...)
  2025-04-25  8:27 ` [PATCH v4 5/9] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-04-25  8:27 ` [PATCH v4 7/9] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka, Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Allocate a sheaf and fill it to the count amount.  Does not fill to the
sheaf limit to detect incorrect allocation requests.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 tools/include/linux/slab.h   | 24 +++++++++++++
 tools/testing/shared/linux.c | 84 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

diff --git a/tools/include/linux/slab.h b/tools/include/linux/slab.h
index d1444e79f2685edb828adbce8b3fbb500c0f8844..1962d7f1abee154e1cda5dba28aef213088dd198 100644
--- a/tools/include/linux/slab.h
+++ b/tools/include/linux/slab.h
@@ -23,6 +23,13 @@ enum slab_state {
 	FULL
 };
 
+struct slab_sheaf {
+	struct kmem_cache *cache;
+	unsigned int size;
+	unsigned int capacity;
+	void *objects[];
+};
+
 struct kmem_cache_args {
 	unsigned int align;
 	unsigned int sheaf_capacity;
@@ -80,4 +87,21 @@ void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list);
 int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 			  void **list);
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size);
+
+void *
+kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf);
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf);
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf **sheafp, unsigned int size);
+
+static inline unsigned int kmem_cache_sheaf_size(struct slab_sheaf *sheaf)
+{
+	return sheaf->size;
+}
+
 #endif		/* _TOOLS_SLAB_H */
diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c
index f998555a1b2af4a899a468a652b04622df459ed3..e0255f53159bd3a1325d49192283dd6790a5e3b8 100644
--- a/tools/testing/shared/linux.c
+++ b/tools/testing/shared/linux.c
@@ -181,6 +181,12 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size,
 	if (kmalloc_verbose)
 		pr_debug("Bulk alloc %zu\n", size);
 
+	if (cachep->exec_callback) {
+		if (cachep->callback)
+			cachep->callback(cachep->private);
+		cachep->exec_callback = false;
+	}
+
 	pthread_mutex_lock(&cachep->lock);
 	if (cachep->nr_objs >= size) {
 		struct radix_tree_node *node;
@@ -270,6 +276,84 @@ __kmem_cache_create_args(const char *name, unsigned int size,
 	return ret;
 }
 
+struct slab_sheaf *
+kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
+{
+	struct slab_sheaf *sheaf;
+	unsigned int capacity;
+
+	if (size > s->sheaf_capacity)
+		capacity = size;
+	else
+		capacity = s->sheaf_capacity;
+
+	sheaf = malloc(sizeof(*sheaf) + sizeof(void *) * s->sheaf_capacity * capacity);
+	if (!sheaf) {
+		return NULL;
+	}
+
+	memset(sheaf, 0, size);
+	sheaf->cache = s;
+	sheaf->capacity = capacity;
+	sheaf->size = kmem_cache_alloc_bulk(s, gfp, size, sheaf->objects);
+	if (!sheaf->size) {
+		free(sheaf);
+		return NULL;
+	}
+
+	return sheaf;
+}
+
+int kmem_cache_refill_sheaf(struct kmem_cache *s, gfp_t gfp,
+		 struct slab_sheaf **sheafp, unsigned int size)
+{
+	struct slab_sheaf *sheaf = *sheafp;
+	int refill;
+
+	if (sheaf->size >= size)
+		return 0;
+
+	if (size > sheaf->capacity) {
+		sheaf = kmem_cache_prefill_sheaf(s, gfp, size);
+		if (!sheaf)
+			return -ENOMEM;
+
+		kmem_cache_return_sheaf(s, gfp, *sheafp);
+		*sheafp = sheaf;
+		return 0;
+	}
+
+	refill = kmem_cache_alloc_bulk(s, gfp, size - sheaf->size,
+				       &sheaf->objects[sheaf->size]);
+	if (!refill)
+		return -ENOMEM;
+
+	sheaf->size += refill;
+	return 0;
+}
+
+void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
+		 struct slab_sheaf *sheaf)
+{
+	if (sheaf->size) {
+		//s->non_kernel += sheaf->size;
+		kmem_cache_free_bulk(s, sheaf->size, &sheaf->objects[0]);
+	}
+	free(sheaf);
+}
+
+void *
+kmem_cache_alloc_from_sheaf(struct kmem_cache *s, gfp_t gfp,
+		struct slab_sheaf *sheaf)
+{
+	if (sheaf->size == 0) {
+		printf("Nothing left in sheaf!\n");
+		return NULL;
+	}
+
+	return sheaf->objects[--sheaf->size];
+}
+
 /*
  * Test the test infrastructure for kem_cache_alloc/free and bulk counterparts.
  */

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 7/9] maple_tree: use percpu sheaves for maple_node_cache
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
                   ` (5 preceding siblings ...)
  2025-04-25  8:27 ` [PATCH v4 6/9] tools: Add sheaves support to testing infrastructure Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-04-25  8:27 ` [PATCH v4 8/9] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Setup the maple_node_cache with percpu sheaves of size 32 to hopefully
improve its performance. Change the single node rcu freeing in
ma_free_rcu() to use kfree_rcu() instead of the custom callback, which
allows the rcu_free sheaf batching to be used. Note there are other
users of mt_free_rcu() where larger parts of maple tree are submitted to
call_rcu() as a whole, and that cannot use the rcu_free sheaf. But it's
still possible for maple nodes freed this way to be reused via the barn,
even if only some cpus are allowed to process rcu callbacks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 lib/maple_tree.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index d0bea23fa4bc9fdd0ca4803a108d3c943f6a0c73..812ba155f3577d1b6ecc779ce9e4e7ded3085d8b 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -208,7 +208,7 @@ static void mt_free_rcu(struct rcu_head *head)
 static void ma_free_rcu(struct maple_node *node)
 {
 	WARN_ON(node->parent != ma_parent_ptr(node));
-	call_rcu(&node->rcu, mt_free_rcu);
+	kfree_rcu(node, rcu);
 }
 
 static void mas_set_height(struct ma_state *mas)
@@ -6254,9 +6254,14 @@ bool mas_nomem(struct ma_state *mas, gfp_t gfp)
 
 void __init maple_tree_init(void)
 {
+	struct kmem_cache_args args = {
+		.align  = sizeof(struct maple_node),
+		.sheaf_capacity = 32,
+	};
+
 	maple_node_cache = kmem_cache_create("maple_node",
-			sizeof(struct maple_node), sizeof(struct maple_node),
-			SLAB_PANIC, NULL);
+			sizeof(struct maple_node), &args,
+			SLAB_PANIC);
 }
 
 /**

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 8/9] mm, vma: use percpu sheaves for vm_area_struct cache
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
                   ` (6 preceding siblings ...)
  2025-04-25  8:27 ` [PATCH v4 7/9] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-05-06 23:08   ` Suren Baghdasaryan
  2025-04-25  8:27 ` [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
  2025-05-15 12:46 ` [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
  9 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Create the vm_area_struct cache with percpu sheaves of size 32 to
improve its performance.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/fork.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b8e7b2b516e0bb0b1d4676ff644dc..3bd711f0798c88aee04bc30ff21fc4ca2b66201a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3216,6 +3216,7 @@ void __init proc_caches_init(void)
 	struct kmem_cache_args args = {
 		.use_freeptr_offset = true,
 		.freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
+		.sheaf_capacity = 32,
 	};
 
 	sighand_cachep = kmem_cache_create("sighand_cache",

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 8/9] mm, vma: use percpu sheaves for vm_area_struct cache
  2025-04-25  8:27 ` [PATCH v4 8/9] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
@ 2025-05-06 23:08   ` Suren Baghdasaryan
  0 siblings, 0 replies; 35+ messages in thread
From: Suren Baghdasaryan @ 2025-05-06 23:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Apr 25, 2025 at 1:28 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Create the vm_area_struct cache with percpu sheaves of size 32 to
> improve its performance.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I think Lorenzo's refactoring moved this code out of fork.c, so it
will have to be adjusted.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
>  kernel/fork.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4b26cd8998b8e7b2b516e0bb0b1d4676ff644dc..3bd711f0798c88aee04bc30ff21fc4ca2b66201a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -3216,6 +3216,7 @@ void __init proc_caches_init(void)
>         struct kmem_cache_args args = {
>                 .use_freeptr_offset = true,
>                 .freeptr_offset = offsetof(struct vm_area_struct, vm_freeptr),
> +               .sheaf_capacity = 32,
>         };
>
>         sighand_cachep = kmem_cache_create("sighand_cache",
>
> --
> 2.49.0
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
                   ` (7 preceding siblings ...)
  2025-04-25  8:27 ` [PATCH v4 8/9] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
@ 2025-04-25  8:27 ` Vlastimil Babka
  2025-04-25 17:35   ` Christoph Lameter (Ampere)
  2025-05-07 10:39   ` Harry Yoo
  2025-05-15 12:46 ` [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
  9 siblings, 2 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-25  8:27 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree, vbabka

Since we don't control the NUMA locality of objects in percpu sheaves,
allocations with node restrictions bypass them. Allocations without
restrictions may however still expect to get local objects with high
probability, and the introduction of sheaves can decrease it due to
freed object from a remote node ending up in percpu sheaves.

The fraction of such remote frees seems low (5% on an 8-node machine)
but it can be expected that some cache or workload specific corner cases
exist. We can either conclude that this is not a problem due to the low
fraction, or we can make remote frees bypass percpu sheaves and go
directly to their slabs. This will make the remote frees more expensive,
but if if's only a small fraction, most frees will still benefit from
the lower overhead of percpu sheaves.

This patch thus makes remote object freeing bypass percpu sheaves,
including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
it's not intended to be 100% guarantee that percpu sheaves will only
contain local objects. The refill from slabs does not provide that
guarantee in the first place, and there might be cpu migrations
happening when we need to unlock the local_lock. Avoiding all that could
be possible but complicated so we can leave it for later investigation
whether it would be worth it. It can be expected that the more selective
freeing will itself prevent accumulation of remote objects in percpu
sheaves so any such violations would have only short-term effects.

Another possible optimization to investigate is whether it would be
beneficial for node-restricted or strict_numa allocations to attempt to
obtain an object from percpu sheaves if the node or mempolicy (i.e.
MPOL_LOCAL) happens to want the local node of the allocating cpu. Right
now such allocations bypass sheaves, but they could probably look first
whether the first available object in percpu sheaves is local, and with
high probability succeed - and only bypass the sheaves in cases it's
not local.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab_common.c |  7 +++++--
 mm/slub.c        | 43 +++++++++++++++++++++++++++++++++++++------
 2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 6c3b90f03cb79b57f426824450f576a977d85c53..af4e225372fa2d1e7d0f55a90b5335a29a36d2ea 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1623,8 +1623,11 @@ static bool kfree_rcu_sheaf(void *obj)
 
 	slab = folio_slab(folio);
 	s = slab->slab_cache;
-	if (s->cpu_sheaves)
-		return __kfree_rcu_sheaf(s, obj);
+	if (s->cpu_sheaves) {
+		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
+			   slab_nid(slab) == numa_node_id()))
+			return __kfree_rcu_sheaf(s, obj);
+	}
 
 	return false;
 }
diff --git a/mm/slub.c b/mm/slub.c
index cc273cc45f632e16644355831132cdc391219cec..2bf83e2b85b23f4db2b311edaded4bef6b7d01de 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -455,6 +455,7 @@ struct slab_sheaf {
 	};
 	struct kmem_cache *cache;
 	unsigned int size;
+	int node; /* only used for rcu_sheaf */
 	void *objects[];
 };
 
@@ -5649,7 +5650,7 @@ static void rcu_free_sheaf(struct rcu_head *head)
 	 */
 	__rcu_free_sheaf_prepare(s, sheaf);
 
-	barn = get_node(s, numa_mem_id())->barn;
+	barn = get_node(s, sheaf->node)->barn;
 
 	/* due to slab_free_hook() */
 	if (unlikely(sheaf->size == 0))
@@ -5724,10 +5725,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
 	rcu_sheaf->objects[rcu_sheaf->size++] = obj;
 
-	if (likely(rcu_sheaf->size < s->sheaf_capacity))
+	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
 		rcu_sheaf = NULL;
-	else
+	} else {
 		pcs->rcu_free = NULL;
+		rcu_sheaf->node = numa_node_id();
+	}
 
 	local_unlock(&s->cpu_sheaves->lock);
 
@@ -5753,9 +5756,13 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	struct slab_sheaf *main, *empty;
 	unsigned int batch, i = 0;
 	bool init;
+	void *remote_objects[PCS_BATCH_MAX];
+	unsigned int remote_nr = 0;
+	int node = numa_node_id();
 
 	init = slab_want_init_on_free(s);
 
+next_remote_batch:
 	while (i < size) {
 		struct slab *slab = virt_to_slab(p[i]);
 
@@ -5765,7 +5772,15 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 		if (unlikely(!slab_free_hook(s, p[i], init, false))) {
 			p[i] = p[--size];
 			if (!size)
-				return;
+				goto flush_remote;
+			continue;
+		}
+
+		if (unlikely(IS_ENABLED(CONFIG_NUMA) && slab_nid(slab) != node)) {
+			remote_objects[remote_nr] = p[i];
+			p[i] = p[--size];
+			if (++remote_nr >= PCS_BATCH_MAX)
+				goto flush_remote;
 			continue;
 		}
 
@@ -5833,6 +5848,15 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	 */
 fallback:
 	__kmem_cache_free_bulk(s, size, p);
+
+flush_remote:
+	if (remote_nr) {
+		__kmem_cache_free_bulk(s, remote_nr, &remote_objects[0]);
+		if (i < size) {
+			remote_nr = 0;
+			goto next_remote_batch;
+		}
+	}
 }
 
 #ifndef CONFIG_SLUB_TINY
@@ -5924,8 +5948,15 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (!s->cpu_sheaves || !free_to_pcs(s, object))
-		do_slab_free(s, slab, object, object, 1, addr);
+	if (s->cpu_sheaves) {
+		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
+			   slab_nid(slab) == numa_node_id())) {
+			free_to_pcs(s, object);
+			return;
+		}
+	}
+
+	do_slab_free(s, slab, object, object, 1, addr);
 }
 
 #ifdef CONFIG_MEMCG

-- 
2.49.0



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing
  2025-04-25  8:27 ` [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
@ 2025-04-25 17:35   ` Christoph Lameter (Ampere)
  2025-04-28  7:08     ` Vlastimil Babka
  2025-05-07 10:39   ` Harry Yoo
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-04-25 17:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, 25 Apr 2025, Vlastimil Babka wrote:

> @@ -5924,8 +5948,15 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>  		return;
>
> -	if (!s->cpu_sheaves || !free_to_pcs(s, object))
> -		do_slab_free(s, slab, object, object, 1, addr);
> +	if (s->cpu_sheaves) {
> +		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
> +			   slab_nid(slab) == numa_node_id())) {

Ah. ok this removes remote object freeing to the pcs.

numa_mem_id() is needed to support memory less numa nodes.

> +			free_to_pcs(s, object);
> +			return;
> +		}
> +	}
> +
> +	do_slab_free(s, slab, object, object, 1, addr);
>  }
>
>  #ifdef CONFIG_MEMCG
>
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing
  2025-04-25 17:35   ` Christoph Lameter (Ampere)
@ 2025-04-28  7:08     ` Vlastimil Babka
  0 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-04-28  7:08 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Suren Baghdasaryan, Liam R. Howlett, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 4/25/25 19:35, Christoph Lameter (Ampere) wrote:
> On Fri, 25 Apr 2025, Vlastimil Babka wrote:
> 
>> @@ -5924,8 +5948,15 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>>  		return;
>>
>> -	if (!s->cpu_sheaves || !free_to_pcs(s, object))
>> -		do_slab_free(s, slab, object, object, 1, addr);
>> +	if (s->cpu_sheaves) {
>> +		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
>> +			   slab_nid(slab) == numa_node_id())) {
> 
> Ah. ok this removes remote object freeing to the pcs.
> 
> numa_mem_id() is needed to support memory less numa nodes.

Ah right those... will fix, thanks.

>> +			free_to_pcs(s, object);
>> +			return;
>> +		}
>> +	}
>> +
>> +	do_slab_free(s, slab, object, object, 1, addr);
>>  }
>>
>>  #ifdef CONFIG_MEMCG
>>
>>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing
  2025-04-25  8:27 ` [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
  2025-04-25 17:35   ` Christoph Lameter (Ampere)
@ 2025-05-07 10:39   ` Harry Yoo
  2025-05-15  8:59     ` Vlastimil Babka
  1 sibling, 1 reply; 35+ messages in thread
From: Harry Yoo @ 2025-05-07 10:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Fri, Apr 25, 2025 at 10:27:29AM +0200, Vlastimil Babka wrote:
> Since we don't control the NUMA locality of objects in percpu sheaves,
> allocations with node restrictions bypass them. Allocations without
> restrictions may however still expect to get local objects with high
> probability, and the introduction of sheaves can decrease it due to
> freed object from a remote node ending up in percpu sheaves.
> 
> The fraction of such remote frees seems low (5% on an 8-node machine)
> but it can be expected that some cache or workload specific corner cases
> exist. We can either conclude that this is not a problem due to the low
> fraction, or we can make remote frees bypass percpu sheaves and go
> directly to their slabs. This will make the remote frees more expensive,
> but if if's only a small fraction, most frees will still benefit from
> the lower overhead of percpu sheaves.
> 
> This patch thus makes remote object freeing bypass percpu sheaves,
> including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
> it's not intended to be 100% guarantee that percpu sheaves will only
> contain local objects. The refill from slabs does not provide that
> guarantee in the first place, and there might be cpu migrations
> happening when we need to unlock the local_lock. Avoiding all that could
> be possible but complicated so we can leave it for later investigation
> whether it would be worth it. It can be expected that the more selective
> freeing will itself prevent accumulation of remote objects in percpu
> sheaves so any such violations would have only short-term effects.
> 
> Another possible optimization to investigate is whether it would be
> beneficial for node-restricted or strict_numa allocations to attempt to
> obtain an object from percpu sheaves if the node or mempolicy (i.e.
> MPOL_LOCAL) happens to want the local node of the allocating cpu. Right
> now such allocations bypass sheaves, but they could probably look first
> whether the first available object in percpu sheaves is local, and with
> high probability succeed - and only bypass the sheaves in cases it's
> not local.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slab_common.c |  7 +++++--
>  mm/slub.c        | 43 +++++++++++++++++++++++++++++++++++++------
>  2 files changed, 42 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index cc273cc45f632e16644355831132cdc391219cec..2bf83e2b85b23f4db2b311edaded4bef6b7d01de 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5924,8 +5948,15 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>  		return;
>  
> -	if (!s->cpu_sheaves || !free_to_pcs(s, object))
> -		do_slab_free(s, slab, object, object, 1, addr);
> +	if (s->cpu_sheaves) {
> +		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
> +			   slab_nid(slab) == numa_node_id())) {
> +			free_to_pcs(s, object);

Shouldn't it call do_slab_free() when free_to_pcs() failed?

> +			return;
> +		}
> +	}
> +
> +	do_slab_free(s, slab, object, object, 1, addr);
>  }
>  
>  #ifdef CONFIG_MEMCG
> 
> -- 
> 2.49.0
> 
> 

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing
  2025-05-07 10:39   ` Harry Yoo
@ 2025-05-15  8:59     ` Vlastimil Babka
  0 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-05-15  8:59 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 5/7/25 12:39, Harry Yoo wrote:
> On Fri, Apr 25, 2025 at 10:27:29AM +0200, Vlastimil Babka wrote:
>> Since we don't control the NUMA locality of objects in percpu sheaves,
>> allocations with node restrictions bypass them. Allocations without
>> restrictions may however still expect to get local objects with high
>> probability, and the introduction of sheaves can decrease it due to
>> freed object from a remote node ending up in percpu sheaves.
>> 
>> The fraction of such remote frees seems low (5% on an 8-node machine)
>> but it can be expected that some cache or workload specific corner cases
>> exist. We can either conclude that this is not a problem due to the low
>> fraction, or we can make remote frees bypass percpu sheaves and go
>> directly to their slabs. This will make the remote frees more expensive,
>> but if if's only a small fraction, most frees will still benefit from
>> the lower overhead of percpu sheaves.
>> 
>> This patch thus makes remote object freeing bypass percpu sheaves,
>> including bulk freeing, and kfree_rcu() via the rcu_free sheaf. However
>> it's not intended to be 100% guarantee that percpu sheaves will only
>> contain local objects. The refill from slabs does not provide that
>> guarantee in the first place, and there might be cpu migrations
>> happening when we need to unlock the local_lock. Avoiding all that could
>> be possible but complicated so we can leave it for later investigation
>> whether it would be worth it. It can be expected that the more selective
>> freeing will itself prevent accumulation of remote objects in percpu
>> sheaves so any such violations would have only short-term effects.
>> 
>> Another possible optimization to investigate is whether it would be
>> beneficial for node-restricted or strict_numa allocations to attempt to
>> obtain an object from percpu sheaves if the node or mempolicy (i.e.
>> MPOL_LOCAL) happens to want the local node of the allocating cpu. Right
>> now such allocations bypass sheaves, but they could probably look first
>> whether the first available object in percpu sheaves is local, and with
>> high probability succeed - and only bypass the sheaves in cases it's
>> not local.
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/slab_common.c |  7 +++++--
>>  mm/slub.c        | 43 +++++++++++++++++++++++++++++++++++++------
>>  2 files changed, 42 insertions(+), 8 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index cc273cc45f632e16644355831132cdc391219cec..2bf83e2b85b23f4db2b311edaded4bef6b7d01de 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5924,8 +5948,15 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
>>  	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
>>  		return;
>>  
>> -	if (!s->cpu_sheaves || !free_to_pcs(s, object))
>> -		do_slab_free(s, slab, object, object, 1, addr);
>> +	if (s->cpu_sheaves) {
>> +		if (likely(!IS_ENABLED(CONFIG_NUMA) ||
>> +			   slab_nid(slab) == numa_node_id())) {
>> +			free_to_pcs(s, object);
> 
> Shouldn't it call do_slab_free() when free_to_pcs() failed?

Oops yes, thanks!

> 
>> +			return;
>> +		}
>> +	}
>> +
>> +	do_slab_free(s, slab, object, object, 1, addr);
>>  }
>>  
>>  #ifdef CONFIG_MEMCG
>> 
>> -- 
>> 2.49.0
>> 
>> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 0/9] SLUB percpu sheaves
  2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
                   ` (8 preceding siblings ...)
  2025-04-25  8:27 ` [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
@ 2025-05-15 12:46 ` Vlastimil Babka
  2025-05-15 15:01   ` Suren Baghdasaryan
  9 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2025-05-15 12:46 UTC (permalink / raw)
  To: Suren Baghdasaryan, Liam R. Howlett, Christoph Lameter,
	David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On 4/25/25 10:27, Vlastimil Babka wrote:
> Hi,
> 
> This is the v4 and first non-RFC series to add an opt-in percpu
> array-based caching layer to SLUB, following the LSF/MM discussions.
> Since v3 I've also made changes to achieve full compatibility with
> slub_debug, and IRC discussions led to the last patch intended to
> improve NUMA locality (the patch remains separate for evaluation
> purposes).

I've pushed the changes based on the feedback here:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slub-percpu-sheaves

You can use that for testing/benchmarking. Thanks!


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 0/9] SLUB percpu sheaves
  2025-05-15 12:46 ` [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
@ 2025-05-15 15:01   ` Suren Baghdasaryan
  0 siblings, 0 replies; 35+ messages in thread
From: Suren Baghdasaryan @ 2025-05-15 15:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Uladzislau Rezki, linux-mm,
	linux-kernel, rcu, maple-tree

On Thu, May 15, 2025 at 5:46 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 4/25/25 10:27, Vlastimil Babka wrote:
> > Hi,
> >
> > This is the v4 and first non-RFC series to add an opt-in percpu
> > array-based caching layer to SLUB, following the LSF/MM discussions.
> > Since v3 I've also made changes to achieve full compatibility with
> > slub_debug, and IRC discussions led to the last patch intended to
> > improve NUMA locality (the patch remains separate for evaluation
> > purposes).
>
> I've pushed the changes based on the feedback here:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/slub-percpu-sheaves
>
> You can use that for testing/benchmarking. Thanks!

Thanks! I'll give it a spin this weekend.


^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-05-15 15:04 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25  8:27 [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 1/9] slab: add opt-in caching layer of " Vlastimil Babka
2025-04-25 17:31   ` Christoph Lameter (Ampere)
2025-04-28  7:01     ` Vlastimil Babka
2025-05-06 17:32       ` Suren Baghdasaryan
2025-05-06 23:11         ` Suren Baghdasaryan
2025-04-29  1:08   ` Harry Yoo
2025-05-13 16:08     ` Vlastimil Babka
2025-05-06 23:14   ` Suren Baghdasaryan
2025-05-14 13:06     ` Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 2/9] slab: add sheaf support for batching kfree_rcu() operations Vlastimil Babka
2025-04-29  7:36   ` Harry Yoo
2025-05-14 13:07     ` Vlastimil Babka
2025-05-06 21:34   ` Suren Baghdasaryan
2025-05-14 14:01     ` Vlastimil Babka
2025-05-15  8:45       ` Vlastimil Babka
2025-05-15 15:03         ` Suren Baghdasaryan
2025-04-25  8:27 ` [PATCH v4 3/9] slab: sheaf prefilling for guaranteed allocations Vlastimil Babka
2025-05-06 22:54   ` Suren Baghdasaryan
2025-05-07  9:15   ` Harry Yoo
2025-05-07  9:20     ` Harry Yoo
2025-05-15  8:41     ` Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 4/9] slab: determine barn status racily outside of lock Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 5/9] tools: Add testing support for changes to rcu and slab for sheaves Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 6/9] tools: Add sheaves support to testing infrastructure Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 7/9] maple_tree: use percpu sheaves for maple_node_cache Vlastimil Babka
2025-04-25  8:27 ` [PATCH v4 8/9] mm, vma: use percpu sheaves for vm_area_struct cache Vlastimil Babka
2025-05-06 23:08   ` Suren Baghdasaryan
2025-04-25  8:27 ` [PATCH v4 9/9] mm, slub: skip percpu sheaves for remote object freeing Vlastimil Babka
2025-04-25 17:35   ` Christoph Lameter (Ampere)
2025-04-28  7:08     ` Vlastimil Babka
2025-05-07 10:39   ` Harry Yoo
2025-05-15  8:59     ` Vlastimil Babka
2025-05-15 12:46 ` [PATCH v4 0/9] SLUB percpu sheaves Vlastimil Babka
2025-05-15 15:01   ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).