[PATCH v3] mm/slub: defer freelist construction until after bulk allocation from a new slab

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v3] mm/slub: defer freelist construction until after bulk allocation from a new slab
@ 2026-04-06 13:50 hu.shengming
  2026-04-07  4:19 ` Harry Yoo (Oracle)
  0 siblings, 1 reply; 3+ messages in thread
From: hu.shengming @ 2026-04-06 13:50 UTC (permalink / raw)
  To: vbabka, harry, akpm
  Cc: hao.li, cl, rientjes, roman.gushchin, linux-mm, linux-kernel,
	zhang.run, xu.xin16, yang.tao172, yang.yang29


[-- Attachment #1.1.1: Type: text/plain, Size: 15249 bytes --]

From: Shengming Hu <hu.shengming@zte.com.cn>

refill_objects() can consume many objects from a fresh slab, and when it
takes all objects from the slab the freelist built during slab allocation
is discarded immediately.

Instead of special-casing the whole-slab bulk refill case, defer freelist
construction until after objects are emitted from the new slab.
allocate_slab() now allocates and initializes slab metadata only.
new_slab() preserves the existing behaviour by building the full freelist
on top, while refill_objects() allocates a raw slab and lets
alloc_from_new_slab() emit objects directly and build a freelist only for
the remaining objects, if any.

To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
small iterator abstraction for walking free objects in allocation order.
The iterator is used both for filling the sheaf and for building the
freelist of the remaining objects.

This removes the need for a separate whole-slab special case, avoids
temporary freelist construction when the slab is consumed entirely.

Also mark setup_object() inline. After this optimization, the compiler no
longer consistently inlines this helper in the hot path, which can hurt
performance. Explicitly marking it inline restores the expected code
generation.

This reduces per-object overhead in bulk allocation paths and improves
allocation throughput significantly. In slub_bulk_bench, the time per
object drops by about 41% to 70% with CONFIG_SLAB_FREELIST_RANDOM=n, and
by about 59% to 71% with CONFIG_SLAB_FREELIST_RANDOM=y.

Benchmark results (slub_bulk_bench):
Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
Kernel: Linux 7.0.0-rc6-next-20260330
Config: x86_64_defconfig
Cpu: 0
Rounds: 20
Total: 256MB

- CONFIG_SLAB_FREELIST_RANDOM=n -

obj_size=16, batch=256:
before: 4.62 +- 0.01 ns/object
after: 2.72 +- 0.01 ns/object
delta: -41.1%

obj_size=32, batch=128:
before: 6.58 +- 0.02 ns/object
after: 3.30 +- 0.02 ns/object
delta: -49.8%

obj_size=64, batch=64:
before: 10.20 +- 0.03 ns/object
after: 4.22 +- 0.03 ns/object
delta: -58.7%

obj_size=128, batch=32:
before: 17.91 +- 0.04 ns/object
after: 5.73 +- 0.09 ns/object
delta: -68.0%

obj_size=256, batch=32:
before: 21.03 +- 0.12 ns/object
after: 6.22 +- 0.08 ns/object
delta: -70.4%

obj_size=512, batch=32:
before: 19.00 +- 0.21 ns/object
after: 6.45 +- 0.13 ns/object
delta: -66.0%

- CONFIG_SLAB_FREELIST_RANDOM=y -

obj_size=16, batch=256:
before: 8.37 +- 0.06 ns/object
after: 3.38 +- 0.05 ns/object
delta: -59.6%

obj_size=32, batch=128:
before: 11.00 +- 0.13 ns/object
after: 4.05 +- 0.01 ns/object
delta: -63.2%

obj_size=64, batch=64:
before: 15.30 +- 0.20 ns/object
after: 5.21 +- 0.03 ns/object
delta: -65.9%

obj_size=128, batch=32:
before: 21.55 +- 0.14 ns/object
after: 7.10 +- 0.02 ns/object
delta: -67.1%

obj_size=256, batch=32:
before: 26.27 +- 0.29 ns/object
after: 7.54 +- 0.05 ns/object
delta: -71.3%

obj_size=512, batch=32:
before: 26.69 +- 0.28 ns/object
after: 7.73 +- 0.09 ns/object
delta: -71.0%

Link: https://github.com/HSM6236/slub_bulk_test.git
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
---
Changes in v2:
- Handle CONFIG_SLAB_FREELIST_RANDOM=y and add benchmark results.
- Update the QEMU benchmark setup to use -enable-kvm -cpu host so benchmark results better reflect native CPU performance.
- Link to v1: https://lore.kernel.org/all/20260328125538341lvTGRpS62UNdRiAAz2gH3@zte.com.cn/

Changes in v3:
- refactor fresh-slab allocation to use a shared slab_obj_iter
- defer freelist construction until after bulk allocation from a new slab
- build a freelist only for leftover objects when the slab is left partial
- add build_slab_freelist(), prepare_slab_alloc_flags() and next_slab_obj() helpers
- remove obsolete freelist construction helpers now replaced by the iterator-based path, including next_freelist_entry() and shuffle_freelist()
- Link to v2: https://lore.kernel.org/all/202604011257259669oAdDsdnKx6twdafNZsF5@zte.com.cn/

---
 mm/slab.h |  11 +++
 mm/slub.c | 256 +++++++++++++++++++++++++++++-------------------------
 2 files changed, 149 insertions(+), 118 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index bf2f87acf5e3..4f0c2fbc1fef 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -91,6 +91,17 @@ struct slab {
 #endif
 };
 
+struct slab_obj_iter {
+	unsigned long pos;
+	void *start;
+	void *cur;
+#ifdef CONFIG_SLAB_FREELIST_RANDOM
+	unsigned long freelist_count;
+	unsigned long page_limit;
+	bool random;
+#endif
+};
+
 #define SLAB_MATCH(pg, sl)						\
 	static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl))
 SLAB_MATCH(flags, flags);
diff --git a/mm/slub.c b/mm/slub.c
index fb2c5c57bc4e..88537e577989 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2733,7 +2733,7 @@ bool slab_free_freelist_hook(struct kmem_cache *s, void **head, void **tail,
 	return *head != NULL;
 }
 
-static void *setup_object(struct kmem_cache *s, void *object)
+static inline void *setup_object(struct kmem_cache *s, void *object)
 {
 	setup_object_debug(s, object);
 	object = kasan_init_slab_obj(s, object);
@@ -3329,87 +3329,14 @@ static void __init init_freelist_randomization(void)
 	mutex_unlock(&slab_mutex);
 }
 
-/* Get the next entry on the pre-computed freelist randomized */
-static void *next_freelist_entry(struct kmem_cache *s,
-				unsigned long *pos, void *start,
-				unsigned long page_limit,
-				unsigned long freelist_count)
-{
-	unsigned int idx;
-
-	/*
-	 * If the target page allocation failed, the number of objects on the
-	 * page might be smaller than the usual size defined by the cache.
-	 */
-	do {
-		idx = s->random_seq[*pos];
-		*pos += 1;
-		if (*pos >= freelist_count)
-			*pos = 0;
-	} while (unlikely(idx >= page_limit));
-
-	return (char *)start + idx;
-}
-
 static DEFINE_PER_CPU(struct rnd_state, slab_rnd_state);
 
-/* Shuffle the single linked freelist based on a random pre-computed sequence */
-static bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
-			     bool allow_spin)
-{
-	void *start;
-	void *cur;
-	void *next;
-	unsigned long idx, pos, page_limit, freelist_count;
-
-	if (slab->objects < 2 || !s->random_seq)
-		return false;
-
-	freelist_count = oo_objects(s->oo);
-	if (allow_spin) {
-		pos = get_random_u32_below(freelist_count);
-	} else {
-		struct rnd_state *state;
-
-		/*
-		 * An interrupt or NMI handler might interrupt and change
-		 * the state in the middle, but that's safe.
-		 */
-		state = &get_cpu_var(slab_rnd_state);
-		pos = prandom_u32_state(state) % freelist_count;
-		put_cpu_var(slab_rnd_state);
-	}
-
-	page_limit = slab->objects * s->size;
-	start = fixup_red_left(s, slab_address(slab));
-
-	/* First entry is used as the base of the freelist */
-	cur = next_freelist_entry(s, &pos, start, page_limit, freelist_count);
-	cur = setup_object(s, cur);
-	slab->freelist = cur;
-
-	for (idx = 1; idx < slab->objects; idx++) {
-		next = next_freelist_entry(s, &pos, start, page_limit,
-			freelist_count);
-		next = setup_object(s, next);
-		set_freepointer(s, cur, next);
-		cur = next;
-	}
-	set_freepointer(s, cur, NULL);
-
-	return true;
-}
 #else
 static inline int init_cache_random_seq(struct kmem_cache *s)
 {
 	return 0;
 }
 static inline void init_freelist_randomization(void) { }
-static inline bool shuffle_freelist(struct kmem_cache *s, struct slab *slab,
-				    bool allow_spin)
-{
-	return false;
-}
 #endif /* CONFIG_SLAB_FREELIST_RANDOM */
 
 static __always_inline void account_slab(struct slab *slab, int order,
@@ -3438,15 +3365,14 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 			    -(PAGE_SIZE << order));
 }
 
-static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
+/* Allocate and initialize a slab without building its freelist. */
+static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node,
+				  bool allow_spin)
 {
-	bool allow_spin = gfpflags_allow_spinning(flags);
 	struct slab *slab;
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
-	void *start, *p, *next;
-	int idx;
-	bool shuffle;
+	void *start;
 
 	flags &= gfp_allowed_mask;
 
@@ -3483,6 +3409,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	slab->frozen = 0;
 
 	slab->slab_cache = s;
+	slab->freelist = NULL;
 
 	kasan_poison_slab(slab);
 
@@ -3497,35 +3424,9 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	alloc_slab_obj_exts_early(s, slab);
 	account_slab(slab, oo_order(oo), s, flags);
 
-	shuffle = shuffle_freelist(s, slab, allow_spin);
-
-	if (!shuffle) {
-		start = fixup_red_left(s, start);
-		start = setup_object(s, start);
-		slab->freelist = start;
-		for (idx = 0, p = start; idx < slab->objects - 1; idx++) {
-			next = p + s->size;
-			next = setup_object(s, next);
-			set_freepointer(s, p, next);
-			p = next;
-		}
-		set_freepointer(s, p, NULL);
-	}
-
 	return slab;
 }
 
-static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
-{
-	if (unlikely(flags & GFP_SLAB_BUG_MASK))
-		flags = kmalloc_fix_flags(flags);
-
-	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
-
-	return allocate_slab(s,
-		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-}
-
 static void __free_slab(struct kmem_cache *s, struct slab *slab, bool allow_spin)
 {
 	struct page *page = slab_page(slab);
@@ -4344,14 +4245,130 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
 			0, sizeof(void *));
 }
 
+/* Return the next free object in allocation order. */
+static inline void *next_slab_obj(struct kmem_cache *s,
+				  struct slab_obj_iter *iter)
+{
+#ifdef CONFIG_SLAB_FREELIST_RANDOM
+	if (iter->random) {
+		unsigned long idx;
+
+		/*
+		 * If the target page allocation failed, the number of objects on the
+		 * page might be smaller than the usual size defined by the cache.
+		 */
+		do {
+			idx = s->random_seq[iter->pos];
+			iter->pos++;
+			if (iter->pos >= iter->freelist_count)
+				iter->pos = 0;
+		} while (unlikely(idx >= iter->page_limit));
+
+		return setup_object(s, (char *)iter->start + idx);
+	}
+#endif
+	void *obj = iter->cur;
+
+	iter->cur = (char *)iter->cur + s->size;
+	return setup_object(s, obj);
+}
+
+/* Build a freelist from the objects not yet allocated from a fresh slab. */
+static inline void build_slab_freelist(struct kmem_cache *s, struct slab *slab,
+				       struct slab_obj_iter *iter)
+{
+	unsigned int nr = slab->objects - slab->inuse;
+	unsigned int i;
+	void *cur, *next;
+
+	if (!nr) {
+		slab->freelist = NULL;
+		return;
+	}
+
+	cur = next_slab_obj(s, iter);
+	slab->freelist = cur;
+
+	for (i = 1; i < nr; i++) {
+		next = next_slab_obj(s, iter);
+		set_freepointer(s, cur, next);
+		cur = next;
+	}
+
+	set_freepointer(s, cur, NULL);
+}
+
+static inline gfp_t prepare_slab_alloc_flags(struct kmem_cache *s, gfp_t flags)
+{
+	if (unlikely(flags & GFP_SLAB_BUG_MASK))
+		flags = kmalloc_fix_flags(flags);
+
+	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
+
+	return flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK);
+}
+
+/* Initialize an iterator over free objects in allocation order. */
+static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
+				      struct slab_obj_iter *iter,
+				      bool allow_spin)
+{
+	iter->pos = 0;
+	iter->start = fixup_red_left(s, slab_address(slab));
+	iter->cur = iter->start;
+
+#ifdef CONFIG_SLAB_FREELIST_RANDOM
+	iter->random = (slab->objects >= 2 && s->random_seq);
+	if (!iter->random)
+		return;
+
+	iter->freelist_count = oo_objects(s->oo);
+	iter->page_limit = slab->objects * s->size;
+
+	if (allow_spin) {
+		iter->pos = get_random_u32_below(iter->freelist_count);
+	} else {
+		struct rnd_state *state;
+
+		/*
+		 * An interrupt or NMI handler might interrupt and change
+		 * the state in the middle, but that's safe.
+		 */
+		state = &get_cpu_var(slab_rnd_state);
+		iter->pos = prandom_u32_state(state) % iter->freelist_count;
+		put_cpu_var(slab_rnd_state);
+	}
+#endif
+}
+
+static struct slab *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+{
+	struct slab_obj_iter iter;
+	struct slab *slab;
+	bool allow_spin;
+
+	flags = prepare_slab_alloc_flags(s, flags);
+	allow_spin = gfpflags_allow_spinning(flags);
+
+	slab = allocate_slab(s, flags, node, allow_spin);
+	if (!slab)
+		return NULL;
+
+	init_slab_obj_iter(s, slab, &iter, allow_spin);
+	build_slab_freelist(s, slab, &iter);
+
+	return slab;
+}
+
 static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
 		void **p, unsigned int count, bool allow_spin)
 {
 	unsigned int allocated = 0;
 	struct kmem_cache_node *n;
+	struct slab_obj_iter iter;
 	bool needs_add_partial;
 	unsigned long flags;
-	void *object;
+	unsigned int target_inuse;
 
 	/*
 	 * Are we going to put the slab on the partial list?
@@ -4359,6 +4376,9 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
 	 */
 	needs_add_partial = (slab->objects > count);
 
+	/* Target inuse count after allocating from this new slab. */
+	target_inuse = needs_add_partial ? count : slab->objects;
+
 	if (!allow_spin && needs_add_partial) {
 
 		n = get_node(s, slab_nid(slab));
@@ -4370,19 +4390,18 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
 		}
 	}
 
-	object = slab->freelist;
-	while (object && allocated < count) {
-		p[allocated] = object;
-		object = get_freepointer(s, object);
+	init_slab_obj_iter(s, slab, &iter, allow_spin);
+
+	while (allocated < target_inuse) {
+		p[allocated] = next_slab_obj(s, &iter);
 		maybe_wipe_obj_freeptr(s, p[allocated]);
 
-		slab->inuse++;
 		allocated++;
 	}
-	slab->freelist = object;
+	slab->inuse = target_inuse;
 
 	if (needs_add_partial) {
-
+		build_slab_freelist(s, slab, &iter);
 		if (allow_spin) {
 			n = get_node(s, slab_nid(slab));
 			spin_lock_irqsave(&n->list_lock, flags);
@@ -7227,6 +7246,8 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 	int local_node = numa_mem_id();
 	unsigned int refilled;
 	struct slab *slab;
+	gfp_t slab_gfp;
+	bool allow_spin;
 
 	if (WARN_ON_ONCE(!gfpflags_allow_spinning(gfp)))
 		return 0;
@@ -7244,16 +7265,15 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 
 new_slab:
 
-	slab = new_slab(s, gfp, local_node);
+	slab_gfp = prepare_slab_alloc_flags(s, gfp);
+	allow_spin = gfpflags_allow_spinning(slab_gfp);
+
+	slab = allocate_slab(s, slab_gfp, local_node, allow_spin);
 	if (!slab)
 		goto out;
 
 	stat(s, ALLOC_SLAB);
 
-	/*
-	 * TODO: possible optimization - if we know we will consume the whole
-	 * slab we might skip creating the freelist?
-	 */
 	refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
 					/* allow_spin = */ true);
 
-- 
2.25.1

[-- Attachment #1.1.2: Type: text/html , Size: 41782 bytes --]

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] mm/slub: defer freelist construction until after bulk allocation from a new slab
  2026-04-06 13:50 [PATCH v3] mm/slub: defer freelist construction until after bulk allocation from a new slab hu.shengming
@ 2026-04-07  4:19 ` Harry Yoo (Oracle)
  2026-04-07 13:02   ` hu.shengming
  0 siblings, 1 reply; 3+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-07  4:19 UTC (permalink / raw)
  To: hu.shengming
  Cc: vbabka, akpm, hao.li, cl, rientjes, roman.gushchin, linux-mm,
	linux-kernel, zhang.run, xu.xin16, yang.tao172, yang.yang29

Hi Shengming, thanks for v3!

Good to see it's getting improved over the revisions.
Let me leave some comments inline.

On Mon, Apr 06, 2026 at 09:50:18PM +0800, hu.shengming@zte.com.cn wrote:
> From: Shengming Hu <hu.shengming@zte.com.cn>
> 
> refill_objects() can consume many objects from a fresh slab, and when it
> takes all objects from the slab the freelist built during slab allocation
> is discarded immediately.
> 
> Instead of special-casing the whole-slab bulk refill case, defer freelist
> construction until after objects are emitted from the new slab.
> allocate_slab() now allocates and initializes slab metadata only.
> new_slab() preserves the existing behaviour by building the full freelist
> on top, while refill_objects() allocates a raw slab and lets
> alloc_from_new_slab() emit objects directly and build a freelist only for
> the remaining objects, if any.
> 
> To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
> small iterator abstraction for walking free objects in allocation order.
> The iterator is used both for filling the sheaf and for building the
> freelist of the remaining objects.
> 
> This removes the need for a separate whole-slab special case, avoids
> temporary freelist construction when the slab is consumed entirely.
> 
> Also mark setup_object() inline. After this optimization, the compiler no
> longer consistently inlines this helper in the hot path, which can hurt
> performance. Explicitly marking it inline restores the expected code
> generation.
> 
> This reduces per-object overhead in bulk allocation paths and improves
> allocation throughput significantly. In slub_bulk_bench, the time per
> object drops by about 41% to 70% with CONFIG_SLAB_FREELIST_RANDOM=n, and
> by about 59% to 71% with CONFIG_SLAB_FREELIST_RANDOM=y.
> 
> Benchmark results (slub_bulk_bench):
> Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
> Kernel: Linux 7.0.0-rc6-next-20260330
> Config: x86_64_defconfig
> Cpu: 0
> Rounds: 20
> Total: 256MB
> 
> - CONFIG_SLAB_FREELIST_RANDOM=n -
> 
> obj_size=16, batch=256:
> before: 4.62 +- 0.01 ns/object
> after: 2.72 +- 0.01 ns/object
> delta: -41.1%
> 
> obj_size=32, batch=128:
> before: 6.58 +- 0.02 ns/object
> after: 3.30 +- 0.02 ns/object
> delta: -49.8%
> 
> obj_size=64, batch=64:
> before: 10.20 +- 0.03 ns/object
> after: 4.22 +- 0.03 ns/object
> delta: -58.7%
> 
> obj_size=128, batch=32:
> before: 17.91 +- 0.04 ns/object
> after: 5.73 +- 0.09 ns/object
> delta: -68.0%
> 
> obj_size=256, batch=32:
> before: 21.03 +- 0.12 ns/object
> after: 6.22 +- 0.08 ns/object
> delta: -70.4%
> 
> obj_size=512, batch=32:
> before: 19.00 +- 0.21 ns/object
> after: 6.45 +- 0.13 ns/object
> delta: -66.0%
> 
> - CONFIG_SLAB_FREELIST_RANDOM=y -
> 
> obj_size=16, batch=256:
> before: 8.37 +- 0.06 ns/object
> after: 3.38 +- 0.05 ns/object
> delta: -59.6%
> 
> obj_size=32, batch=128:
> before: 11.00 +- 0.13 ns/object
> after: 4.05 +- 0.01 ns/object
> delta: -63.2%
> 
> obj_size=64, batch=64:
> before: 15.30 +- 0.20 ns/object
> after: 5.21 +- 0.03 ns/object
> delta: -65.9%
> 
> obj_size=128, batch=32:
> before: 21.55 +- 0.14 ns/object
> after: 7.10 +- 0.02 ns/object
> delta: -67.1%
> 
> obj_size=256, batch=32:
> before: 26.27 +- 0.29 ns/object
> after: 7.54 +- 0.05 ns/object
> delta: -71.3%
> 
> obj_size=512, batch=32:
> before: 26.69 +- 0.28 ns/object
> after: 7.73 +- 0.09 ns/object
> delta: -71.0%
> 
> Link: https://github.com/HSM6236/slub_bulk_test.git
> Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
> ---
> Changes in v2:
> - Handle CONFIG_SLAB_FREELIST_RANDOM=y and add benchmark results.
> - Update the QEMU benchmark setup to use -enable-kvm -cpu host so benchmark results better reflect native CPU performance.
> - Link to v1: https://lore.kernel.org/all/20260328125538341lvTGRpS62UNdRiAAz2gH3@zte.com.cn/
> 
> Changes in v3:
> - refactor fresh-slab allocation to use a shared slab_obj_iter
> - defer freelist construction until after bulk allocation from a new slab
> - build a freelist only for leftover objects when the slab is left partial
> - add build_slab_freelist(), prepare_slab_alloc_flags() and next_slab_obj() helpers
> - remove obsolete freelist construction helpers now replaced by the iterator-based path, including next_freelist_entry() and shuffle_freelist()
> - Link to v2: https://lore.kernel.org/all/202604011257259669oAdDsdnKx6twdafNZsF5@zte.com.cn/
> 
> ---
>  mm/slab.h |  11 +++
>  mm/slub.c | 256 +++++++++++++++++++++++++++++-------------------------
>  2 files changed, 149 insertions(+), 118 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index fb2c5c57bc4e..88537e577989 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4344,14 +4245,130 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
>  			0, sizeof(void *));
>  }
>  
> +/* Return the next free object in allocation order. */
> +static inline void *next_slab_obj(struct kmem_cache *s,
> +				  struct slab_obj_iter *iter)
> +{
> +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> +	if (iter->random) {
> +		unsigned long idx;
> +
> +		/*
> +		 * If the target page allocation failed, the number of objects on the
> +		 * page might be smaller than the usual size defined by the cache.
> +		 */
> +		do {
> +			idx = s->random_seq[iter->pos];
> +			iter->pos++;
> +			if (iter->pos >= iter->freelist_count)
> +				iter->pos = 0;
> +		} while (unlikely(idx >= iter->page_limit));
> +
> +		return setup_object(s, (char *)iter->start + idx);
> +	}
> +#endif
> +	void *obj = iter->cur;
> +
> +	iter->cur = (char *)iter->cur + s->size;
> +	return setup_object(s, obj);
> +}
> +
> +/* Initialize an iterator over free objects in allocation order. */
> +static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
> +				      struct slab_obj_iter *iter,
> +				      bool allow_spin)
> +{
> +	iter->pos = 0;
> +	iter->start = fixup_red_left(s, slab_address(slab));
> +	iter->cur = iter->start;

It's confusing that iter->pos field is used only when randomization is
enabled and iter->cur field is used only when randomization is disabled.

I think we could simply use iter->pos for both random and non-random cases
(as I have shown in the skeleton before)?

> +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> +	iter->random = (slab->objects >= 2 && s->random_seq);
> +	if (!iter->random)
> +		return;
> +
> +	iter->freelist_count = oo_objects(s->oo);
> +	iter->page_limit = slab->objects * s->size;
> +
> +	if (allow_spin) {
> +		iter->pos = get_random_u32_below(iter->freelist_count);
> +	} else {
> +		struct rnd_state *state;
> +
> +		/*
> +		 * An interrupt or NMI handler might interrupt and change
> +		 * the state in the middle, but that's safe.
> +		 */
> +		state = &get_cpu_var(slab_rnd_state);
> +		iter->pos = prandom_u32_state(state) % iter->freelist_count;
> +		put_cpu_var(slab_rnd_state);
> +	}
> +#endif
> +}
>  static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  		void **p, unsigned int count, bool allow_spin)

There is one problem with this change; ___slab_alloc() builds the
freelist before calling alloc_from_new_slab(), while refill_objects()
does not. For consistency, let's allocate a new slab without building
freelist in ___slab_alloc() and build the freelist in
alloc_single_from_new_slab() and alloc_from_new_slab()?

>  {
>  	unsigned int allocated = 0;
>  	struct kmem_cache_node *n;
> +	struct slab_obj_iter iter;
>  	bool needs_add_partial;
>  	unsigned long flags;
> -	void *object;
> +	unsigned int target_inuse;
>  
>  	/*
>  	 * Are we going to put the slab on the partial list?
> @@ -4359,6 +4376,9 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  	 */
>  	needs_add_partial = (slab->objects > count);
>  
> +	/* Target inuse count after allocating from this new slab. */
> +	target_inuse = needs_add_partial ? count : slab->objects;
> +
>  	if (!allow_spin && needs_add_partial) {
>  
>  		n = get_node(s, slab_nid(slab));

Now new slabs without freelist can be freed in this path.
which is confusing but should be _technically_ fine, I think...

> @@ -4370,19 +4390,18 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
>  		}
>  	}
>  
> -	object = slab->freelist;
> -	while (object && allocated < count) {
> -		p[allocated] = object;
> -		object = get_freepointer(s, object);
> +	init_slab_obj_iter(s, slab, &iter, allow_spin);
> +
> +	while (allocated < target_inuse) {
> +		p[allocated] = next_slab_obj(s, &iter);
>  		maybe_wipe_obj_freeptr(s, p[allocated]);

We don't have to wipe the free pointer as we didn't build the freelist?

> -		slab->inuse++;
>  		allocated++;
>  	}
> -	slab->freelist = object;
> +	slab->inuse = target_inuse;
>  
>  	if (needs_add_partial) {
> -
> +		build_slab_freelist(s, slab, &iter);

When allow_spin is true, it's building the freelist while holding the
spinlock, and that's not great.

Hmm, can we do better?

Perhaps just allocate object(s) from the slab and build the freelist
with the objects left (if exists), but free the slab if allow_spin
is false AND trylock fails, and accept the fact that the slab may not be
fully free when it's freed due to trylock failure?

something like:

alloc_from_new_slab() {
	needs_add_partial = (slab->objects > count);
	target_inuse = needs_add_partial ? count : slab->objects;

	init_slab_obj_iter(s, slab, &iter, allow_spin);
	while (allocated < target_inuse) {
		p[allocated] = next_slab_obj(s, &iter);
		allocated++;
	}
	slab->inuse = target_inuse;

	if (needs_add_partial) {
		build_slab_freelist(s, slab, &iter);
		n = get_node(s, slab_nid(slab))
		if (allow_spin) {
			spin_lock_irqsave(&n->list_lock, flags);
		} else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
			/* 
			 * Unlucky, discard newly allocated slab.
			 * The slab is not fully free, but it's fine as
			 * objects are not allocated to users.
			 */
			free_new_slab_nolock(s, slab);
			return 0;
		}
		add_partial(n, slab, ADD_TO_HEAD);
		spin_unlock_irqrestore(&n->list_lock, flags);
	}
	[...]
}

And do something similar in alloc_single_from_new_slab() as well.

>  		if (allow_spin) {
>  			n = get_node(s, slab_nid(slab));
>  			spin_lock_irqsave(&n->list_lock, flags);
> @@ -7244,16 +7265,15 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>  
>  new_slab:
>  
> -	slab = new_slab(s, gfp, local_node);
> +	slab_gfp = prepare_slab_alloc_flags(s, gfp);

Could we do `flags = prepare_slab_alloc_flags(s, flags);`
within allocate_slab()? Having gfp and slab_gfp flags is distractive.
The value of allow_spin should not change after
prepare_slab_alloc_flags() anyway.

> +	allow_spin = gfpflags_allow_spinning(slab_gfp);
> +
> +	slab = allocate_slab(s, slab_gfp, local_node, allow_spin);
>  	if (!slab)
>  		goto out;
>  
>  	stat(s, ALLOC_SLAB);
>  
> -	/*
> -	 * TODO: possible optimization - if we know we will consume the whole
> -	 * slab we might skip creating the freelist?
> -	 */
>  	refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
>  					/* allow_spin = */ true);
>  
> -- 
> 2.25.1

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] mm/slub: defer freelist construction until after bulk allocation from a new slab
  2026-04-07  4:19 ` Harry Yoo (Oracle)
@ 2026-04-07 13:02   ` hu.shengming
  0 siblings, 0 replies; 3+ messages in thread
From: hu.shengming @ 2026-04-07 13:02 UTC (permalink / raw)
  To: harry
  Cc: vbabka, akpm, hao.li, cl, rientjes, roman.gushchin, linux-mm,
	linux-kernel, zhang.run, xu.xin16, yang.tao172, yang.yang29

Harry wrote:

> Hi Shengming, thanks for v3!
> 
> Good to see it's getting improved over the revisions.
> Let me leave some comments inline.
> 

Hi Harry,

Thanks a lot for the detailed review.

> On Mon, Apr 06, 2026 at 09:50:18PM +0800, hu.shengming@zte.com.cn wrote:
> > From: Shengming Hu <hu.shengming@zte.com.cn>
> > 
> > refill_objects() can consume many objects from a fresh slab, and when it
> > takes all objects from the slab the freelist built during slab allocation
> > is discarded immediately.
> > 
> > Instead of special-casing the whole-slab bulk refill case, defer freelist
> > construction until after objects are emitted from the new slab.
> > allocate_slab() now allocates and initializes slab metadata only.
> > new_slab() preserves the existing behaviour by building the full freelist
> > on top, while refill_objects() allocates a raw slab and lets
> > alloc_from_new_slab() emit objects directly and build a freelist only for
> > the remaining objects, if any.
> > 
> > To keep CONFIG_SLAB_FREELIST_RANDOM=y/n on the same path, introduce a
> > small iterator abstraction for walking free objects in allocation order.
> > The iterator is used both for filling the sheaf and for building the
> > freelist of the remaining objects.
> > 
> > This removes the need for a separate whole-slab special case, avoids
> > temporary freelist construction when the slab is consumed entirely.
> > 
> > Also mark setup_object() inline. After this optimization, the compiler no
> > longer consistently inlines this helper in the hot path, which can hurt
> > performance. Explicitly marking it inline restores the expected code
> > generation.
> > 
> > This reduces per-object overhead in bulk allocation paths and improves
> > allocation throughput significantly. In slub_bulk_bench, the time per
> > object drops by about 41% to 70% with CONFIG_SLAB_FREELIST_RANDOM=n, and
> > by about 59% to 71% with CONFIG_SLAB_FREELIST_RANDOM=y.
> > 
> > Benchmark results (slub_bulk_bench):
> > Machine: qemu-system-x86 -m 1024M -smp 8 -enable-kvm -cpu host
> > Kernel: Linux 7.0.0-rc6-next-20260330
> > Config: x86_64_defconfig
> > Cpu: 0
> > Rounds: 20
> > Total: 256MB
> > 
> > - CONFIG_SLAB_FREELIST_RANDOM=n -
> > 
> > obj_size=16, batch=256:
> > before: 4.62 +- 0.01 ns/object
> > after: 2.72 +- 0.01 ns/object
> > delta: -41.1%
> > 
> > obj_size=32, batch=128:
> > before: 6.58 +- 0.02 ns/object
> > after: 3.30 +- 0.02 ns/object
> > delta: -49.8%
> > 
> > obj_size=64, batch=64:
> > before: 10.20 +- 0.03 ns/object
> > after: 4.22 +- 0.03 ns/object
> > delta: -58.7%
> > 
> > obj_size=128, batch=32:
> > before: 17.91 +- 0.04 ns/object
> > after: 5.73 +- 0.09 ns/object
> > delta: -68.0%
> > 
> > obj_size=256, batch=32:
> > before: 21.03 +- 0.12 ns/object
> > after: 6.22 +- 0.08 ns/object
> > delta: -70.4%
> > 
> > obj_size=512, batch=32:
> > before: 19.00 +- 0.21 ns/object
> > after: 6.45 +- 0.13 ns/object
> > delta: -66.0%
> > 
> > - CONFIG_SLAB_FREELIST_RANDOM=y -
> > 
> > obj_size=16, batch=256:
> > before: 8.37 +- 0.06 ns/object
> > after: 3.38 +- 0.05 ns/object
> > delta: -59.6%
> > 
> > obj_size=32, batch=128:
> > before: 11.00 +- 0.13 ns/object
> > after: 4.05 +- 0.01 ns/object
> > delta: -63.2%
> > 
> > obj_size=64, batch=64:
> > before: 15.30 +- 0.20 ns/object
> > after: 5.21 +- 0.03 ns/object
> > delta: -65.9%
> > 
> > obj_size=128, batch=32:
> > before: 21.55 +- 0.14 ns/object
> > after: 7.10 +- 0.02 ns/object
> > delta: -67.1%
> > 
> > obj_size=256, batch=32:
> > before: 26.27 +- 0.29 ns/object
> > after: 7.54 +- 0.05 ns/object
> > delta: -71.3%
> > 
> > obj_size=512, batch=32:
> > before: 26.69 +- 0.28 ns/object
> > after: 7.73 +- 0.09 ns/object
> > delta: -71.0%
> > 
> > Link: https://github.com/HSM6236/slub_bulk_test.git
> > Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
> > ---
> > Changes in v2:
> > - Handle CONFIG_SLAB_FREELIST_RANDOM=y and add benchmark results.
> > - Update the QEMU benchmark setup to use -enable-kvm -cpu host so benchmark results better reflect native CPU performance.
> > - Link to v1: https://lore.kernel.org/all/20260328125538341lvTGRpS62UNdRiAAz2gH3@zte.com.cn/
> > 
> > Changes in v3:
> > - refactor fresh-slab allocation to use a shared slab_obj_iter
> > - defer freelist construction until after bulk allocation from a new slab
> > - build a freelist only for leftover objects when the slab is left partial
> > - add build_slab_freelist(), prepare_slab_alloc_flags() and next_slab_obj() helpers
> > - remove obsolete freelist construction helpers now replaced by the iterator-based path, including next_freelist_entry() and shuffle_freelist()
> > - Link to v2: https://lore.kernel.org/all/202604011257259669oAdDsdnKx6twdafNZsF5@zte.com.cn/
> > 
> > ---
> >  mm/slab.h |  11 +++
> >  mm/slub.c | 256 +++++++++++++++++++++++++++++-------------------------
> >  2 files changed, 149 insertions(+), 118 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index fb2c5c57bc4e..88537e577989 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4344,14 +4245,130 @@ static __always_inline void maybe_wipe_obj_freeptr(struct kmem_cache *s,
> >              0, sizeof(void *));
> >  }
> >  
> > +/* Return the next free object in allocation order. */
> > +static inline void *next_slab_obj(struct kmem_cache *s,
> > +                  struct slab_obj_iter *iter)
> > +{
> > +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> > +    if (iter->random) {
> > +        unsigned long idx;
> > +
> > +        /*
> > +         * If the target page allocation failed, the number of objects on the
> > +         * page might be smaller than the usual size defined by the cache.
> > +         */
> > +        do {
> > +            idx = s->random_seq[iter->pos];
> > +            iter->pos++;
> > +            if (iter->pos >= iter->freelist_count)
> > +                iter->pos = 0;
> > +        } while (unlikely(idx >= iter->page_limit));
> > +
> > +        return setup_object(s, (char *)iter->start + idx);
> > +    }
> > +#endif
> > +    void *obj = iter->cur;
> > +
> > +    iter->cur = (char *)iter->cur + s->size;
> > +    return setup_object(s, obj);
> > +}
> > +
> > +/* Initialize an iterator over free objects in allocation order. */
> > +static inline void init_slab_obj_iter(struct kmem_cache *s, struct slab *slab,
> > +                      struct slab_obj_iter *iter,
> > +                      bool allow_spin)
> > +{
> > +    iter->pos = 0;
> > +    iter->start = fixup_red_left(s, slab_address(slab));
> > +    iter->cur = iter->start;
> 
> It's confusing that iter->pos field is used only when randomization is
> enabled and iter->cur field is used only when randomization is disabled.
> 
> I think we could simply use iter->pos for both random and non-random cases
> (as I have shown in the skeleton before)?
> 

Right, I introduced cur only to keep the non-random iteration close to
the original form, but I agree that using pos for both cases is cleaner.

> > +#ifdef CONFIG_SLAB_FREELIST_RANDOM
> > +    iter->random = (slab->objects >= 2 && s->random_seq);
> > +    if (!iter->random)
> > +        return;
> > +
> > +    iter->freelist_count = oo_objects(s->oo);
> > +    iter->page_limit = slab->objects * s->size;
> > +
> > +    if (allow_spin) {
> > +        iter->pos = get_random_u32_below(iter->freelist_count);
> > +    } else {
> > +        struct rnd_state *state;
> > +
> > +        /*
> > +         * An interrupt or NMI handler might interrupt and change
> > +         * the state in the middle, but that's safe.
> > +         */
> > +        state = &get_cpu_var(slab_rnd_state);
> > +        iter->pos = prandom_u32_state(state) % iter->freelist_count;
> > +        put_cpu_var(slab_rnd_state);
> > +    }
> > +#endif
> > +}
> >  static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> >          void **p, unsigned int count, bool allow_spin)
> 
> There is one problem with this change; ___slab_alloc() builds the
> freelist before calling alloc_from_new_slab(), while refill_objects()
> does not. For consistency, let's allocate a new slab without building
> freelist in ___slab_alloc() and build the freelist in
> alloc_single_from_new_slab() and alloc_from_new_slab()?
> 

Agreed, also, new_slab() is currently used by both early_kmem_cache_node_alloc()
and ___slab_alloc(), so I'll rework the early allocation path as well to
keep the new-slab flow consistent.

> >  {
> >      unsigned int allocated = 0;
> >      struct kmem_cache_node *n;
> > +    struct slab_obj_iter iter;
> >      bool needs_add_partial;
> >      unsigned long flags;
> > -    void *object;
> > +    unsigned int target_inuse;
> >  
> >      /*
> >       * Are we going to put the slab on the partial list?
> > @@ -4359,6 +4376,9 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> >       */
> >      needs_add_partial = (slab->objects > count);
> >  
> > +    /* Target inuse count after allocating from this new slab. */
> > +    target_inuse = needs_add_partial ? count : slab->objects;
> > +
> >      if (!allow_spin && needs_add_partial) {
> >  
> >          n = get_node(s, slab_nid(slab));
> 
> Now new slabs without freelist can be freed in this path.
> which is confusing but should be _technically_ fine, I think...
> 
> > @@ -4370,19 +4390,18 @@ static unsigned int alloc_from_new_slab(struct kmem_cache *s, struct slab *slab,
> >          }
> >      }
> >  
> > -    object = slab->freelist;
> > -    while (object && allocated < count) {
> > -        p[allocated] = object;
> > -        object = get_freepointer(s, object);
> > +    init_slab_obj_iter(s, slab, &iter, allow_spin);
> > +
> > +    while (allocated < target_inuse) {
> > +        p[allocated] = next_slab_obj(s, &iter);
> >          maybe_wipe_obj_freeptr(s, p[allocated]);
> 
> We don't have to wipe the free pointer as we didn't build the freelist?
> 

Right, maybe_wipe_obj_freeptr() is not needed for objects emitted directly from a fresh slab,
I'll remove it.

> > -        slab->inuse++;
> >          allocated++;
> >      }
> > -    slab->freelist = object;
> > +    slab->inuse = target_inuse;
> >  
> >      if (needs_add_partial) {
> > -
> > +        build_slab_freelist(s, slab, &iter);
> 
> When allow_spin is true, it's building the freelist while holding the
> spinlock, and that's not great.
> 
> Hmm, can we do better?
> 
> Perhaps just allocate object(s) from the slab and build the freelist
> with the objects left (if exists), but free the slab if allow_spin
> is false AND trylock fails, and accept the fact that the slab may not be
> fully free when it's freed due to trylock failure?
> 
> something like:
> 
> alloc_from_new_slab() {
>     needs_add_partial = (slab->objects > count);
>     target_inuse = needs_add_partial ? count : slab->objects;
> 
>     init_slab_obj_iter(s, slab, &iter, allow_spin);
>     while (allocated < target_inuse) {
>         p[allocated] = next_slab_obj(s, &iter);
>         allocated++;
>     }
>     slab->inuse = target_inuse;
> 
>     if (needs_add_partial) {
>         build_slab_freelist(s, slab, &iter);
>         n = get_node(s, slab_nid(slab))
>         if (allow_spin) {
>             spin_lock_irqsave(&n->list_lock, flags);
>         } else if (!spin_trylock_irqsave(&n->list_lock, flags)) {
>             /* 
>              * Unlucky, discard newly allocated slab.
>              * The slab is not fully free, but it's fine as
>              * objects are not allocated to users.
>              */
>             free_new_slab_nolock(s, slab);
>             return 0;
>         }
>         add_partial(n, slab, ADD_TO_HEAD);
>         spin_unlock_irqrestore(&n->list_lock, flags);
>     }
>     [...]
> }
> 
> And do something similar in alloc_single_from_new_slab() as well.
> 

Good point. I'll restructure the path so objects are emitted first, the leftover
freelist is built only if needed, and the slab is added to partial afterwards. For
the !allow_spin trylock failure case, I'll discard the new slab and return 0. I'll
do the same for the single-object path as well.

> >          if (allow_spin) {
> >              n = get_node(s, slab_nid(slab));
> >              spin_lock_irqsave(&n->list_lock, flags);
> > @@ -7244,16 +7265,15 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> >  
> >  new_slab:
> >  
> > -    slab = new_slab(s, gfp, local_node);
> > +    slab_gfp = prepare_slab_alloc_flags(s, gfp);
> 
> Could we do `flags = prepare_slab_alloc_flags(s, flags);`
> within allocate_slab()? Having gfp and slab_gfp flags is distractive.
> The value of allow_spin should not change after
> prepare_slab_alloc_flags() anyway.
> 

Agreed. I'll move the prepare_slab_alloc_flags() handling into allocate_slab()
so the call sites stay simpler, and keep the iterator/freelist construction
local to alloc_single_from_new_slab() and alloc_from_new_slab().

I also have a question about the allow_spin semantics in the refill_objects path.

Now that init_slab_obj_iter() has been moved into alloc_from_new_slab(..., true),
allow_spin on this path appears to be unconditionally set to true.

Previously, shuffle_freelist() received allow_spin = gfpflags_allow_spinning(gfp),
so I wanted to check whether moving init_slab_obj_iter() into alloc_from_new_slab()
changes the intended semantics here.

My current understanding is that this is still fine, because refill_objects()
is guaranteed to run only when spinning is allowed. Is that correct?

> > +    allow_spin = gfpflags_allow_spinning(slab_gfp);
> > +
> > +    slab = allocate_slab(s, slab_gfp, local_node, allow_spin);
> >      if (!slab)
> >          goto out;
> >  
> >      stat(s, ALLOC_SLAB);
> >  
> > -    /*
> > -     * TODO: possible optimization - if we know we will consume the whole
> > -     * slab we might skip creating the freelist?
> > -     */
> >      refilled += alloc_from_new_slab(s, slab, p + refilled, max - refilled,
> >                      /* allow_spin = */ true);
> >  
> > -- 
> > 2.25.1
> 
> -- 
> Cheers,
> Harry / Hyeonggon

Thanks again. I will fold these changes into the next revision.
--
With Best Regards,
Shengming


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-07 13:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-06 13:50 [PATCH v3] mm/slub: defer freelist construction until after bulk allocation from a new slab hu.shengming
2026-04-07  4:19 ` Harry Yoo (Oracle)
2026-04-07 13:02   ` hu.shengming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox