[PATCH 0/3] slab: support memoryless nodes with sheaves

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 0/3] slab: support memoryless nodes with sheaves
@ 2026-03-11  8:25 Vlastimil Babka (SUSE)
  2026-03-11  8:25 ` [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node Vlastimil Babka (SUSE)
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-11  8:25 UTC (permalink / raw)
  To: Ming Lei, Harry Yoo
  Cc: Hao Li, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, linux-mm, linux-kernel, Vlastimil Babka (SUSE)

This is the draft patch from [1] turned into a proper series with
incremental changes. It's based on v7.0-rc3. It's too intrusive for a
7.0 hotfix, so we'll only be able to fix/reduce the regression in 7.1. I
hope it's acceptable given it's a non-standard configuration, 7.0 is not
a LTS, and it's a perf regression, not functionality.

Ming can you please retest this on top of v7.0-rc3, which already has
fb1091febd66 ("mm/slab: allow sheaf refill if blocking is not
allowed"). Separate data point for v7.0-rc3 could be also useful.

[1] https://lore.kernel.org/all/c6a01f7e-c6eb-454b-9b9e-734526dd659d@kernel.org/

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
Vlastimil Babka (SUSE) (3):
      slab: decouple pointer to barn from kmem_cache_node
      slab: create barns for online memoryless nodes
      slab: free remote objects to sheaves on memoryless nodes

 mm/slab.h |   7 +-
 mm/slub.c | 256 +++++++++++++++++++++++++++++++++++++++++++++-----------------
 2 files changed, 191 insertions(+), 72 deletions(-)
---
base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
change-id: 20260311-b4-slab-memoryless-barns-fad64172ba05

Best regards,
-- 
Vlastimil Babka (SUSE) <vbabka@kernel.org>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node
  2026-03-11  8:25 [PATCH 0/3] slab: support memoryless nodes with sheaves Vlastimil Babka (SUSE)
@ 2026-03-11  8:25 ` Vlastimil Babka (SUSE)
  2026-03-13  9:27   ` Harry Yoo
  2026-03-11  8:25 ` [PATCH 2/3] slab: create barns for online memoryless nodes Vlastimil Babka (SUSE)
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-11  8:25 UTC (permalink / raw)
  To: Ming Lei, Harry Yoo
  Cc: Hao Li, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, linux-mm, linux-kernel, Vlastimil Babka (SUSE)

The pointer to barn currently exists in struct kmem_cache_node. That
struct is instantiated for every NUMA node with memory, but we want to
have a barn for every online node (including memoryless).

Thus decouple the two structures. In struct kmem_cache we have an array
for kmem_cache_node pointers that appears to be sized MAX_NUMNODES but
the actual size calculation in kmem_cache_init() uses nr_node_ids.
Therefore we can't just add another array of barn pointers. Instead
change the array to newly introduced struct kmem_cache_per_node_ptrs
holding both kmem_cache_node and barn pointer.

Adjust barn accessor and allocation/initialization code accordingly. For
now no functional change intended, barns are created 1:1 together with
kmem_cache_nodes.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slab.h |   7 +++-
 mm/slub.c | 128 +++++++++++++++++++++++++++++++++++---------------------------
 2 files changed, 78 insertions(+), 57 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index e9ab292acd22..c735e6b4dddb 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -191,6 +191,11 @@ struct kmem_cache_order_objects {
 	unsigned int x;
 };
 
+struct kmem_cache_per_node_ptrs {
+	struct node_barn *barn;
+	struct kmem_cache_node *node;
+};
+
 /*
  * Slab cache management.
  */
@@ -247,7 +252,7 @@ struct kmem_cache {
 	struct kmem_cache_stats __percpu *cpu_stats;
 #endif
 
-	struct kmem_cache_node *node[MAX_NUMNODES];
+	struct kmem_cache_per_node_ptrs per_node[MAX_NUMNODES];
 };
 
 /*
diff --git a/mm/slub.c b/mm/slub.c
index 20cb4f3b636d..609a183f8533 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -59,7 +59,7 @@
  *   0.  cpu_hotplug_lock
  *   1.  slab_mutex (Global Mutex)
  *   2a. kmem_cache->cpu_sheaves->lock (Local trylock)
- *   2b. node->barn->lock (Spinlock)
+ *   2b. barn->lock (Spinlock)
  *   2c. node->list_lock (Spinlock)
  *   3.  slab_lock(slab) (Only on some arches)
  *   4.  object_map_lock (Only for debugging)
@@ -136,7 +136,7 @@
  *   or spare sheaf can handle the allocation or free, there is no other
  *   overhead.
  *
- *   node->barn->lock (spinlock)
+ *   barn->lock (spinlock)
  *
  *   This lock protects the operations on per-NUMA-node barn. It can quickly
  *   serve an empty or full sheaf if available, and avoid more expensive refill
@@ -436,26 +436,24 @@ struct kmem_cache_node {
 	atomic_long_t total_objects;
 	struct list_head full;
 #endif
-	struct node_barn *barn;
 };
 
 static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 {
-	return s->node[node];
+	return s->per_node[node].node;
+}
+
+static inline struct node_barn *get_barn_node(struct kmem_cache *s, int node)
+{
+	return s->per_node[node].barn;
 }
 
 /*
- * Get the barn of the current cpu's closest memory node. It may not exist on
- * systems with memoryless nodes but without CONFIG_HAVE_MEMORYLESS_NODES
+ * Get the barn of the current cpu's memory node. It may be a memoryless node.
  */
 static inline struct node_barn *get_barn(struct kmem_cache *s)
 {
-	struct kmem_cache_node *n = get_node(s, numa_mem_id());
-
-	if (!n)
-		return NULL;
-
-	return n->barn;
+	return get_barn_node(s, numa_node_id());
 }
 
 /*
@@ -5791,7 +5789,6 @@ bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
 
 static void rcu_free_sheaf(struct rcu_head *head)
 {
-	struct kmem_cache_node *n;
 	struct slab_sheaf *sheaf;
 	struct node_barn *barn = NULL;
 	struct kmem_cache *s;
@@ -5814,12 +5811,10 @@ static void rcu_free_sheaf(struct rcu_head *head)
 	if (__rcu_free_sheaf_prepare(s, sheaf))
 		goto flush;
 
-	n = get_node(s, sheaf->node);
-	if (!n)
+	barn = get_barn_node(s, sheaf->node);
+	if (!barn)
 		goto flush;
 
-	barn = n->barn;
-
 	/* due to slab_free_hook() */
 	if (unlikely(sheaf->size == 0))
 		goto empty;
@@ -7430,7 +7425,7 @@ static inline int calculate_order(unsigned int size)
 }
 
 static void
-init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
+init_kmem_cache_node(struct kmem_cache_node *n)
 {
 	n->nr_partial = 0;
 	spin_lock_init(&n->list_lock);
@@ -7440,9 +7435,6 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct node_barn *barn)
 	atomic_long_set(&n->total_objects, 0);
 	INIT_LIST_HEAD(&n->full);
 #endif
-	n->barn = barn;
-	if (barn)
-		barn_init(barn);
 }
 
 #ifdef CONFIG_SLUB_STATS
@@ -7537,8 +7529,8 @@ static void early_kmem_cache_node_alloc(int node)
 	n = kasan_slab_alloc(kmem_cache_node, n, GFP_KERNEL, false);
 	slab->freelist = get_freepointer(kmem_cache_node, n);
 	slab->inuse = 1;
-	kmem_cache_node->node[node] = n;
-	init_kmem_cache_node(n, NULL);
+	kmem_cache_node->per_node[node].node = n;
+	init_kmem_cache_node(n);
 	inc_slabs_node(kmem_cache_node, node, slab->objects);
 
 	/*
@@ -7553,15 +7545,20 @@ static void free_kmem_cache_nodes(struct kmem_cache *s)
 	int node;
 	struct kmem_cache_node *n;
 
-	for_each_kmem_cache_node(s, node, n) {
-		if (n->barn) {
-			WARN_ON(n->barn->nr_full);
-			WARN_ON(n->barn->nr_empty);
-			kfree(n->barn);
-			n->barn = NULL;
-		}
+	for_each_node(node) {
+		struct node_barn *barn = get_barn_node(s, node);
 
-		s->node[node] = NULL;
+		if (!barn)
+			continue;
+
+		WARN_ON(barn->nr_full);
+		WARN_ON(barn->nr_empty);
+		kfree(barn);
+		s->per_node[node].barn = NULL;
+	}
+
+	for_each_kmem_cache_node(s, node, n) {
+		s->per_node[node].node = NULL;
 		kmem_cache_free(kmem_cache_node, n);
 	}
 }
@@ -7582,31 +7579,36 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 
 	for_each_node_mask(node, slab_nodes) {
 		struct kmem_cache_node *n;
-		struct node_barn *barn = NULL;
 
 		if (slab_state == DOWN) {
 			early_kmem_cache_node_alloc(node);
 			continue;
 		}
 
-		if (cache_has_sheaves(s)) {
-			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
-
-			if (!barn)
-				return 0;
-		}
-
 		n = kmem_cache_alloc_node(kmem_cache_node,
 						GFP_KERNEL, node);
-		if (!n) {
-			kfree(barn);
+		if (!n)
 			return 0;
-		}
 
-		init_kmem_cache_node(n, barn);
+		init_kmem_cache_node(n);
+		s->per_node[node].node = n;
+	}
+
+	if (slab_state == DOWN || !cache_has_sheaves(s))
+		return 1;
+
+	for_each_node_mask(node, slab_nodes) {
+		struct node_barn *barn;
+
+		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
+
+		if (!barn)
+			return 0;
 
-		s->node[node] = n;
+		barn_init(barn);
+		s->per_node[node].barn = barn;
 	}
+
 	return 1;
 }
 
@@ -7895,10 +7897,15 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 	if (cache_has_sheaves(s))
 		rcu_barrier();
 
+	for_each_node(node) {
+		struct node_barn *barn = get_barn_node(s, node);
+
+		if (barn)
+			barn_shrink(s, barn);
+	}
+
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
-		if (n->barn)
-			barn_shrink(s, n->barn);
 		free_partial(s, n);
 		if (n->nr_partial || node_nr_slabs(n))
 			return 1;
@@ -8108,14 +8115,18 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 	unsigned long flags;
 	int ret = 0;
 
+	for_each_node(node) {
+		struct node_barn *barn = get_barn_node(s, node);
+
+		if (barn)
+			barn_shrink(s, barn);
+	}
+
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
 		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 			INIT_LIST_HEAD(promote + i);
 
-		if (n->barn)
-			barn_shrink(s, n->barn);
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -8204,7 +8215,8 @@ static int slab_mem_going_online_callback(int nid)
 		if (get_node(s, nid))
 			continue;
 
-		if (cache_has_sheaves(s)) {
+		if (cache_has_sheaves(s) && !get_barn_node(s, nid)) {
+
 			barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
 
 			if (!barn) {
@@ -8225,13 +8237,17 @@ static int slab_mem_going_online_callback(int nid)
 			goto out;
 		}
 
-		init_kmem_cache_node(n, barn);
+		init_kmem_cache_node(n);
+		s->per_node[nid].node = n;
 
-		s->node[nid] = n;
+		if (barn) {
+			barn_init(barn);
+			s->per_node[nid].barn = barn;
+		}
 	}
 	/*
 	 * Any cache created after this point will also have kmem_cache_node
-	 * initialized for the new node.
+	 * and barn initialized for the new node.
 	 */
 	node_set(nid, slab_nodes);
 out:
@@ -8323,7 +8339,7 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
 		}
 
 		barn_init(barn);
-		get_node(s, node)->barn = barn;
+		s->per_node[node].barn = barn;
 	}
 
 	for_each_possible_cpu(cpu) {
@@ -8394,8 +8410,8 @@ void __init kmem_cache_init(void)
 	slab_state = PARTIAL;
 
 	create_boot_cache(kmem_cache, "kmem_cache",
-			offsetof(struct kmem_cache, node) +
-				nr_node_ids * sizeof(struct kmem_cache_node *),
+			offsetof(struct kmem_cache, per_node) +
+				nr_node_ids * sizeof(struct kmem_cache_per_node_ptrs),
 			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
 
 	kmem_cache = bootstrap(&boot_kmem_cache);

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] slab: create barns for online memoryless nodes
  2026-03-11  8:25 [PATCH 0/3] slab: support memoryless nodes with sheaves Vlastimil Babka (SUSE)
  2026-03-11  8:25 ` [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node Vlastimil Babka (SUSE)
@ 2026-03-11  8:25 ` Vlastimil Babka (SUSE)
  2026-03-16  3:25   ` Harry Yoo
  2026-03-18  9:27   ` Hao Li
  2026-03-11  8:25 ` [PATCH 3/3] slab: free remote objects to sheaves on " Vlastimil Babka (SUSE)
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-11  8:25 UTC (permalink / raw)
  To: Ming Lei, Harry Yoo
  Cc: Hao Li, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, linux-mm, linux-kernel, Vlastimil Babka (SUSE)

Ming Lei has reported [1] a performance regression due to replacing cpu
(partial) slabs with sheaves. With slub stats enabled, a large amount of
slowpath allocations were observed. The affected system has 8 online
NUMA nodes but only 2 have memory.

For sheaves to work effectively on given cpu, its NUMA node has to have
struct node_barn allocated. Those are currently only allocated on nodes
with memory (N_MEMORY) where kmem_cache_node also exist as the goal is
to cache only node-local objects. But in order to have good performance
on a memoryless node, we need its barn to exist and use sheaves to cache
non-local objects (as no local objects can exist anyway).

Therefore change the implementation to allocate barns on all online
nodes, tracked in a new nodemask slab_barn_nodes. Also add a cpu hotplug
callback as that's when a memoryless node can become online.

Change rcu_sheaf->node assignment to numa_node_id() so it's returned to
the barn of the local cpu's (potentially memoryless) node, and not to
the nearest node with memory anymore.

Reported-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1]
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 609a183f8533..d8496b37e364 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -472,6 +472,12 @@ static inline struct node_barn *get_barn(struct kmem_cache *s)
  */
 static nodemask_t slab_nodes;
 
+/*
+ * Similar to slab_nodes but for where we have node_barn allocated.
+ * Corresponds to N_ONLINE nodes.
+ */
+static nodemask_t slab_barn_nodes;
+
 /*
  * Workqueue used for flushing cpu and kfree_rcu sheaves.
  */
@@ -4084,6 +4090,51 @@ void flush_all_rcu_sheaves(void)
 	rcu_barrier();
 }
 
+static int slub_cpu_setup(unsigned int cpu)
+{
+	int nid = cpu_to_node(cpu);
+	struct kmem_cache *s;
+	int ret = 0;
+
+	/*
+	 * we never clear a nid so it's safe to do a quick check before taking
+	 * the mutex, and then recheck to handle parallel cpu hotplug safely
+	 */
+	if (node_isset(nid, slab_barn_nodes))
+		return 0;
+
+	mutex_lock(&slab_mutex);
+
+	if (node_isset(nid, slab_barn_nodes))
+		goto out;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		struct node_barn *barn;
+
+		/*
+		 * barn might already exist if a previous callback failed midway
+		 */
+		if (!cache_has_sheaves(s) || get_barn_node(s, nid))
+			continue;
+
+		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, nid);
+
+		if (!barn) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		barn_init(barn);
+		s->per_node[nid].barn = barn;
+	}
+	node_set(nid, slab_barn_nodes);
+
+out:
+	mutex_unlock(&slab_mutex);
+
+	return ret;
+}
+
 /*
  * Use the cpu notifier to insure that the cpu slabs are flushed when
  * necessary.
@@ -5936,7 +5987,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 		rcu_sheaf = NULL;
 	} else {
 		pcs->rcu_free = NULL;
-		rcu_sheaf->node = numa_mem_id();
+		rcu_sheaf->node = numa_node_id();
 	}
 
 	/*
@@ -7597,7 +7648,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
 	if (slab_state == DOWN || !cache_has_sheaves(s))
 		return 1;
 
-	for_each_node_mask(node, slab_nodes) {
+	for_each_node_mask(node, slab_barn_nodes) {
 		struct node_barn *barn;
 
 		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
@@ -8250,6 +8301,7 @@ static int slab_mem_going_online_callback(int nid)
 	 * and barn initialized for the new node.
 	 */
 	node_set(nid, slab_nodes);
+	node_set(nid, slab_barn_nodes);
 out:
 	mutex_unlock(&slab_mutex);
 	return ret;
@@ -8328,7 +8380,7 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
 	if (!capacity)
 		return;
 
-	for_each_node_mask(node, slab_nodes) {
+	for_each_node_mask(node, slab_barn_nodes) {
 		struct node_barn *barn;
 
 		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
@@ -8400,6 +8452,9 @@ void __init kmem_cache_init(void)
 	for_each_node_state(node, N_MEMORY)
 		node_set(node, slab_nodes);
 
+	for_each_online_node(node)
+		node_set(node, slab_barn_nodes);
+
 	create_boot_cache(kmem_cache_node, "kmem_cache_node",
 			sizeof(struct kmem_cache_node),
 			SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
@@ -8426,7 +8481,7 @@ void __init kmem_cache_init(void)
 	/* Setup random freelists for each cache */
 	init_freelist_randomization();
 
-	cpuhp_setup_state_nocalls(CPUHP_SLUB_DEAD, "slub:dead", NULL,
+	cpuhp_setup_state_nocalls(CPUHP_SLUB_DEAD, "slub:dead", slub_cpu_setup,
 				  slub_cpu_dead);
 
 	pr_info("SLUB: HWalign=%d, Order=%u-%u, MinObjects=%u, CPUs=%u, Nodes=%u\n",

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] slab: free remote objects to sheaves on memoryless nodes
  2026-03-11  8:25 [PATCH 0/3] slab: support memoryless nodes with sheaves Vlastimil Babka (SUSE)
  2026-03-11  8:25 ` [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node Vlastimil Babka (SUSE)
  2026-03-11  8:25 ` [PATCH 2/3] slab: create barns for online memoryless nodes Vlastimil Babka (SUSE)
@ 2026-03-11  8:25 ` Vlastimil Babka (SUSE)
  2026-03-16  3:48   ` Harry Yoo
  2026-03-11  9:49 ` [PATCH 0/3] slab: support memoryless nodes with sheaves Ming Lei
  2026-03-16 13:33 ` Vlastimil Babka (SUSE)
  4 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-11  8:25 UTC (permalink / raw)
  To: Ming Lei, Harry Yoo
  Cc: Hao Li, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, linux-mm, linux-kernel, Vlastimil Babka (SUSE)

On memoryless nodes we can now allocate from cpu sheaves and refill them
normally. But when a node is memoryless on a system without actual
CONFIG_HAVE_MEMORYLESS_NODES support, freeing always uses the slowpath
because all objects appear as remote. We could instead benefit from the
freeing fastpath, because the allocations can't obtain local objects
anyway if the node is memoryless.

Thus adapt the locality check when freeing, and move them to an inline
function can_free_to_pcs() for a single shared implementation.

On configurations with CONFIG_HAVE_MEMORYLESS_NODES=y continue using
numa_mem_id() so the percpu sheaves and barn on a memoryless node will
contain mostly objects from the closest memory node (returned by
numa_mem_id()). No change is thus intended for such configuration.

On systems with CONFIG_HAVE_MEMORYLESS_NODES=n use numa_node_id() (the
cpu's node) since numa_mem_id() just aliases it anyway. But if we are
freeing on a memoryless node, allow the freeing to use percpu sheaves
for objects from any node, since they are all remote anyway.

This way we avoid the slowpath and get more performant freeing. The
potential downside is that allocations will obtain objects with a larger
average distance. If we kept bypassing the sheaves on freeing, a refill
of sheaves from slabs would tend to get closer objects thanks to the
ordering of the zonelist. Architectures that allow de-facto memoryless
nodes without proper CONFIG_HAVE_MEMORYLESS_NODES support should perhaps
consider adding such support.

Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
---
 mm/slub.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 12 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d8496b37e364..2e095ce76dd0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6009,6 +6009,56 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 	return false;
 }
 
+static __always_inline bool can_free_to_pcs(struct slab *slab)
+{
+	int slab_node;
+	int numa_node;
+
+	if (!IS_ENABLED(CONFIG_NUMA))
+		goto check_pfmemalloc;
+
+	slab_node = slab_nid(slab);
+
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+	/*
+	 * numa_mem_id() points to the closest node with memory so only allow
+	 * objects from that node to the percpu sheaves
+	 */
+	numa_node = numa_mem_id();
+
+	if (likely(slab_node == numa_node))
+		goto check_pfmemalloc;
+#else
+
+	/*
+	 * numa_mem_id() is only a wrapper to numa_node_id() which is where this
+	 * cpu belongs to, but it might be a memoryless node anyway. We don't
+	 * know what the closest node is.
+	 */
+	numa_node = numa_node_id();
+
+	/* freed object is from this cpu's node, proceed */
+	if (likely(slab_node == numa_node))
+		goto check_pfmemalloc;
+
+	/*
+	 * Freed object isn't from this cpu's node, but that node is memoryless.
+	 * Proceed as it's better to cache remote objects than falling back to
+	 * the slowpath for everything. The allocation side can never obtain
+	 * a local object anyway, if none exist. We don't have numa_mem_id() to
+	 * point to the closest node as we would on a proper memoryless node
+	 * setup.
+	 */
+	if (unlikely(!node_isset(numa_node, slab_nodes)))
+		goto check_pfmemalloc;
+#endif
+
+	return false;
+
+check_pfmemalloc:
+	return likely(!slab_test_pfmemalloc(slab));
+}
+
 /*
  * Bulk free objects to the percpu sheaves.
  * Unlike free_to_pcs() this includes the calls to all necessary hooks
@@ -6023,7 +6073,6 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	struct node_barn *barn;
 	void *remote_objects[PCS_BATCH_MAX];
 	unsigned int remote_nr = 0;
-	int node = numa_mem_id();
 
 next_remote_batch:
 	while (i < size) {
@@ -6037,8 +6086,7 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 			continue;
 		}
 
-		if (unlikely((IS_ENABLED(CONFIG_NUMA) && slab_nid(slab) != node)
-			     || slab_test_pfmemalloc(slab))) {
+		if (unlikely(!can_free_to_pcs(slab))) {
 			remote_objects[remote_nr] = p[i];
 			p[i] = p[--size];
 			if (++remote_nr >= PCS_BATCH_MAX)
@@ -6214,11 +6262,8 @@ void slab_free(struct kmem_cache *s, struct slab *slab, void *object,
 	if (unlikely(!slab_free_hook(s, object, slab_want_init_on_free(s), false)))
 		return;
 
-	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())
-	    && likely(!slab_test_pfmemalloc(slab))) {
-		if (likely(free_to_pcs(s, object, true)))
-			return;
-	}
+	if (likely(can_free_to_pcs(slab)) && likely(free_to_pcs(s, object, true)))
+		return;
 
 	__slab_free(s, slab, object, object, 1, addr);
 	stat(s, FREE_SLOWPATH);
@@ -6589,10 +6634,8 @@ void kfree_nolock(const void *object)
 	 */
 	kasan_slab_free(s, x, false, false, /* skip quarantine */true);
 
-	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id())) {
-		if (likely(free_to_pcs(s, x, false)))
-			return;
-	}
+	if (likely(can_free_to_pcs(slab)) && likely(free_to_pcs(s, x, false)))
+		return;
 
 	/*
 	 * __slab_free() can locklessly cmpxchg16 into a slab, but then it might

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] slab: support memoryless nodes with sheaves
  2026-03-11  8:25 [PATCH 0/3] slab: support memoryless nodes with sheaves Vlastimil Babka (SUSE)
                   ` (2 preceding siblings ...)
  2026-03-11  8:25 ` [PATCH 3/3] slab: free remote objects to sheaves on " Vlastimil Babka (SUSE)
@ 2026-03-11  9:49 ` Ming Lei
  2026-03-11 17:22   ` Vlastimil Babka (SUSE)
  2026-03-16 13:33 ` Vlastimil Babka (SUSE)
  4 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2026-03-11  9:49 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Harry Yoo, Hao Li, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On Wed, Mar 11, 2026 at 09:25:54AM +0100, Vlastimil Babka (SUSE) wrote:
> This is the draft patch from [1] turned into a proper series with
> incremental changes. It's based on v7.0-rc3. It's too intrusive for a
> 7.0 hotfix, so we'll only be able to fix/reduce the regression in 7.1. I
> hope it's acceptable given it's a non-standard configuration, 7.0 is not
> a LTS, and it's a perf regression, not functionality.
> 
> Ming can you please retest this on top of v7.0-rc3, which already has
> fb1091febd66 ("mm/slab: allow sheaf refill if blocking is not
> allowed"). Separate data point for v7.0-rc3 could be also useful.
> 
> [1] https://lore.kernel.org/all/c6a01f7e-c6eb-454b-9b9e-734526dd659d@kernel.org/
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
> Vlastimil Babka (SUSE) (3):
>       slab: decouple pointer to barn from kmem_cache_node
>       slab: create barns for online memoryless nodes
>       slab: free remote objects to sheaves on memoryless nodes

Hi Vlastimil and Guys,

I re-run the test case used in https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/

- v6.19-rc5: 34M

- 815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next: 13M

- v7.0-rc3: 13M

- v7.0-rc3 + the three patches: 24M

# Test Machines

- AMD Zen4, dual sockets, 64 cores, 8 NUMA node(configure BIOS to use per-CCD numa, just 2 memory node)

- numactl -H:

https://lore.kernel.org/all/aZ7p9uF8H8u6RxrK@fedora/

# slab stat log

root@tomsrv:~/temp/mm/7.0-rc3/patched# (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
./remote_node_defrag_ratio:100
./total_objects:7344 N1=3417 N5=3927
./alloc_fastpath:476106437 C0=128 C1=26852005 C2=128 C3=27291181 C4=65 C5=35617011 C6=97 C7=34258221 C8=96 C9=28158690 C11=26433128 C12=128 C13=31715794 C15=28819773 C16=97 C17=26168947 C19=30768051 C20=128 C21=32964376 C23=34696825 C25=26471644 C26=130 C27=27844688 C28=97 C29=28480054 C31=29564950 C40=1 C42=2 C63=2
./cpu_slabs:0
./objects:7265 N1=3374 N5=3891
./sheaf_return_slow:0
./objects_partial:533 N1=212 N5=321
./sheaf_return_fast:0
./cpu_partial:0
./free_slowpath:295 C4=158 C6=136 C20=1
./barn_get_fail:270 C0=5 C1=16 C2=5 C3=6 C4=3 C5=21 C6=4 C7=14 C8=2 C9=7 C11=23 C12=3 C13=10 C15=19 C16=3 C17=4 C19=25 C20=5 C21=22 C23=6 C25=21 C26=5 C27=6 C28=1 C29=4 C31=27 C40=1 C42=1 C63=1
./sheaf_prefill_oversize:0
./skip_kfence:0
./min_partial:5
./order_fallback:0
./sheaf_capacity:28
./sheaf_flush:0
./free_rcu_sheaf:0
./sheaf_alloc:179 C0=9 C1=1 C2=4 C4=8 C5=1 C6=4 C7=65 C8=3 C10=10 C11=1 C12=2 C14=11 C15=1 C16=5 C18=8 C19=1 C20=8 C21=1 C22=5 C24=8 C25=1 C26=5 C28=5 C30=8 C31=1 C40=1 C42=1 C63=1
./sheaf_free:0
./sheaf_prefill_slow:0
./sheaf_prefill_fast:0
./poison:0
./red_zone:0
./free_slab:0
./slabs:144 N1=67 N5=77
./barn_get:17003547 C1=958985 C3=974680 C5=1272016 C7=1223494 C8=2 C9=1005661 C11=944018 C12=2 C13=1132697 C15=1029259 C16=1 C17=934602 C19=1098834 C21=1177278 C23=1239167 C25=945395 C27=994448 C28=3 C29=1017141 C31=1055864
./alloc_slowpath:0
./destroy_by_rcu:1
./free_rcu_sheaf_fail:0
./barn_put:17003623 C0=958995 C2=974679 C4=1272023 C6=1223496 C8=1005661 C10=944030 C12=1132701 C14=1029267 C16=934598 C18=1098848 C20=1177293 C22=1239162 C24=945405 C26=994447 C28=1017138 C30=1055880
./usersize:0
./sanity_checks:0
./barn_put_fail:0
./align:64
./alloc_node_mismatch:0
./alloc_slab:144 C0=2 C1=8 C2=3 C3=2 C4=1 C5=5 C6=1 C7=3 C8=2 C9=4 C11=14 C12=2 C13=7 C15=11 C16=2 C17=3 C19=20 C20=1 C21=5 C23=1 C25=13 C26=4 C27=5 C29=1 C31=21 C40=1 C42=1 C63=1
./free_remove_partial:0
./aliases:0
./store_user:0
./trace:0
./reclaim_account:0
./order:2
./sheaf_refill:7560 C0=140 C1=448 C2=140 C3=168 C4=84 C5=588 C6=112 C7=392 C8=56 C9=196 C11=644 C12=84 C13=280 C15=532 C16=84 C17=112 C19=700 C20=140 C21=616 C23=168 C25=588 C26=140 C27=168 C28=28 C29=112 C31=756 C40=28 C42=28 C63=28
./object_size:256
./free_fastpath:476102026 C0=26851883 C2=27291053 C4=35616664 C6=34257923 C8=28158529 C9=1 C10=26432875 C11=2 C12=31715665 C14=28819520 C16=26168783 C18=30767788 C20=32964224 C21=2 C22=34696578 C24=26471388 C26=27844558 C27=2 C28=28479894 C30=29564692 C31=2
./hwcache_align:1
./cmpxchg_double_fail:0
./objs_per_slab:51
./partial:12 N1=5 N5=7
./slabs_cpu_partial:0(0)
./free_add_partial:143 C0=3 C1=8 C2=2 C3=4 C4=11 C5=16 C6=13 C7=9 C9=3 C11=8 C12=1 C13=3 C15=8 C16=1 C17=1 C19=5 C20=5 C21=17 C23=5 C25=8 C26=1 C27=1 C28=1 C29=3 C31=6
./slab_size:320
./cache_dma:0


Thanks,
Ming



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] slab: support memoryless nodes with sheaves
  2026-03-11  9:49 ` [PATCH 0/3] slab: support memoryless nodes with sheaves Ming Lei
@ 2026-03-11 17:22   ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-11 17:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Harry Yoo, Hao Li, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On 3/11/26 10:49, Ming Lei wrote:
> On Wed, Mar 11, 2026 at 09:25:54AM +0100, Vlastimil Babka (SUSE) wrote:
>> This is the draft patch from [1] turned into a proper series with
>> incremental changes. It's based on v7.0-rc3. It's too intrusive for a
>> 7.0 hotfix, so we'll only be able to fix/reduce the regression in 7.1. I
>> hope it's acceptable given it's a non-standard configuration, 7.0 is not
>> a LTS, and it's a perf regression, not functionality.
>> 
>> Ming can you please retest this on top of v7.0-rc3, which already has
>> fb1091febd66 ("mm/slab: allow sheaf refill if blocking is not
>> allowed"). Separate data point for v7.0-rc3 could be also useful.
>> 
>> [1] https://lore.kernel.org/all/c6a01f7e-c6eb-454b-9b9e-734526dd659d@kernel.org/
>> 
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
>> Vlastimil Babka (SUSE) (3):
>>       slab: decouple pointer to barn from kmem_cache_node
>>       slab: create barns for online memoryless nodes
>>       slab: free remote objects to sheaves on memoryless nodes
> 
> Hi Vlastimil and Guys,
> 
> I re-run the test case used in https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/
> 
> - v6.19-rc5: 34M
> 
> - 815c8e35511d Merge branch 'slab/for-7.0/sheaves' into slab/for-next: 13M
> 
> - v7.0-rc3: 13M

Thanks, that's in line with your previous testing of "mm/slab: allow sheaf
refill if blocking is not allowed" making no difference here. At least we
just learned it helps other benchmarks :)

> - v7.0-rc3 + the three patches: 24M

OK. So now it might be really the total per-cpu caching capacity difference.

> # Test Machines
> 
> - AMD Zen4, dual sockets, 64 cores, 8 NUMA node(configure BIOS to use per-CCD numa, just 2 memory node)
> 
> - numactl -H:
> 
> https://lore.kernel.org/all/aZ7p9uF8H8u6RxrK@fedora/
> 
> # slab stat log
> 
> root@tomsrv:~/temp/mm/7.0-rc3/patched# (cd /sys/kernel/slab/bio-256/ && find . -type f -exec grep -aH . {} \;)
> ./remote_node_defrag_ratio:100
> ./total_objects:7344 N1=3417 N5=3927
> ./alloc_fastpath:476106437 C0=128 C1=26852005 C2=128 C3=27291181 C4=65 C5=35617011 C6=97 C7=34258221 C8=96 C9=28158690 C11=26433128 C12=128 C13=31715794 C15=28819773 C16=97 C17=26168947 C19=30768051 C20=128 C21=32964376 C23=34696825 C25=26471644 C26=130 C27=27844688 C28=97 C29=28480054 C31=29564950 C40=1 C42=2 C63=2
> ./cpu_slabs:0
> ./objects:7265 N1=3374 N5=3891
> ./sheaf_return_slow:0
> ./objects_partial:533 N1=212 N5=321
> ./sheaf_return_fast:0
> ./cpu_partial:0
> ./free_slowpath:295 C4=158 C6=136 C20=1
> ./barn_get_fail:270 C0=5 C1=16 C2=5 C3=6 C4=3 C5=21 C6=4 C7=14 C8=2 C9=7 C11=23 C12=3 C13=10 C15=19 C16=3 C17=4 C19=25 C20=5 C21=22 C23=6 C25=21 C26=5 C27=6 C28=1 C29=4 C31=27 C40=1 C42=1 C63=1
> ./sheaf_prefill_oversize:0
> ./skip_kfence:0
> ./min_partial:5
> ./order_fallback:0
> ./sheaf_capacity:28
> ./sheaf_flush:0
> ./free_rcu_sheaf:0
> ./sheaf_alloc:179 C0=9 C1=1 C2=4 C4=8 C5=1 C6=4 C7=65 C8=3 C10=10 C11=1 C12=2 C14=11 C15=1 C16=5 C18=8 C19=1 C20=8 C21=1 C22=5 C24=8 C25=1 C26=5 C28=5 C30=8 C31=1 C40=1 C42=1 C63=1
> ./sheaf_free:0
> ./sheaf_prefill_slow:0
> ./sheaf_prefill_fast:0
> ./poison:0
> ./red_zone:0
> ./free_slab:0
> ./slabs:144 N1=67 N5=77
> ./barn_get:17003547 C1=958985 C3=974680 C5=1272016 C7=1223494 C8=2 C9=1005661 C11=944018 C12=2 C13=1132697 C15=1029259 C16=1 C17=934602 C19=1098834 C21=1177278 C23=1239167 C25=945395 C27=994448 C28=3 C29=1017141 C31=1055864
> ./alloc_slowpath:0
> ./destroy_by_rcu:1
> ./free_rcu_sheaf_fail:0
> ./barn_put:17003623 C0=958995 C2=974679 C4=1272023 C6=1223496 C8=1005661 C10=944030 C12=1132701 C14=1029267 C16=934598 C18=1098848 C20=1177293 C22=1239162 C24=945405 C26=994447 C28=1017138 C30=1055880
> ./usersize:0
> ./sanity_checks:0
> ./barn_put_fail:0
> ./align:64
> ./alloc_node_mismatch:0
> ./alloc_slab:144 C0=2 C1=8 C2=3 C3=2 C4=1 C5=5 C6=1 C7=3 C8=2 C9=4 C11=14 C12=2 C13=7 C15=11 C16=2 C17=3 C19=20 C20=1 C21=5 C23=1 C25=13 C26=4 C27=5 C29=1 C31=21 C40=1 C42=1 C63=1
> ./free_remove_partial:0
> ./aliases:0
> ./store_user:0
> ./trace:0
> ./reclaim_account:0
> ./order:2
> ./sheaf_refill:7560 C0=140 C1=448 C2=140 C3=168 C4=84 C5=588 C6=112 C7=392 C8=56 C9=196 C11=644 C12=84 C13=280 C15=532 C16=84 C17=112 C19=700 C20=140 C21=616 C23=168 C25=588 C26=140 C27=168 C28=28 C29=112 C31=756 C40=28 C42=28 C63=28
> ./object_size:256
> ./free_fastpath:476102026 C0=26851883 C2=27291053 C4=35616664 C6=34257923 C8=28158529 C9=1 C10=26432875 C11=2 C12=31715665 C14=28819520 C16=26168783 C18=30767788 C20=32964224 C21=2 C22=34696578 C24=26471388 C26=27844558 C27=2 C28=28479894 C30=29564692 C31=2
> ./hwcache_align:1
> ./cmpxchg_double_fail:0
> ./objs_per_slab:51
> ./partial:12 N1=5 N5=7
> ./slabs_cpu_partial:0(0)
> ./free_add_partial:143 C0=3 C1=8 C2=2 C3=4 C4=11 C5=16 C6=13 C7=9 C9=3 C11=8 C12=1 C13=3 C15=8 C16=1 C17=1 C19=5 C20=5 C21=17 C23=5 C25=8 C26=1 C27=1 C28=1 C29=3 C31=6
> ./slab_size:320
> ./cache_dma:0
> 
> 
> Thanks,
> Ming
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node
  2026-03-11  8:25 ` [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node Vlastimil Babka (SUSE)
@ 2026-03-13  9:27   ` Harry Yoo
  2026-03-13  9:46     ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 19+ messages in thread
From: Harry Yoo @ 2026-03-13  9:27 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Ming Lei, Hao Li, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On Wed, Mar 11, 2026 at 09:25:55AM +0100, Vlastimil Babka (SUSE) wrote:
> The pointer to barn currently exists in struct kmem_cache_node. That
> struct is instantiated for every NUMA node with memory, but we want to
> have a barn for every online node (including memoryless).
> 
> Thus decouple the two structures. In struct kmem_cache we have an array
> for kmem_cache_node pointers that appears to be sized MAX_NUMNODES but
> the actual size calculation in kmem_cache_init() uses nr_node_ids.
> Therefore we can't just add another array of barn pointers. Instead
> change the array to newly introduced struct kmem_cache_per_node_ptrs
> holding both kmem_cache_node and barn pointer.
> 
> Adjust barn accessor and allocation/initialization code accordingly. For
> now no functional change intended, barns are created 1:1 together with
> kmem_cache_nodes.
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slab.h |   7 +++-
>  mm/slub.c | 128 +++++++++++++++++++++++++++++++++++---------------------------
>  2 files changed, 78 insertions(+), 57 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index e9ab292acd22..c735e6b4dddb 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -247,7 +252,7 @@ struct kmem_cache {
>  	struct kmem_cache_stats __percpu *cpu_stats;
>  #endif
>  
> -	struct kmem_cache_node *node[MAX_NUMNODES];
> +	struct kmem_cache_per_node_ptrs per_node[MAX_NUMNODES];
>  };

We should probably turn this into a true flexible array at some point,
but that's out of scope for this patchset.

> diff --git a/mm/slub.c b/mm/slub.c
> index 20cb4f3b636d..609a183f8533 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -436,26 +436,24 @@ struct kmem_cache_node {
>  /*
> - * Get the barn of the current cpu's closest memory node. It may not exist on
> - * systems with memoryless nodes but without CONFIG_HAVE_MEMORYLESS_NODES
> + * Get the barn of the current cpu's memory node. It may be a memoryless node.
>   */
>  static inline struct node_barn *get_barn(struct kmem_cache *s)
>  {
> -	struct kmem_cache_node *n = get_node(s, numa_mem_id());
> -
> -	if (!n)
> -		return NULL;
> -
> -	return n->barn;
> +	return get_barn_node(s, numa_node_id());
>  }

Previously, memoryless nodes on architectures w/ CONFIG_HAVE_MEMORYLESS_NODES
shared the barn of the nearest NUMA node with memory.

But now memoryless nodes will have their own barns (after patch 2)
regardless of CONFIG_HAVE_MEMORYLESS_NODES, and that's intentional, right?

Otherwise LGTM!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node
  2026-03-13  9:27   ` Harry Yoo
@ 2026-03-13  9:46     ` Vlastimil Babka (SUSE)
  2026-03-13 11:48       ` Harry Yoo
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-13  9:46 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Hao Li, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On 3/13/26 10:27, Harry Yoo wrote:
> On Wed, Mar 11, 2026 at 09:25:55AM +0100, Vlastimil Babka (SUSE) wrote:
>> The pointer to barn currently exists in struct kmem_cache_node. That
>> struct is instantiated for every NUMA node with memory, but we want to
>> have a barn for every online node (including memoryless).
>> 
>> Thus decouple the two structures. In struct kmem_cache we have an array
>> for kmem_cache_node pointers that appears to be sized MAX_NUMNODES but
>> the actual size calculation in kmem_cache_init() uses nr_node_ids.
>> Therefore we can't just add another array of barn pointers. Instead
>> change the array to newly introduced struct kmem_cache_per_node_ptrs
>> holding both kmem_cache_node and barn pointer.
>> 
>> Adjust barn accessor and allocation/initialization code accordingly. For
>> now no functional change intended, barns are created 1:1 together with
>> kmem_cache_nodes.
>> 
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
>>  mm/slab.h |   7 +++-
>>  mm/slub.c | 128 +++++++++++++++++++++++++++++++++++---------------------------
>>  2 files changed, 78 insertions(+), 57 deletions(-)
>> 
>> diff --git a/mm/slab.h b/mm/slab.h
>> index e9ab292acd22..c735e6b4dddb 100644
>> --- a/mm/slab.h
>> +++ b/mm/slab.h
>> @@ -247,7 +252,7 @@ struct kmem_cache {
>>  	struct kmem_cache_stats __percpu *cpu_stats;
>>  #endif
>>  
>> -	struct kmem_cache_node *node[MAX_NUMNODES];
>> +	struct kmem_cache_per_node_ptrs per_node[MAX_NUMNODES];
>>  };
> 
> We should probably turn this into a true flexible array at some point,
> but that's out of scope for this patchset.

Right.

>> diff --git a/mm/slub.c b/mm/slub.c
>> index 20cb4f3b636d..609a183f8533 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -436,26 +436,24 @@ struct kmem_cache_node {
>>  /*
>> - * Get the barn of the current cpu's closest memory node. It may not exist on
>> - * systems with memoryless nodes but without CONFIG_HAVE_MEMORYLESS_NODES
>> + * Get the barn of the current cpu's memory node. It may be a memoryless node.
>>   */
>>  static inline struct node_barn *get_barn(struct kmem_cache *s)
>>  {
>> -	struct kmem_cache_node *n = get_node(s, numa_mem_id());
>> -
>> -	if (!n)
>> -		return NULL;
>> -
>> -	return n->barn;
>> +	return get_barn_node(s, numa_node_id());
>>  }
> 
> Previously, memoryless nodes on architectures w/ CONFIG_HAVE_MEMORYLESS_NODES
> shared the barn of the nearest NUMA node with memory.
> 
> But now memoryless nodes will have their own barns (after patch 2)
> regardless of CONFIG_HAVE_MEMORYLESS_NODES, and that's intentional, right?

Yeah it improves their caching capacity, but good point, will mention it in
the changelog.

> Otherwise LGTM!
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node
  2026-03-13  9:46     ` Vlastimil Babka (SUSE)
@ 2026-03-13 11:48       ` Harry Yoo
  2026-03-16 13:19         ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 19+ messages in thread
From: Harry Yoo @ 2026-03-13 11:48 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Ming Lei, Hao Li, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On Fri, Mar 13, 2026 at 10:46:15AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/13/26 10:27, Harry Yoo wrote:
> > On Wed, Mar 11, 2026 at 09:25:55AM +0100, Vlastimil Babka (SUSE) wrote:
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 20cb4f3b636d..609a183f8533 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -436,26 +436,24 @@ struct kmem_cache_node {
> >>  /*
> >> - * Get the barn of the current cpu's closest memory node. It may not exist on
> >> - * systems with memoryless nodes but without CONFIG_HAVE_MEMORYLESS_NODES
> >> + * Get the barn of the current cpu's memory node. It may be a memoryless node.
> >>   */
> >>  static inline struct node_barn *get_barn(struct kmem_cache *s)
> >>  {
> >> -	struct kmem_cache_node *n = get_node(s, numa_mem_id());
> >> -
> >> -	if (!n)
> >> -		return NULL;
> >> -
> >> -	return n->barn;
> >> +	return get_barn_node(s, numa_node_id());
> >>  }
> > 
> > Previously, memoryless nodes on architectures w/ CONFIG_HAVE_MEMORYLESS_NODES
> > shared the barn of the nearest NUMA node with memory.
> > 
> > But now memoryless nodes will have their own barns (after patch 2)
> > regardless of CONFIG_HAVE_MEMORYLESS_NODES, and that's intentional, right?
> 
> Yeah it improves their caching capacity, but good point, will mention it in
> the changelog.

Thanks! just wanted to check that it was intentional.

with that, please feel free to add:
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] slab: create barns for online memoryless nodes
  2026-03-11  8:25 ` [PATCH 2/3] slab: create barns for online memoryless nodes Vlastimil Babka (SUSE)
@ 2026-03-16  3:25   ` Harry Yoo
  2026-03-18  9:27   ` Hao Li
  1 sibling, 0 replies; 19+ messages in thread
From: Harry Yoo @ 2026-03-16  3:25 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Ming Lei, Hao Li, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On Wed, Mar 11, 2026 at 09:25:56AM +0100, Vlastimil Babka (SUSE) wrote:
> Ming Lei has reported [1] a performance regression due to replacing cpu
> (partial) slabs with sheaves. With slub stats enabled, a large amount of
> slowpath allocations were observed. The affected system has 8 online
> NUMA nodes but only 2 have memory.
> 
> For sheaves to work effectively on given cpu, its NUMA node has to have
> struct node_barn allocated. Those are currently only allocated on nodes
> with memory (N_MEMORY) where kmem_cache_node also exist as the goal is
> to cache only node-local objects. But in order to have good performance
> on a memoryless node, we need its barn to exist and use sheaves to cache
> non-local objects (as no local objects can exist anyway).
> 
> Therefore change the implementation to allocate barns on all online
> nodes, tracked in a new nodemask slab_barn_nodes. Also add a cpu hotplug
> callback as that's when a memoryless node can become online.
> 
> Change rcu_sheaf->node assignment to numa_node_id() so it's returned to
> the barn of the local cpu's (potentially memoryless) node, and not to
> the nearest node with memory anymore.
> 
> Reported-by: Ming Lei <ming.lei@redhat.com>
> Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] slab: free remote objects to sheaves on memoryless nodes
  2026-03-11  8:25 ` [PATCH 3/3] slab: free remote objects to sheaves on " Vlastimil Babka (SUSE)
@ 2026-03-16  3:48   ` Harry Yoo
  0 siblings, 0 replies; 19+ messages in thread
From: Harry Yoo @ 2026-03-16  3:48 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Ming Lei, Hao Li, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On Wed, Mar 11, 2026 at 09:25:57AM +0100, Vlastimil Babka (SUSE) wrote:
> On memoryless nodes we can now allocate from cpu sheaves and refill them
> normally. But when a node is memoryless on a system without actual
> CONFIG_HAVE_MEMORYLESS_NODES support, freeing always uses the slowpath
> because all objects appear as remote. We could instead benefit from the
> freeing fastpath, because the allocations can't obtain local objects
> anyway if the node is memoryless.
> 
> Thus adapt the locality check when freeing, and move them to an inline
> function can_free_to_pcs() for a single shared implementation.
> 
> On configurations with CONFIG_HAVE_MEMORYLESS_NODES=y continue using
> numa_mem_id() so the percpu sheaves and barn on a memoryless node will
> contain mostly objects from the closest memory node (returned by
> numa_mem_id()). No change is thus intended for such configuration.
> 
> On systems with CONFIG_HAVE_MEMORYLESS_NODES=n use numa_node_id() (the
> cpu's node) since numa_mem_id() just aliases it anyway. But if we are
> freeing on a memoryless node, allow the freeing to use percpu sheaves
> for objects from any node, since they are all remote anyway.
> 
> This way we avoid the slowpath and get more performant freeing.

> The potential downside is that allocations will obtain objects with a larger
> average distance. If we kept bypassing the sheaves on freeing, a refill
> of sheaves from slabs would tend to get closer objects thanks to the
> ordering of the zonelist.

When I think about ways to avoid this, the right solution is to
implement HAVE_MEMORYLESS_NODES :)

> Architectures that allow de-facto memoryless
> nodes without proper CONFIG_HAVE_MEMORYLESS_NODES support should perhaps
> consider adding such support.

Exactly!

> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node
  2026-03-13 11:48       ` Harry Yoo
@ 2026-03-16 13:19         ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-16 13:19 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Ming Lei, Hao Li, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On 3/13/26 12:48, Harry Yoo wrote:
> On Fri, Mar 13, 2026 at 10:46:15AM +0100, Vlastimil Babka (SUSE) wrote:
>> On 3/13/26 10:27, Harry Yoo wrote:
>> > On Wed, Mar 11, 2026 at 09:25:55AM +0100, Vlastimil Babka (SUSE) wrote:
>> >> diff --git a/mm/slub.c b/mm/slub.c
>> >> index 20cb4f3b636d..609a183f8533 100644
>> >> --- a/mm/slub.c
>> >> +++ b/mm/slub.c
>> >> @@ -436,26 +436,24 @@ struct kmem_cache_node {
>> >>  /*
>> >> - * Get the barn of the current cpu's closest memory node. It may not exist on
>> >> - * systems with memoryless nodes but without CONFIG_HAVE_MEMORYLESS_NODES
>> >> + * Get the barn of the current cpu's memory node. It may be a memoryless node.
>> >>   */
>> >>  static inline struct node_barn *get_barn(struct kmem_cache *s)
>> >>  {
>> >> -	struct kmem_cache_node *n = get_node(s, numa_mem_id());
>> >> -
>> >> -	if (!n)
>> >> -		return NULL;
>> >> -
>> >> -	return n->barn;
>> >> +	return get_barn_node(s, numa_node_id());
>> >>  }
>> > 
>> > Previously, memoryless nodes on architectures w/ CONFIG_HAVE_MEMORYLESS_NODES
>> > shared the barn of the nearest NUMA node with memory.
>> > 
>> > But now memoryless nodes will have their own barns (after patch 2)
>> > regardless of CONFIG_HAVE_MEMORYLESS_NODES, and that's intentional, right?
>> 
>> Yeah it improves their caching capacity, but good point, will mention it in
>> the changelog.
> 
> Thanks! just wanted to check that it was intentional.

I wanted to update the changelog as promised. But realized that the change
from numa_node_id() to numa_node_id() in get_barn() should actually be done
only in patch 2, so I will move it there. In patch 1 that would mean no
barns with CONFIG_HAVE_MEMORYLESS_NODES and thus a performance bisection hazard.

> with that, please feel free to add:
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks!



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] slab: support memoryless nodes with sheaves
  2026-03-11  8:25 [PATCH 0/3] slab: support memoryless nodes with sheaves Vlastimil Babka (SUSE)
                   ` (3 preceding siblings ...)
  2026-03-11  9:49 ` [PATCH 0/3] slab: support memoryless nodes with sheaves Ming Lei
@ 2026-03-16 13:33 ` Vlastimil Babka (SUSE)
  4 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-16 13:33 UTC (permalink / raw)
  To: Ming Lei, Harry Yoo
  Cc: Hao Li, Andrew Morton, Christoph Lameter, David Rientjes,
	Roman Gushchin, linux-mm, linux-kernel

On 3/11/26 09:25, Vlastimil Babka (SUSE) wrote:
> This is the draft patch from [1] turned into a proper series with
> incremental changes. It's based on v7.0-rc3. It's too intrusive for a
> 7.0 hotfix, so we'll only be able to fix/reduce the regression in 7.1. I
> hope it's acceptable given it's a non-standard configuration, 7.0 is not
> a LTS, and it's a perf regression, not functionality.
> 
> Ming can you please retest this on top of v7.0-rc3, which already has
> fb1091febd66 ("mm/slab: allow sheaf refill if blocking is not
> allowed"). Separate data point for v7.0-rc3 could be also useful.
> 
> [1] https://lore.kernel.org/all/c6a01f7e-c6eb-454b-9b9e-734526dd659d@kernel.org/
> 
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
> Vlastimil Babka (SUSE) (3):
>       slab: decouple pointer to barn from kmem_cache_node
>       slab: create barns for online memoryless nodes
>       slab: free remote objects to sheaves on memoryless nodes
> 
>  mm/slab.h |   7 +-
>  mm/slub.c | 256 +++++++++++++++++++++++++++++++++++++++++++++-----------------
>  2 files changed, 191 insertions(+), 72 deletions(-)
> ---
> base-commit: 1f318b96cc84d7c2ab792fcc0bfd42a7ca890681
> change-id: 20260311-b4-slab-memoryless-barns-fad64172ba05
> 
> Best regards,

Range-diff in slab/for-7.1/sheaves after applying Harry's feedback:

  2:  cc67056e94f1 ! 472:  b002755da434 slab: decouple pointer to barn from kmem_cache_node
    @@ Commit message
     
         Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-1-70ab850be4ce@kernel.org
         Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
    +    Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
     
      ## mm/slab.h ##
     @@ mm/slab.h: struct kmem_cache_order_objects {
    @@ mm/slub.c: struct kmem_cache_node {
      /*
     - * Get the barn of the current cpu's closest memory node. It may not exist on
     - * systems with memoryless nodes but without CONFIG_HAVE_MEMORYLESS_NODES
    -+ * Get the barn of the current cpu's memory node. It may be a memoryless node.
    ++ * Get the barn of the current cpu's NUMA node. It may be a memoryless node.
       */
      static inline struct node_barn *get_barn(struct kmem_cache *s)
      {
    @@ mm/slub.c: struct kmem_cache_node {
     -          return NULL;
     -
     -  return n->barn;
    -+  return get_barn_node(s, numa_node_id());
    ++  return get_barn_node(s, numa_mem_id());
      }
      
      /*
  3:  285bca63cf15 ! 473:  f811cc3d9f6e slab: create barns for online memoryless nodes
    @@ Commit message
         nodes, tracked in a new nodemask slab_barn_nodes. Also add a cpu hotplug
         callback as that's when a memoryless node can become online.
     
    -    Change rcu_sheaf->node assignment to numa_node_id() so it's returned to
    -    the barn of the local cpu's (potentially memoryless) node, and not to
    -    the nearest node with memory anymore.
    +    Change both get_barn() and rcu_sheaf->node assignment to numa_node_id()
    +    so it's returned to the barn of the local cpu's (potentially memoryless)
    +    node, and not to the nearest node with memory anymore.
    +
    +    On systems with CONFIG_HAVE_MEMORYLESS_NODES=y (which are not the main
    +    target of this change) barns did not exist on memoryless nodes, but
    +    get_barn() using numa_mem_id() meant a barn was returned from the
    +    nearest node with memory. This works, but the barn lock contention
    +    increases with every such memoryless node. With this change, barn will
    +    be allocated also on the memoryless node, reducing this contention in
    +    exchange for increased memory consumption.
     
         Reported-by: Ming Lei <ming.lei@redhat.com>
         Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1]
         Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-2-70ab850be4ce@kernel.org
         Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
    +    Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
     
      ## mm/slub.c ##
    +@@ mm/slub.c: static inline struct node_barn *get_barn_node(struct kmem_cache *s, int node)
    +  */
    + static inline struct node_barn *get_barn(struct kmem_cache *s)
    + {
    +-  return get_barn_node(s, numa_mem_id());
    ++  return get_barn_node(s, numa_node_id());
    + }
    + 
    + /*
     @@ mm/slub.c: static inline struct node_barn *get_barn(struct kmem_cache *s)
       */
      static nodemask_t slab_nodes;
  4:  1fe49af3aa46 ! 474:  86e18f36844f slab: free remote objects to sheaves on memoryless nodes
    @@ Commit message
     
         Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-3-70ab850be4ce@kernel.org
         Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
    +    Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
     
      ## mm/slub.c ##
     @@ mm/slub.c: bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] slab: create barns for online memoryless nodes
  2026-03-11  8:25 ` [PATCH 2/3] slab: create barns for online memoryless nodes Vlastimil Babka (SUSE)
  2026-03-16  3:25   ` Harry Yoo
@ 2026-03-18  9:27   ` Hao Li
  2026-03-18 12:11     ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 19+ messages in thread
From: Hao Li @ 2026-03-18  9:27 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Ming Lei, Harry Yoo, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On Wed, Mar 11, 2026 at 09:25:56AM +0100, Vlastimil Babka (SUSE) wrote:
> Ming Lei has reported [1] a performance regression due to replacing cpu
> (partial) slabs with sheaves. With slub stats enabled, a large amount of
> slowpath allocations were observed. The affected system has 8 online
> NUMA nodes but only 2 have memory.
> 
> For sheaves to work effectively on given cpu, its NUMA node has to have
> struct node_barn allocated. Those are currently only allocated on nodes
> with memory (N_MEMORY) where kmem_cache_node also exist as the goal is
> to cache only node-local objects. But in order to have good performance
> on a memoryless node, we need its barn to exist and use sheaves to cache
> non-local objects (as no local objects can exist anyway).
> 
> Therefore change the implementation to allocate barns on all online
> nodes, tracked in a new nodemask slab_barn_nodes. Also add a cpu hotplug
> callback as that's when a memoryless node can become online.
> 
> Change rcu_sheaf->node assignment to numa_node_id() so it's returned to
> the barn of the local cpu's (potentially memoryless) node, and not to
> the nearest node with memory anymore.
> 
> Reported-by: Ming Lei <ming.lei@redhat.com>
> Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1]
> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> ---
>  mm/slub.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 59 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 609a183f8533..d8496b37e364 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
[...]
>  
>  	/*
> @@ -7597,7 +7648,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>  	if (slab_state == DOWN || !cache_has_sheaves(s))
>  		return 1;
>  
> -	for_each_node_mask(node, slab_nodes) {
> +	for_each_node_mask(node, slab_barn_nodes) {
>  		struct node_barn *barn;
>  
>  		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> @@ -8250,6 +8301,7 @@ static int slab_mem_going_online_callback(int nid)
>  	 * and barn initialized for the new node.
>  	 */
>  	node_set(nid, slab_nodes);
> +	node_set(nid, slab_barn_nodes);

I had a somewhat related question here.

During memory hotplug, we call node_set() on slab_nodes when memory is brought
online, but we do not seem to call node_clear() when memory is taken offline. I
was wondering what the reasoning behind this is.

That also made me wonder about a related case. If I am understanding this
correctly, even if all memory of a node has been offlined, slab_nodes would
still make it appear that the node has memory, even though in reality it no
longer does. If so, then in patch 3, the condition
"if (unlikely(!node_isset(numa_node, slab_nodes)))" in can_free_to_pcs() seems
would cause the object free path to skip sheaves.

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] slab: create barns for online memoryless nodes
  2026-03-18  9:27   ` Hao Li
@ 2026-03-18 12:11     ` Vlastimil Babka (SUSE)
  2026-03-19  7:01       ` Hao Li
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-18 12:11 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Harry Yoo, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On 3/18/26 10:27, Hao Li wrote:
> On Wed, Mar 11, 2026 at 09:25:56AM +0100, Vlastimil Babka (SUSE) wrote:
>> Ming Lei has reported [1] a performance regression due to replacing cpu
>> (partial) slabs with sheaves. With slub stats enabled, a large amount of
>> slowpath allocations were observed. The affected system has 8 online
>> NUMA nodes but only 2 have memory.
>> 
>> For sheaves to work effectively on given cpu, its NUMA node has to have
>> struct node_barn allocated. Those are currently only allocated on nodes
>> with memory (N_MEMORY) where kmem_cache_node also exist as the goal is
>> to cache only node-local objects. But in order to have good performance
>> on a memoryless node, we need its barn to exist and use sheaves to cache
>> non-local objects (as no local objects can exist anyway).
>> 
>> Therefore change the implementation to allocate barns on all online
>> nodes, tracked in a new nodemask slab_barn_nodes. Also add a cpu hotplug
>> callback as that's when a memoryless node can become online.
>> 
>> Change rcu_sheaf->node assignment to numa_node_id() so it's returned to
>> the barn of the local cpu's (potentially memoryless) node, and not to
>> the nearest node with memory anymore.
>> 
>> Reported-by: Ming Lei <ming.lei@redhat.com>
>> Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1]
>> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
>> ---
>>  mm/slub.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 59 insertions(+), 4 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 609a183f8533..d8496b37e364 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
> [...]
>>  
>>  	/*
>> @@ -7597,7 +7648,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
>>  	if (slab_state == DOWN || !cache_has_sheaves(s))
>>  		return 1;
>>  
>> -	for_each_node_mask(node, slab_nodes) {
>> +	for_each_node_mask(node, slab_barn_nodes) {
>>  		struct node_barn *barn;
>>  
>>  		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
>> @@ -8250,6 +8301,7 @@ static int slab_mem_going_online_callback(int nid)
>>  	 * and barn initialized for the new node.
>>  	 */
>>  	node_set(nid, slab_nodes);
>> +	node_set(nid, slab_barn_nodes);
> 
> I had a somewhat related question here.
> 
> During memory hotplug, we call node_set() on slab_nodes when memory is brought
> online, but we do not seem to call node_clear() when memory is taken offline. I
> was wondering what the reasoning behind this is.

Probably nobody took the task the implement the necessary teardown.

> That also made me wonder about a related case. If I am understanding this
> correctly, even if all memory of a node has been offlined, slab_nodes would
> still make it appear that the node has memory, even though in reality it no
> longer does. If so, then in patch 3, the condition
> "if (unlikely(!node_isset(numa_node, slab_nodes)))" in can_free_to_pcs() seems
> would cause the object free path to skip sheaves.

Maybe the condition should be looking at N_MEMORY then?

Also ideally we should be using N_NORMAL_MEMORY everywhere for slab_nodes.
Oh we actually did, but give that up in commit 1bf47d4195e45.

Note in practice full memory offline of a node can only be achieved if it
was all ZONE_MOVABLE and thus no slab allocations ever happened on it. But
if it has only movable memory, it's practically memoryless for slab
purposes. Maybe the condition should be looking at N_NORMAL_MEMORY then.
That would cover the case when it became offline and also the case when it's
online but with only movable memory?

I don't know if with CONFIG_HAVE_MEMORYLESS_NODES it's possible that
numa_mem_id() (the closest node with memory) would be ZONE_MOVABLE only.
Maybe let's hope not, and not adjust that part?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] slab: create barns for online memoryless nodes
  2026-03-18 12:11     ` Vlastimil Babka (SUSE)
@ 2026-03-19  7:01       ` Hao Li
  2026-03-19  9:56         ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Li @ 2026-03-19  7:01 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Ming Lei, Harry Yoo, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On Wed, Mar 18, 2026 at 01:11:58PM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/18/26 10:27, Hao Li wrote:
> > On Wed, Mar 11, 2026 at 09:25:56AM +0100, Vlastimil Babka (SUSE) wrote:
> >> Ming Lei has reported [1] a performance regression due to replacing cpu
> >> (partial) slabs with sheaves. With slub stats enabled, a large amount of
> >> slowpath allocations were observed. The affected system has 8 online
> >> NUMA nodes but only 2 have memory.
> >> 
> >> For sheaves to work effectively on given cpu, its NUMA node has to have
> >> struct node_barn allocated. Those are currently only allocated on nodes
> >> with memory (N_MEMORY) where kmem_cache_node also exist as the goal is
> >> to cache only node-local objects. But in order to have good performance
> >> on a memoryless node, we need its barn to exist and use sheaves to cache
> >> non-local objects (as no local objects can exist anyway).
> >> 
> >> Therefore change the implementation to allocate barns on all online
> >> nodes, tracked in a new nodemask slab_barn_nodes. Also add a cpu hotplug
> >> callback as that's when a memoryless node can become online.
> >> 
> >> Change rcu_sheaf->node assignment to numa_node_id() so it's returned to
> >> the barn of the local cpu's (potentially memoryless) node, and not to
> >> the nearest node with memory anymore.
> >> 
> >> Reported-by: Ming Lei <ming.lei@redhat.com>
> >> Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1]
> >> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
> >> ---
> >>  mm/slub.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >>  1 file changed, 59 insertions(+), 4 deletions(-)
> >> 
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 609a183f8533..d8496b37e364 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> > [...]
> >>  
> >>  	/*
> >> @@ -7597,7 +7648,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
> >>  	if (slab_state == DOWN || !cache_has_sheaves(s))
> >>  		return 1;
> >>  
> >> -	for_each_node_mask(node, slab_nodes) {
> >> +	for_each_node_mask(node, slab_barn_nodes) {
> >>  		struct node_barn *barn;
> >>  
> >>  		barn = kmalloc_node(sizeof(*barn), GFP_KERNEL, node);
> >> @@ -8250,6 +8301,7 @@ static int slab_mem_going_online_callback(int nid)
> >>  	 * and barn initialized for the new node.
> >>  	 */
> >>  	node_set(nid, slab_nodes);
> >> +	node_set(nid, slab_barn_nodes);
> > 
> > I had a somewhat related question here.
> > 
> > During memory hotplug, we call node_set() on slab_nodes when memory is brought
> > online, but we do not seem to call node_clear() when memory is taken offline. I
> > was wondering what the reasoning behind this is.
> 
> Probably nobody took the task the implement the necessary teardown.
> 
> > That also made me wonder about a related case. If I am understanding this
> > correctly, even if all memory of a node has been offlined, slab_nodes would
> > still make it appear that the node has memory, even though in reality it no
> > longer does. If so, then in patch 3, the condition
> > "if (unlikely(!node_isset(numa_node, slab_nodes)))" in can_free_to_pcs() seems
> > would cause the object free path to skip sheaves.
> 
> Maybe the condition should be looking at N_MEMORY then?

Yes, that's what I was thinking too.
I feel that, at least for the current patchset, this is probably a reasonable
approach.

> 
> Also ideally we should be using N_NORMAL_MEMORY everywhere for slab_nodes.
> Oh we actually did, but give that up in commit 1bf47d4195e45.

Thanks, I hadn't realized that node_clear had actually existed before.

> 
> Note in practice full memory offline of a node can only be achieved if it
> was all ZONE_MOVABLE and thus no slab allocations ever happened on it. But
> if it has only movable memory, it's practically memoryless for slab
> purposes.

That's a good point! I just realized that too.

> Maybe the condition should be looking at N_NORMAL_MEMORY then.
> That would cover the case when it became offline and also the case when it's
> online but with only movable memory?

Exactly, conceptually, N_NORMAL_MEMORY seems more precise than N_MEMORY. I took
a quick look through the code, though, and it seems that N_NORMAL_MEMORY hasn't
been fully handled in the hotplug code.

Given that, I think it makes sense to use N_MEMORY for now, and then switch to
N_NORMAL_MEMORY later once the handling there is improved.

> 
> I don't know if with CONFIG_HAVE_MEMORYLESS_NODES it's possible that
> numa_mem_id() (the closest node with memory) would be ZONE_MOVABLE only.
> Maybe let's hope not, and not adjust that part?
> 

I think that, in the CONFIG_HAVE_MEMORYLESS_NODES=y case, numa_mem_id() ends up
calling local_memory_node(), and the NUMA node it returns should be one that
can allocate slab memory. So the slab_node == numa_node check seems reasonable
to me.

So it seems that the issue being discussed here may only be specific to the
CONFIG_HAVE_MEMORYLESS_NODES=n case.

-- 
Thanks,
Hao


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] slab: create barns for online memoryless nodes
  2026-03-19  7:01       ` Hao Li
@ 2026-03-19  9:56         ` Vlastimil Babka (SUSE)
  2026-03-19 11:27           ` Hao Li
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-19  9:56 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Harry Yoo, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On 3/19/26 08:01, Hao Li wrote:
> On Wed, Mar 18, 2026 at 01:11:58PM +0100, Vlastimil Babka (SUSE) wrote:
>> On 3/18/26 10:27, Hao Li wrote:
>> > On Wed, Mar 11, 2026 at 09:25:56AM +0100, Vlastimil Babka (SUSE) wrote:
>> > 
>> > I had a somewhat related question here.
>> > 
>> > During memory hotplug, we call node_set() on slab_nodes when memory is brought
>> > online, but we do not seem to call node_clear() when memory is taken offline. I
>> > was wondering what the reasoning behind this is.
>> 
>> Probably nobody took the task the implement the necessary teardown.
>> 
>> > That also made me wonder about a related case. If I am understanding this
>> > correctly, even if all memory of a node has been offlined, slab_nodes would
>> > still make it appear that the node has memory, even though in reality it no
>> > longer does. If so, then in patch 3, the condition
>> > "if (unlikely(!node_isset(numa_node, slab_nodes)))" in can_free_to_pcs() seems
>> > would cause the object free path to skip sheaves.
>> 
>> Maybe the condition should be looking at N_MEMORY then?
> 
> Yes, that's what I was thinking too.
> I feel that, at least for the current patchset, this is probably a reasonable
> approach.

Ack.

>> 
>> Also ideally we should be using N_NORMAL_MEMORY everywhere for slab_nodes.
>> Oh we actually did, but give that up in commit 1bf47d4195e45.
> 
> Thanks, I hadn't realized that node_clear had actually existed before.
> 
>> 
>> Note in practice full memory offline of a node can only be achieved if it
>> was all ZONE_MOVABLE and thus no slab allocations ever happened on it. But
>> if it has only movable memory, it's practically memoryless for slab
>> purposes.
> 
> That's a good point! I just realized that too.
> 
>> Maybe the condition should be looking at N_NORMAL_MEMORY then.
>> That would cover the case when it became offline and also the case when it's
>> online but with only movable memory?
> 
> Exactly, conceptually, N_NORMAL_MEMORY seems more precise than N_MEMORY. I took
> a quick look through the code, though, and it seems that N_NORMAL_MEMORY hasn't
> been fully handled in the hotplug code.

Huh you're right, the hotplug code doesn't seem to set it. How much code
that we have is broken by that?
It seems hotplug doesn't handle it since 2007 in commit 37b07e4163f7
("memoryless nodes: fixup uses of node_online_map in generic code"),
although the initial support in 7ea1530ab3fd ("Memoryless nodes: introduce
mask of nodes with memory") did set it from hotplug.

> Given that, I think it makes sense to use N_MEMORY for now, and then switch to
> N_NORMAL_MEMORY later once the handling there is improved.

So I'll do this:

diff --git a/mm/slub.c b/mm/slub.c
index 01ab90bb4622..fb2c5c57bc4e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6029,7 +6029,7 @@ static __always_inline bool can_free_to_pcs(struct
slab *slab)
         * point to the closest node as we would on a proper memoryless node
         * setup.
         */
-       if (unlikely(!node_isset(numa_node, slab_nodes)))
+       if (unlikely(!node_state(numa_node, N_MEMORY)))
                goto check_pfmemalloc;
 #endif


>> 
>> I don't know if with CONFIG_HAVE_MEMORYLESS_NODES it's possible that
>> numa_mem_id() (the closest node with memory) would be ZONE_MOVABLE only.
>> Maybe let's hope not, and not adjust that part?
>> 
> 
> I think that, in the CONFIG_HAVE_MEMORYLESS_NODES=y case, numa_mem_id() ends up
> calling local_memory_node(), and the NUMA node it returns should be one that
> can allocate slab memory. So the slab_node == numa_node check seems reasonable
> to me.
> 
> So it seems that the issue being discussed here may only be specific to the
> CONFIG_HAVE_MEMORYLESS_NODES=n case.

Great. Thanks!



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] slab: create barns for online memoryless nodes
  2026-03-19  9:56         ` Vlastimil Babka (SUSE)
@ 2026-03-19 11:27           ` Hao Li
  2026-03-19 12:25             ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Li @ 2026-03-19 11:27 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Ming Lei, Harry Yoo, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On Thu, Mar 19, 2026 at 10:56:09AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/19/26 08:01, Hao Li wrote:
> > On Wed, Mar 18, 2026 at 01:11:58PM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 3/18/26 10:27, Hao Li wrote:
> >> > On Wed, Mar 11, 2026 at 09:25:56AM +0100, Vlastimil Babka (SUSE) wrote:
> >> > 
> >> > I had a somewhat related question here.
> >> > 
> >> > During memory hotplug, we call node_set() on slab_nodes when memory is brought
> >> > online, but we do not seem to call node_clear() when memory is taken offline. I
> >> > was wondering what the reasoning behind this is.
> >> 
> >> Probably nobody took the task the implement the necessary teardown.
> >> 
> >> > That also made me wonder about a related case. If I am understanding this
> >> > correctly, even if all memory of a node has been offlined, slab_nodes would
> >> > still make it appear that the node has memory, even though in reality it no
> >> > longer does. If so, then in patch 3, the condition
> >> > "if (unlikely(!node_isset(numa_node, slab_nodes)))" in can_free_to_pcs() seems
> >> > would cause the object free path to skip sheaves.
> >> 
> >> Maybe the condition should be looking at N_MEMORY then?
> > 
> > Yes, that's what I was thinking too.
> > I feel that, at least for the current patchset, this is probably a reasonable
> > approach.
> 
> Ack.
> 
> >> 
> >> Also ideally we should be using N_NORMAL_MEMORY everywhere for slab_nodes.
> >> Oh we actually did, but give that up in commit 1bf47d4195e45.
> > 
> > Thanks, I hadn't realized that node_clear had actually existed before.
> > 
> >> 
> >> Note in practice full memory offline of a node can only be achieved if it
> >> was all ZONE_MOVABLE and thus no slab allocations ever happened on it. But
> >> if it has only movable memory, it's practically memoryless for slab
> >> purposes.
> > 
> > That's a good point! I just realized that too.
> > 
> >> Maybe the condition should be looking at N_NORMAL_MEMORY then.
> >> That would cover the case when it became offline and also the case when it's
> >> online but with only movable memory?
> > 
> > Exactly, conceptually, N_NORMAL_MEMORY seems more precise than N_MEMORY. I took
> > a quick look through the code, though, and it seems that N_NORMAL_MEMORY hasn't
> > been fully handled in the hotplug code.
> 
> Huh you're right, the hotplug code doesn't seem to set it. How much code
> that we have is broken by that?

This probably needs a bit more digging.

> It seems hotplug doesn't handle it since 2007 in commit 37b07e4163f7
> ("memoryless nodes: fixup uses of node_online_map in generic code"),
> although the initial support in 7ea1530ab3fd ("Memoryless nodes: introduce
> mask of nodes with memory") did set it from hotplug.

Yes, this really is quite an old issue. It looks like we may need to dig
through the git history a bit more carefully.

I'd be happy to dig into it further.

> 
> > Given that, I think it makes sense to use N_MEMORY for now, and then switch to
> > N_NORMAL_MEMORY later once the handling there is improved.
> 
> So I'll do this:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 01ab90bb4622..fb2c5c57bc4e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -6029,7 +6029,7 @@ static __always_inline bool can_free_to_pcs(struct
> slab *slab)
>          * point to the closest node as we would on a proper memoryless node
>          * setup.
>          */
> -       if (unlikely(!node_isset(numa_node, slab_nodes)))
> +       if (unlikely(!node_state(numa_node, N_MEMORY)))

Looks good to me.

I've gone through the full series, including the range-diff updates, and the
rest looks good to me.
Feel free to add my rb-tag to three updated patches. Thanks!

Reviewed-by: Hao Li <hao.li@linux.dev>

>                 goto check_pfmemalloc;
>  #endif
> 
> 
> >> 
> >> I don't know if with CONFIG_HAVE_MEMORYLESS_NODES it's possible that
> >> numa_mem_id() (the closest node with memory) would be ZONE_MOVABLE only.
> >> Maybe let's hope not, and not adjust that part?
> >> 
> > 
> > I think that, in the CONFIG_HAVE_MEMORYLESS_NODES=y case, numa_mem_id() ends up
> > calling local_memory_node(), and the NUMA node it returns should be one that
> > can allocate slab memory. So the slab_node == numa_node check seems reasonable
> > to me.
> > 
> > So it seems that the issue being discussed here may only be specific to the
> > CONFIG_HAVE_MEMORYLESS_NODES=n case.
> 
> Great. Thanks!
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] slab: create barns for online memoryless nodes
  2026-03-19 11:27           ` Hao Li
@ 2026-03-19 12:25             ` Vlastimil Babka (SUSE)
  0 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-19 12:25 UTC (permalink / raw)
  To: Hao Li
  Cc: Ming Lei, Harry Yoo, Andrew Morton, Christoph Lameter,
	David Rientjes, Roman Gushchin, linux-mm, linux-kernel

On 3/19/26 12:27, Hao Li wrote:
> On Thu, Mar 19, 2026 at 10:56:09AM +0100, Vlastimil Babka (SUSE) wrote:
>> > 
>> > Exactly, conceptually, N_NORMAL_MEMORY seems more precise than N_MEMORY. I took
>> > a quick look through the code, though, and it seems that N_NORMAL_MEMORY hasn't
>> > been fully handled in the hotplug code.
>> 
>> Huh you're right, the hotplug code doesn't seem to set it. How much code
>> that we have is broken by that?
> 
> This probably needs a bit more digging.
> 
>> It seems hotplug doesn't handle it since 2007 in commit 37b07e4163f7
>> ("memoryless nodes: fixup uses of node_online_map in generic code"),
>> although the initial support in 7ea1530ab3fd ("Memoryless nodes: introduce
>> mask of nodes with memory") did set it from hotplug.
> 
> Yes, this really is quite an old issue. It looks like we may need to dig
> through the git history a bit more carefully.
> 
> I'd be happy to dig into it further.

Great!

> 
>> 
>> > Given that, I think it makes sense to use N_MEMORY for now, and then switch to
>> > N_NORMAL_MEMORY later once the handling there is improved.
>> 
>> So I'll do this:
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 01ab90bb4622..fb2c5c57bc4e 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -6029,7 +6029,7 @@ static __always_inline bool can_free_to_pcs(struct
>> slab *slab)
>>          * point to the closest node as we would on a proper memoryless node
>>          * setup.
>>          */
>> -       if (unlikely(!node_isset(numa_node, slab_nodes)))
>> +       if (unlikely(!node_state(numa_node, N_MEMORY)))
> 
> Looks good to me.
> 
> I've gone through the full series, including the range-diff updates, and the
> rest looks good to me.
> Feel free to add my rb-tag to three updated patches. Thanks!
> 
> Reviewed-by: Hao Li <hao.li@linux.dev>

Thanks, updated in slab/for-next

> 
>>                 goto check_pfmemalloc;
>>  #endif
>> 
>> 
>> >> 
>> >> I don't know if with CONFIG_HAVE_MEMORYLESS_NODES it's possible that
>> >> numa_mem_id() (the closest node with memory) would be ZONE_MOVABLE only.
>> >> Maybe let's hope not, and not adjust that part?
>> >> 
>> > 
>> > I think that, in the CONFIG_HAVE_MEMORYLESS_NODES=y case, numa_mem_id() ends up
>> > calling local_memory_node(), and the NUMA node it returns should be one that
>> > can allocate slab memory. So the slab_node == numa_node check seems reasonable
>> > to me.
>> > 
>> > So it seems that the issue being discussed here may only be specific to the
>> > CONFIG_HAVE_MEMORYLESS_NODES=n case.
>> 
>> Great. Thanks!
>> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-03-19 12:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11  8:25 [PATCH 0/3] slab: support memoryless nodes with sheaves Vlastimil Babka (SUSE)
2026-03-11  8:25 ` [PATCH 1/3] slab: decouple pointer to barn from kmem_cache_node Vlastimil Babka (SUSE)
2026-03-13  9:27   ` Harry Yoo
2026-03-13  9:46     ` Vlastimil Babka (SUSE)
2026-03-13 11:48       ` Harry Yoo
2026-03-16 13:19         ` Vlastimil Babka (SUSE)
2026-03-11  8:25 ` [PATCH 2/3] slab: create barns for online memoryless nodes Vlastimil Babka (SUSE)
2026-03-16  3:25   ` Harry Yoo
2026-03-18  9:27   ` Hao Li
2026-03-18 12:11     ` Vlastimil Babka (SUSE)
2026-03-19  7:01       ` Hao Li
2026-03-19  9:56         ` Vlastimil Babka (SUSE)
2026-03-19 11:27           ` Hao Li
2026-03-19 12:25             ` Vlastimil Babka (SUSE)
2026-03-11  8:25 ` [PATCH 3/3] slab: free remote objects to sheaves on " Vlastimil Babka (SUSE)
2026-03-16  3:48   ` Harry Yoo
2026-03-11  9:49 ` [PATCH 0/3] slab: support memoryless nodes with sheaves Ming Lei
2026-03-11 17:22   ` Vlastimil Babka (SUSE)
2026-03-16 13:33 ` Vlastimil Babka (SUSE)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox