Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree

BPF List
 help / color / mirror / Atom feed

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
       [not found]       ` <1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz>
@ 2025-11-27 14:00         ` Daniel Gomez
  2025-11-27 19:29           ` Suren Baghdasaryan
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Gomez @ 2025-11-27 14:00 UTC (permalink / raw)
  To: Vlastimil Babka, Harry Yoo, Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, bpf,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi



On 05/11/2025 12.25, Vlastimil Babka wrote:
> On 11/3/25 04:17, Harry Yoo wrote:
>> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
>>>
>>>
>>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>>>> addition to main and spare sheaves.
>>>>
>>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>>>> the sheaf is detached and submitted to call_rcu() with a handler that
>>>> will try to put it in the barn, or flush to slab pages using bulk free,
>>>> when the barn is full. Then a new empty sheaf must be obtained to put
>>>> more objects there.
>>>>
>>>> It's possible that no free sheaves are available to use for a new
>>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>>>> kfree_rcu() implementation.
>>>>
>>>> Expected advantages:
>>>> - batching the kfree_rcu() operations, that could eventually replace the
>>>>   existing batching
>>>> - sheaves can be reused for allocations via barn instead of being
>>>>   flushed to slabs, which is more efficient
>>>>   - this includes cases where only some cpus are allowed to process rcu
>>>>     callbacks (Android)
>>>>
>>>> Possible disadvantage:
>>>> - objects might be waiting for more than their grace period (it is
>>>>   determined by the last object freed into the sheaf), increasing memory
>>>>   usage - but the existing batching does that too.
>>>>
>>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>>>> implementation favors smaller memory footprint over performance.
>>>>
>>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>>>> contexts where kfree_rcu() is called might not be compatible with taking
>>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>>>> spinlock - the current kfree_rcu() implementation avoids doing that.
>>>>
>>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>>>> that have them. This is not a cheap operation, but the barrier usage is
>>>> rare - currently kmem_cache_destroy() or on module unload.
>>>>
>>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>>>> many had to fall back to the existing implementation.
>>>>
>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>
>>> Hi Vlastimil,
>>>
>>> This patch increases kmod selftest (stress module loader) runtime by about
>>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>>> causing this, or how to address it?
>>
>> This is likely due to increased kvfree_rcu_barrier() during module unload.
> 
> Hm so there are actually two possible sources of this. One is that the
> module creates some kmem_cache and calls kmem_cache_destroy() on it before
> unloading. That does kvfree_rcu_barrier() which iterates all caches via
> flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
> have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
> that single cache.

Thanks for the feedback. And thanks to Jon who has revived this again.

> 
> The other source is codetag_unload_module(), and I'm afraid it's this one as
> it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?

Yes, we do have that enabled.

> Disabling it should help in this case, if you don't need memory allocation
> profiling for that stress test. I think there's some space for improvement -
> when compiled in but memalloc profiling never enabled during the uptime,
> this could probably be skipped? Suren?
> 
>> It currently iterates over all CPUs x slab caches (that enabled sheaves,
>> there should be only a few now) pair to make sure rcu sheaf is flushed
>> by the time kvfree_rcu_barrier() returns.
> 
> Yeah, also it's done under slab_mutex. Is the stress test trying to unload
> multiple modules in parallel? That would make things worse, although I'd
> expect there's a lot serialization in this area already.

AFAIK, the kmod stress test does not unload modules in parallel. Module unload
happens one at a time before each test iteration. However, test 0008 and 0009
run 300 total sequential module unloads.

ALL_TESTS="$ALL_TESTS 0008:150:1"
ALL_TESTS="$ALL_TESTS 0009:150:1"

> 
> Unfortunately it will get worse with sheaves extended to all caches. We
> could probably mark caches once they allocate their first rcu_free sheaf
> (should not add visible overhead) and keep skipping those that never did.
>> Just being curious, do you have any serious workload that depends on
>> the performance of module unload?

Can we have a combination of a weaker form of kvfree_rcu_barrier() + tracking?
Happy to test this again if you have a patch or something in mind.

In addition and AFAIK, module unloading is similar to ebpf programs. Ccing bpf
folks in case they have a workload.

But I don't have a particular workload in mind.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 14:00         ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Daniel Gomez
@ 2025-11-27 19:29           ` Suren Baghdasaryan
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  0 siblings, 1 reply; 7+ messages in thread
From: Suren Baghdasaryan @ 2025-11-27 19:29 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Vlastimil Babka, Harry Yoo, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules, bpf,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi

On Thu, Nov 27, 2025 at 6:01 AM Daniel Gomez <da.gomez@kernel.org> wrote:
>
>
>
> On 05/11/2025 12.25, Vlastimil Babka wrote:
> > On 11/3/25 04:17, Harry Yoo wrote:
> >> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
> >>>
> >>>
> >>> On 10/09/2025 10.01, Vlastimil Babka wrote:
> >>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> >>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> >>>> addition to main and spare sheaves.
> >>>>
> >>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> >>>> the sheaf is detached and submitted to call_rcu() with a handler that
> >>>> will try to put it in the barn, or flush to slab pages using bulk free,
> >>>> when the barn is full. Then a new empty sheaf must be obtained to put
> >>>> more objects there.
> >>>>
> >>>> It's possible that no free sheaves are available to use for a new
> >>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> >>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> >>>> kfree_rcu() implementation.
> >>>>
> >>>> Expected advantages:
> >>>> - batching the kfree_rcu() operations, that could eventually replace the
> >>>>   existing batching
> >>>> - sheaves can be reused for allocations via barn instead of being
> >>>>   flushed to slabs, which is more efficient
> >>>>   - this includes cases where only some cpus are allowed to process rcu
> >>>>     callbacks (Android)
> >>>>
> >>>> Possible disadvantage:
> >>>> - objects might be waiting for more than their grace period (it is
> >>>>   determined by the last object freed into the sheaf), increasing memory
> >>>>   usage - but the existing batching does that too.
> >>>>
> >>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> >>>> implementation favors smaller memory footprint over performance.
> >>>>
> >>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> >>>> contexts where kfree_rcu() is called might not be compatible with taking
> >>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> >>>> spinlock - the current kfree_rcu() implementation avoids doing that.
> >>>>
> >>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> >>>> that have them. This is not a cheap operation, but the barrier usage is
> >>>> rare - currently kmem_cache_destroy() or on module unload.
> >>>>
> >>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> >>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> >>>> many had to fall back to the existing implementation.
> >>>>
> >>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>
> >>> Hi Vlastimil,
> >>>
> >>> This patch increases kmod selftest (stress module loader) runtime by about
> >>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> >>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> >>> causing this, or how to address it?
> >>
> >> This is likely due to increased kvfree_rcu_barrier() during module unload.
> >
> > Hm so there are actually two possible sources of this. One is that the
> > module creates some kmem_cache and calls kmem_cache_destroy() on it before
> > unloading. That does kvfree_rcu_barrier() which iterates all caches via
> > flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
> > have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
> > that single cache.
>
> Thanks for the feedback. And thanks to Jon who has revived this again.
>
> >
> > The other source is codetag_unload_module(), and I'm afraid it's this one as
> > it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?
>
> Yes, we do have that enabled.

Sorry I missed this discussion before.
IIUC, the performance is impacted because kvfree_rcu_barrier() has to
flush_all_rcu_sheaves(), therefore is more costly than before.

>
> > Disabling it should help in this case, if you don't need memory allocation
> > profiling for that stress test. I think there's some space for improvement -
> > when compiled in but memalloc profiling never enabled during the uptime,
> > this could probably be skipped? Suren?

I think yes, we should be able to skip kvfree_rcu_barrier() inside
codetag_unload_module() if profiling was not enabled.
kvfree_rcu_barrier() is there to ensure all potential kfree_rcu()'s
for module allocations are finished before destroying the tags. I'll
need to add an additional "sticky" flag to record that profiling was
used so that we detect a case when it was enabled, then disabled
before module unloading. I can work on it next week.

> >
> >> It currently iterates over all CPUs x slab caches (that enabled sheaves,
> >> there should be only a few now) pair to make sure rcu sheaf is flushed
> >> by the time kvfree_rcu_barrier() returns.
> >
> > Yeah, also it's done under slab_mutex. Is the stress test trying to unload
> > multiple modules in parallel? That would make things worse, although I'd
> > expect there's a lot serialization in this area already.
>
> AFAIK, the kmod stress test does not unload modules in parallel. Module unload
> happens one at a time before each test iteration. However, test 0008 and 0009
> run 300 total sequential module unloads.
>
> ALL_TESTS="$ALL_TESTS 0008:150:1"
> ALL_TESTS="$ALL_TESTS 0009:150:1"
>
> >
> > Unfortunately it will get worse with sheaves extended to all caches. We
> > could probably mark caches once they allocate their first rcu_free sheaf
> > (should not add visible overhead) and keep skipping those that never did.
> >> Just being curious, do you have any serious workload that depends on
> >> the performance of module unload?
>
> Can we have a combination of a weaker form of kvfree_rcu_barrier() + tracking?
> Happy to test this again if you have a patch or something in mind.
>
> In addition and AFAIK, module unloading is similar to ebpf programs. Ccing bpf
> folks in case they have a workload.
>
> But I don't have a particular workload in mind.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-27 19:29           ` Suren Baghdasaryan
@ 2025-11-28 11:37             ` Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
                                 ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Harry Yoo @ 2025-11-28 11:37 UTC (permalink / raw)
  To: surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, harry.yoo, linux-kernel,
	linux-mm, linux-modules, lucas.demarchi, maple-tree, mcgrof,
	petr.pavlu, rcu, rientjes, roman.gushchin, samitolvanen,
	sidhartha.kumar, urezki, vbabka, jonathanh

Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
caches when a cache is destroyed. This is unnecessary when destroying
a slab cache; only the RCU sheaves belonging to the cache being destroyed
need to be flushed.

As suggested by Vlastimil Babka, introduce a weaker form of
kvfree_rcu_barrier() that operates on a specific slab cache and call it
on cache destruction.

The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
5900X machine (1 socket), by loading slub_kunit module.

Before:
  Total calls: 19
  Average latency (us): 8529
  Total time (us): 162069

After:
  Total calls: 19
  Average latency (us): 3804
  Total time (us): 72287

Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---

Not sure if the regression is worse on the reporters' machines due to
higher core count (or because some cores were busy doing other things,
dunno).

Hopefully this will reduce the time to complete tests,
and Suren could add his patch on top of this ;)

 include/linux/slab.h |  5 ++++
 mm/slab.h            |  1 +
 mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
 mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
 4 files changed, 73 insertions(+), 40 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index cf443f064a66..937c93d44e8c 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1149,6 +1149,10 @@ static inline void kvfree_rcu_barrier(void)
 {
 	rcu_barrier();
 }
+static inline void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
+{
+	rcu_barrier();
+}
 
 static inline void kfree_rcu_scheduler_running(void) { }
 #else
@@ -1156,6 +1160,7 @@ void kvfree_rcu_barrier(void);
 
 void kfree_rcu_scheduler_running(void);
 #endif
+void kvfree_rcu_barrier_on_cache(struct kmem_cache *s);
 
 /**
  * kmalloc_size_roundup - Report allocation bucket size for the given size
diff --git a/mm/slab.h b/mm/slab.h
index f730e012553c..e767aa7e91b0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -422,6 +422,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 
 bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
 void flush_all_rcu_sheaves(void);
+void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
 
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 84dfff4f7b1f..dd8a49d6f9cc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -492,7 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 		return;
 
 	/* in-flight kfree_rcu()'s may include objects from our cache */
-	kvfree_rcu_barrier();
+	kvfree_rcu_barrier_on_cache(s);
 
 	if (IS_ENABLED(CONFIG_SLUB_RCU_DEBUG) &&
 	    (s->flags & SLAB_TYPESAFE_BY_RCU)) {
@@ -2038,25 +2038,13 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 }
 EXPORT_SYMBOL_GPL(kvfree_call_rcu);
 
-/**
- * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
- *
- * Note that a single argument of kvfree_rcu() call has a slow path that
- * triggers synchronize_rcu() following by freeing a pointer. It is done
- * before the return from the function. Therefore for any single-argument
- * call that will result in a kfree() to a cache that is to be destroyed
- * during module exit, it is developer's responsibility to ensure that all
- * such calls have returned before the call to kmem_cache_destroy().
- */
-void kvfree_rcu_barrier(void)
+static inline void __kvfree_rcu_barrier(void)
 {
 	struct kfree_rcu_cpu_work *krwp;
 	struct kfree_rcu_cpu *krcp;
 	bool queued;
 	int i, cpu;
 
-	flush_all_rcu_sheaves();
-
 	/*
 	 * Firstly we detach objects and queue them over an RCU-batch
 	 * for all CPUs. Finally queued works are flushed for each CPU.
@@ -2118,8 +2106,43 @@ void kvfree_rcu_barrier(void)
 		}
 	}
 }
+
+/**
+ * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
+ *
+ * Note that a single argument of kvfree_rcu() call has a slow path that
+ * triggers synchronize_rcu() following by freeing a pointer. It is done
+ * before the return from the function. Therefore for any single-argument
+ * call that will result in a kfree() to a cache that is to be destroyed
+ * during module exit, it is developer's responsibility to ensure that all
+ * such calls have returned before the call to kmem_cache_destroy().
+ */
+void kvfree_rcu_barrier(void)
+{
+	flush_all_rcu_sheaves();
+	__kvfree_rcu_barrier();
+}
 EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
 
+/**
+ * kvfree_rcu_barrier_on_cache - Wait for in-flight kvfree_rcu() calls on a
+ *                               specific slab cache.
+ * @s: slab cache to wait for
+ *
+ * See the description of kvfree_rcu_barrier() for details.
+ */
+void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
+{
+	if (s->cpu_sheaves)
+		flush_rcu_sheaves_on_cache(s);
+	/*
+	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
+	 * on a specific slab cache.
+	 */
+	__kvfree_rcu_barrier();
+}
+EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
+
 static unsigned long
 kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 {
@@ -2215,4 +2238,3 @@ void __init kvfree_rcu_init(void)
 }
 
 #endif /* CONFIG_KVFREE_RCU_BATCHED */
-
diff --git a/mm/slub.c b/mm/slub.c
index 785e25a14999..7cec2220712b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4118,42 +4118,47 @@ static void flush_rcu_sheaf(struct work_struct *w)
 
 
 /* needed for kvfree_rcu_barrier() */
-void flush_all_rcu_sheaves(void)
+void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
-	struct kmem_cache *s;
 	unsigned int cpu;
 
-	cpus_read_lock();
-	mutex_lock(&slab_mutex);
+	mutex_lock(&flush_lock);
 
-	list_for_each_entry(s, &slab_caches, list) {
-		if (!s->cpu_sheaves)
-			continue;
+	for_each_online_cpu(cpu) {
+		sfw = &per_cpu(slub_flush, cpu);
 
-		mutex_lock(&flush_lock);
+		/*
+		 * we don't check if rcu_free sheaf exists - racing
+		 * __kfree_rcu_sheaf() might have just removed it.
+		 * by executing flush_rcu_sheaf() on the cpu we make
+		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
+		 */
 
-		for_each_online_cpu(cpu) {
-			sfw = &per_cpu(slub_flush, cpu);
+		INIT_WORK(&sfw->work, flush_rcu_sheaf);
+		sfw->s = s;
+		queue_work_on(cpu, flushwq, &sfw->work);
+	}
 
-			/*
-			 * we don't check if rcu_free sheaf exists - racing
-			 * __kfree_rcu_sheaf() might have just removed it.
-			 * by executing flush_rcu_sheaf() on the cpu we make
-			 * sure the __kfree_rcu_sheaf() finished its call_rcu()
-			 */
+	for_each_online_cpu(cpu) {
+		sfw = &per_cpu(slub_flush, cpu);
+		flush_work(&sfw->work);
+	}
 
-			INIT_WORK(&sfw->work, flush_rcu_sheaf);
-			sfw->s = s;
-			queue_work_on(cpu, flushwq, &sfw->work);
-		}
+	mutex_unlock(&flush_lock);
+}
 
-		for_each_online_cpu(cpu) {
-			sfw = &per_cpu(slub_flush, cpu);
-			flush_work(&sfw->work);
-		}
+void flush_all_rcu_sheaves(void)
+{
+	struct kmem_cache *s;
+
+	cpus_read_lock();
+	mutex_lock(&slab_mutex);
 
-		mutex_unlock(&flush_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		if (!s->cpu_sheaves)
+			continue;
+		flush_rcu_sheaves_on_cache(s);
 	}
 
 	mutex_unlock(&slab_mutex);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
@ 2025-11-28 12:22               ` Harry Yoo
  2025-11-28 12:38               ` Daniel Gomez
  2025-12-02  9:29               ` Jon Hunter
  2 siblings, 0 replies; 7+ messages in thread
From: Harry Yoo @ 2025-11-28 12:22 UTC (permalink / raw)
  To: surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, jonathanh

On Fri, Nov 28, 2025 at 08:37:40PM +0900, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>   Total calls: 19
>   Average latency (us): 8529
>   Total time (us): 162069
> 
> After:
>   Total calls: 19
>   Average latency (us): 3804
>   Total time (us): 72287

Ooh, I just realized that I messed up the config and
have only two cores enabled. Will update the numbers after enabling 22 more :)

> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
> 
> Not sure if the regression is worse on the reporters' machines due to
> higher core count (or because some cores were busy doing other things,
> dunno).
> 
> Hopefully this will reduce the time to complete tests,
> and Suren could add his patch on top of this ;)
> 
>  include/linux/slab.h |  5 ++++
>  mm/slab.h            |  1 +
>  mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
>  mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
>  4 files changed, 73 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index cf443f064a66..937c93d44e8c 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -1149,6 +1149,10 @@ static inline void kvfree_rcu_barrier(void)
>  {
>  	rcu_barrier();
>  }
> +static inline void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
> +{
> +	rcu_barrier();
> +}
>  
>  static inline void kfree_rcu_scheduler_running(void) { }
>  #else
> @@ -1156,6 +1160,7 @@ void kvfree_rcu_barrier(void);
>  
>  void kfree_rcu_scheduler_running(void);
>  #endif
> +void kvfree_rcu_barrier_on_cache(struct kmem_cache *s);
>  
>  /**
>   * kmalloc_size_roundup - Report allocation bucket size for the given size
> diff --git a/mm/slab.h b/mm/slab.h
> index f730e012553c..e767aa7e91b0 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -422,6 +422,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>  
>  bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
>  void flush_all_rcu_sheaves(void);
> +void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
>  
>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>  			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 84dfff4f7b1f..dd8a49d6f9cc 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -492,7 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
>  		return;
>  
>  	/* in-flight kfree_rcu()'s may include objects from our cache */
> -	kvfree_rcu_barrier();
> +	kvfree_rcu_barrier_on_cache(s);
>  
>  	if (IS_ENABLED(CONFIG_SLUB_RCU_DEBUG) &&
>  	    (s->flags & SLAB_TYPESAFE_BY_RCU)) {
> @@ -2038,25 +2038,13 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  }
>  EXPORT_SYMBOL_GPL(kvfree_call_rcu);
>  
> -/**
> - * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
> - *
> - * Note that a single argument of kvfree_rcu() call has a slow path that
> - * triggers synchronize_rcu() following by freeing a pointer. It is done
> - * before the return from the function. Therefore for any single-argument
> - * call that will result in a kfree() to a cache that is to be destroyed
> - * during module exit, it is developer's responsibility to ensure that all
> - * such calls have returned before the call to kmem_cache_destroy().
> - */
> -void kvfree_rcu_barrier(void)
> +static inline void __kvfree_rcu_barrier(void)
>  {
>  	struct kfree_rcu_cpu_work *krwp;
>  	struct kfree_rcu_cpu *krcp;
>  	bool queued;
>  	int i, cpu;
>  
> -	flush_all_rcu_sheaves();
> -
>  	/*
>  	 * Firstly we detach objects and queue them over an RCU-batch
>  	 * for all CPUs. Finally queued works are flushed for each CPU.
> @@ -2118,8 +2106,43 @@ void kvfree_rcu_barrier(void)
>  		}
>  	}
>  }
> +
> +/**
> + * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
> + *
> + * Note that a single argument of kvfree_rcu() call has a slow path that
> + * triggers synchronize_rcu() following by freeing a pointer. It is done
> + * before the return from the function. Therefore for any single-argument
> + * call that will result in a kfree() to a cache that is to be destroyed
> + * during module exit, it is developer's responsibility to ensure that all
> + * such calls have returned before the call to kmem_cache_destroy().
> + */
> +void kvfree_rcu_barrier(void)
> +{
> +	flush_all_rcu_sheaves();
> +	__kvfree_rcu_barrier();
> +}
>  EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
>  
> +/**
> + * kvfree_rcu_barrier_on_cache - Wait for in-flight kvfree_rcu() calls on a
> + *                               specific slab cache.
> + * @s: slab cache to wait for
> + *
> + * See the description of kvfree_rcu_barrier() for details.
> + */
> +void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
> +{
> +	if (s->cpu_sheaves)
> +		flush_rcu_sheaves_on_cache(s);
> +	/*
> +	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
> +	 * on a specific slab cache.
> +	 */
> +	__kvfree_rcu_barrier();
> +}
> +EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
> +
>  static unsigned long
>  kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> @@ -2215,4 +2238,3 @@ void __init kvfree_rcu_init(void)
>  }
>  
>  #endif /* CONFIG_KVFREE_RCU_BATCHED */
> -
> diff --git a/mm/slub.c b/mm/slub.c
> index 785e25a14999..7cec2220712b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4118,42 +4118,47 @@ static void flush_rcu_sheaf(struct work_struct *w)
>  
>  
>  /* needed for kvfree_rcu_barrier() */
> -void flush_all_rcu_sheaves(void)
> +void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
>  {
>  	struct slub_flush_work *sfw;
> -	struct kmem_cache *s;
>  	unsigned int cpu;
>  
> -	cpus_read_lock();
> -	mutex_lock(&slab_mutex);
> +	mutex_lock(&flush_lock);
>  
> -	list_for_each_entry(s, &slab_caches, list) {
> -		if (!s->cpu_sheaves)
> -			continue;
> +	for_each_online_cpu(cpu) {
> +		sfw = &per_cpu(slub_flush, cpu);
>  
> -		mutex_lock(&flush_lock);
> +		/*
> +		 * we don't check if rcu_free sheaf exists - racing
> +		 * __kfree_rcu_sheaf() might have just removed it.
> +		 * by executing flush_rcu_sheaf() on the cpu we make
> +		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
> +		 */
>  
> -		for_each_online_cpu(cpu) {
> -			sfw = &per_cpu(slub_flush, cpu);
> +		INIT_WORK(&sfw->work, flush_rcu_sheaf);
> +		sfw->s = s;
> +		queue_work_on(cpu, flushwq, &sfw->work);
> +	}
>  
> -			/*
> -			 * we don't check if rcu_free sheaf exists - racing
> -			 * __kfree_rcu_sheaf() might have just removed it.
> -			 * by executing flush_rcu_sheaf() on the cpu we make
> -			 * sure the __kfree_rcu_sheaf() finished its call_rcu()
> -			 */
> +	for_each_online_cpu(cpu) {
> +		sfw = &per_cpu(slub_flush, cpu);
> +		flush_work(&sfw->work);
> +	}
>  
> -			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> -			sfw->s = s;
> -			queue_work_on(cpu, flushwq, &sfw->work);
> -		}
> +	mutex_unlock(&flush_lock);
> +}
>  
> -		for_each_online_cpu(cpu) {
> -			sfw = &per_cpu(slub_flush, cpu);
> -			flush_work(&sfw->work);
> -		}
> +void flush_all_rcu_sheaves(void)
> +{
> +	struct kmem_cache *s;
> +
> +	cpus_read_lock();
> +	mutex_lock(&slab_mutex);
>  
> -		mutex_unlock(&flush_lock);
> +	list_for_each_entry(s, &slab_caches, list) {
> +		if (!s->cpu_sheaves)
> +			continue;
> +		flush_rcu_sheaves_on_cache(s);
>  	}
>  
>  	mutex_unlock(&slab_mutex);
> -- 
> 2.43.0
> 

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
@ 2025-11-28 12:38               ` Daniel Gomez
  2025-12-02  9:29               ` Jon Hunter
  2 siblings, 0 replies; 7+ messages in thread
From: Daniel Gomez @ 2025-11-28 12:38 UTC (permalink / raw)
  To: Harry Yoo, surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, jonathanh



On 28/11/2025 12.37, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>   Total calls: 19
>   Average latency (us): 8529
>   Total time (us): 162069
> 
> After:
>   Total calls: 19
>   Average latency (us): 3804
>   Total time (us): 72287
> 
> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---

Thanks Harry for the patch,

A quick test on a different machine from the one I originally used to report
this shows a decrease from 214s to 100s.

LGTM,

Tested-by: Daniel Gomez <da.gomez@samsung.com>

> 
> Not sure if the regression is worse on the reporters' machines due to
> higher core count (or because some cores were busy doing other things,
> dunno).

FWIW, CI modules run on an 8 core VM. Depending on the host CPU, this made the
absolute number different but equivalent performance degradation.

> 
> Hopefully this will reduce the time to complete tests,
> and Suren could add his patch on top of this ;)
> 
>  include/linux/slab.h |  5 ++++
>  mm/slab.h            |  1 +
>  mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
>  mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
>  4 files changed, 73 insertions(+), 40 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
  2025-11-28 12:38               ` Daniel Gomez
@ 2025-12-02  9:29               ` Jon Hunter
  2025-12-02 10:18                 ` Harry Yoo
  2 siblings, 1 reply; 7+ messages in thread
From: Jon Hunter @ 2025-12-02  9:29 UTC (permalink / raw)
  To: Harry Yoo, surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, linux-tegra@vger.kernel.org


On 28/11/2025 11:37, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>    Total calls: 19
>    Average latency (us): 8529
>    Total time (us): 162069
> 
> After:
>    Total calls: 19
>    Average latency (us): 3804
>    Total time (us): 72287
> 
> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---

Thanks for the rapid fix. I have been testing this and can confirm that 
this does fix the performance regression I was seeing.

BTW shouldn't we add a 'Fixes:' tag above? I would like to ensure that 
this gets picked up for v6.18 stable.

Otherwise ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-12-02  9:29               ` Jon Hunter
@ 2025-12-02 10:18                 ` Harry Yoo
  0 siblings, 0 replies; 7+ messages in thread
From: Harry Yoo @ 2025-12-02 10:18 UTC (permalink / raw)
  To: Jon Hunter
  Cc: surenb, Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel,
	linux-mm, linux-modules, lucas.demarchi, maple-tree, mcgrof,
	petr.pavlu, rcu, rientjes, roman.gushchin, samitolvanen,
	sidhartha.kumar, urezki, vbabka, linux-tegra@vger.kernel.org

On Tue, Dec 02, 2025 at 09:29:17AM +0000, Jon Hunter wrote:
> 
> On 28/11/2025 11:37, Harry Yoo wrote:
> > Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> > caches when a cache is destroyed. This is unnecessary when destroying
> > a slab cache; only the RCU sheaves belonging to the cache being destroyed
> > need to be flushed.
> > 
> > As suggested by Vlastimil Babka, introduce a weaker form of
> > kvfree_rcu_barrier() that operates on a specific slab cache and call it
> > on cache destruction.
> > 
> > The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> > 5900X machine (1 socket), by loading slub_kunit module.
> > 
> > Before:
> >    Total calls: 19
> >    Average latency (us): 8529
> >    Total time (us): 162069
> > 
> > After:
> >    Total calls: 19
> >    Average latency (us): 3804
> >    Total time (us): 72287
> > 
> > Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> > Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> > Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> 
> Thanks for the rapid fix. I have been testing this and can confirm that this
> does fix the performance regression I was seeing.

Great!

> BTW shouldn't we add a 'Fixes:' tag above? I would like to ensure that this
> gets picked up for v6.18 stable.

Good point, I added Cc: stable and Fixes: tags.
(and your and Daniel's Reported-and-tested-by: tags)

> Otherwise ...
> 
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thank you Jon and Daniel a lot for reporting regression and testing the fix!

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-12-02 10:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
     [not found] ` <20250910-slub-percpu-caches-v8-4-ca3099d8352c@suse.cz>
     [not found]   ` <0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org>
     [not found]     ` <aQge2rmgRvd1JKxc@harry>
     [not found]       ` <1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz>
2025-11-27 14:00         ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Daniel Gomez
2025-11-27 19:29           ` Suren Baghdasaryan
2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
2025-11-28 12:22               ` Harry Yoo
2025-11-28 12:38               ` Daniel Gomez
2025-12-02  9:29               ` Jon Hunter
2025-12-02 10:18                 ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox