linux-modules.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
       [not found] ` <20250910-slub-percpu-caches-v8-4-ca3099d8352c@suse.cz>
@ 2025-10-31 21:32   ` Daniel Gomez
  2025-11-03  3:17     ` Harry Yoo
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
  0 siblings, 2 replies; 18+ messages in thread
From: Daniel Gomez @ 2025-10-31 21:32 UTC (permalink / raw)
  To: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi



On 10/09/2025 10.01, Vlastimil Babka wrote:
> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> addition to main and spare sheaves.
> 
> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> the sheaf is detached and submitted to call_rcu() with a handler that
> will try to put it in the barn, or flush to slab pages using bulk free,
> when the barn is full. Then a new empty sheaf must be obtained to put
> more objects there.
> 
> It's possible that no free sheaves are available to use for a new
> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> kfree_rcu() implementation.
> 
> Expected advantages:
> - batching the kfree_rcu() operations, that could eventually replace the
>   existing batching
> - sheaves can be reused for allocations via barn instead of being
>   flushed to slabs, which is more efficient
>   - this includes cases where only some cpus are allowed to process rcu
>     callbacks (Android)
> 
> Possible disadvantage:
> - objects might be waiting for more than their grace period (it is
>   determined by the last object freed into the sheaf), increasing memory
>   usage - but the existing batching does that too.
> 
> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> implementation favors smaller memory footprint over performance.
> 
> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> contexts where kfree_rcu() is called might not be compatible with taking
> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> spinlock - the current kfree_rcu() implementation avoids doing that.
> 
> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> that have them. This is not a cheap operation, but the barrier usage is
> rare - currently kmem_cache_destroy() or on module unload.
> 
> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> many had to fall back to the existing implementation.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Hi Vlastimil,

This patch increases kmod selftest (stress module loader) runtime by about
~50-60%, from ~200s to ~300s total execution time. My tested kernel has
CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
causing this, or how to address it?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-10-31 21:32   ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Daniel Gomez
@ 2025-11-03  3:17     ` Harry Yoo
  2025-11-05 11:25       ` Vlastimil Babka
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
  1 sibling, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2025-11-03  3:17 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Vlastimil Babka, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Uladzislau Rezki, Sidhartha Kumar, linux-mm, linux-kernel, rcu,
	maple-tree, linux-modules, Luis Chamberlain, Petr Pavlu,
	Sami Tolvanen, Aaron Tomlin, Lucas De Marchi

On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
> 
> 
> On 10/09/2025 10.01, Vlastimil Babka wrote:
> > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> > addition to main and spare sheaves.
> > 
> > kfree_rcu() operations will try to put objects on this sheaf. Once full,
> > the sheaf is detached and submitted to call_rcu() with a handler that
> > will try to put it in the barn, or flush to slab pages using bulk free,
> > when the barn is full. Then a new empty sheaf must be obtained to put
> > more objects there.
> > 
> > It's possible that no free sheaves are available to use for a new
> > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> > kfree_rcu() implementation.
> > 
> > Expected advantages:
> > - batching the kfree_rcu() operations, that could eventually replace the
> >   existing batching
> > - sheaves can be reused for allocations via barn instead of being
> >   flushed to slabs, which is more efficient
> >   - this includes cases where only some cpus are allowed to process rcu
> >     callbacks (Android)
> > 
> > Possible disadvantage:
> > - objects might be waiting for more than their grace period (it is
> >   determined by the last object freed into the sheaf), increasing memory
> >   usage - but the existing batching does that too.
> > 
> > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> > implementation favors smaller memory footprint over performance.
> > 
> > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> > contexts where kfree_rcu() is called might not be compatible with taking
> > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> > spinlock - the current kfree_rcu() implementation avoids doing that.
> > 
> > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> > that have them. This is not a cheap operation, but the barrier usage is
> > rare - currently kmem_cache_destroy() or on module unload.
> > 
> > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> > count how many kfree_rcu() used the rcu_free sheaf successfully and how
> > many had to fall back to the existing implementation.
> > 
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Hi Vlastimil,
> 
> This patch increases kmod selftest (stress module loader) runtime by about
> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> causing this, or how to address it?

This is likely due to increased kvfree_rcu_barrier() during module unload.

It currently iterates over all CPUs x slab caches (that enabled sheaves,
there should be only a few now) pair to make sure rcu sheaf is flushed
by the time kvfree_rcu_barrier() returns.

Just being curious, do you have any serious workload that depends on
the performance of module unload?

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-03  3:17     ` Harry Yoo
@ 2025-11-05 11:25       ` Vlastimil Babka
  2025-11-27 14:00         ` Daniel Gomez
  0 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2025-11-05 11:25 UTC (permalink / raw)
  To: Harry Yoo, Daniel Gomez, Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi

On 11/3/25 04:17, Harry Yoo wrote:
> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
>> 
>> 
>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>> > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>> > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>> > addition to main and spare sheaves.
>> > 
>> > kfree_rcu() operations will try to put objects on this sheaf. Once full,
>> > the sheaf is detached and submitted to call_rcu() with a handler that
>> > will try to put it in the barn, or flush to slab pages using bulk free,
>> > when the barn is full. Then a new empty sheaf must be obtained to put
>> > more objects there.
>> > 
>> > It's possible that no free sheaves are available to use for a new
>> > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>> > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>> > kfree_rcu() implementation.
>> > 
>> > Expected advantages:
>> > - batching the kfree_rcu() operations, that could eventually replace the
>> >   existing batching
>> > - sheaves can be reused for allocations via barn instead of being
>> >   flushed to slabs, which is more efficient
>> >   - this includes cases where only some cpus are allowed to process rcu
>> >     callbacks (Android)
>> > 
>> > Possible disadvantage:
>> > - objects might be waiting for more than their grace period (it is
>> >   determined by the last object freed into the sheaf), increasing memory
>> >   usage - but the existing batching does that too.
>> > 
>> > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>> > implementation favors smaller memory footprint over performance.
>> > 
>> > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>> > contexts where kfree_rcu() is called might not be compatible with taking
>> > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>> > spinlock - the current kfree_rcu() implementation avoids doing that.
>> > 
>> > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>> > that have them. This is not a cheap operation, but the barrier usage is
>> > rare - currently kmem_cache_destroy() or on module unload.
>> > 
>> > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>> > count how many kfree_rcu() used the rcu_free sheaf successfully and how
>> > many had to fall back to the existing implementation.
>> > 
>> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> 
>> Hi Vlastimil,
>> 
>> This patch increases kmod selftest (stress module loader) runtime by about
>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>> causing this, or how to address it?
> 
> This is likely due to increased kvfree_rcu_barrier() during module unload.

Hm so there are actually two possible sources of this. One is that the
module creates some kmem_cache and calls kmem_cache_destroy() on it before
unloading. That does kvfree_rcu_barrier() which iterates all caches via
flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
that single cache.

The other source is codetag_unload_module(), and I'm afraid it's this one as
it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?
Disabling it should help in this case, if you don't need memory allocation
profiling for that stress test. I think there's some space for improvement -
when compiled in but memalloc profiling never enabled during the uptime,
this could probably be skipped? Suren?

> It currently iterates over all CPUs x slab caches (that enabled sheaves,
> there should be only a few now) pair to make sure rcu sheaf is flushed
> by the time kvfree_rcu_barrier() returns.

Yeah, also it's done under slab_mutex. Is the stress test trying to unload
multiple modules in parallel? That would make things worse, although I'd
expect there's a lot serialization in this area already.

Unfortunately it will get worse with sheaves extended to all caches. We
could probably mark caches once they allocate their first rcu_free sheaf
(should not add visible overhead) and keep skipping those that never did.
> Just being curious, do you have any serious workload that depends on
> the performance of module unload?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-10-31 21:32   ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Daniel Gomez
  2025-11-03  3:17     ` Harry Yoo
@ 2025-11-27 11:38     ` Jon Hunter
  2025-11-27 11:50       ` Jon Hunter
                         ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Jon Hunter @ 2025-11-27 11:38 UTC (permalink / raw)
  To: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi, linux-tegra@vger.kernel.org



On 31/10/2025 21:32, Daniel Gomez wrote:
> 
> 
> On 10/09/2025 10.01, Vlastimil Babka wrote:
>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>> addition to main and spare sheaves.
>>
>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>> the sheaf is detached and submitted to call_rcu() with a handler that
>> will try to put it in the barn, or flush to slab pages using bulk free,
>> when the barn is full. Then a new empty sheaf must be obtained to put
>> more objects there.
>>
>> It's possible that no free sheaves are available to use for a new
>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>> kfree_rcu() implementation.
>>
>> Expected advantages:
>> - batching the kfree_rcu() operations, that could eventually replace the
>>    existing batching
>> - sheaves can be reused for allocations via barn instead of being
>>    flushed to slabs, which is more efficient
>>    - this includes cases where only some cpus are allowed to process rcu
>>      callbacks (Android)
>>
>> Possible disadvantage:
>> - objects might be waiting for more than their grace period (it is
>>    determined by the last object freed into the sheaf), increasing memory
>>    usage - but the existing batching does that too.
>>
>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>> implementation favors smaller memory footprint over performance.
>>
>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>> contexts where kfree_rcu() is called might not be compatible with taking
>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>> spinlock - the current kfree_rcu() implementation avoids doing that.
>>
>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>> that have them. This is not a cheap operation, but the barrier usage is
>> rare - currently kmem_cache_destroy() or on module unload.
>>
>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>> many had to fall back to the existing implementation.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Hi Vlastimil,
> 
> This patch increases kmod selftest (stress module loader) runtime by about
> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> causing this, or how to address it?
> 

I have been looking into a regression for Linux v6.18-rc where time 
taken to run some internal graphics tests on our Tegra234 device has 
increased from around 35% causing the tests to timeout. Bisect is 
pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.

I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am 
not sure if there are any downsides to disabling this?

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
@ 2025-11-27 11:50       ` Jon Hunter
  2025-11-27 12:33       ` Harry Yoo
  2025-11-27 13:18       ` Vlastimil Babka
  2 siblings, 0 replies; 18+ messages in thread
From: Jon Hunter @ 2025-11-27 11:50 UTC (permalink / raw)
  To: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi, linux-tegra@vger.kernel.org


On 27/11/2025 11:38, Jon Hunter wrote:
> 
> 
> On 31/10/2025 21:32, Daniel Gomez wrote:
>>
>>
>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>>> addition to main and spare sheaves.
>>>
>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>>> the sheaf is detached and submitted to call_rcu() with a handler that
>>> will try to put it in the barn, or flush to slab pages using bulk free,
>>> when the barn is full. Then a new empty sheaf must be obtained to put
>>> more objects there.
>>>
>>> It's possible that no free sheaves are available to use for a new
>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>>> kfree_rcu() implementation.
>>>
>>> Expected advantages:
>>> - batching the kfree_rcu() operations, that could eventually replace the
>>>    existing batching
>>> - sheaves can be reused for allocations via barn instead of being
>>>    flushed to slabs, which is more efficient
>>>    - this includes cases where only some cpus are allowed to process rcu
>>>      callbacks (Android)
>>>
>>> Possible disadvantage:
>>> - objects might be waiting for more than their grace period (it is
>>>    determined by the last object freed into the sheaf), increasing 
>>> memory
>>>    usage - but the existing batching does that too.
>>>
>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>>> implementation favors smaller memory footprint over performance.
>>>
>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>>> contexts where kfree_rcu() is called might not be compatible with taking
>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>>> spinlock - the current kfree_rcu() implementation avoids doing that.
>>>
>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>>> that have them. This is not a cheap operation, but the barrier usage is
>>> rare - currently kmem_cache_destroy() or on module unload.
>>>
>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>>> many had to fall back to the existing implementation.
>>>
>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> Hi Vlastimil,
>>
>> This patch increases kmod selftest (stress module loader) runtime by 
>> about
>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what 
>> might be
>> causing this, or how to address it?
>>
> 
> I have been looking into a regression for Linux v6.18-rc where time 
> taken to run some internal graphics tests on our Tegra234 device has 
> increased from around 35% causing the tests to timeout. Bisect is 

I meant 'increased by around 35%'.

> pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
> 
> I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am 
> not sure if there are any downsides to disabling this?
> 
> Thanks
> Jon
> 

-- 
nvpublic


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
  2025-11-27 11:50       ` Jon Hunter
@ 2025-11-27 12:33       ` Harry Yoo
  2025-11-27 12:48         ` Harry Yoo
  2025-11-27 13:18       ` Vlastimil Babka
  2 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2025-11-27 12:33 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi,
	linux-tegra@vger.kernel.org

On Thu, Nov 27, 2025 at 11:38:49AM +0000, Jon Hunter wrote:
> 
> 
> On 31/10/2025 21:32, Daniel Gomez wrote:
> > 
> > 
> > On 10/09/2025 10.01, Vlastimil Babka wrote:
> > > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> > > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> > > addition to main and spare sheaves.
> > > 
> > > kfree_rcu() operations will try to put objects on this sheaf. Once full,
> > > the sheaf is detached and submitted to call_rcu() with a handler that
> > > will try to put it in the barn, or flush to slab pages using bulk free,
> > > when the barn is full. Then a new empty sheaf must be obtained to put
> > > more objects there.
> > > 
> > > It's possible that no free sheaves are available to use for a new
> > > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> > > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> > > kfree_rcu() implementation.
> > > 
> > > Expected advantages:
> > > - batching the kfree_rcu() operations, that could eventually replace the
> > >    existing batching
> > > - sheaves can be reused for allocations via barn instead of being
> > >    flushed to slabs, which is more efficient
> > >    - this includes cases where only some cpus are allowed to process rcu
> > >      callbacks (Android)
> > > 
> > > Possible disadvantage:
> > > - objects might be waiting for more than their grace period (it is
> > >    determined by the last object freed into the sheaf), increasing memory
> > >    usage - but the existing batching does that too.
> > > 
> > > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> > > implementation favors smaller memory footprint over performance.
> > > 
> > > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> > > contexts where kfree_rcu() is called might not be compatible with taking
> > > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> > > spinlock - the current kfree_rcu() implementation avoids doing that.
> > > 
> > > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> > > that have them. This is not a cheap operation, but the barrier usage is
> > > rare - currently kmem_cache_destroy() or on module unload.
> > > 
> > > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> > > count how many kfree_rcu() used the rcu_free sheaf successfully and how
> > > many had to fall back to the existing implementation.
> > > 
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > 
> > Hi Vlastimil,
> > 
> > This patch increases kmod selftest (stress module loader) runtime by about
> > ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> > CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> > causing this, or how to address it?
> > 
> 
> I have been looking into a regression for Linux v6.18-rc where time taken to
> run some internal graphics tests on our Tegra234 device has increased from
> around 35% causing the tests to timeout. Bisect is pointing to this commit
> and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.

Thanks for reporting! Uh, this has been put aside while I was busy working
on other stuff... but now that we have two people complaining about this,
I'll allocate some time to investigate and improve it.

It'll take some time though :)

> I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am not
> sure if there are any downsides to disabling this?

I would not recommend doing that, unless you want to sacrifice overall
performance just for the test. Disabling it could create too many RCU
grace periods in the system.

> 
> Thanks
> Jon
> 
> -- 
> nvpublic

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 12:33       ` Harry Yoo
@ 2025-11-27 12:48         ` Harry Yoo
  2025-11-28  8:57           ` Jon Hunter
  0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2025-11-27 12:48 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi,
	linux-tegra@vger.kernel.org

On Thu, Nov 27, 2025 at 09:33:46PM +0900, Harry Yoo wrote:
> On Thu, Nov 27, 2025 at 11:38:49AM +0000, Jon Hunter wrote:
> > 
> > 
> > On 31/10/2025 21:32, Daniel Gomez wrote:
> > > 
> > > 
> > > On 10/09/2025 10.01, Vlastimil Babka wrote:
> > > > Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> > > > For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> > > > addition to main and spare sheaves.
> > > > 
> > > > kfree_rcu() operations will try to put objects on this sheaf. Once full,
> > > > the sheaf is detached and submitted to call_rcu() with a handler that
> > > > will try to put it in the barn, or flush to slab pages using bulk free,
> > > > when the barn is full. Then a new empty sheaf must be obtained to put
> > > > more objects there.
> > > > 
> > > > It's possible that no free sheaves are available to use for a new
> > > > rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> > > > GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> > > > kfree_rcu() implementation.
> > > > 
> > > > Expected advantages:
> > > > - batching the kfree_rcu() operations, that could eventually replace the
> > > >    existing batching
> > > > - sheaves can be reused for allocations via barn instead of being
> > > >    flushed to slabs, which is more efficient
> > > >    - this includes cases where only some cpus are allowed to process rcu
> > > >      callbacks (Android)
> > > > 
> > > > Possible disadvantage:
> > > > - objects might be waiting for more than their grace period (it is
> > > >    determined by the last object freed into the sheaf), increasing memory
> > > >    usage - but the existing batching does that too.
> > > > 
> > > > Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> > > > implementation favors smaller memory footprint over performance.
> > > > 
> > > > Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> > > > contexts where kfree_rcu() is called might not be compatible with taking
> > > > a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> > > > spinlock - the current kfree_rcu() implementation avoids doing that.
> > > > 
> > > > Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> > > > that have them. This is not a cheap operation, but the barrier usage is
> > > > rare - currently kmem_cache_destroy() or on module unload.
> > > > 
> > > > Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> > > > count how many kfree_rcu() used the rcu_free sheaf successfully and how
> > > > many had to fall back to the existing implementation.
> > > > 
> > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > 
> > > Hi Vlastimil,
> > > 
> > > This patch increases kmod selftest (stress module loader) runtime by about
> > > ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> > > CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> > > causing this, or how to address it?
> > > 
> > 
> > I have been looking into a regression for Linux v6.18-rc where time taken to
> > run some internal graphics tests on our Tegra234 device has increased from
> > around 35% causing the tests to timeout. Bisect is pointing to this commit
> > and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
> 
> Thanks for reporting! Uh, this has been put aside while I was busy working
> on other stuff... but now that we have two people complaining about this,
> I'll allocate some time to investigate and improve it.
> 
> It'll take some time though :)

By the way, how many CPUs do you have on your system, and does your
kernel have CONFIG_CODE_TAGGING enabled?

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
  2025-11-27 11:50       ` Jon Hunter
  2025-11-27 12:33       ` Harry Yoo
@ 2025-11-27 13:18       ` Vlastimil Babka
  2025-11-28  8:59         ` Jon Hunter
  2 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2025-11-27 13:18 UTC (permalink / raw)
  To: Jon Hunter, Daniel Gomez, Suren Baghdasaryan, Liam R. Howlett,
	Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi, linux-tegra@vger.kernel.org

On 11/27/25 12:38, Jon Hunter wrote:
> 
> 
> On 31/10/2025 21:32, Daniel Gomez wrote:
>> 
>> 
>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>> 
>> Hi Vlastimil,
>> 
>> This patch increases kmod selftest (stress module loader) runtime by about
>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>> causing this, or how to address it?
>> 
> 
> I have been looking into a regression for Linux v6.18-rc where time 
> taken to run some internal graphics tests on our Tegra234 device has 
> increased from around 35% causing the tests to timeout. Bisect is 
> pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.

Do the tegra tests involve (frequent) module unloads too, then? Or calling
kmem_cache_destroy() somewhere?

Thanks,
Vlastimil

> I have not tried disabling CONFIG_KVFREE_RCU_BATCHED=y but I can. I am 
> not sure if there are any downsides to disabling this?
> 
> Thanks
> Jon
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-05 11:25       ` Vlastimil Babka
@ 2025-11-27 14:00         ` Daniel Gomez
  2025-11-27 19:29           ` Suren Baghdasaryan
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Gomez @ 2025-11-27 14:00 UTC (permalink / raw)
  To: Vlastimil Babka, Harry Yoo, Suren Baghdasaryan
  Cc: Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, bpf,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi



On 05/11/2025 12.25, Vlastimil Babka wrote:
> On 11/3/25 04:17, Harry Yoo wrote:
>> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
>>>
>>>
>>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
>>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
>>>> addition to main and spare sheaves.
>>>>
>>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
>>>> the sheaf is detached and submitted to call_rcu() with a handler that
>>>> will try to put it in the barn, or flush to slab pages using bulk free,
>>>> when the barn is full. Then a new empty sheaf must be obtained to put
>>>> more objects there.
>>>>
>>>> It's possible that no free sheaves are available to use for a new
>>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
>>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
>>>> kfree_rcu() implementation.
>>>>
>>>> Expected advantages:
>>>> - batching the kfree_rcu() operations, that could eventually replace the
>>>>   existing batching
>>>> - sheaves can be reused for allocations via barn instead of being
>>>>   flushed to slabs, which is more efficient
>>>>   - this includes cases where only some cpus are allowed to process rcu
>>>>     callbacks (Android)
>>>>
>>>> Possible disadvantage:
>>>> - objects might be waiting for more than their grace period (it is
>>>>   determined by the last object freed into the sheaf), increasing memory
>>>>   usage - but the existing batching does that too.
>>>>
>>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
>>>> implementation favors smaller memory footprint over performance.
>>>>
>>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
>>>> contexts where kfree_rcu() is called might not be compatible with taking
>>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
>>>> spinlock - the current kfree_rcu() implementation avoids doing that.
>>>>
>>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
>>>> that have them. This is not a cheap operation, but the barrier usage is
>>>> rare - currently kmem_cache_destroy() or on module unload.
>>>>
>>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
>>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
>>>> many had to fall back to the existing implementation.
>>>>
>>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>>
>>> Hi Vlastimil,
>>>
>>> This patch increases kmod selftest (stress module loader) runtime by about
>>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>>> causing this, or how to address it?
>>
>> This is likely due to increased kvfree_rcu_barrier() during module unload.
> 
> Hm so there are actually two possible sources of this. One is that the
> module creates some kmem_cache and calls kmem_cache_destroy() on it before
> unloading. That does kvfree_rcu_barrier() which iterates all caches via
> flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
> have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
> that single cache.

Thanks for the feedback. And thanks to Jon who has revived this again.

> 
> The other source is codetag_unload_module(), and I'm afraid it's this one as
> it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?

Yes, we do have that enabled.

> Disabling it should help in this case, if you don't need memory allocation
> profiling for that stress test. I think there's some space for improvement -
> when compiled in but memalloc profiling never enabled during the uptime,
> this could probably be skipped? Suren?
> 
>> It currently iterates over all CPUs x slab caches (that enabled sheaves,
>> there should be only a few now) pair to make sure rcu sheaf is flushed
>> by the time kvfree_rcu_barrier() returns.
> 
> Yeah, also it's done under slab_mutex. Is the stress test trying to unload
> multiple modules in parallel? That would make things worse, although I'd
> expect there's a lot serialization in this area already.

AFAIK, the kmod stress test does not unload modules in parallel. Module unload
happens one at a time before each test iteration. However, test 0008 and 0009
run 300 total sequential module unloads.

ALL_TESTS="$ALL_TESTS 0008:150:1"
ALL_TESTS="$ALL_TESTS 0009:150:1"

> 
> Unfortunately it will get worse with sheaves extended to all caches. We
> could probably mark caches once they allocate their first rcu_free sheaf
> (should not add visible overhead) and keep skipping those that never did.
>> Just being curious, do you have any serious workload that depends on
>> the performance of module unload?

Can we have a combination of a weaker form of kvfree_rcu_barrier() + tracking?
Happy to test this again if you have a patch or something in mind.

In addition and AFAIK, module unloading is similar to ebpf programs. Ccing bpf
folks in case they have a workload.

But I don't have a particular workload in mind.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 14:00         ` Daniel Gomez
@ 2025-11-27 19:29           ` Suren Baghdasaryan
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2025-11-27 19:29 UTC (permalink / raw)
  To: Daniel Gomez
  Cc: Vlastimil Babka, Harry Yoo, Liam R. Howlett, Christoph Lameter,
	David Rientjes, Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules, bpf,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi

On Thu, Nov 27, 2025 at 6:01 AM Daniel Gomez <da.gomez@kernel.org> wrote:
>
>
>
> On 05/11/2025 12.25, Vlastimil Babka wrote:
> > On 11/3/25 04:17, Harry Yoo wrote:
> >> On Fri, Oct 31, 2025 at 10:32:54PM +0100, Daniel Gomez wrote:
> >>>
> >>>
> >>> On 10/09/2025 10.01, Vlastimil Babka wrote:
> >>>> Extend the sheaf infrastructure for more efficient kfree_rcu() handling.
> >>>> For caches with sheaves, on each cpu maintain a rcu_free sheaf in
> >>>> addition to main and spare sheaves.
> >>>>
> >>>> kfree_rcu() operations will try to put objects on this sheaf. Once full,
> >>>> the sheaf is detached and submitted to call_rcu() with a handler that
> >>>> will try to put it in the barn, or flush to slab pages using bulk free,
> >>>> when the barn is full. Then a new empty sheaf must be obtained to put
> >>>> more objects there.
> >>>>
> >>>> It's possible that no free sheaves are available to use for a new
> >>>> rcu_free sheaf, and the allocation in kfree_rcu() context can only use
> >>>> GFP_NOWAIT and thus may fail. In that case, fall back to the existing
> >>>> kfree_rcu() implementation.
> >>>>
> >>>> Expected advantages:
> >>>> - batching the kfree_rcu() operations, that could eventually replace the
> >>>>   existing batching
> >>>> - sheaves can be reused for allocations via barn instead of being
> >>>>   flushed to slabs, which is more efficient
> >>>>   - this includes cases where only some cpus are allowed to process rcu
> >>>>     callbacks (Android)
> >>>>
> >>>> Possible disadvantage:
> >>>> - objects might be waiting for more than their grace period (it is
> >>>>   determined by the last object freed into the sheaf), increasing memory
> >>>>   usage - but the existing batching does that too.
> >>>>
> >>>> Only implement this for CONFIG_KVFREE_RCU_BATCHED as the tiny
> >>>> implementation favors smaller memory footprint over performance.
> >>>>
> >>>> Also for now skip the usage of rcu sheaf for CONFIG_PREEMPT_RT as the
> >>>> contexts where kfree_rcu() is called might not be compatible with taking
> >>>> a barn spinlock or a GFP_NOWAIT allocation of a new sheaf taking a
> >>>> spinlock - the current kfree_rcu() implementation avoids doing that.
> >>>>
> >>>> Teach kvfree_rcu_barrier() to flush all rcu_free sheaves from all caches
> >>>> that have them. This is not a cheap operation, but the barrier usage is
> >>>> rare - currently kmem_cache_destroy() or on module unload.
> >>>>
> >>>> Add CONFIG_SLUB_STATS counters free_rcu_sheaf and free_rcu_sheaf_fail to
> >>>> count how many kfree_rcu() used the rcu_free sheaf successfully and how
> >>>> many had to fall back to the existing implementation.
> >>>>
> >>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>>
> >>> Hi Vlastimil,
> >>>
> >>> This patch increases kmod selftest (stress module loader) runtime by about
> >>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
> >>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
> >>> causing this, or how to address it?
> >>
> >> This is likely due to increased kvfree_rcu_barrier() during module unload.
> >
> > Hm so there are actually two possible sources of this. One is that the
> > module creates some kmem_cache and calls kmem_cache_destroy() on it before
> > unloading. That does kvfree_rcu_barrier() which iterates all caches via
> > flush_all_rcu_sheaves(), but in this case it shouldn't need to - we could
> > have a weaker form of kvfree_rcu_barrier() that only guarantees flushing of
> > that single cache.
>
> Thanks for the feedback. And thanks to Jon who has revived this again.
>
> >
> > The other source is codetag_unload_module(), and I'm afraid it's this one as
> > it's hooked to evey module unload. Do you have CONFIG_CODE_TAGGING enabled?
>
> Yes, we do have that enabled.

Sorry I missed this discussion before.
IIUC, the performance is impacted because kvfree_rcu_barrier() has to
flush_all_rcu_sheaves(), therefore is more costly than before.

>
> > Disabling it should help in this case, if you don't need memory allocation
> > profiling for that stress test. I think there's some space for improvement -
> > when compiled in but memalloc profiling never enabled during the uptime,
> > this could probably be skipped? Suren?

I think yes, we should be able to skip kvfree_rcu_barrier() inside
codetag_unload_module() if profiling was not enabled.
kvfree_rcu_barrier() is there to ensure all potential kfree_rcu()'s
for module allocations are finished before destroying the tags. I'll
need to add an additional "sticky" flag to record that profiling was
used so that we detect a case when it was enabled, then disabled
before module unloading. I can work on it next week.

> >
> >> It currently iterates over all CPUs x slab caches (that enabled sheaves,
> >> there should be only a few now) pair to make sure rcu sheaf is flushed
> >> by the time kvfree_rcu_barrier() returns.
> >
> > Yeah, also it's done under slab_mutex. Is the stress test trying to unload
> > multiple modules in parallel? That would make things worse, although I'd
> > expect there's a lot serialization in this area already.
>
> AFAIK, the kmod stress test does not unload modules in parallel. Module unload
> happens one at a time before each test iteration. However, test 0008 and 0009
> run 300 total sequential module unloads.
>
> ALL_TESTS="$ALL_TESTS 0008:150:1"
> ALL_TESTS="$ALL_TESTS 0009:150:1"
>
> >
> > Unfortunately it will get worse with sheaves extended to all caches. We
> > could probably mark caches once they allocate their first rcu_free sheaf
> > (should not add visible overhead) and keep skipping those that never did.
> >> Just being curious, do you have any serious workload that depends on
> >> the performance of module unload?
>
> Can we have a combination of a weaker form of kvfree_rcu_barrier() + tracking?
> Happy to test this again if you have a patch or something in mind.
>
> In addition and AFAIK, module unloading is similar to ebpf programs. Ccing bpf
> folks in case they have a workload.
>
> But I don't have a particular workload in mind.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 12:48         ` Harry Yoo
@ 2025-11-28  8:57           ` Jon Hunter
  2025-12-01  6:55             ` Harry Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Jon Hunter @ 2025-11-28  8:57 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi,
	linux-tegra@vger.kernel.org


On 27/11/2025 12:48, Harry Yoo wrote:

...

>>> I have been looking into a regression for Linux v6.18-rc where time taken to
>>> run some internal graphics tests on our Tegra234 device has increased from
>>> around 35% causing the tests to timeout. Bisect is pointing to this commit
>>> and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
>>
>> Thanks for reporting! Uh, this has been put aside while I was busy working
>> on other stuff... but now that we have two people complaining about this,
>> I'll allocate some time to investigate and improve it.
>>
>> It'll take some time though :)
> 
> By the way, how many CPUs do you have on your system, and does your
> kernel have CONFIG_CODE_TAGGING enabled?

For this device there are 12 CPUs. I don't see CONFIG_CODE_TAGGING enabled.

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-27 13:18       ` Vlastimil Babka
@ 2025-11-28  8:59         ` Jon Hunter
  0 siblings, 0 replies; 18+ messages in thread
From: Jon Hunter @ 2025-11-28  8:59 UTC (permalink / raw)
  To: Vlastimil Babka, Daniel Gomez, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes
  Cc: Roman Gushchin, Harry Yoo, Uladzislau Rezki, Sidhartha Kumar,
	linux-mm, linux-kernel, rcu, maple-tree, linux-modules,
	Luis Chamberlain, Petr Pavlu, Sami Tolvanen, Aaron Tomlin,
	Lucas De Marchi, linux-tegra@vger.kernel.org


On 27/11/2025 13:18, Vlastimil Babka wrote:
> On 11/27/25 12:38, Jon Hunter wrote:
>>
>>
>> On 31/10/2025 21:32, Daniel Gomez wrote:
>>>
>>>
>>> On 10/09/2025 10.01, Vlastimil Babka wrote:
>>>
>>> Hi Vlastimil,
>>>
>>> This patch increases kmod selftest (stress module loader) runtime by about
>>> ~50-60%, from ~200s to ~300s total execution time. My tested kernel has
>>> CONFIG_KVFREE_RCU_BATCHED enabled. Any idea or suggestions on what might be
>>> causing this, or how to address it?
>>>
>>
>> I have been looking into a regression for Linux v6.18-rc where time
>> taken to run some internal graphics tests on our Tegra234 device has
>> increased from around 35% causing the tests to timeout. Bisect is
>> pointing to this commit and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
> 
> Do the tegra tests involve (frequent) module unloads too, then? Or calling
> kmem_cache_destroy() somewhere?

In this specific case I am not running the tegra-tests but we have a 
internal testsuite of GPU related tests. I don't believe that believe 
this is unloading any modules. I can take a look next week to see if 
kmem_cache_destroy() is getting called somewhere when these tests run.

Thanks
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-27 19:29           ` Suren Baghdasaryan
@ 2025-11-28 11:37             ` Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
                                 ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Harry Yoo @ 2025-11-28 11:37 UTC (permalink / raw)
  To: surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, harry.yoo, linux-kernel,
	linux-mm, linux-modules, lucas.demarchi, maple-tree, mcgrof,
	petr.pavlu, rcu, rientjes, roman.gushchin, samitolvanen,
	sidhartha.kumar, urezki, vbabka, jonathanh

Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
caches when a cache is destroyed. This is unnecessary when destroying
a slab cache; only the RCU sheaves belonging to the cache being destroyed
need to be flushed.

As suggested by Vlastimil Babka, introduce a weaker form of
kvfree_rcu_barrier() that operates on a specific slab cache and call it
on cache destruction.

The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
5900X machine (1 socket), by loading slub_kunit module.

Before:
  Total calls: 19
  Average latency (us): 8529
  Total time (us): 162069

After:
  Total calls: 19
  Average latency (us): 3804
  Total time (us): 72287

Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
---

Not sure if the regression is worse on the reporters' machines due to
higher core count (or because some cores were busy doing other things,
dunno).

Hopefully this will reduce the time to complete tests,
and Suren could add his patch on top of this ;)

 include/linux/slab.h |  5 ++++
 mm/slab.h            |  1 +
 mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
 mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
 4 files changed, 73 insertions(+), 40 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index cf443f064a66..937c93d44e8c 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1149,6 +1149,10 @@ static inline void kvfree_rcu_barrier(void)
 {
 	rcu_barrier();
 }
+static inline void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
+{
+	rcu_barrier();
+}
 
 static inline void kfree_rcu_scheduler_running(void) { }
 #else
@@ -1156,6 +1160,7 @@ void kvfree_rcu_barrier(void);
 
 void kfree_rcu_scheduler_running(void);
 #endif
+void kvfree_rcu_barrier_on_cache(struct kmem_cache *s);
 
 /**
  * kmalloc_size_roundup - Report allocation bucket size for the given size
diff --git a/mm/slab.h b/mm/slab.h
index f730e012553c..e767aa7e91b0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -422,6 +422,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 
 bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
 void flush_all_rcu_sheaves(void);
+void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
 
 #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
 			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 84dfff4f7b1f..dd8a49d6f9cc 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -492,7 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 		return;
 
 	/* in-flight kfree_rcu()'s may include objects from our cache */
-	kvfree_rcu_barrier();
+	kvfree_rcu_barrier_on_cache(s);
 
 	if (IS_ENABLED(CONFIG_SLUB_RCU_DEBUG) &&
 	    (s->flags & SLAB_TYPESAFE_BY_RCU)) {
@@ -2038,25 +2038,13 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 }
 EXPORT_SYMBOL_GPL(kvfree_call_rcu);
 
-/**
- * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
- *
- * Note that a single argument of kvfree_rcu() call has a slow path that
- * triggers synchronize_rcu() following by freeing a pointer. It is done
- * before the return from the function. Therefore for any single-argument
- * call that will result in a kfree() to a cache that is to be destroyed
- * during module exit, it is developer's responsibility to ensure that all
- * such calls have returned before the call to kmem_cache_destroy().
- */
-void kvfree_rcu_barrier(void)
+static inline void __kvfree_rcu_barrier(void)
 {
 	struct kfree_rcu_cpu_work *krwp;
 	struct kfree_rcu_cpu *krcp;
 	bool queued;
 	int i, cpu;
 
-	flush_all_rcu_sheaves();
-
 	/*
 	 * Firstly we detach objects and queue them over an RCU-batch
 	 * for all CPUs. Finally queued works are flushed for each CPU.
@@ -2118,8 +2106,43 @@ void kvfree_rcu_barrier(void)
 		}
 	}
 }
+
+/**
+ * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
+ *
+ * Note that a single argument of kvfree_rcu() call has a slow path that
+ * triggers synchronize_rcu() following by freeing a pointer. It is done
+ * before the return from the function. Therefore for any single-argument
+ * call that will result in a kfree() to a cache that is to be destroyed
+ * during module exit, it is developer's responsibility to ensure that all
+ * such calls have returned before the call to kmem_cache_destroy().
+ */
+void kvfree_rcu_barrier(void)
+{
+	flush_all_rcu_sheaves();
+	__kvfree_rcu_barrier();
+}
 EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
 
+/**
+ * kvfree_rcu_barrier_on_cache - Wait for in-flight kvfree_rcu() calls on a
+ *                               specific slab cache.
+ * @s: slab cache to wait for
+ *
+ * See the description of kvfree_rcu_barrier() for details.
+ */
+void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
+{
+	if (s->cpu_sheaves)
+		flush_rcu_sheaves_on_cache(s);
+	/*
+	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
+	 * on a specific slab cache.
+	 */
+	__kvfree_rcu_barrier();
+}
+EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
+
 static unsigned long
 kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 {
@@ -2215,4 +2238,3 @@ void __init kvfree_rcu_init(void)
 }
 
 #endif /* CONFIG_KVFREE_RCU_BATCHED */
-
diff --git a/mm/slub.c b/mm/slub.c
index 785e25a14999..7cec2220712b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4118,42 +4118,47 @@ static void flush_rcu_sheaf(struct work_struct *w)
 
 
 /* needed for kvfree_rcu_barrier() */
-void flush_all_rcu_sheaves(void)
+void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
 {
 	struct slub_flush_work *sfw;
-	struct kmem_cache *s;
 	unsigned int cpu;
 
-	cpus_read_lock();
-	mutex_lock(&slab_mutex);
+	mutex_lock(&flush_lock);
 
-	list_for_each_entry(s, &slab_caches, list) {
-		if (!s->cpu_sheaves)
-			continue;
+	for_each_online_cpu(cpu) {
+		sfw = &per_cpu(slub_flush, cpu);
 
-		mutex_lock(&flush_lock);
+		/*
+		 * we don't check if rcu_free sheaf exists - racing
+		 * __kfree_rcu_sheaf() might have just removed it.
+		 * by executing flush_rcu_sheaf() on the cpu we make
+		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
+		 */
 
-		for_each_online_cpu(cpu) {
-			sfw = &per_cpu(slub_flush, cpu);
+		INIT_WORK(&sfw->work, flush_rcu_sheaf);
+		sfw->s = s;
+		queue_work_on(cpu, flushwq, &sfw->work);
+	}
 
-			/*
-			 * we don't check if rcu_free sheaf exists - racing
-			 * __kfree_rcu_sheaf() might have just removed it.
-			 * by executing flush_rcu_sheaf() on the cpu we make
-			 * sure the __kfree_rcu_sheaf() finished its call_rcu()
-			 */
+	for_each_online_cpu(cpu) {
+		sfw = &per_cpu(slub_flush, cpu);
+		flush_work(&sfw->work);
+	}
 
-			INIT_WORK(&sfw->work, flush_rcu_sheaf);
-			sfw->s = s;
-			queue_work_on(cpu, flushwq, &sfw->work);
-		}
+	mutex_unlock(&flush_lock);
+}
 
-		for_each_online_cpu(cpu) {
-			sfw = &per_cpu(slub_flush, cpu);
-			flush_work(&sfw->work);
-		}
+void flush_all_rcu_sheaves(void)
+{
+	struct kmem_cache *s;
+
+	cpus_read_lock();
+	mutex_lock(&slab_mutex);
 
-		mutex_unlock(&flush_lock);
+	list_for_each_entry(s, &slab_caches, list) {
+		if (!s->cpu_sheaves)
+			continue;
+		flush_rcu_sheaves_on_cache(s);
 	}
 
 	mutex_unlock(&slab_mutex);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
@ 2025-11-28 12:22               ` Harry Yoo
  2025-11-28 12:38               ` Daniel Gomez
  2025-12-02  9:29               ` Jon Hunter
  2 siblings, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2025-11-28 12:22 UTC (permalink / raw)
  To: surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, jonathanh

On Fri, Nov 28, 2025 at 08:37:40PM +0900, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>   Total calls: 19
>   Average latency (us): 8529
>   Total time (us): 162069
> 
> After:
>   Total calls: 19
>   Average latency (us): 3804
>   Total time (us): 72287

Ooh, I just realized that I messed up the config and
have only two cores enabled. Will update the numbers after enabling 22 more :)

> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
> 
> Not sure if the regression is worse on the reporters' machines due to
> higher core count (or because some cores were busy doing other things,
> dunno).
> 
> Hopefully this will reduce the time to complete tests,
> and Suren could add his patch on top of this ;)
> 
>  include/linux/slab.h |  5 ++++
>  mm/slab.h            |  1 +
>  mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
>  mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
>  4 files changed, 73 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index cf443f064a66..937c93d44e8c 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -1149,6 +1149,10 @@ static inline void kvfree_rcu_barrier(void)
>  {
>  	rcu_barrier();
>  }
> +static inline void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
> +{
> +	rcu_barrier();
> +}
>  
>  static inline void kfree_rcu_scheduler_running(void) { }
>  #else
> @@ -1156,6 +1160,7 @@ void kvfree_rcu_barrier(void);
>  
>  void kfree_rcu_scheduler_running(void);
>  #endif
> +void kvfree_rcu_barrier_on_cache(struct kmem_cache *s);
>  
>  /**
>   * kmalloc_size_roundup - Report allocation bucket size for the given size
> diff --git a/mm/slab.h b/mm/slab.h
> index f730e012553c..e767aa7e91b0 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -422,6 +422,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
>  
>  bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
>  void flush_all_rcu_sheaves(void);
> +void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
>  
>  #define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
>  			 SLAB_CACHE_DMA32 | SLAB_PANIC | \
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 84dfff4f7b1f..dd8a49d6f9cc 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -492,7 +492,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
>  		return;
>  
>  	/* in-flight kfree_rcu()'s may include objects from our cache */
> -	kvfree_rcu_barrier();
> +	kvfree_rcu_barrier_on_cache(s);
>  
>  	if (IS_ENABLED(CONFIG_SLUB_RCU_DEBUG) &&
>  	    (s->flags & SLAB_TYPESAFE_BY_RCU)) {
> @@ -2038,25 +2038,13 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  }
>  EXPORT_SYMBOL_GPL(kvfree_call_rcu);
>  
> -/**
> - * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
> - *
> - * Note that a single argument of kvfree_rcu() call has a slow path that
> - * triggers synchronize_rcu() following by freeing a pointer. It is done
> - * before the return from the function. Therefore for any single-argument
> - * call that will result in a kfree() to a cache that is to be destroyed
> - * during module exit, it is developer's responsibility to ensure that all
> - * such calls have returned before the call to kmem_cache_destroy().
> - */
> -void kvfree_rcu_barrier(void)
> +static inline void __kvfree_rcu_barrier(void)
>  {
>  	struct kfree_rcu_cpu_work *krwp;
>  	struct kfree_rcu_cpu *krcp;
>  	bool queued;
>  	int i, cpu;
>  
> -	flush_all_rcu_sheaves();
> -
>  	/*
>  	 * Firstly we detach objects and queue them over an RCU-batch
>  	 * for all CPUs. Finally queued works are flushed for each CPU.
> @@ -2118,8 +2106,43 @@ void kvfree_rcu_barrier(void)
>  		}
>  	}
>  }
> +
> +/**
> + * kvfree_rcu_barrier - Wait until all in-flight kvfree_rcu() complete.
> + *
> + * Note that a single argument of kvfree_rcu() call has a slow path that
> + * triggers synchronize_rcu() following by freeing a pointer. It is done
> + * before the return from the function. Therefore for any single-argument
> + * call that will result in a kfree() to a cache that is to be destroyed
> + * during module exit, it is developer's responsibility to ensure that all
> + * such calls have returned before the call to kmem_cache_destroy().
> + */
> +void kvfree_rcu_barrier(void)
> +{
> +	flush_all_rcu_sheaves();
> +	__kvfree_rcu_barrier();
> +}
>  EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
>  
> +/**
> + * kvfree_rcu_barrier_on_cache - Wait for in-flight kvfree_rcu() calls on a
> + *                               specific slab cache.
> + * @s: slab cache to wait for
> + *
> + * See the description of kvfree_rcu_barrier() for details.
> + */
> +void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
> +{
> +	if (s->cpu_sheaves)
> +		flush_rcu_sheaves_on_cache(s);
> +	/*
> +	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
> +	 * on a specific slab cache.
> +	 */
> +	__kvfree_rcu_barrier();
> +}
> +EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
> +
>  static unsigned long
>  kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> @@ -2215,4 +2238,3 @@ void __init kvfree_rcu_init(void)
>  }
>  
>  #endif /* CONFIG_KVFREE_RCU_BATCHED */
> -
> diff --git a/mm/slub.c b/mm/slub.c
> index 785e25a14999..7cec2220712b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4118,42 +4118,47 @@ static void flush_rcu_sheaf(struct work_struct *w)
>  
>  
>  /* needed for kvfree_rcu_barrier() */
> -void flush_all_rcu_sheaves(void)
> +void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
>  {
>  	struct slub_flush_work *sfw;
> -	struct kmem_cache *s;
>  	unsigned int cpu;
>  
> -	cpus_read_lock();
> -	mutex_lock(&slab_mutex);
> +	mutex_lock(&flush_lock);
>  
> -	list_for_each_entry(s, &slab_caches, list) {
> -		if (!s->cpu_sheaves)
> -			continue;
> +	for_each_online_cpu(cpu) {
> +		sfw = &per_cpu(slub_flush, cpu);
>  
> -		mutex_lock(&flush_lock);
> +		/*
> +		 * we don't check if rcu_free sheaf exists - racing
> +		 * __kfree_rcu_sheaf() might have just removed it.
> +		 * by executing flush_rcu_sheaf() on the cpu we make
> +		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
> +		 */
>  
> -		for_each_online_cpu(cpu) {
> -			sfw = &per_cpu(slub_flush, cpu);
> +		INIT_WORK(&sfw->work, flush_rcu_sheaf);
> +		sfw->s = s;
> +		queue_work_on(cpu, flushwq, &sfw->work);
> +	}
>  
> -			/*
> -			 * we don't check if rcu_free sheaf exists - racing
> -			 * __kfree_rcu_sheaf() might have just removed it.
> -			 * by executing flush_rcu_sheaf() on the cpu we make
> -			 * sure the __kfree_rcu_sheaf() finished its call_rcu()
> -			 */
> +	for_each_online_cpu(cpu) {
> +		sfw = &per_cpu(slub_flush, cpu);
> +		flush_work(&sfw->work);
> +	}
>  
> -			INIT_WORK(&sfw->work, flush_rcu_sheaf);
> -			sfw->s = s;
> -			queue_work_on(cpu, flushwq, &sfw->work);
> -		}
> +	mutex_unlock(&flush_lock);
> +}
>  
> -		for_each_online_cpu(cpu) {
> -			sfw = &per_cpu(slub_flush, cpu);
> -			flush_work(&sfw->work);
> -		}
> +void flush_all_rcu_sheaves(void)
> +{
> +	struct kmem_cache *s;
> +
> +	cpus_read_lock();
> +	mutex_lock(&slab_mutex);
>  
> -		mutex_unlock(&flush_lock);
> +	list_for_each_entry(s, &slab_caches, list) {
> +		if (!s->cpu_sheaves)
> +			continue;
> +		flush_rcu_sheaves_on_cache(s);
>  	}
>  
>  	mutex_unlock(&slab_mutex);
> -- 
> 2.43.0
> 

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
@ 2025-11-28 12:38               ` Daniel Gomez
  2025-12-02  9:29               ` Jon Hunter
  2 siblings, 0 replies; 18+ messages in thread
From: Daniel Gomez @ 2025-11-28 12:38 UTC (permalink / raw)
  To: Harry Yoo, surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, jonathanh



On 28/11/2025 12.37, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>   Total calls: 19
>   Average latency (us): 8529
>   Total time (us): 162069
> 
> After:
>   Total calls: 19
>   Average latency (us): 3804
>   Total time (us): 72287
> 
> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---

Thanks Harry for the patch,

A quick test on a different machine from the one I originally used to report
this shows a decrease from 214s to 100s.

LGTM,

Tested-by: Daniel Gomez <da.gomez@samsung.com>

> 
> Not sure if the regression is worse on the reporters' machines due to
> higher core count (or because some cores were busy doing other things,
> dunno).

FWIW, CI modules run on an 8 core VM. Depending on the host CPU, this made the
absolute number different but equivalent performance degradation.

> 
> Hopefully this will reduce the time to complete tests,
> and Suren could add his patch on top of this ;)
> 
>  include/linux/slab.h |  5 ++++
>  mm/slab.h            |  1 +
>  mm/slab_common.c     | 52 +++++++++++++++++++++++++++++------------
>  mm/slub.c            | 55 ++++++++++++++++++++++++--------------------
>  4 files changed, 73 insertions(+), 40 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
  2025-11-28  8:57           ` Jon Hunter
@ 2025-12-01  6:55             ` Harry Yoo
  0 siblings, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2025-12-01  6:55 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Daniel Gomez, Vlastimil Babka, Suren Baghdasaryan,
	Liam R. Howlett, Christoph Lameter, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, Sidhartha Kumar, linux-mm,
	linux-kernel, rcu, maple-tree, linux-modules, Luis Chamberlain,
	Petr Pavlu, Sami Tolvanen, Aaron Tomlin, Lucas De Marchi,
	linux-tegra@vger.kernel.org

On Fri, Nov 28, 2025 at 08:57:28AM +0000, Jon Hunter wrote:
> 
> On 27/11/2025 12:48, Harry Yoo wrote:
> 
> ...
> 
> > > > I have been looking into a regression for Linux v6.18-rc where time taken to
> > > > run some internal graphics tests on our Tegra234 device has increased from
> > > > around 35% causing the tests to timeout. Bisect is pointing to this commit
> > > > and I also see we have CONFIG_KVFREE_RCU_BATCHED=y.
> > > 
> > > Thanks for reporting! Uh, this has been put aside while I was busy working
> > > on other stuff... but now that we have two people complaining about this,
> > > I'll allocate some time to investigate and improve it.
> > > 
> > > It'll take some time though :)
> > 
> > By the way, how many CPUs do you have on your system, and does your
> > kernel have CONFIG_CODE_TAGGING enabled?
> 
> For this device there are 12 CPUs. I don't see CONFIG_CODE_TAGGING enabled.

Thanks! Then it's probably due to kmem_cache_destroy().
Please let me know this patch improves your test execution time.

https://lore.kernel.org/linux-mm/20251128113740.90129-1-harry.yoo@oracle.com/

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
  2025-11-28 12:22               ` Harry Yoo
  2025-11-28 12:38               ` Daniel Gomez
@ 2025-12-02  9:29               ` Jon Hunter
  2025-12-02 10:18                 ` Harry Yoo
  2 siblings, 1 reply; 18+ messages in thread
From: Jon Hunter @ 2025-12-02  9:29 UTC (permalink / raw)
  To: Harry Yoo, surenb
  Cc: Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel, linux-mm,
	linux-modules, lucas.demarchi, maple-tree, mcgrof, petr.pavlu,
	rcu, rientjes, roman.gushchin, samitolvanen, sidhartha.kumar,
	urezki, vbabka, linux-tegra@vger.kernel.org


On 28/11/2025 11:37, Harry Yoo wrote:
> Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> caches when a cache is destroyed. This is unnecessary when destroying
> a slab cache; only the RCU sheaves belonging to the cache being destroyed
> need to be flushed.
> 
> As suggested by Vlastimil Babka, introduce a weaker form of
> kvfree_rcu_barrier() that operates on a specific slab cache and call it
> on cache destruction.
> 
> The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> 5900X machine (1 socket), by loading slub_kunit module.
> 
> Before:
>    Total calls: 19
>    Average latency (us): 8529
>    Total time (us): 162069
> 
> After:
>    Total calls: 19
>    Average latency (us): 3804
>    Total time (us): 72287
> 
> Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---

Thanks for the rapid fix. I have been testing this and can confirm that 
this does fix the performance regression I was seeing.

BTW shouldn't we add a 'Fixes:' tag above? I would like to ensure that 
this gets picked up for v6.18 stable.

Otherwise ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!
Jon

-- 
nvpublic


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
  2025-12-02  9:29               ` Jon Hunter
@ 2025-12-02 10:18                 ` Harry Yoo
  0 siblings, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2025-12-02 10:18 UTC (permalink / raw)
  To: Jon Hunter
  Cc: surenb, Liam.Howlett, atomlin, bpf, cl, da.gomez, linux-kernel,
	linux-mm, linux-modules, lucas.demarchi, maple-tree, mcgrof,
	petr.pavlu, rcu, rientjes, roman.gushchin, samitolvanen,
	sidhartha.kumar, urezki, vbabka, linux-tegra@vger.kernel.org

On Tue, Dec 02, 2025 at 09:29:17AM +0000, Jon Hunter wrote:
> 
> On 28/11/2025 11:37, Harry Yoo wrote:
> > Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
> > caches when a cache is destroyed. This is unnecessary when destroying
> > a slab cache; only the RCU sheaves belonging to the cache being destroyed
> > need to be flushed.
> > 
> > As suggested by Vlastimil Babka, introduce a weaker form of
> > kvfree_rcu_barrier() that operates on a specific slab cache and call it
> > on cache destruction.
> > 
> > The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
> > 5900X machine (1 socket), by loading slub_kunit module.
> > 
> > Before:
> >    Total calls: 19
> >    Average latency (us): 8529
> >    Total time (us): 162069
> > 
> > After:
> >    Total calls: 19
> >    Average latency (us): 3804
> >    Total time (us): 72287
> > 
> > Link: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
> > Link: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
> > Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> > ---
> 
> Thanks for the rapid fix. I have been testing this and can confirm that this
> does fix the performance regression I was seeing.

Great!

> BTW shouldn't we add a 'Fixes:' tag above? I would like to ensure that this
> gets picked up for v6.18 stable.

Good point, I added Cc: stable and Fixes: tags.
(and your and Daniel's Reported-and-tested-by: tags)

> Otherwise ...
> 
> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thank you Jon and Daniel a lot for reporting regression and testing the fix!

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-12-02 10:19 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250910-slub-percpu-caches-v8-0-ca3099d8352c@suse.cz>
     [not found] ` <20250910-slub-percpu-caches-v8-4-ca3099d8352c@suse.cz>
2025-10-31 21:32   ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Daniel Gomez
2025-11-03  3:17     ` Harry Yoo
2025-11-05 11:25       ` Vlastimil Babka
2025-11-27 14:00         ` Daniel Gomez
2025-11-27 19:29           ` Suren Baghdasaryan
2025-11-28 11:37             ` [PATCH V1] mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction Harry Yoo
2025-11-28 12:22               ` Harry Yoo
2025-11-28 12:38               ` Daniel Gomez
2025-12-02  9:29               ` Jon Hunter
2025-12-02 10:18                 ` Harry Yoo
2025-11-27 11:38     ` [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations Jon Hunter
2025-11-27 11:50       ` Jon Hunter
2025-11-27 12:33       ` Harry Yoo
2025-11-27 12:48         ` Harry Yoo
2025-11-28  8:57           ` Jon Hunter
2025-12-01  6:55             ` Harry Yoo
2025-11-27 13:18       ` Vlastimil Babka
2025-11-28  8:59         ` Jon Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).