* [RFC PATCH v2 0/8] kvfree_rcu() improvements
@ 2026-04-16 9:10 Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm, Alexander Viro,
Christian Brauner
These are a few improvements for k[v]free_rcu() API,
suggested by Alexei Starovoitov. This aims to tackle two problems:
1) Allow an 8-byte field to be used as an alternative to
struct rcu_head (16-byte) for 2-argument kvfree_rcu()
to save memory.
2) Add kfree_rcu_nolock() API for an unknown context.
"Unknown context" means the caller does not know whether spinning
on a lock is safe. For example, a BPF program attached to an
arbitrary kernel function may run while the CPU already holds
krcp->lock. However, in practice, it's not held most of the time.
# Discussion
Now that we have sheaves for kmalloc caches, most of frees go through
the sheaves layer. However, when sheaves becomes full w/ !allow_spin,
call_rcu() cannot be called because the context is unknown. (e.g., it
might have preempted call_rcu()). There are two possible approaches:
a) Implement a general call_rcu_nolock() in the RCU subsystem that
defers call_rcu() when it's not safe.
b) Handle this as a special case only for rcu sheaf submission
in mm/slab_common.c, without touching the RCU core.
This series takes approach (b). This is because a general
call_rcu_nolock() would need to flush deferred callbacks before
rcu_barrier() to preserve its guarantee, increasing the cost of
rcu_barrier() for all RCU users, not just kfree_rcu. By keeping the
deferred call_rcu logic in the slab subsystem, only
kvfree_rcu_barrier() pays the extra cost.
One downside of the current approach is that slab uses the condition
`!allow_spin && irqs_disabled()` to determine whether it's safe to
call call_rcu(), which creates a dependency on RCU's implementation
details. I'd like to hear thoughts on this.
# Part 1. Allow an 8-byte field to be used as an alternative to
struct rcu_head for 2-argument kvfree_rcu()
(patches 1-2)
Technically, objects that are freed with k[v]free_rcu() need
only one pointer to link objects, because we already know that
the callback function is always kvfree(). For this purpose,
struct rcu_head is unnecessarily large (16 bytes on 64-bit).
Allow a smaller, 8-byte field (of struct rcu_ptr type) to be used
with k[v]free_rcu(). Let's save one pointer per slab object.
I have to admit that my naming skill isn't great; hopefully
we'll come up with a better name than `struct rcu_ptr`.
With this feature, either a struct rcu_ptr or rcu_head field
can be used as the second argument of the k[v]free_rcu() API.
Users that only use k[v]free_rcu() are may use struct rcu_ptr to save
memory (if there can be a lot of objects). However, some users,
such as maple tree, may use call_rcu() or k[v]free_rcu() for objects
of the same type. For such users, struct rcu_head remains the only
option.
Patch 1 implements the struct rcu_ptr feature (for
CONFIG_KVFREE_RCU_BATCHED), and patch 2 converts fs/dcache external_name
to use struct rcu_ptr as an example user, saving a pointer per
dynamically allocated external file name.
# Part 2. Add kfree_rcu_nolock() for unknown contexts
(patches 3-8)
Currently, kfree_rcu() cannot be called when the context is unknown,
which might not allow spinning on a lock. In such a context, even
calling call_rcu() is not legal, forcing users to implement some
sort of deferred freeing. Let's make users' lives easier with
a new kfree_rcu_nolock() variant.
Note that only the 2-argument variant is supported, since there is
not much we can do when trylock & memory allocation fails.
When spinning on a lock is not allowed, try to acquire the spinlock
using spin_trylock(). When trylock succeeds, do either:
1) Use the rcu sheaf to free the object. Note that call_rcu() cannot
be called in an unknown context, because it might have preempted
call_rcu(). When the rcu sheaf becomes full by freeing the object,
defer the submission of the full sheaf using irq_work
(defer_call_rcu).
2) Use bnode (of struct kvfree_rcu_bulk_data) to store the pointer.
If trylock succeeded but no cached bnode is available, fall back
and queue page cache worker just like normal 2-args kvfree_rcu()
path.
In rare cases where trylock fails, a non-lazy irq_work is used to
defer calling kvfree_call_rcu().
When certain debug features (kmemleak, debugobjects) are enabled,
freeing is always deferred because they use spinlocks.
Patch 3 moves code for preparation.
Patch 4 introduces kfree_rcu_nolock().
Patch 5 teaches the rcu sheaf to handle the !allow_spin case.
Patch 6 wraps rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef.
Patch 7 introduces deferred submission of rcu sheaves for the
!allow_spin case when IRQs are disabled.
Patch 8 adds a kunit test case for kfree_rcu_nolock().
Changes since RFC V1 [1]:
- Dropped the kmalloc_nolock() -> kfree[_rcu]() path support
and the objexts_flags cleanup as they already have landed mainline.
- Dropped rcu_ptr conversions in mm/ (previous patch 2) and instead
added struct external_name in fs/dcache.c as a user(new patch 2).
- (Fix) Handle kfence addresses correctly using is_kfence_address()
and kfence_object_start().
- Reworked kfree_rcu_nolock() (patch 4):
- When trylock succeeds, now attempts to use cached bnodes
(like normal kvfree_rcu 2-arg path) instead of only inserting
into krcp->head.
- Added allow_spin parameter to __schedule_delayed_monitor_work()
and run_page_cache_worker() to defer work submission via
irq_work when spinning is not allowed (Joel).
- (Fix) Introduced defer_kvfree_rcu_barrier() to flush deferred
objects before flushing rcu sheaves, preserving correctness of
kvfree_rcu_barrier().
- (Fix) Moved kvfree_rcu_barrier()/kvfree_rcu_barrier_on_cache()
to slab_common.c on CONFIG_KVFREE_RCU_BATCHED=n, and made them
wait for deferred irq_works even without kvfree_rcu batching.
- Introduced object_start_addr() helper to deduplicate the
start address calculation logic.
- Instead of falling back when the rcu sheaf becomes full,
implemented deferred submission of rcu sheaves using irq_work
(new patch 7) (Vlastimil, Alexei).
- Wrapped rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef
(new patch 6).
- Added a kunit test for kfree_rcu_nolock() (new patch 8).
[1] RFC V1: https://lore.kernel.org/linux-mm/20260206093410.160622-1-harry.yoo@oracle.com
RFC V2 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v2r1
RFC V1 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v1r1
What haven't changed since RFC v1:
- PREEMPT_RT support for kfree_rcu_sheaf() (Vlastimil): that is worth
addressing and I think it's doable, but it'll be a too big change to
be part of this series.
- Reducing struct rcu_ptr on !KVFREE_RCU_BATCHED (Vlastimil): I tried,
but I'm not still sure it's worth the complexity for
CONFIG_KVFREE_RCU_BATCHED=n users. Also, this inevitably introduces
some delay in freeing objects which is against the purpose of
RCU_STRICT_GRACE_PERIOD.
- While writing this cover letter, just realized that I should probably
try to reduce the number of irq work structures (pointed out by Joel)
(at least to 2 for lazy and non-lazy instead of 4). Will explore this
in the next version.
Harry Yoo (Oracle) (8):
mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
fs/dcache: use rcu_ptr instead of rcu_head for external names
mm/slab: move kfree_rcu_cpu[_work] definitions
mm/slab: introduce kfree_rcu_nolock()
mm/slab: make kfree_rcu_nolock() work with sheaves
mm/slab: wrap rcu sheaf handling with ifdef
mm/slab: introduce deferred submission of rcu sheaves
lib/tests/slub_kunit: add a test case for kfree_rcu_nolock()
fs/dcache.c | 8 +-
include/linux/rcupdate.h | 64 ++++--
include/linux/slab.h | 16 +-
include/linux/types.h | 9 +
lib/tests/slub_kunit.c | 73 +++++++
mm/slab.h | 8 +-
mm/slab_common.c | 452 +++++++++++++++++++++++++++++----------
mm/slub.c | 47 +++-
8 files changed, 514 insertions(+), 163 deletions(-)
base-commit: 7e0445f673205fd045f3358cacb52b3557627317
--
2.43.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
@ 2026-04-16 9:10 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 2/8] fs/dcache: use rcu_ptr instead of rcu_head for external names Harry Yoo (Oracle)
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm
k[v]free_rcu() repurposes two fields of struct rcu_head: 'func' to store
the start address of the object, and 'next' to link objects.
However, using 'func' to store the start address is unnecessary:
1. slab can get the start address from the address of struct rcu_head
field via nearest_obj(), and
2. vmalloc and large kmalloc can get the start address by aligning
down the address of the struct rcu_head field to the page boundary.
Therefore, allow an 8-byte (on 64-bit) field (of a new type called
struct rcu_ptr) to be used with k[v]free_rcu() with two arguments.
Some users use both call_rcu() and k[v]free_rcu() to process callbacks
(e.g., maple tree), so it makes sense to have struct rcu_head field
to handle both cases. However, many users that simply free objects via
kvfree_rcu() can save one pointer by using struct rcu_ptr instead of
struct rcu_head.
Note that struct rcu_ptr is a single pointer only when
CONFIG_KVFREE_RCU_BATCHED=y. To keep kvfree_rcu() implementation minimal
when CONFIG_KVFREE_RCU_BATCHED is disabled, struct rcu_ptr is the size
as struct rcu_head, and the implementation of kvfree_rcu() remains
unchanged in that configuration.
Note that implementing a kvfree_rcu batching on !KVFREE_RCU_BATCHED is
against the purpose of RCU_STRICT_GRACE_PERIOD which is often used
to catch use-after-free bugs.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
include/linux/rcupdate.h | 61 +++++++++++++++++++++++++++-------------
include/linux/types.h | 9 ++++++
mm/slab_common.c | 46 +++++++++++++++++++-----------
3 files changed, 81 insertions(+), 35 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 04f3f86a4145..3ca82500a19f 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1057,22 +1057,30 @@ static inline void rcu_read_unlock_migrate(void)
/**
* kfree_rcu() - kfree an object after a grace period.
* @ptr: pointer to kfree for double-argument invocations.
- * @rhf: the name of the struct rcu_head within the type of @ptr.
+ * @rf: the name of the struct rcu_head or struct rcu_ptr within the type of @ptr.
*
* Many rcu callbacks functions just call kfree() on the base structure.
* These functions are trivial, but their size adds up, and furthermore
* when they are used in a kernel module, that module must invoke the
* high-latency rcu_barrier() function at module-unload time.
+ * The kfree_rcu() function handles this issue by batching.
*
- * The kfree_rcu() function handles this issue. In order to have a universal
- * callback function handling different offsets of rcu_head, the callback needs
- * to determine the starting address of the freed object, which can be a large
- * kmalloc or vmalloc allocation. To allow simply aligning the pointer down to
- * page boundary for those, only offsets up to 4095 bytes can be accommodated.
- * If the offset is larger than 4095 bytes, a compile-time error will
- * be generated in kvfree_rcu_arg_2(). If this error is triggered, you can
- * either fall back to use of call_rcu() or rearrange the structure to
- * position the rcu_head structure into the first 4096 bytes.
+ * Typically, struct rcu_head is used to process RCU callbacks, but it requires
+ * two pointers. However, since kfree_rcu() uses kfree() as the callback
+ * function, it can process callbacks with struct rcu_ptr, which is only
+ * one pointer in size (unless !CONFIG_KVFREE_RCU_BATCHED).
+ *
+ * The type of @rf can be either struct rcu_head or struct rcu_ptr, and when
+ * possible, it is recommended to use struct rcu_ptr due to its smaller size.
+ *
+ * In order to have a universal callback function handling different offsets
+ * of @rf, the callback needs to determine the starting address of the freed
+ * object, which can be a large kmalloc or vmalloc allocation. To allow simply
+ * aligning the pointer down to page boundary for those, only offsets up to
+ * 4095 bytes can be accommodated. If the offset is larger than 4095 bytes,
+ * a compile-time error will be generated in kvfree_rcu_arg_2().
+ * If this error is triggered, you can either fall back to use of call_rcu()
+ * or rearrange the structure to position @rf into the first 4096 bytes.
*
* The object to be freed can be allocated either by kmalloc(),
* kmalloc_nolock(), or kmem_cache_alloc().
@@ -1082,8 +1090,8 @@ static inline void rcu_read_unlock_migrate(void)
* The BUILD_BUG_ON check must not involve any function calls, hence the
* checks are done in macros here.
*/
-#define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf)
-#define kvfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf)
+#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
+#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
/**
* kfree_rcu_mightsleep() - kfree an object after a grace period.
@@ -1105,22 +1113,37 @@ static inline void rcu_read_unlock_migrate(void)
#define kfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
#define kvfree_rcu_mightsleep(ptr) kvfree_rcu_arg_1(ptr)
-/*
- * In mm/slab_common.c, no suitable header to include here.
- */
-void kvfree_call_rcu(struct rcu_head *head, void *ptr);
+
+#ifdef CONFIG_KVFREE_RCU_BATCHED
+void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr);
+#define kvfree_call_rcu(head, ptr) \
+ _Generic((head), \
+ struct rcu_head *: kvfree_call_rcu_ptr, \
+ struct rcu_ptr *: kvfree_call_rcu_ptr, \
+ void *: kvfree_call_rcu_ptr \
+ )((struct rcu_ptr *)(head), (ptr))
+#else
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr);
+static_assert(sizeof(struct rcu_head) == sizeof(struct rcu_ptr));
+#define kvfree_call_rcu(head, ptr) \
+ _Generic((head), \
+ struct rcu_head *: kvfree_call_rcu_head, \
+ struct rcu_ptr *: kvfree_call_rcu_head, \
+ void *: kvfree_call_rcu_head \
+ )((struct rcu_head *)(head), (ptr))
+#endif
/*
* The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
* comment of kfree_rcu() for details.
*/
-#define kvfree_rcu_arg_2(ptr, rhf) \
+#define kvfree_rcu_arg_2(ptr, rf) \
do { \
typeof (ptr) ___p = (ptr); \
\
if (___p) { \
- BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \
- kvfree_call_rcu(&((___p)->rhf), (void *) (___p)); \
+ BUILD_BUG_ON(offsetof(typeof(*(ptr)), rf) >= 4096); \
+ kvfree_call_rcu(&((___p)->rf), (void *) (___p)); \
} \
} while (0)
diff --git a/include/linux/types.h b/include/linux/types.h
index 7e71d260763c..46c3cfe08f50 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -249,6 +249,15 @@ struct callback_head {
} __attribute__((aligned(sizeof(void *))));
#define rcu_head callback_head
+
+struct rcu_ptr {
+#ifdef CONFIG_KVFREE_RCU_BATCHED
+ struct rcu_ptr *next;
+#else
+ struct callback_head;
+#endif
+} __attribute__((aligned(sizeof(void *))));
+
typedef void (*rcu_callback_t)(struct rcu_head *head);
typedef void (*call_rcu_func_t)(struct rcu_head *head, rcu_callback_t func);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index d5a70a831a2a..85c9c2d0620e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1265,7 +1265,7 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
#ifndef CONFIG_KVFREE_RCU_BATCHED
-void kvfree_call_rcu(struct rcu_head *head, void *ptr)
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
{
if (head) {
kasan_record_aux_stack(ptr);
@@ -1278,7 +1278,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
synchronize_rcu();
kvfree(ptr);
}
-EXPORT_SYMBOL_GPL(kvfree_call_rcu);
+EXPORT_SYMBOL_GPL(kvfree_call_rcu_head);
void __init kvfree_rcu_init(void)
{
@@ -1346,7 +1346,7 @@ struct kvfree_rcu_bulk_data {
struct kfree_rcu_cpu_work {
struct rcu_work rcu_work;
- struct rcu_head *head_free;
+ struct rcu_ptr *head_free;
struct rcu_gp_oldstate head_free_gp_snap;
struct list_head bulk_head_free[FREE_N_CHANNELS];
struct kfree_rcu_cpu *krcp;
@@ -1381,8 +1381,7 @@ struct kfree_rcu_cpu_work {
*/
struct kfree_rcu_cpu {
// Objects queued on a linked list
- // through their rcu_head structures.
- struct rcu_head *head;
+ struct rcu_ptr *head;
unsigned long head_gp_snap;
atomic_t head_count;
@@ -1523,18 +1522,34 @@ kvfree_rcu_bulk(struct kfree_rcu_cpu *krcp,
}
static void
-kvfree_rcu_list(struct rcu_head *head)
+kvfree_rcu_list(struct rcu_ptr *head)
{
- struct rcu_head *next;
+ struct rcu_ptr *next;
for (; head; head = next) {
- void *ptr = (void *) head->func;
- unsigned long offset = (void *) head - ptr;
+ void *ptr;
+ unsigned long offset;
+ struct slab *slab;
+ if (is_vmalloc_addr(head)) {
+ ptr = (void *)PAGE_ALIGN_DOWN((unsigned long)head);
+ } else {
+ slab = virt_to_slab(head);
+ if (!slab)
+ ptr = (void *)PAGE_ALIGN_DOWN((unsigned long)head);
+ else if (is_kfence_address(head))
+ ptr = kfence_object_start(head);
+ else
+ ptr = nearest_obj(slab->slab_cache, slab, head);
+ }
+
+ offset = (void *)head - ptr;
next = head->next;
debug_rcu_head_unqueue((struct rcu_head *)ptr);
rcu_lock_acquire(&rcu_callback_map);
- trace_rcu_invoke_kvfree_callback("slab", head, offset);
+ trace_rcu_invoke_kvfree_callback("slab",
+ (struct rcu_head *)head,
+ offset);
kvfree(ptr);
@@ -1552,7 +1567,7 @@ static void kfree_rcu_work(struct work_struct *work)
unsigned long flags;
struct kvfree_rcu_bulk_data *bnode, *n;
struct list_head bulk_head[FREE_N_CHANNELS];
- struct rcu_head *head;
+ struct rcu_ptr *head;
struct kfree_rcu_cpu *krcp;
struct kfree_rcu_cpu_work *krwp;
struct rcu_gp_oldstate head_gp_snap;
@@ -1675,7 +1690,7 @@ kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
{
struct list_head bulk_ready[FREE_N_CHANNELS];
struct kvfree_rcu_bulk_data *bnode, *n;
- struct rcu_head *head_ready = NULL;
+ struct rcu_ptr *head_ready = NULL;
unsigned long flags;
int i;
@@ -1938,7 +1953,7 @@ void __init kfree_rcu_scheduler_running(void)
* be free'd in workqueue context. This allows us to: batch requests together to
* reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
*/
-void kvfree_call_rcu(struct rcu_head *head, void *ptr)
+void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
{
unsigned long flags;
struct kfree_rcu_cpu *krcp;
@@ -1960,7 +1975,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
// Queue the object but don't yet schedule the batch.
if (debug_rcu_head_queue(ptr)) {
// Probable double kfree_rcu(), just leak.
- WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
+ WARN_ONCE(1, "%s(): Double-freed call. rcu_ptr %p\n",
__func__, head);
// Mark as success and leave.
@@ -1976,7 +1991,6 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
// Inline if kvfree_rcu(one_arg) call.
goto unlock_return;
- head->func = ptr;
head->next = krcp->head;
WRITE_ONCE(krcp->head, head);
atomic_inc(&krcp->head_count);
@@ -2012,7 +2026,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
kvfree(ptr);
}
}
-EXPORT_SYMBOL_GPL(kvfree_call_rcu);
+EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
static inline void __kvfree_rcu_barrier(void)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/8] fs/dcache: use rcu_ptr instead of rcu_head for external names
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
@ 2026-04-16 9:10 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 3/8] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo (Oracle)
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm, Alexander Viro,
Christian Brauner, Jan Kara
When a file name length exceeds 31 (DCACHE_INLINE_LEN-1),
struct external_name is dynamically allocated. Because only kfree_rcu()
is used to free the objects, struct rcu_ptr is enough and saves a pointer
per object.
Under the author's home directory, there are 230k unique file names that
are longer than 31. Some memory saving benefit is expected.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
fs/dcache.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 7ba1801d8132..fa37e3964b38 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -325,12 +325,12 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
/*
* long names are allocated separately from dentry and never modified.
* Refcounted, freeing is RCU-delayed. See take_dentry_name_snapshot()
- * for the reason why ->count and ->head can't be combined into a union.
+ * for the reason why ->count and ->rcu can't be combined into a union.
* dentry_string_cmp() relies upon ->name[] being word-aligned.
*/
struct external_name {
atomic_t count;
- struct rcu_head head;
+ struct rcu_ptr rcu;
unsigned char name[] __aligned(sizeof(unsigned long));
};
@@ -393,7 +393,7 @@ void release_dentry_name_snapshot(struct name_snapshot *name)
struct external_name *p;
p = container_of(name->name.name, struct external_name, name[0]);
if (unlikely(atomic_dec_and_test(&p->count)))
- kfree_rcu(p, head);
+ kfree_rcu(p, rcu);
}
}
EXPORT_SYMBOL(release_dentry_name_snapshot);
@@ -2863,7 +2863,7 @@ static void copy_name(struct dentry *dentry, struct dentry *target)
dentry->__d_name.hash_len = target->__d_name.hash_len;
}
if (old_name && likely(atomic_dec_and_test(&old_name->count)))
- kfree_rcu(old_name, head);
+ kfree_rcu(old_name, rcu);
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/8] mm/slab: move kfree_rcu_cpu[_work] definitions
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 2/8] fs/dcache: use rcu_ptr instead of rcu_head for external names Harry Yoo (Oracle)
@ 2026-04-16 9:10 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm
In preparation for defining kfree_rcu_cpu under
CONFIG_KVFREE_RCU_BATCHED=n and adding a new function common to both
configurations, move the existing kfree_rcu_cpu[_work] definitions to
just before the beginning of the kfree_rcu batching infrastructure.
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
mm/slab_common.c | 142 ++++++++++++++++++++++++-----------------------
1 file changed, 72 insertions(+), 70 deletions(-)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 85c9c2d0620e..cddbf3279c13 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1263,78 +1263,9 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
EXPORT_TRACEPOINT_SYMBOL(kfree);
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
-#ifndef CONFIG_KVFREE_RCU_BATCHED
-
-void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
-{
- if (head) {
- kasan_record_aux_stack(ptr);
- call_rcu(head, kvfree_rcu_cb);
- return;
- }
-
- // kvfree_rcu(one_arg) call.
- might_sleep();
- synchronize_rcu();
- kvfree(ptr);
-}
-EXPORT_SYMBOL_GPL(kvfree_call_rcu_head);
-
-void __init kvfree_rcu_init(void)
-{
-}
-
-#else /* CONFIG_KVFREE_RCU_BATCHED */
-
-/*
- * This rcu parameter is runtime-read-only. It reflects
- * a minimum allowed number of objects which can be cached
- * per-CPU. Object size is equal to one page. This value
- * can be changed at boot time.
- */
-static int rcu_min_cached_objs = 5;
-module_param(rcu_min_cached_objs, int, 0444);
-
-// A page shrinker can ask for pages to be freed to make them
-// available for other parts of the system. This usually happens
-// under low memory conditions, and in that case we should also
-// defer page-cache filling for a short time period.
-//
-// The default value is 5 seconds, which is long enough to reduce
-// interference with the shrinker while it asks other systems to
-// drain their caches.
-static int rcu_delay_page_cache_fill_msec = 5000;
-module_param(rcu_delay_page_cache_fill_msec, int, 0444);
-
-static struct workqueue_struct *rcu_reclaim_wq;
-
-/* Maximum number of jiffies to wait before draining a batch. */
-#define KFREE_DRAIN_JIFFIES (5 * HZ)
+#ifdef CONFIG_KVFREE_RCU_BATCHED
#define KFREE_N_BATCHES 2
#define FREE_N_CHANNELS 2
-
-/**
- * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
- * @list: List node. All blocks are linked between each other
- * @gp_snap: Snapshot of RCU state for objects placed to this bulk
- * @nr_records: Number of active pointers in the array
- * @records: Array of the kvfree_rcu() pointers
- */
-struct kvfree_rcu_bulk_data {
- struct list_head list;
- struct rcu_gp_oldstate gp_snap;
- unsigned long nr_records;
- void *records[] __counted_by(nr_records);
-};
-
-/*
- * This macro defines how many entries the "records" array
- * will contain. It is based on the fact that the size of
- * kvfree_rcu_bulk_data structure becomes exactly one page.
- */
-#define KVFREE_BULK_MAX_ENTR \
- ((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
-
/**
* struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
* @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
@@ -1402,6 +1333,77 @@ struct kfree_rcu_cpu {
struct llist_head bkvcache;
int nr_bkv_objs;
};
+#endif
+
+#ifndef CONFIG_KVFREE_RCU_BATCHED
+
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
+{
+ if (head) {
+ kasan_record_aux_stack(ptr);
+ call_rcu(head, kvfree_rcu_cb);
+ return;
+ }
+
+ // kvfree_rcu(one_arg) call.
+ might_sleep();
+ synchronize_rcu();
+ kvfree(ptr);
+}
+EXPORT_SYMBOL_GPL(kvfree_call_rcu_head);
+
+void __init kvfree_rcu_init(void)
+{
+}
+
+#else /* CONFIG_KVFREE_RCU_BATCHED */
+
+/*
+ * This rcu parameter is runtime-read-only. It reflects
+ * a minimum allowed number of objects which can be cached
+ * per-CPU. Object size is equal to one page. This value
+ * can be changed at boot time.
+ */
+static int rcu_min_cached_objs = 5;
+module_param(rcu_min_cached_objs, int, 0444);
+
+// A page shrinker can ask for pages to be freed to make them
+// available for other parts of the system. This usually happens
+// under low memory conditions, and in that case we should also
+// defer page-cache filling for a short time period.
+//
+// The default value is 5 seconds, which is long enough to reduce
+// interference with the shrinker while it asks other systems to
+// drain their caches.
+static int rcu_delay_page_cache_fill_msec = 5000;
+module_param(rcu_delay_page_cache_fill_msec, int, 0444);
+
+static struct workqueue_struct *rcu_reclaim_wq;
+
+/* Maximum number of jiffies to wait before draining a batch. */
+#define KFREE_DRAIN_JIFFIES (5 * HZ)
+
+/**
+ * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
+ * @list: List node. All blocks are linked between each other
+ * @gp_snap: Snapshot of RCU state for objects placed to this bulk
+ * @nr_records: Number of active pointers in the array
+ * @records: Array of the kvfree_rcu() pointers
+ */
+struct kvfree_rcu_bulk_data {
+ struct list_head list;
+ struct rcu_gp_oldstate gp_snap;
+ unsigned long nr_records;
+ void *records[] __counted_by(nr_records);
+};
+
+/*
+ * This macro defines how many entries the "records" array
+ * will contain. It is based on the fact that the size of
+ * kvfree_rcu_bulk_data structure becomes exactly one page.
+ */
+#define KVFREE_BULK_MAX_ENTR \
+ ((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
.lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock()
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
` (2 preceding siblings ...)
2026-04-16 9:10 ` [PATCH 3/8] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo (Oracle)
@ 2026-04-16 9:10 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 5/8] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo (Oracle)
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm
Currently, kfree_rcu() cannot be called when the context is unknown,
which might not allow spinning on a lock. In such an unknown
context, even calling call_rcu() is not legal, forcing users to
implement some sort of deferred freeing.
Make users' lives easier by introducing kfree_rcu_nolock() variant.
It passes allow_spin = false to kvfree_call_rcu(), which means spinning
on a lock is not allowed because the context is unknown.
Unlike kfree_rcu(), kfree_rcu_nolock() only supports a 2-argument
variant because, in the worst case where memory allocation fails,
the caller cannot synchronously wait for the grace period to finish.
kfree_rcu_nolock() tries to acquire kfree_rcu_cpu spinlock.
When trylock succeeds, get a cached bnode and use it to store the
pointer. Just like existing kvfree_rcu() with 2-arg variant, fall back
if there's no cached bnode available.
If trylock fails, insert the object to the per-cpu lockless list
and defer freeing using irq_work that calls kvfree_call_rcu() later.
Note that in the most of the cases the context allows spinning,
and thus it is worth trying to acquire the lock.
To ensure rcu sheaves are flushed in flush_rcu_all_sheaves() and
flush_rcu_sheaves_on_cache(), deferred objects must be processed before
calling them. Otherwise, irq work might insert objects to a sheaf and
end up not flushing it. Implement a defer_kvfree_rcu_barrier() and
call it before flushing rcu sheaves.
In case kmemleak or debug objects is enabled, always defer freeing as
those debug features use spinlocks.
Determine whether work items (page cache worker or delayed monitor) need
to be queued under krcp->lock. If so, use irq_work to defer the actual
work submission. The existing logic prevents excessive irq_work
queueing.
For now, the sheaves layer is bypassed if spinning is not allowed.
Without CONFIG_KVFREE_RCU_BATCHED, all frees in the !allow_spin case are
deferred using irq_work. Move kvfree_rcu_barrier[_on_cache]() to
mm/slab_common.c and let them wait for irq_works.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
include/linux/rcupdate.h | 23 ++--
include/linux/slab.h | 16 +--
mm/slab.h | 1 +
mm/slab_common.c | 260 +++++++++++++++++++++++++++++++--------
mm/slub.c | 6 +-
5 files changed, 231 insertions(+), 75 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 3ca82500a19f..8776b2a394bb 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1090,8 +1090,9 @@ static inline void rcu_read_unlock_migrate(void)
* The BUILD_BUG_ON check must not involve any function calls, hence the
* checks are done in macros here.
*/
-#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
-#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
+#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
+#define kfree_rcu_nolock(ptr, rf) kvfree_rcu_arg_2(ptr, rf, false)
+#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
/**
* kfree_rcu_mightsleep() - kfree an object after a grace period.
@@ -1115,35 +1116,35 @@ static inline void rcu_read_unlock_migrate(void)
#ifdef CONFIG_KVFREE_RCU_BATCHED
-void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr);
-#define kvfree_call_rcu(head, ptr) \
+void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin);
+#define kvfree_call_rcu(head, ptr, spin) \
_Generic((head), \
struct rcu_head *: kvfree_call_rcu_ptr, \
struct rcu_ptr *: kvfree_call_rcu_ptr, \
void *: kvfree_call_rcu_ptr \
- )((struct rcu_ptr *)(head), (ptr))
+ )((struct rcu_ptr *)(head), (ptr), spin)
#else
-void kvfree_call_rcu_head(struct rcu_head *head, void *ptr);
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr, bool allow_spin);
static_assert(sizeof(struct rcu_head) == sizeof(struct rcu_ptr));
-#define kvfree_call_rcu(head, ptr) \
+#define kvfree_call_rcu(head, ptr, spin) \
_Generic((head), \
struct rcu_head *: kvfree_call_rcu_head, \
struct rcu_ptr *: kvfree_call_rcu_head, \
void *: kvfree_call_rcu_head \
- )((struct rcu_head *)(head), (ptr))
+ )((struct rcu_head *)(head), (ptr), spin)
#endif
/*
* The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
* comment of kfree_rcu() for details.
*/
-#define kvfree_rcu_arg_2(ptr, rf) \
+#define kvfree_rcu_arg_2(ptr, rf, spin) \
do { \
typeof (ptr) ___p = (ptr); \
\
if (___p) { \
BUILD_BUG_ON(offsetof(typeof(*(ptr)), rf) >= 4096); \
- kvfree_call_rcu(&((___p)->rf), (void *) (___p)); \
+ kvfree_call_rcu(&((___p)->rf), (void *) (___p), spin); \
} \
} while (0)
@@ -1152,7 +1153,7 @@ do { \
typeof(ptr) ___p = (ptr); \
\
if (___p) \
- kvfree_call_rcu(NULL, (void *) (___p)); \
+ kvfree_call_rcu(NULL, (void *) (___p), true); \
} while (0)
/*
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 15a60b501b95..67528f698fe2 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -1238,23 +1238,13 @@ extern void kvfree_sensitive(const void *addr, size_t len);
unsigned int kmem_cache_size(struct kmem_cache *s);
-#ifndef CONFIG_KVFREE_RCU_BATCHED
-static inline void kvfree_rcu_barrier(void)
-{
- rcu_barrier();
-}
-
-static inline void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
-{
- rcu_barrier();
-}
-
-static inline void kfree_rcu_scheduler_running(void) { }
-#else
void kvfree_rcu_barrier(void);
void kvfree_rcu_barrier_on_cache(struct kmem_cache *s);
+#ifndef CONFIG_KVFREE_RCU_BATCHED
+static inline void kfree_rcu_scheduler_running(void) { }
+#else
void kfree_rcu_scheduler_running(void);
#endif
diff --git a/mm/slab.h b/mm/slab.h
index c735e6b4dddb..ae2e990e8dc2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -412,6 +412,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
void flush_all_rcu_sheaves(void);
void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
+void defer_kvfree_rcu_barrier(void);
#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
SLAB_CACHE_DMA32 | SLAB_PANIC | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index cddbf3279c13..e840956233dd 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1311,6 +1311,14 @@ struct kfree_rcu_cpu_work {
* the interactions with the slab allocators.
*/
struct kfree_rcu_cpu {
+ // Objects queued on a lockless linked list, used to free objects
+ // in unknown contexts when trylock fails.
+ struct llist_head defer_head;
+
+ struct irq_work defer_free;
+ struct irq_work sched_delayed_monitor;
+ struct irq_work run_page_cache_worker;
+
// Objects queued on a linked list
struct rcu_ptr *head;
unsigned long head_gp_snap;
@@ -1333,12 +1341,99 @@ struct kfree_rcu_cpu {
struct llist_head bkvcache;
int nr_bkv_objs;
};
+
+static void defer_kfree_rcu_irq_work_fn(struct irq_work *work);
+static void sched_delayed_monitor_irq_work_fn(struct irq_work *work);
+static void run_page_cache_worker_irq_work_fn(struct irq_work *work);
+
+static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
+ .lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
+ .defer_head = LLIST_HEAD_INIT(defer_head),
+ .defer_free = IRQ_WORK_INIT(defer_kfree_rcu_irq_work_fn),
+ .sched_delayed_monitor =
+ IRQ_WORK_INIT_LAZY(sched_delayed_monitor_irq_work_fn),
+ .run_page_cache_worker =
+ IRQ_WORK_INIT_LAZY(run_page_cache_worker_irq_work_fn),
+};
+#else
+struct kfree_rcu_cpu {
+ struct llist_head defer_head;
+ struct irq_work defer_free;
+};
+
+static void defer_kfree_rcu_irq_work_fn(struct irq_work *work);
+
+static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
+ .defer_head = LLIST_HEAD_INIT(defer_head),
+ .defer_free = IRQ_WORK_INIT(defer_kfree_rcu_irq_work_fn),
+};
#endif
-#ifndef CONFIG_KVFREE_RCU_BATCHED
+/* Wait for deferred work from kfree_rcu_nolock() */
+void defer_kvfree_rcu_barrier(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ irq_work_sync(&per_cpu_ptr(&krc, cpu)->defer_free);
+}
+
+static void *object_start_addr(void *ptr)
+{
+ struct slab *slab;
+ void *start;
+
+ if (is_vmalloc_addr(ptr)) {
+ start = (void *)PAGE_ALIGN_DOWN((unsigned long)ptr);
+ } else {
+ slab = virt_to_slab(ptr);
+ if (!slab)
+ start = (void *)PAGE_ALIGN_DOWN((unsigned long)ptr);
+ else if (is_kfence_address(ptr))
+ start = kfence_object_start(ptr);
+ else
+ start = nearest_obj(slab->slab_cache, slab, ptr);
+ }
-void kvfree_call_rcu_head(struct rcu_head *head, void *ptr)
+ return start;
+}
+
+static void defer_kfree_rcu_irq_work_fn(struct irq_work *work)
{
+ struct kfree_rcu_cpu *krcp;
+ struct llist_head *head;
+ struct llist_node *llnode, *pos, *t;
+
+ krcp = container_of(work, struct kfree_rcu_cpu, defer_free);
+ head = &krcp->defer_head;
+
+ if (llist_empty(head))
+ return;
+
+ llnode = llist_del_all(head);
+ llist_for_each_safe(pos, t, llnode) {
+ void *objp;
+ struct rcu_ptr *rcup = (struct rcu_ptr *)pos;
+
+ objp = object_start_addr(rcup);
+ kvfree_call_rcu(rcup, objp, true);
+ }
+}
+
+#ifndef CONFIG_KVFREE_RCU_BATCHED
+void kvfree_call_rcu_head(struct rcu_head *head, void *ptr, bool allow_spin)
+{
+ if (!allow_spin) {
+ struct kfree_rcu_cpu *krcp;
+
+ guard(preempt)();
+
+ krcp = this_cpu_ptr(&krc);
+ if (llist_add((struct llist_node *)head, &krcp->defer_head))
+ irq_work_queue(&krcp->defer_free);
+ return;
+ }
+
if (head) {
kasan_record_aux_stack(ptr);
call_rcu(head, kvfree_rcu_cb);
@@ -1356,6 +1451,19 @@ void __init kvfree_rcu_init(void)
{
}
+void kvfree_rcu_barrier(void)
+{
+ defer_kvfree_rcu_barrier();
+ rcu_barrier();
+}
+EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
+
+void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
+{
+ kvfree_rcu_barrier();
+}
+EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
+
#else /* CONFIG_KVFREE_RCU_BATCHED */
/*
@@ -1405,9 +1513,16 @@ struct kvfree_rcu_bulk_data {
#define KVFREE_BULK_MAX_ENTR \
((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
-static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
- .lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
-};
+
+static void schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp);
+
+static void sched_delayed_monitor_irq_work_fn(struct irq_work *work)
+{
+ struct kfree_rcu_cpu *krcp;
+
+ krcp = container_of(work, struct kfree_rcu_cpu, sched_delayed_monitor);
+ schedule_delayed_monitor_work(krcp);
+}
static __always_inline void
debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
@@ -1421,13 +1536,18 @@ debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
}
static inline struct kfree_rcu_cpu *
-krc_this_cpu_lock(unsigned long *flags)
+krc_this_cpu_lock(unsigned long *flags, bool allow_spin)
{
struct kfree_rcu_cpu *krcp;
local_irq_save(*flags); // For safely calling this_cpu_ptr().
krcp = this_cpu_ptr(&krc);
- raw_spin_lock(&krcp->lock);
+ if (allow_spin) {
+ raw_spin_lock(&krcp->lock);
+ } else if (!raw_spin_trylock(&krcp->lock)) {
+ local_irq_restore(*flags);
+ return NULL;
+ }
return krcp;
}
@@ -1531,20 +1651,8 @@ kvfree_rcu_list(struct rcu_ptr *head)
for (; head; head = next) {
void *ptr;
unsigned long offset;
- struct slab *slab;
-
- if (is_vmalloc_addr(head)) {
- ptr = (void *)PAGE_ALIGN_DOWN((unsigned long)head);
- } else {
- slab = virt_to_slab(head);
- if (!slab)
- ptr = (void *)PAGE_ALIGN_DOWN((unsigned long)head);
- else if (is_kfence_address(head))
- ptr = kfence_object_start(head);
- else
- ptr = nearest_obj(slab->slab_cache, slab, head);
- }
+ ptr = object_start_addr(head);
offset = (void *)head - ptr;
next = head->next;
debug_rcu_head_unqueue((struct rcu_head *)ptr);
@@ -1663,18 +1771,26 @@ static int krc_count(struct kfree_rcu_cpu *krcp)
}
static void
-__schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
+__schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp, bool allow_spin)
{
long delay, delay_left;
delay = krc_count(krcp) >= KVFREE_BULK_MAX_ENTR ? 1:KFREE_DRAIN_JIFFIES;
if (delayed_work_pending(&krcp->monitor_work)) {
delay_left = krcp->monitor_work.timer.expires - jiffies;
- if (delay < delay_left)
- mod_delayed_work(rcu_reclaim_wq, &krcp->monitor_work, delay);
+ if (delay < delay_left) {
+ if (allow_spin)
+ mod_delayed_work(rcu_reclaim_wq, &krcp->monitor_work, delay);
+ else
+ irq_work_queue(&krcp->sched_delayed_monitor);
+ }
return;
}
- queue_delayed_work(rcu_reclaim_wq, &krcp->monitor_work, delay);
+
+ if (allow_spin)
+ queue_delayed_work(rcu_reclaim_wq, &krcp->monitor_work, delay);
+ else
+ irq_work_queue(&krcp->sched_delayed_monitor);
}
static void
@@ -1683,7 +1799,7 @@ schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
unsigned long flags;
raw_spin_lock_irqsave(&krcp->lock, flags);
- __schedule_delayed_monitor_work(krcp);
+ __schedule_delayed_monitor_work(krcp, true);
raw_spin_unlock_irqrestore(&krcp->lock, flags);
}
@@ -1847,25 +1963,25 @@ static void fill_page_cache_func(struct work_struct *work)
// Returns true if ptr was successfully recorded, else the caller must
// use a fallback.
static inline bool
-add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
- unsigned long *flags, void *ptr, bool can_alloc)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu *krcp,
+ unsigned long *flags, void *ptr, bool can_alloc, bool allow_spin)
{
struct kvfree_rcu_bulk_data *bnode;
int idx;
- *krcp = krc_this_cpu_lock(flags);
- if (unlikely(!(*krcp)->initialized))
+ if (unlikely(!krcp->initialized))
return false;
idx = !!is_vmalloc_addr(ptr);
- bnode = list_first_entry_or_null(&(*krcp)->bulk_head[idx],
+ bnode = list_first_entry_or_null(&krcp->bulk_head[idx],
struct kvfree_rcu_bulk_data, list);
/* Check if a new block is required. */
if (!bnode || bnode->nr_records == KVFREE_BULK_MAX_ENTR) {
- bnode = get_cached_bnode(*krcp);
+ bnode = get_cached_bnode(krcp);
if (!bnode && can_alloc) {
- krc_this_cpu_unlock(*krcp, *flags);
+ krc_this_cpu_unlock(krcp, *flags);
+ VM_WARN_ON_ONCE(!allow_spin);
// __GFP_NORETRY - allows a light-weight direct reclaim
// what is OK from minimizing of fallback hitting point of
@@ -1880,7 +1996,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
// scenarios.
bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
- raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
+ raw_spin_lock_irqsave(&krcp->lock, *flags);
}
if (!bnode)
@@ -1888,14 +2004,14 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
// Initialize the new block and attach it.
bnode->nr_records = 0;
- list_add(&bnode->list, &(*krcp)->bulk_head[idx]);
+ list_add(&bnode->list, &krcp->bulk_head[idx]);
}
// Finally insert and update the GP for this page.
bnode->nr_records++;
bnode->records[bnode->nr_records - 1] = ptr;
get_state_synchronize_rcu_full(&bnode->gp_snap);
- atomic_inc(&(*krcp)->bulk_count[idx]);
+ atomic_inc(&krcp->bulk_count[idx]);
return true;
}
@@ -1911,7 +2027,32 @@ schedule_page_work_fn(struct hrtimer *t)
}
static void
-run_page_cache_worker(struct kfree_rcu_cpu *krcp)
+__run_page_cache_worker(struct kfree_rcu_cpu *krcp)
+{
+ if (atomic_read(&krcp->backoff_page_cache_fill)) {
+ queue_delayed_work(rcu_reclaim_wq,
+ &krcp->page_cache_work,
+ msecs_to_jiffies(rcu_delay_page_cache_fill_msec));
+ } else {
+ hrtimer_setup(&krcp->hrtimer, schedule_page_work_fn, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
+ }
+}
+
+static void run_page_cache_worker_irq_work_fn(struct irq_work *work)
+{
+ unsigned long flags;
+ struct kfree_rcu_cpu *krcp =
+ container_of(work, struct kfree_rcu_cpu, run_page_cache_worker);
+
+ raw_spin_lock_irqsave(&krcp->lock, flags);
+ __run_page_cache_worker(krcp);
+ raw_spin_unlock_irqrestore(&krcp->lock, flags);
+}
+
+static void
+run_page_cache_worker(struct kfree_rcu_cpu *krcp, bool allow_spin)
{
// If cache disabled, bail out.
if (!rcu_min_cached_objs)
@@ -1919,15 +2060,10 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
!atomic_xchg(&krcp->work_in_progress, 1)) {
- if (atomic_read(&krcp->backoff_page_cache_fill)) {
- queue_delayed_work(rcu_reclaim_wq,
- &krcp->page_cache_work,
- msecs_to_jiffies(rcu_delay_page_cache_fill_msec));
- } else {
- hrtimer_setup(&krcp->hrtimer, schedule_page_work_fn, CLOCK_MONOTONIC,
- HRTIMER_MODE_REL);
- hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL);
- }
+ if (allow_spin)
+ __run_page_cache_worker(krcp);
+ else
+ irq_work_queue(&krcp->run_page_cache_worker);
}
}
@@ -1955,7 +2091,7 @@ void __init kfree_rcu_scheduler_running(void)
* be free'd in workqueue context. This allows us to: batch requests together to
* reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
*/
-void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
+void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin)
{
unsigned long flags;
struct kfree_rcu_cpu *krcp;
@@ -1971,7 +2107,12 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
if (!head)
might_sleep();
- if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr))
+ if (!allow_spin && (IS_ENABLED(CONFIG_DEBUG_OBJECTS_RCU_HEAD) ||
+ IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)))
+ goto defer_free;
+
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
+ (allow_spin && kfree_rcu_sheaf(ptr)))
return;
// Queue the object but don't yet schedule the batch.
@@ -1985,9 +2126,14 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
}
kasan_record_aux_stack(ptr);
- success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
+
+ krcp = krc_this_cpu_lock(&flags, allow_spin);
+ if (!krcp)
+ goto defer_free;
+
+ success = add_ptr_to_bulk_krc_lock(krcp, &flags, ptr, !head, allow_spin);
if (!success) {
- run_page_cache_worker(krcp);
+ run_page_cache_worker(krcp, allow_spin);
if (head == NULL)
// Inline if kvfree_rcu(one_arg) call.
@@ -2012,7 +2158,7 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
// Set timer to drain after KFREE_DRAIN_JIFFIES.
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
- __schedule_delayed_monitor_work(krcp);
+ __schedule_delayed_monitor_work(krcp, allow_spin);
unlock_return:
krc_this_cpu_unlock(krcp, flags);
@@ -2023,10 +2169,22 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
* CPU can pass the QS state.
*/
if (!success) {
+ VM_WARN_ON_ONCE(!allow_spin);
debug_rcu_head_unqueue((struct rcu_head *) ptr);
synchronize_rcu();
kvfree(ptr);
}
+ return;
+
+defer_free:
+ VM_WARN_ON_ONCE(allow_spin);
+ guard(preempt)();
+
+ krcp = this_cpu_ptr(&krc);
+ if (llist_add((struct llist_node *)head, &krcp->defer_head))
+ irq_work_queue(&krcp->defer_free);
+ return;
+
}
EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
@@ -2125,6 +2283,8 @@ EXPORT_SYMBOL_GPL(kvfree_rcu_barrier);
*/
void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
{
+ defer_kvfree_rcu_barrier();
+
if (cache_has_sheaves(s)) {
flush_rcu_sheaves_on_cache(s);
rcu_barrier();
diff --git a/mm/slub.c b/mm/slub.c
index 92362eeb13e5..6f658ec00751 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4018,7 +4018,10 @@ static void flush_rcu_sheaf(struct work_struct *w)
}
-/* needed for kvfree_rcu_barrier() */
+/*
+ * Needed for kvfree_rcu_barrier(). The caller should invoke
+ * defer_kvfree_rcu_barrier() before calling this function.
+ */
void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
{
struct slub_flush_work *sfw;
@@ -4053,6 +4056,7 @@ void flush_all_rcu_sheaves(void)
{
struct kmem_cache *s;
+ defer_kvfree_rcu_barrier();
cpus_read_lock();
mutex_lock(&slab_mutex);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 5/8] mm/slab: make kfree_rcu_nolock() work with sheaves
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
` (3 preceding siblings ...)
2026-04-16 9:10 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-04-16 9:10 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 6/8] mm/slab: wrap rcu sheaf handling with ifdef Harry Yoo (Oracle)
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm
Teach kfree_rcu_sheaf() how to handle the !allow_spin case. Similar to
__pcs_replace_full_main(), try to get an empty sheaf from pcs->spare or
the barn, but don't add !allow_spin support for alloc_empty_sheaf() and
fail early instead.
Since call_rcu() does not support NMI contexts, kfree_rcu_sheaf() fails
when the rcu sheaf becomes full.
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
mm/slab.h | 2 +-
mm/slab_common.c | 7 +++----
mm/slub.c | 14 ++++++++++++--
3 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/mm/slab.h b/mm/slab.h
index ae2e990e8dc2..d7fd7626e9fe 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -409,7 +409,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
}
-bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin);
void flush_all_rcu_sheaves(void);
void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
void defer_kvfree_rcu_barrier(void);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index e840956233dd..46a2bee1662b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1716,7 +1716,7 @@ static void kfree_rcu_work(struct work_struct *work)
kvfree_rcu_list(head);
}
-static bool kfree_rcu_sheaf(void *obj)
+static bool kfree_rcu_sheaf(void *obj, bool allow_spin)
{
struct kmem_cache *s;
struct slab *slab;
@@ -1730,7 +1730,7 @@ static bool kfree_rcu_sheaf(void *obj)
s = slab->slab_cache;
if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id()))
- return __kfree_rcu_sheaf(s, obj);
+ return __kfree_rcu_sheaf(s, obj, allow_spin);
return false;
}
@@ -2111,8 +2111,7 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin)
IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)))
goto defer_free;
- if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
- (allow_spin && kfree_rcu_sheaf(ptr)))
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr, allow_spin))
return;
// Queue the object but don't yet schedule the batch.
diff --git a/mm/slub.c b/mm/slub.c
index 6f658ec00751..d0db8d070570 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5895,7 +5895,7 @@ static void rcu_free_sheaf(struct rcu_head *head)
*/
static DEFINE_WAIT_OVERRIDE_MAP(kfree_rcu_sheaf_map, LD_WAIT_CONFIG);
-bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
{
struct slub_percpu_sheaves *pcs;
struct slab_sheaf *rcu_sheaf;
@@ -5933,7 +5933,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
goto fail;
}
- empty = barn_get_empty_sheaf(barn, true);
+ empty = barn_get_empty_sheaf(barn, allow_spin);
if (empty) {
pcs->rcu_free = empty;
@@ -5942,6 +5942,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
local_unlock(&s->cpu_sheaves->lock);
+ /* It's easier to fall back than trying harder with !allow_spin */
+ if (!allow_spin)
+ goto fail;
+
empty = alloc_empty_sheaf(s, GFP_NOWAIT);
if (!empty)
@@ -5973,6 +5977,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
rcu_sheaf = NULL;
} else {
+ if (unlikely(!allow_spin)) {
+ /* call_rcu() cannot be called in an unknown context */
+ rcu_sheaf->size--;
+ local_unlock(&s->cpu_sheaves->lock);
+ goto fail;
+ }
pcs->rcu_free = NULL;
rcu_sheaf->node = numa_node_id();
}
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 6/8] mm/slab: wrap rcu sheaf handling with ifdef
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
` (4 preceding siblings ...)
2026-04-16 9:10 ` [PATCH 5/8] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo (Oracle)
@ 2026-04-16 9:10 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 7/8] mm/slab: introduce deferred submission of rcu sheaves Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 8/8] lib/tests/slub_kunit: add a test case for kfree_rcu_nolock() Harry Yoo (Oracle)
7 siblings, 0 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm
Freeing objects via rcu sheaves is only done with
CONFIG_KVFREE_RCU_BATCHED. Wrap the related functions and struct
fields with ifdef to make this dependency explicit.
Also remove a TODO about implementing __kvfree_rcu_barrier_on_cache()
for a specific slab cache, as there doesn't seem to be a simple and
effective way to do so.
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
mm/slab.h | 3 +++
mm/slab_common.c | 4 ----
mm/slub.c | 27 +++++++++++++++++++++++++--
3 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/mm/slab.h b/mm/slab.h
index d7fd7626e9fe..bdad5f389490 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -409,9 +409,12 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
}
+#ifdef CONFIG_KVFREE_RCU_BATCHED
bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin);
void flush_all_rcu_sheaves(void);
void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
+#endif
+
void defer_kvfree_rcu_barrier(void);
#define SLAB_CORE_FLAGS (SLAB_HWCACHE_ALIGN | SLAB_CACHE_DMA | \
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 46a2bee1662b..347e52f1538c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -2289,10 +2289,6 @@ void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
rcu_barrier();
}
- /*
- * TODO: Introduce a version of __kvfree_rcu_barrier() that works
- * on a specific slab cache.
- */
__kvfree_rcu_barrier();
}
EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
diff --git a/mm/slub.c b/mm/slub.c
index d0db8d070570..91b8827d65da 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -421,7 +421,9 @@ struct slub_percpu_sheaves {
local_trylock_t lock;
struct slab_sheaf *main; /* never NULL when unlocked */
struct slab_sheaf *spare; /* empty or full, may be NULL */
+#ifdef CONFIG_KVFREE_RCU_BATCHED
struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
+#endif
};
/*
@@ -2923,6 +2925,7 @@ static void sheaf_flush_unused(struct kmem_cache *s, struct slab_sheaf *sheaf)
sheaf->size = 0;
}
+#ifdef CONFIG_KVFREE_RCU_BATCHED
static bool __rcu_free_sheaf_prepare(struct kmem_cache *s,
struct slab_sheaf *sheaf)
{
@@ -2965,6 +2968,7 @@ static void rcu_free_sheaf_nobarn(struct rcu_head *head)
free_empty_sheaf(s, sheaf);
}
+#endif
/*
* Caller needs to make sure migration is disabled in order to fully flush
@@ -2978,7 +2982,10 @@ static void rcu_free_sheaf_nobarn(struct rcu_head *head)
static void pcs_flush_all(struct kmem_cache *s)
{
struct slub_percpu_sheaves *pcs;
- struct slab_sheaf *spare, *rcu_free;
+ struct slab_sheaf *spare;
+#ifdef CONFIG_KVFREE_RCU_BATCHED
+ struct slab_sheaf *rcu_free;
+#endif
local_lock(&s->cpu_sheaves->lock);
pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -2986,8 +2993,10 @@ static void pcs_flush_all(struct kmem_cache *s)
spare = pcs->spare;
pcs->spare = NULL;
+#ifdef CONFIG_KVFREE_RCU_BATCHED
rcu_free = pcs->rcu_free;
pcs->rcu_free = NULL;
+#endif
local_unlock(&s->cpu_sheaves->lock);
@@ -2996,8 +3005,10 @@ static void pcs_flush_all(struct kmem_cache *s)
free_empty_sheaf(s, spare);
}
+#ifdef CONFIG_KVFREE_RCU_BATCHED
if (rcu_free)
call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
+#endif
sheaf_flush_main(s);
}
@@ -3016,10 +3027,12 @@ static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
pcs->spare = NULL;
}
+#ifdef CONFIG_KVFREE_RCU_BATCHED
if (pcs->rcu_free) {
call_rcu(&pcs->rcu_free->rcu_head, rcu_free_sheaf_nobarn);
pcs->rcu_free = NULL;
}
+#endif
}
static void pcs_destroy(struct kmem_cache *s)
@@ -3056,7 +3069,9 @@ static void pcs_destroy(struct kmem_cache *s)
*/
WARN_ON(pcs->spare);
+#ifdef CONFIG_KVFREE_RCU_BATCHED
WARN_ON(pcs->rcu_free);
+#endif
if (!WARN_ON(pcs->main->size)) {
free_empty_sheaf(s, pcs->main);
@@ -3937,7 +3952,11 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
- return (pcs->spare || pcs->rcu_free || pcs->main->size);
+#ifdef CONFIG_KVFREE_RCU_BATCHED
+ if (pcs->rcu_free)
+ return true;
+#endif
+ return (pcs->spare || pcs->main->size);
}
/*
@@ -3995,6 +4014,7 @@ static void flush_all(struct kmem_cache *s)
cpus_read_unlock();
}
+#ifdef CONFIG_KVFREE_RCU_BATCHED
static void flush_rcu_sheaf(struct work_struct *w)
{
struct slub_percpu_sheaves *pcs;
@@ -4071,6 +4091,7 @@ void flush_all_rcu_sheaves(void)
rcu_barrier();
}
+#endif /* CONFIG_KVFREE_RCU_BATCHED */
static int slub_cpu_setup(unsigned int cpu)
{
@@ -5825,6 +5846,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
return true;
}
+#ifdef CONFIG_KVFREE_RCU_BATCHED
static void rcu_free_sheaf(struct rcu_head *head)
{
struct slab_sheaf *sheaf;
@@ -6005,6 +6027,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
lock_map_release(&kfree_rcu_sheaf_map);
return false;
}
+#endif /* CONFIG_KVFREE_RCU_BATCHED */
static __always_inline bool can_free_to_pcs(struct slab *slab)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 7/8] mm/slab: introduce deferred submission of rcu sheaves
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
` (5 preceding siblings ...)
2026-04-16 9:10 ` [PATCH 6/8] mm/slab: wrap rcu sheaf handling with ifdef Harry Yoo (Oracle)
@ 2026-04-16 9:10 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 8/8] lib/tests/slub_kunit: add a test case for kfree_rcu_nolock() Harry Yoo (Oracle)
7 siblings, 0 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm
Instead of falling back when the rcu sheaf becomes full, implement
deferred submission of rcu sheaves. If kfree_rcu_sheaf() is invoked
by kfree_rcu_nolock() (!allow_spin) and IRQs are disabled, the CPU might
be in the middle of call_rcu() and thus defer call_rcu() with irq_work.
Submit all deferred RCU sheaves to call_rcu() before calling
rcu_barrier() to ensure the promise of kvfree_rcu_barrier().
An alternative approach could be to implement this in the RCU subsystem,
tracking if it's safe to call call_rcu() and allowing falling back to
deferred call_rcu() at the cost of more expensive rcu_barrier() calls.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
mm/slab.h | 2 ++
mm/slab_common.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--
mm/slub.c | 12 ++++--------
3 files changed, 53 insertions(+), 10 deletions(-)
diff --git a/mm/slab.h b/mm/slab.h
index bdad5f389490..9ba3aad1eeb2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -411,6 +411,8 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
#ifdef CONFIG_KVFREE_RCU_BATCHED
bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin);
+void rcu_free_sheaf(struct rcu_head *head);
+void submit_rcu_sheaf(struct rcu_head *head, bool allow_spin);
void flush_all_rcu_sheaves(void);
void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
#endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 347e52f1538c..226009b10c4a 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1314,8 +1314,11 @@ struct kfree_rcu_cpu {
// Objects queued on a lockless linked list, used to free objects
// in unknown contexts when trylock fails.
struct llist_head defer_head;
-
struct irq_work defer_free;
+
+ struct llist_head defer_call_rcu_head;
+ struct irq_work defer_call_rcu;
+
struct irq_work sched_delayed_monitor;
struct irq_work run_page_cache_worker;
@@ -1345,11 +1348,14 @@ struct kfree_rcu_cpu {
static void defer_kfree_rcu_irq_work_fn(struct irq_work *work);
static void sched_delayed_monitor_irq_work_fn(struct irq_work *work);
static void run_page_cache_worker_irq_work_fn(struct irq_work *work);
+static void defer_call_rcu_irq_work_fn(struct irq_work *work);
static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
.lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
.defer_head = LLIST_HEAD_INIT(defer_head),
.defer_free = IRQ_WORK_INIT(defer_kfree_rcu_irq_work_fn),
+ .defer_call_rcu_head = LLIST_HEAD_INIT(defer_call_rcu_head),
+ .defer_call_rcu = IRQ_WORK_INIT(defer_call_rcu_irq_work_fn),
.sched_delayed_monitor =
IRQ_WORK_INIT_LAZY(sched_delayed_monitor_irq_work_fn),
.run_page_cache_worker =
@@ -1374,8 +1380,12 @@ void defer_kvfree_rcu_barrier(void)
{
int cpu;
- for_each_possible_cpu(cpu)
+ for_each_possible_cpu(cpu) {
irq_work_sync(&per_cpu_ptr(&krc, cpu)->defer_free);
+#ifdef CONFIG_KVFREE_RCU_BATCHED
+ irq_work_sync(&per_cpu_ptr(&krc, cpu)->defer_call_rcu);
+#endif
+ }
}
static void *object_start_addr(void *ptr)
@@ -1524,6 +1534,21 @@ static void sched_delayed_monitor_irq_work_fn(struct irq_work *work)
schedule_delayed_monitor_work(krcp);
}
+static void defer_call_rcu_irq_work_fn(struct irq_work *work)
+{
+ struct kfree_rcu_cpu *krcp;
+ struct llist_node *llnode, *pos, *t;
+
+ krcp = container_of(work, struct kfree_rcu_cpu, defer_call_rcu);
+
+ if (llist_empty(&krcp->defer_call_rcu_head))
+ return;
+
+ llnode = llist_del_all(&krcp->defer_call_rcu_head);
+ llist_for_each_safe(pos, t, llnode)
+ call_rcu((struct rcu_head *)pos, rcu_free_sheaf);
+}
+
static __always_inline void
debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
{
@@ -2187,6 +2212,26 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin)
}
EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
+static inline void defer_call_rcu(struct rcu_head *head)
+{
+ struct kfree_rcu_cpu *krcp;
+
+ VM_WARN_ON_ONCE(!irqs_disabled());
+
+ krcp = this_cpu_ptr(&krc);
+ if (llist_add((struct llist_node *)head, &krcp->defer_call_rcu_head))
+ irq_work_queue(&krcp->defer_call_rcu);
+}
+
+void submit_rcu_sheaf(struct rcu_head *head, bool allow_spin)
+{
+ /* Might be in the middle of call_rcu(), defer it */
+ if (unlikely(!allow_spin && irqs_disabled()))
+ defer_call_rcu(head);
+ else
+ call_rcu(head, rcu_free_sheaf);
+}
+
static inline void __kvfree_rcu_barrier(void)
{
struct kfree_rcu_cpu_work *krwp;
diff --git a/mm/slub.c b/mm/slub.c
index 91b8827d65da..1c3451166498 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4152,6 +4152,8 @@ static int slub_cpu_dead(unsigned int cpu)
__pcs_flush_all_cpu(s, cpu);
}
mutex_unlock(&slab_mutex);
+
+ /* pending IRQ work should have been flushed before going offline */
return 0;
}
@@ -5847,7 +5849,7 @@ bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
}
#ifdef CONFIG_KVFREE_RCU_BATCHED
-static void rcu_free_sheaf(struct rcu_head *head)
+void rcu_free_sheaf(struct rcu_head *head)
{
struct slab_sheaf *sheaf;
struct node_barn *barn = NULL;
@@ -5999,12 +6001,6 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
rcu_sheaf = NULL;
} else {
- if (unlikely(!allow_spin)) {
- /* call_rcu() cannot be called in an unknown context */
- rcu_sheaf->size--;
- local_unlock(&s->cpu_sheaves->lock);
- goto fail;
- }
pcs->rcu_free = NULL;
rcu_sheaf->node = numa_node_id();
}
@@ -6014,7 +6010,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
* flush_all_rcu_sheaves() doesn't miss this sheaf
*/
if (rcu_sheaf)
- call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+ submit_rcu_sheaf(&rcu_sheaf->rcu_head, allow_spin);
local_unlock(&s->cpu_sheaves->lock);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 8/8] lib/tests/slub_kunit: add a test case for kfree_rcu_nolock()
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
` (6 preceding siblings ...)
2026-04-16 9:10 ` [PATCH 7/8] mm/slab: introduce deferred submission of rcu sheaves Harry Yoo (Oracle)
@ 2026-04-16 9:10 ` Harry Yoo (Oracle)
7 siblings, 0 replies; 9+ messages in thread
From: Harry Yoo (Oracle) @ 2026-04-16 9:10 UTC (permalink / raw)
To: Andrew Morton, Vlastimil Babka
Cc: Christoph Lameter, David Rientjes, Roman Gushchin, Hao Li,
Alexei Starovoitov, Uladzislau Rezki, Paul E . McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Zqiang, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, rcu, linux-mm
Similar to test_kmalloc_kfree_nolock, add a test that allocates objects
via kmalloc_nolock() and frees them via kfree_rcu_nolock() in a perf
overflow handler (NMI or hardirq depending on the arch), while the main
loop allocates and frees objects via kmalloc() and kfree_rcu().
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
lib/tests/slub_kunit.c | 73 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/lib/tests/slub_kunit.c b/lib/tests/slub_kunit.c
index fa6d31dbca16..f8d979912246 100644
--- a/lib/tests/slub_kunit.c
+++ b/lib/tests/slub_kunit.c
@@ -367,6 +367,78 @@ static void test_kmalloc_kfree_nolock(struct kunit *test)
kfree(objects[j]);
}
+cleanup:
+ perf_event_disable(ctx.event);
+ perf_event_release_kernel(ctx.event);
+
+ kunit_info(test, "callback_count: %d, alloc_ok: %d, alloc_fail: %d\n",
+ ctx.callback_count, ctx.alloc_ok, ctx.alloc_fail);
+
+ if (alloc_fail)
+ kunit_skip(test, "Allocation failed");
+ KUNIT_EXPECT_EQ(test, 0, slab_errors);
+}
+
+struct dummy_struct {
+ struct rcu_ptr rcu;
+};
+
+static void overflow_handler_test_kfree_rcu_nolock(struct perf_event *event,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct dummy_struct *dummy;
+ gfp_t gfp;
+ struct test_nolock_context *ctx = event->overflow_handler_context;
+
+ /* __GFP_ACCOUNT to test kmalloc_nolock() in alloc_slab_obj_exts() */
+ gfp = (ctx->callback_count % 2) ? 0 : __GFP_ACCOUNT;
+ dummy = kmalloc_nolock(sizeof(*dummy), gfp, NUMA_NO_NODE);
+
+ if (dummy) {
+ ctx->alloc_ok++;
+ kfree_rcu_nolock(dummy, rcu);
+ } else {
+ ctx->alloc_fail++;
+ }
+ ctx->callback_count++;
+}
+
+static void test_kfree_rcu_nolock(struct kunit *test)
+{
+ int i, j;
+ struct test_nolock_context ctx = { .test = test };
+ struct perf_event *event;
+ bool alloc_fail = false;
+ struct dummy_struct *dummy;
+
+ if (IS_BUILTIN(CONFIG_SLUB_KUNIT_TEST))
+ kunit_skip(test, "can't do kfree_rcu_nolock() when test is built-in");
+
+ event = perf_event_create_kernel_counter(&hw_attr, -1, current,
+ overflow_handler_test_kfree_rcu_nolock,
+ &ctx);
+ if (IS_ERR(event))
+ kunit_skip(test, "Failed to create perf event");
+ ctx.event = event;
+ perf_event_enable(ctx.event);
+ for (i = 0; i < NR_ITERATIONS; i++) {
+ for (j = 0; j < NR_OBJECTS; j++) {
+ gfp_t gfp = (i % 2) ? GFP_KERNEL : GFP_KERNEL_ACCOUNT;
+
+ objects[j] = kmalloc(sizeof(*dummy), gfp);
+ if (!objects[j]) {
+ j--;
+ while (j >= 0)
+ kfree(objects[j--]);
+ alloc_fail = true;
+ goto cleanup;
+ }
+ }
+ for (j = 0; j < NR_OBJECTS; j++)
+ kfree_rcu((struct dummy_struct *)objects[j], rcu);
+ }
+
cleanup:
perf_event_disable(ctx.event);
perf_event_release_kernel(ctx.event);
@@ -406,6 +478,7 @@ static struct kunit_case test_cases[] = {
KUNIT_CASE(test_krealloc_redzone_zeroing),
#ifdef CONFIG_PERF_EVENTS
KUNIT_CASE_SLOW(test_kmalloc_kfree_nolock),
+ KUNIT_CASE_SLOW(test_kfree_rcu_nolock),
#endif
{}
};
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-04-16 9:10 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16 9:10 [RFC PATCH v2 0/8] kvfree_rcu() improvements Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 2/8] fs/dcache: use rcu_ptr instead of rcu_head for external names Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 3/8] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 5/8] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 6/8] mm/slab: wrap rcu sheaf handling with ifdef Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 7/8] mm/slab: introduce deferred submission of rcu sheaves Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 8/8] lib/tests/slub_kunit: add a test case for kfree_rcu_nolock() Harry Yoo (Oracle)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox