From: "Harry Yoo (Oracle)" <harry@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@kernel.org>
Cc: Christoph Lameter <cl@gentwo.org>,
David Rientjes <rientjes@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Hao Li <hao.li@linux.dev>, Alexei Starovoitov <ast@kernel.org>,
Uladzislau Rezki <urezki@gmail.com>,
"Paul E . McKenney" <paulmck@kernel.org>,
Frederic Weisbecker <frederic@kernel.org>,
Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
Joel Fernandes <joelagnelf@nvidia.com>,
Josh Triplett <josh@joshtriplett.org>,
Boqun Feng <boqun@kernel.org>, Zqiang <qiang.zhang@linux.dev>,
Steven Rostedt <rostedt@goodmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Lai Jiangshan <jiangshanlai@gmail.com>,
rcu@vger.kernel.org, linux-mm@kvack.org,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>
Subject: [RFC PATCH v2 0/8] kvfree_rcu() improvements
Date: Thu, 16 Apr 2026 18:10:14 +0900 [thread overview]
Message-ID: <20260416091022.36823-1-harry@kernel.org> (raw)
These are a few improvements for k[v]free_rcu() API,
suggested by Alexei Starovoitov. This aims to tackle two problems:
1) Allow an 8-byte field to be used as an alternative to
struct rcu_head (16-byte) for 2-argument kvfree_rcu()
to save memory.
2) Add kfree_rcu_nolock() API for an unknown context.
"Unknown context" means the caller does not know whether spinning
on a lock is safe. For example, a BPF program attached to an
arbitrary kernel function may run while the CPU already holds
krcp->lock. However, in practice, it's not held most of the time.
# Discussion
Now that we have sheaves for kmalloc caches, most of frees go through
the sheaves layer. However, when sheaves becomes full w/ !allow_spin,
call_rcu() cannot be called because the context is unknown. (e.g., it
might have preempted call_rcu()). There are two possible approaches:
a) Implement a general call_rcu_nolock() in the RCU subsystem that
defers call_rcu() when it's not safe.
b) Handle this as a special case only for rcu sheaf submission
in mm/slab_common.c, without touching the RCU core.
This series takes approach (b). This is because a general
call_rcu_nolock() would need to flush deferred callbacks before
rcu_barrier() to preserve its guarantee, increasing the cost of
rcu_barrier() for all RCU users, not just kfree_rcu. By keeping the
deferred call_rcu logic in the slab subsystem, only
kvfree_rcu_barrier() pays the extra cost.
One downside of the current approach is that slab uses the condition
`!allow_spin && irqs_disabled()` to determine whether it's safe to
call call_rcu(), which creates a dependency on RCU's implementation
details. I'd like to hear thoughts on this.
# Part 1. Allow an 8-byte field to be used as an alternative to
struct rcu_head for 2-argument kvfree_rcu()
(patches 1-2)
Technically, objects that are freed with k[v]free_rcu() need
only one pointer to link objects, because we already know that
the callback function is always kvfree(). For this purpose,
struct rcu_head is unnecessarily large (16 bytes on 64-bit).
Allow a smaller, 8-byte field (of struct rcu_ptr type) to be used
with k[v]free_rcu(). Let's save one pointer per slab object.
I have to admit that my naming skill isn't great; hopefully
we'll come up with a better name than `struct rcu_ptr`.
With this feature, either a struct rcu_ptr or rcu_head field
can be used as the second argument of the k[v]free_rcu() API.
Users that only use k[v]free_rcu() are may use struct rcu_ptr to save
memory (if there can be a lot of objects). However, some users,
such as maple tree, may use call_rcu() or k[v]free_rcu() for objects
of the same type. For such users, struct rcu_head remains the only
option.
Patch 1 implements the struct rcu_ptr feature (for
CONFIG_KVFREE_RCU_BATCHED), and patch 2 converts fs/dcache external_name
to use struct rcu_ptr as an example user, saving a pointer per
dynamically allocated external file name.
# Part 2. Add kfree_rcu_nolock() for unknown contexts
(patches 3-8)
Currently, kfree_rcu() cannot be called when the context is unknown,
which might not allow spinning on a lock. In such a context, even
calling call_rcu() is not legal, forcing users to implement some
sort of deferred freeing. Let's make users' lives easier with
a new kfree_rcu_nolock() variant.
Note that only the 2-argument variant is supported, since there is
not much we can do when trylock & memory allocation fails.
When spinning on a lock is not allowed, try to acquire the spinlock
using spin_trylock(). When trylock succeeds, do either:
1) Use the rcu sheaf to free the object. Note that call_rcu() cannot
be called in an unknown context, because it might have preempted
call_rcu(). When the rcu sheaf becomes full by freeing the object,
defer the submission of the full sheaf using irq_work
(defer_call_rcu).
2) Use bnode (of struct kvfree_rcu_bulk_data) to store the pointer.
If trylock succeeded but no cached bnode is available, fall back
and queue page cache worker just like normal 2-args kvfree_rcu()
path.
In rare cases where trylock fails, a non-lazy irq_work is used to
defer calling kvfree_call_rcu().
When certain debug features (kmemleak, debugobjects) are enabled,
freeing is always deferred because they use spinlocks.
Patch 3 moves code for preparation.
Patch 4 introduces kfree_rcu_nolock().
Patch 5 teaches the rcu sheaf to handle the !allow_spin case.
Patch 6 wraps rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef.
Patch 7 introduces deferred submission of rcu sheaves for the
!allow_spin case when IRQs are disabled.
Patch 8 adds a kunit test case for kfree_rcu_nolock().
Changes since RFC V1 [1]:
- Dropped the kmalloc_nolock() -> kfree[_rcu]() path support
and the objexts_flags cleanup as they already have landed mainline.
- Dropped rcu_ptr conversions in mm/ (previous patch 2) and instead
added struct external_name in fs/dcache.c as a user(new patch 2).
- (Fix) Handle kfence addresses correctly using is_kfence_address()
and kfence_object_start().
- Reworked kfree_rcu_nolock() (patch 4):
- When trylock succeeds, now attempts to use cached bnodes
(like normal kvfree_rcu 2-arg path) instead of only inserting
into krcp->head.
- Added allow_spin parameter to __schedule_delayed_monitor_work()
and run_page_cache_worker() to defer work submission via
irq_work when spinning is not allowed (Joel).
- (Fix) Introduced defer_kvfree_rcu_barrier() to flush deferred
objects before flushing rcu sheaves, preserving correctness of
kvfree_rcu_barrier().
- (Fix) Moved kvfree_rcu_barrier()/kvfree_rcu_barrier_on_cache()
to slab_common.c on CONFIG_KVFREE_RCU_BATCHED=n, and made them
wait for deferred irq_works even without kvfree_rcu batching.
- Introduced object_start_addr() helper to deduplicate the
start address calculation logic.
- Instead of falling back when the rcu sheaf becomes full,
implemented deferred submission of rcu sheaves using irq_work
(new patch 7) (Vlastimil, Alexei).
- Wrapped rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef
(new patch 6).
- Added a kunit test for kfree_rcu_nolock() (new patch 8).
[1] RFC V1: https://lore.kernel.org/linux-mm/20260206093410.160622-1-harry.yoo@oracle.com
RFC V2 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v2r1
RFC V1 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v1r1
What haven't changed since RFC v1:
- PREEMPT_RT support for kfree_rcu_sheaf() (Vlastimil): that is worth
addressing and I think it's doable, but it'll be a too big change to
be part of this series.
- Reducing struct rcu_ptr on !KVFREE_RCU_BATCHED (Vlastimil): I tried,
but I'm not still sure it's worth the complexity for
CONFIG_KVFREE_RCU_BATCHED=n users. Also, this inevitably introduces
some delay in freeing objects which is against the purpose of
RCU_STRICT_GRACE_PERIOD.
- While writing this cover letter, just realized that I should probably
try to reduce the number of irq work structures (pointed out by Joel)
(at least to 2 for lazy and non-lazy instead of 4). Will explore this
in the next version.
Harry Yoo (Oracle) (8):
mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
fs/dcache: use rcu_ptr instead of rcu_head for external names
mm/slab: move kfree_rcu_cpu[_work] definitions
mm/slab: introduce kfree_rcu_nolock()
mm/slab: make kfree_rcu_nolock() work with sheaves
mm/slab: wrap rcu sheaf handling with ifdef
mm/slab: introduce deferred submission of rcu sheaves
lib/tests/slub_kunit: add a test case for kfree_rcu_nolock()
fs/dcache.c | 8 +-
include/linux/rcupdate.h | 64 ++++--
include/linux/slab.h | 16 +-
include/linux/types.h | 9 +
lib/tests/slub_kunit.c | 73 +++++++
mm/slab.h | 8 +-
mm/slab_common.c | 452 +++++++++++++++++++++++++++++----------
mm/slub.c | 47 +++-
8 files changed, 514 insertions(+), 163 deletions(-)
base-commit: 7e0445f673205fd045f3358cacb52b3557627317
--
2.43.0
next reply other threads:[~2026-04-16 9:10 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-16 9:10 Harry Yoo (Oracle) [this message]
2026-04-16 9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
2026-04-22 14:41 ` Vlastimil Babka (SUSE)
2026-04-23 1:36 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 2/8] fs/dcache: use rcu_ptr instead of rcu_head for external names Harry Yoo (Oracle)
2026-04-21 20:21 ` Al Viro
2026-04-22 1:16 ` Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 3/8] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo (Oracle)
2026-04-16 9:10 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
2026-04-21 22:46 ` Alexei Starovoitov
2026-04-21 23:10 ` Paul E. McKenney
2026-04-21 23:14 ` Alexei Starovoitov
2026-04-22 3:02 ` Harry Yoo (Oracle)
2026-04-22 14:42 ` Uladzislau Rezki
2026-04-23 1:08 ` Harry Yoo (Oracle)
2026-04-23 1:56 ` Harry Yoo (Oracle)
2026-04-27 18:08 ` Vlastimil Babka (SUSE)
2026-04-27 18:51 ` Paul E. McKenney
2026-04-23 2:14 ` Harry Yoo (Oracle)
2026-04-23 4:23 ` Harry Yoo (Oracle)
2026-04-23 11:35 ` Uladzislau Rezki
2026-04-28 13:12 ` Harry Yoo (Oracle)
2026-04-30 12:10 ` Uladzislau Rezki
2026-04-27 13:08 ` Vlastimil Babka (SUSE)
2026-04-16 9:10 ` [PATCH 5/8] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo (Oracle)
2026-04-27 13:32 ` Vlastimil Babka (SUSE)
2026-04-27 13:53 ` Vlastimil Babka (SUSE)
2026-04-27 14:45 ` Alexei Starovoitov
2026-04-27 15:08 ` Vlastimil Babka (SUSE)
2026-04-27 15:11 ` Alexei Starovoitov
2026-04-16 9:10 ` [PATCH 6/8] mm/slab: wrap rcu sheaf handling with ifdef Harry Yoo (Oracle)
2026-04-27 15:47 ` Vlastimil Babka (SUSE)
2026-04-16 9:10 ` [PATCH 7/8] mm/slab: introduce deferred submission of rcu sheaves Harry Yoo (Oracle)
2026-04-21 22:51 ` Alexei Starovoitov
2026-04-22 3:11 ` Harry Yoo (Oracle)
2026-04-27 15:55 ` Vlastimil Babka (SUSE)
2026-04-16 9:10 ` [PATCH 8/8] lib/tests/slub_kunit: add a test case for kfree_rcu_nolock() Harry Yoo (Oracle)
2026-04-22 14:30 ` [RFC PATCH v2 0/8] kvfree_rcu() improvements Vlastimil Babka (SUSE)
2026-04-22 22:41 ` Paul E. McKenney
2026-04-23 1:31 ` Harry Yoo (Oracle)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260416091022.36823-1-harry@kernel.org \
--to=harry@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=boqun@kernel.org \
--cc=brauner@kernel.org \
--cc=cl@gentwo.org \
--cc=frederic@kernel.org \
--cc=hao.li@linux.dev \
--cc=jiangshanlai@gmail.com \
--cc=joelagnelf@nvidia.com \
--cc=josh@joshtriplett.org \
--cc=linux-mm@kvack.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=neeraj.upadhyay@kernel.org \
--cc=paulmck@kernel.org \
--cc=qiang.zhang@linux.dev \
--cc=rcu@vger.kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=urezki@gmail.com \
--cc=vbabka@kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.