public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: "Harry Yoo (Oracle)" <harry@kernel.org>
To: Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@kernel.org>
Cc: Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Hao Li <hao.li@linux.dev>, Alexei Starovoitov <ast@kernel.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Boqun Feng <boqun@kernel.org>, Zqiang <qiang.zhang@linux.dev>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	rcu@vger.kernel.org, linux-mm@kvack.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>
Subject: [RFC PATCH v2 0/8] kvfree_rcu() improvements
Date: Thu, 16 Apr 2026 18:10:14 +0900	[thread overview]
Message-ID: <20260416091022.36823-1-harry@kernel.org> (raw)

These are a few improvements for k[v]free_rcu() API,
suggested by Alexei Starovoitov. This aims to tackle two problems:

  1) Allow an 8-byte field to be used as an alternative to
     struct rcu_head (16-byte) for 2-argument kvfree_rcu()
     to save memory.

  2) Add kfree_rcu_nolock() API for an unknown context.

  "Unknown context" means the caller does not know whether spinning
  on a lock is safe. For example, a BPF program attached to an
  arbitrary kernel function may run while the CPU already holds
  krcp->lock. However, in practice, it's not held most of the time.

# Discussion

Now that we have sheaves for kmalloc caches, most of frees go through
the sheaves layer. However, when sheaves becomes full w/ !allow_spin,
call_rcu() cannot be called because the context is unknown. (e.g., it
might have preempted call_rcu()). There are two possible approaches:

  a) Implement a general call_rcu_nolock() in the RCU subsystem that
     defers call_rcu() when it's not safe.

  b) Handle this as a special case only for rcu sheaf submission
     in mm/slab_common.c, without touching the RCU core.

This series takes approach (b). This is because a general
call_rcu_nolock() would need to flush deferred callbacks before
rcu_barrier() to preserve its guarantee, increasing the cost of
rcu_barrier() for all RCU users, not just kfree_rcu. By keeping the
deferred call_rcu logic in the slab subsystem, only
kvfree_rcu_barrier() pays the extra cost.

One downside of the current approach is that slab uses the condition
`!allow_spin && irqs_disabled()` to determine whether it's safe to
call call_rcu(), which creates a dependency on RCU's implementation
details. I'd like to hear thoughts on this.

# Part 1. Allow an 8-byte field to be used as an alternative to
  struct rcu_head for 2-argument kvfree_rcu()
  (patches 1-2)

Technically, objects that are freed with k[v]free_rcu() need
only one pointer to link objects, because we already know that
the callback function is always kvfree(). For this purpose,
struct rcu_head is unnecessarily large (16 bytes on 64-bit).

Allow a smaller, 8-byte field (of struct rcu_ptr type) to be used
with k[v]free_rcu(). Let's save one pointer per slab object.

I have to admit that my naming skill isn't great; hopefully
we'll come up with a better name than `struct rcu_ptr`.

With this feature, either a struct rcu_ptr or rcu_head field
can be used as the second argument of the k[v]free_rcu() API.

Users that only use k[v]free_rcu() are may use struct rcu_ptr to save
memory (if there can be a lot of objects). However, some users,
such as maple tree, may use call_rcu() or k[v]free_rcu() for objects
of the same type. For such users, struct rcu_head remains the only
option.

Patch 1 implements the struct rcu_ptr feature (for
CONFIG_KVFREE_RCU_BATCHED), and patch 2 converts fs/dcache external_name
to use struct rcu_ptr as an example user, saving a pointer per
dynamically allocated external file name.

# Part 2. Add kfree_rcu_nolock() for unknown contexts
  (patches 3-8)

Currently, kfree_rcu() cannot be called when the context is unknown,
which might not allow spinning on a lock. In such a context, even
calling call_rcu() is not legal, forcing users to implement some
sort of deferred freeing. Let's make users' lives easier with
a new kfree_rcu_nolock() variant.

Note that only the 2-argument variant is supported, since there is
not much we can do when trylock & memory allocation fails.

When spinning on a lock is not allowed, try to acquire the spinlock
using spin_trylock(). When trylock succeeds, do either:

  1) Use the rcu sheaf to free the object. Note that call_rcu() cannot
     be called in an unknown context, because it might have preempted
     call_rcu(). When the rcu sheaf becomes full by freeing the object,
     defer the submission of the full sheaf using irq_work
     (defer_call_rcu).

  2) Use bnode (of struct kvfree_rcu_bulk_data) to store the pointer.
     If trylock succeeded but no cached bnode is available, fall back
     and queue page cache worker just like normal 2-args kvfree_rcu()
     path.

In rare cases where trylock fails, a non-lazy irq_work is used to
defer calling kvfree_call_rcu().

When certain debug features (kmemleak, debugobjects) are enabled,
freeing is always deferred because they use spinlocks.

Patch 3 moves code for preparation.
Patch 4 introduces kfree_rcu_nolock().
Patch 5 teaches the rcu sheaf to handle the !allow_spin case.
Patch 6 wraps rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef.
Patch 7 introduces deferred submission of rcu sheaves for the
!allow_spin case when IRQs are disabled.
Patch 8 adds a kunit test case for kfree_rcu_nolock().

Changes since RFC V1 [1]:
  - Dropped the kmalloc_nolock() -> kfree[_rcu]() path support
    and the objexts_flags cleanup as they already have landed mainline.
  - Dropped rcu_ptr conversions in mm/ (previous patch 2) and instead
    added struct external_name in fs/dcache.c as a user(new patch 2).
  - (Fix) Handle kfence addresses correctly using is_kfence_address()
    and kfence_object_start().
  - Reworked kfree_rcu_nolock() (patch 4):
    - When trylock succeeds, now attempts to use cached bnodes
      (like normal kvfree_rcu 2-arg path) instead of only inserting
      into krcp->head.
    - Added allow_spin parameter to __schedule_delayed_monitor_work()
      and run_page_cache_worker() to defer work submission via
      irq_work when spinning is not allowed (Joel).
    - (Fix) Introduced defer_kvfree_rcu_barrier() to flush deferred
      objects before flushing rcu sheaves, preserving correctness of
      kvfree_rcu_barrier().
    - (Fix) Moved kvfree_rcu_barrier()/kvfree_rcu_barrier_on_cache()
      to slab_common.c on CONFIG_KVFREE_RCU_BATCHED=n, and made them
      wait for deferred irq_works even without kvfree_rcu batching.
    - Introduced object_start_addr() helper to deduplicate the
      start address calculation logic.
  - Instead of falling back when the rcu sheaf becomes full,
    implemented deferred submission of rcu sheaves using irq_work
    (new patch 7) (Vlastimil, Alexei).
  - Wrapped rcu sheaf handling with CONFIG_KVFREE_RCU_BATCHED ifdef
    (new patch 6).
  - Added a kunit test for kfree_rcu_nolock() (new patch 8).

[1] RFC V1: https://lore.kernel.org/linux-mm/20260206093410.160622-1-harry.yoo@oracle.com

RFC V2 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v2r1

RFC V1 branch is available at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kvfree-rcu-improvements-rfc-v1r1

What haven't changed since RFC v1:

- PREEMPT_RT support for kfree_rcu_sheaf() (Vlastimil): that is worth
  addressing and I think it's doable, but it'll be a too big change to
  be part of this series.

- Reducing struct rcu_ptr on !KVFREE_RCU_BATCHED (Vlastimil): I tried,
  but I'm not still sure it's worth the complexity for
  CONFIG_KVFREE_RCU_BATCHED=n users. Also, this inevitably introduces
  some delay in freeing objects which is against the purpose of
  RCU_STRICT_GRACE_PERIOD.

- While writing this cover letter, just realized that I should probably
  try to reduce the number of irq work structures (pointed out by Joel)
  (at least to 2 for lazy and non-lazy instead of 4). Will explore this
  in the next version.

Harry Yoo (Oracle) (8):
  mm/slab: introduce k[v]free_rcu() with struct rcu_ptr
  fs/dcache: use rcu_ptr instead of rcu_head for external names
  mm/slab: move kfree_rcu_cpu[_work] definitions
  mm/slab: introduce kfree_rcu_nolock()
  mm/slab: make kfree_rcu_nolock() work with sheaves
  mm/slab: wrap rcu sheaf handling with ifdef
  mm/slab: introduce deferred submission of rcu sheaves
  lib/tests/slub_kunit: add a test case for kfree_rcu_nolock()

 fs/dcache.c              |   8 +-
 include/linux/rcupdate.h |  64 ++++--
 include/linux/slab.h     |  16 +-
 include/linux/types.h    |   9 +
 lib/tests/slub_kunit.c   |  73 +++++++
 mm/slab.h                |   8 +-
 mm/slab_common.c         | 452 +++++++++++++++++++++++++++++----------
 mm/slub.c                |  47 +++-
 8 files changed, 514 insertions(+), 163 deletions(-)


base-commit: 7e0445f673205fd045f3358cacb52b3557627317
-- 
2.43.0



             reply	other threads:[~2026-04-16  9:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-16  9:10 Harry Yoo (Oracle) [this message]
2026-04-16  9:10 ` [PATCH 1/8] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 2/8] fs/dcache: use rcu_ptr instead of rcu_head for external names Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 3/8] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 5/8] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 6/8] mm/slab: wrap rcu sheaf handling with ifdef Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 7/8] mm/slab: introduce deferred submission of rcu sheaves Harry Yoo (Oracle)
2026-04-16  9:10 ` [PATCH 8/8] lib/tests/slub_kunit: add a test case for kfree_rcu_nolock() Harry Yoo (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260416091022.36823-1-harry@kernel.org \
    --to=harry@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=boqun@kernel.org \
    --cc=brauner@kernel.org \
    --cc=cl@gentwo.org \
    --cc=frederic@kernel.org \
    --cc=hao.li@linux.dev \
    --cc=jiangshanlai@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-mm@kvack.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=urezki@gmail.com \
    --cc=vbabka@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox