From: Joel Fernandes <joelagnelf@nvidia.com>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@suse.cz>,
Christoph Lameter <cl@gentwo.org>,
David Rientjes <rientjes@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Johannes Weiner <hannes@cmpxchg.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
Michal Hocko <mhocko@kernel.org>, Hao Li <hao.li@linux.dev>,
Alexei Starovoitov <ast@kernel.org>,
Puranjay Mohan <puranjay@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Amery Hung <ameryhung@gmail.com>,
Catalin Marinas <catalin.marinas@arm.com>,
"Paul E . McKenney" <paulmck@kernel.org>,
Frederic Weisbecker <frederic@kernel.org>,
Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
Josh Triplett <josh@joshtriplett.org>,
Boqun Feng <boqun.feng@gmail.com>,
Uladzislau Rezki <urezki@gmail.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Lai Jiangshan <jiangshanlai@gmail.com>,
Zqiang <qiang.zhang@linux.dev>,
Dave Chinner <david@fromorbit.com>,
Qi Zheng <zhengqi.arch@bytedance.com>,
Muchun Song <muchun.song@linux.dev>,
rcu@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org
Subject: Re: [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock()
Date: Mon, 16 Feb 2026 16:07:55 -0500 [thread overview]
Message-ID: <20260216210755.GA1320175@joelbox2> (raw)
In-Reply-To: <20260206093410.160622-7-harry.yoo@oracle.com>
Hi Harry,
On Fri, Feb 06, 2026 at 06:34:09PM +0900, Harry Yoo wrote:
> Currently, kfree_rcu() cannot be called in an NMI context.
> In such a context, even calling call_rcu() is not legal,
> forcing users to implement deferred freeing.
>
> Make users' lives easier by introducing kfree_rcu_nolock() variant.
> Unlike kfree_rcu(), kfree_rcu_nolock() only supports a 2-argument
> variant, because, in the worst case where memory allocation fails,
> the caller cannot synchronously wait for the grace period to finish.
>
> Similar to kfree_nolock() implementation, try to acquire kfree_rcu_cpu
> spinlock, and if that fails, insert the object to per-cpu lockless list
> and delay freeing using irq_work that calls kvfree_call_rcu() later.
> In case kmemleak or debugobjects is enabled, always defer freeing as
> those debug features don't support NMI contexts.
>
> When trylock succeeds, avoid consuming bnode and run_page_cache_worker()
> altogether. Instead, insert objects into struct kfree_rcu_cpu.head
> without consuming additional memory.
>
> For now, the sheaves layer is bypassed if spinning is not allowed.
>
> Scheduling delayed monitor work in an NMI context is tricky; use
> irq_work to schedule, but use lazy irq_work to avoid raising self-IPIs.
> That means scheduling delayed monitor work can be delayed up to the
> length of a time slice.
>
> Without CONFIG_KVFREE_RCU_BATCHED, all frees in the !allow_spin case are
> delayed using irq_work.
>
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
> ---
> include/linux/rcupdate.h | 23 ++++---
> mm/slab_common.c | 140 +++++++++++++++++++++++++++++++++------
> 2 files changed, 133 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index db5053a7b0cb..18bb7378b23d 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -1092,8 +1092,9 @@ static inline void rcu_read_unlock_migrate(void)
> * The BUILD_BUG_ON check must not involve any function calls, hence the
> * checks are done in macros here.
> */
> -#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
> -#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf)
> +#define kfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
> +#define kfree_rcu_nolock(ptr, rf) kvfree_rcu_arg_2(ptr, rf, false)
> +#define kvfree_rcu(ptr, rf) kvfree_rcu_arg_2(ptr, rf, true)
>
> /**
> * kfree_rcu_mightsleep() - kfree an object after a grace period.
> @@ -1117,35 +1118,35 @@ static inline void rcu_read_unlock_migrate(void)
>
>
> #ifdef CONFIG_KVFREE_RCU_BATCHED
> -void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr);
> -#define kvfree_call_rcu(head, ptr) \
> +void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr, bool allow_spin);
> +#define kvfree_call_rcu(head, ptr, spin) \
> _Generic((head), \
> struct rcu_head *: kvfree_call_rcu_ptr, \
> struct rcu_ptr *: kvfree_call_rcu_ptr, \
> void *: kvfree_call_rcu_ptr \
> - )((struct rcu_ptr *)(head), (ptr))
> + )((struct rcu_ptr *)(head), (ptr), spin)
> #else
> -void kvfree_call_rcu_head(struct rcu_head *head, void *ptr);
> +void kvfree_call_rcu_head(struct rcu_head *head, void *ptr, bool allow_spin);
> static_assert(sizeof(struct rcu_head) == sizeof(struct rcu_ptr));
> -#define kvfree_call_rcu(head, ptr) \
> +#define kvfree_call_rcu(head, ptr, spin) \
> _Generic((head), \
> struct rcu_head *: kvfree_call_rcu_head, \
> struct rcu_ptr *: kvfree_call_rcu_head, \
> void *: kvfree_call_rcu_head \
> - )((struct rcu_head *)(head), (ptr))
> + )((struct rcu_head *)(head), (ptr), spin)
> #endif
>
> /*
> * The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
> * comment of kfree_rcu() for details.
> */
> -#define kvfree_rcu_arg_2(ptr, rf) \
> +#define kvfree_rcu_arg_2(ptr, rf, spin) \
> do { \
> typeof (ptr) ___p = (ptr); \
> \
> if (___p) { \
> BUILD_BUG_ON(offsetof(typeof(*(ptr)), rf) >= 4096); \
> - kvfree_call_rcu(&((___p)->rf), (void *) (___p)); \
> + kvfree_call_rcu(&((___p)->rf), (void *) (___p), spin); \
> } \
> } while (0)
>
> @@ -1154,7 +1155,7 @@ do { \
> typeof(ptr) ___p = (ptr); \
> \
> if (___p) \
> - kvfree_call_rcu(NULL, (void *) (___p)); \
> + kvfree_call_rcu(NULL, (void *) (___p), true); \
> } while (0)
>
> /*
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index d232b99a4b52..9d7801e5cb73 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1311,6 +1311,12 @@ struct kfree_rcu_cpu_work {
> * the interactions with the slab allocators.
> */
> struct kfree_rcu_cpu {
> + // Objects queued on a lockless linked list, not protected by the lock.
> + // This allows freeing objects in NMI context, where trylock may fail.
> + struct llist_head llist_head;
> + struct irq_work irq_work;
> + struct irq_work sched_monitor_irq_work;
It would be great if irq_work_queue() could support a lazy flag, or a new
irq_work_queue_lazy() which then just skips the irq_work_raise() for the lazy
case. Then we don't need multiple struct irq_work doing the same thing. +PeterZ
[...]
> @@ -1979,9 +2059,15 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
> }
>
> kasan_record_aux_stack(ptr);
> - success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
> +
> + krcp = krc_this_cpu_lock(&flags, allow_spin);
> + if (!krcp)
> + goto defer_free;
> +
> + success = add_ptr_to_bulk_krc_lock(krcp, &flags, ptr, !head, allow_spin);
> if (!success) {
> - run_page_cache_worker(krcp);
> + if (allow_spin)
> + run_page_cache_worker(krcp);
>
> if (head == NULL)
> // Inline if kvfree_rcu(one_arg) call.
> @@ -2005,8 +2091,12 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
> kmemleak_ignore(ptr);
>
> // Set timer to drain after KFREE_DRAIN_JIFFIES.
> - if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING)
> - __schedule_delayed_monitor_work(krcp);
> + if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING) {
> + if (allow_spin)
> + __schedule_delayed_monitor_work(krcp);
> + else
> + irq_work_queue(&krcp->sched_monitor_irq_work);
Here this irq_work will be queued even if delayed_work_pending? That might be
additional irq_work overhead (which was not needed) when the delayed monitor
was already queued?
If delayed_work_pending() is safe to call from NMI, you could also call
that to avoid unnecessary irq_work queueing. But do double check if it is.
Also per [1], I gather allow_spin does not always imply NMI. If that is true,
is better to call in_nmi() instead of relying on allow_spin?
[1] https://lore.kernel.org/all/CAADnVQKk_Bgi0bc-td_3pVpHYXR3CpC3R8rg-NHwdLEDiQSeNg@mail.gmail.com/
Thanks,
--
Joel Fernandes
> + }
>
> unlock_return:
> krc_this_cpu_unlock(krcp, flags);
> @@ -2017,10 +2107,22 @@ void kvfree_call_rcu_ptr(struct rcu_ptr *head, void *ptr)
> * CPU can pass the QS state.
> */
> if (!success) {
> + VM_WARN_ON_ONCE(!allow_spin);
> debug_rcu_head_unqueue((struct rcu_head *) ptr);
> synchronize_rcu();
> kvfree(ptr);
> }
> + return;
> +
> +defer_free:
> + VM_WARN_ON_ONCE(allow_spin);
> + guard(preempt)();
> +
> + krcp = this_cpu_ptr(&krc);
> + if (llist_add((struct llist_node *)head, &krcp->llist_head))
> + irq_work_queue(&krcp->irq_work);
> + return;
> +
> }
> EXPORT_SYMBOL_GPL(kvfree_call_rcu_ptr);
>
> --
> 2.43.0
>
next prev parent reply other threads:[~2026-02-16 21:08 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-06 9:34 [RFC PATCH 0/7] k[v]free_rcu() improvements Harry Yoo
2026-02-06 9:34 ` [RFC PATCH 1/7] mm/slab: introduce k[v]free_rcu() with struct rcu_ptr Harry Yoo
2026-02-07 8:25 ` kernel test robot
2026-02-11 10:16 ` Uladzislau Rezki
2026-02-11 10:44 ` Harry Yoo
2026-02-11 10:53 ` Uladzislau Rezki
2026-02-11 11:26 ` Harry Yoo
2026-02-11 13:02 ` Uladzislau Rezki
2026-02-11 17:05 ` Alexei Starovoitov
2026-02-12 11:52 ` Vlastimil Babka
2026-02-13 5:17 ` Harry Yoo
2026-02-06 9:34 ` [RFC PATCH 2/7] mm: use rcu_ptr instead of rcu_head Harry Yoo
2026-02-09 10:41 ` Uladzislau Rezki
2026-02-09 11:22 ` Harry Yoo
2026-02-06 9:34 ` [RFC PATCH 3/7] mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]() Harry Yoo
2026-02-06 9:34 ` [RFC PATCH 4/7] mm/slab: free a bit in enum objexts_flags Harry Yoo
2026-02-06 20:09 ` Alexei Starovoitov
2026-02-09 9:38 ` Vlastimil Babka
2026-02-09 18:44 ` Alexei Starovoitov
2026-02-06 9:34 ` [RFC PATCH 5/7] mm/slab: move kfree_rcu_cpu[_work] definitions Harry Yoo
2026-02-06 9:34 ` [RFC PATCH 6/7] mm/slab: introduce kfree_rcu_nolock() Harry Yoo
2026-02-07 8:25 ` kernel test robot
2026-02-12 2:58 ` Harry Yoo
2026-02-16 21:07 ` Joel Fernandes [this message]
2026-02-16 21:32 ` Joel Fernandes
2026-02-25 5:55 ` Harry Yoo
2026-02-06 9:34 ` [RFC PATCH 7/7] mm/slab: make kfree_rcu_nolock() work with sheaves Harry Yoo
2026-02-12 19:15 ` Alexei Starovoitov
2026-02-13 11:55 ` Harry Yoo
2026-02-07 0:16 ` [RFC PATCH 0/7] k[v]free_rcu() improvements Paul E. McKenney
2026-02-07 1:21 ` Harry Yoo
2026-02-07 1:33 ` Paul E. McKenney
2026-02-09 9:02 ` Harry Yoo
2026-02-09 16:40 ` Paul E. McKenney
2026-02-12 14:28 ` Vlastimil Babka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260216210755.GA1320175@joelbox2 \
--to=joelagnelf@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=boqun.feng@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=catalin.marinas@arm.com \
--cc=cl@gentwo.org \
--cc=david@fromorbit.com \
--cc=frederic@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hao.li@linux.dev \
--cc=harry.yoo@oracle.com \
--cc=jiangshanlai@gmail.com \
--cc=josh@joshtriplett.org \
--cc=linux-mm@kvack.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=neeraj.upadhyay@kernel.org \
--cc=paulmck@kernel.org \
--cc=puranjay@kernel.org \
--cc=qiang.zhang@linux.dev \
--cc=rcu@vger.kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=urezki@gmail.com \
--cc=vbabka@suse.cz \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.