From: Harry Yoo <harry@kernel.org>
To: Uladzislau Rezki <urezki@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@kernel.org>,
Christoph Lameter <cl@gentwo.org>,
David Rientjes <rientjes@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Hao Li <hao.li@linux.dev>, Alexei Starovoitov <ast@kernel.org>,
"Paul E . McKenney" <paulmck@kernel.org>,
Frederic Weisbecker <frederic@kernel.org>,
Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
Joel Fernandes <joelagnelf@nvidia.com>,
Josh Triplett <josh@joshtriplett.org>,
Boqun Feng <boqun@kernel.org>, Zqiang <qiang.zhang@linux.dev>,
Steven Rostedt <rostedt@goodmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Lai Jiangshan <jiangshanlai@gmail.com>,
rcu@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org
Subject: Re: [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock()
Date: Tue, 19 May 2026 16:44:30 +0900 [thread overview]
Message-ID: <82d2145a-9b41-4ee4-b980-e7bd5d12f035@kernel.org> (raw)
In-Reply-To: <afNG0jNQNYeZ940g@pc636>
[Resending as it's rejected by mailing lists due to my broken email
setup. Apologies for the noise.]
*shows up late again after LSFMM and processing some backlog*
On 4/30/26 9:10 PM, Uladzislau Rezki wrote:
> Hello, Harry!
>
>>
>> Hi Ulad. Apologies for the delayed response.
>> I meant to reply sooner but sidetracked by other issues.
>>
> No problem, sometimes i also can lag because of other tasks :)
>
>> Your questions are fair, but let me try to clarify
>> the current situation.
>>
>> And before diving into details, I would like to reiterate that
>> there are potentially two points to discuss here:
>>
>> Point 1. Can we justify complicating subsystems by passing
>> `allow_spin` parameter all over the place?
>>
> Yes, we can. But as i noted i see some drawbacks :)
>
> - all new incoming patches have to respect that new third argument;
That is true :)
> - the fallback mechanism which uses irq-work is not optimal in my
> opinion:
In most cases it would not fall back because most likely trylock would
succeed. If most of the calls do not fall back, a bit of suboptimality
on the fallback path is acceptable.
> a) We introduce an extra window between queuing a pointer, mark
> irq-work to be executed and then reenter the kfree_rcu() with
> no-sync flag and now we need to wait a GP for them. But the GP
> might be already passed for such pointers. So we potentially
> need more time to offload. This is rather minus.
>
> b) Since it is for BPF, allow_spin is always false, thus only
> fallback path is used. Decoupling comes to mind.
No, allow_spin == true means spinning on a lock is safe. If allow_spin
is false, it would do a trylock instead of spinning, and it is expected
to succeed most of the time. As long as trylock succeeds, it uses the
same data structures as the existing kvfree_rcu batching without fallback.
>
> c)
>
> Why should we mix those? What it is worth to do, is to prevent mixing
> "unknown path which is for BPF/others" with generic kfree_rcu().
Because we want to reuse the existing kvfree_rcu batching infrastructure
without reinventing a new feature to do the same thing.
The intent is to avoid the fallback in most cases when allow_spin is
false, with fallback being there for correctness.
>> Point 2. Can we avoid adding this complexity to kvfree_rcu() and
>> let slab handle it instead? (as mentioned in [4])
>>
> it depends if BPF people want to free a pointer using RCU machinery?
> Do you know if that an intention?
They want to free slab objects after RCU grace period. Freeing slab
objects without involving RCU is already supported by kfree_nolock().
(There are other use cases as well, as recently posted in [1])
I meant RCU sheaves can handle freeing slab objects after RCU grace
period, and kfree_rcu_nolock() users don't need to handle vmalloc pages.
So technically we don't have to add this complexity to kvfree_rcu
batching and handle it in slab.
But to do that, we shouldn't disable kfree_rcu_sheaf() completely on RT.
Apparently Vlastimil has a suggestion to address this, and I'm going to
digest his suggestion and explore that aspect.
[1]
https://lore.kernel.org/linux-mm/esepccfhqg7m6jo76ns2znj2cnuaepx2xvw5zaygtwohq4psma@563ypprp6rr3
[2]
https://lore.kernel.org/linux-mm/6811cc17-8ee4-48c8-8cbf-6bf4d9f98162@kernel.org
>> On Point 1: IMHO it could be justified, but at the same time I hope we
>> end up avoiding more complexity in the long term by working on Point 2.
>>
>> This reply focuses only on Point 1 and explains why it could be
>> justified.
>>
>> On Thu, Apr 23, 2026 at 01:35:25PM +0200, Uladzislau Rezki wrote:
>>> On Thu, Apr 23, 2026 at 01:23:25PM +0900, Harry Yoo (Oracle) wrote:
>>>> On Wed, Apr 22, 2026 at 04:42:28PM +0200, Uladzislau Rezki wrote:
>>>> How much performance do we sacrifice compared to
>>>> letting them go through the kvfree_rcu() fastpath?
>>>
>>> Freeing an object over RCU from
>>> NMI context is a corner case. It is __not_ generic.
>>
>> First, I want to clarify that kfree_rcu_nolock() is not just for NMI
>> context. It is intended to be used when the context is unknown (because
>> it can be called in an arbitrary code locations).
>>
> When we say "unknown" to me it sounds like a worst case, which is NMI :)
If we say "allow_spin = false assumes the most restrictive context, such
as NMI context", that is misleading. It sounds like we always fall back,
but we don't. Even when the context is unknown, fallback isn't required
most of the time.
So I would like to say "the context is unknown", meaning that
technically kvfree_rcu could be re-entered in the middle of kvfree_rcu
and we need to be able to handle that for correctness (although in most
cases there's no re-entry and no fallback).
>> There are two kinds of problematic situations where BPF programs
>> are attached to:
>>
>> - 1) a tracepoint or a function that can be invoked in a critical
>> section (w/ a lock held), or
>>
>> - 2) a function that can be called in an NMI context, which might
>> preempt an arbitrary context holding a lock.
>>
>> While 1) and 2) are not (I think) dominant use cases, and although
>> most of users can legally call kvfree_rcu(), BPF can't use kvfree_rcu()
>> and must consider the most restrictive contexts.
>>
>>> We even do not have(now
>>> in mainline) users because we never support it from NMI,
>>> just like call_rcu().
>>
>> Unfortunately, we've had this use case (of allocating memory for BPF
>> programs) for a long time in the mainline. There are two current
>> approaches to mitigate the limitation:
>>
>> - 1) Pre-allocate all memory. e.g.) allocate all hash table elements
>> when creating a BPF map, rather than allocating them on demand.
>> This ensures correctness but sacrifices memory.
>>
>> - 2) Use the BPF-specific memory allocator [1] [2] to allocate memory
>> on demand and avoid preallocation. While this wastes less memory
>> than 1) and also maintains performance, it is re-inventing yet
>> another memory allocator.
>>
>> Also, the allocator reinvented kfree_rcu batching as well.
>>
>> Now, we're trying to avoid 1) and 2) as much as possible and use
>> kmalloc_nolock() instead [3].
>>
>>> If BPF needs
>>> it, then the first question which comes to mind is not about performance.
>>> It is how to support this case in kfree_rcu() without adding noticeable
>>> complexity or overhead or hacks to the generic path without making it harder
>>> to maintain.
>>
>> Since there will be only few subsystems that needs it, and because
>> they already use it on production systems, I don't see much value in
>> maintaining a simple implementation if that compromises performance
>> (and thus make the transition harder).
>>
>>> Performance wise you noted, you mean:
>>>
>>> a) call latency(this is probably the most important for NMI)?
>>> b) memory footprint?
>>> c) pointer-chasing overhead?
>>
>> I think it's either
>>
>> - The performance of kfree_rcu_nolock() itself (a), or
>> - Not distrubing workloads running on the machine (b and c)
>>
>> depending on what people use BPF for.
>>
> Are you aware of any specific workloads which we can run? To test
> and see what we have when it comes to performance metrics? I mean
> exact uses cases with exact steps who to trigger them?
>
> That would be useful to see on behaviour.
I'll share once I find what BPF folks are using for performance
benchmarks. (Which means I'm not aware at the moment :D)
[+Cc bpf@vger.kernel.org]
Thanks!
--
Cheers,
Harry / Hyeonggon
next parent reply other threads:[~2026-05-19 7:44 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260416091022.36823-1-harry@kernel.org>
[not found] ` <20260416091022.36823-5-harry@kernel.org>
[not found] ` <aejeVK0J_jHSfVhD@milan>
[not found] ` <aemevSofweaUSx0n@hyeyoo>
[not found] ` <aeoD_Ts6hLsgNc9-@pc636>
[not found] ` <3s4jafam3la72a6y3dkfvhtzxk3fsngb2cka3bpfqrirl5m633@pz3vzizefoxb>
[not found] ` <afNG0jNQNYeZ940g@pc636>
2026-05-19 7:44 ` Harry Yoo [this message]
2026-05-20 10:01 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Uladzislau Rezki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=82d2145a-9b41-4ee4-b980-e7bd5d12f035@kernel.org \
--to=harry@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=boqun@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cl@gentwo.org \
--cc=frederic@kernel.org \
--cc=hao.li@linux.dev \
--cc=jiangshanlai@gmail.com \
--cc=joelagnelf@nvidia.com \
--cc=josh@joshtriplett.org \
--cc=linux-mm@kvack.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=neeraj.upadhyay@kernel.org \
--cc=paulmck@kernel.org \
--cc=qiang.zhang@linux.dev \
--cc=rcu@vger.kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=urezki@gmail.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox