From: Uladzislau Rezki <urezki@gmail.com>
To: Harry Yoo <harry@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@kernel.org>,
Christoph Lameter <cl@gentwo.org>,
David Rientjes <rientjes@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Hao Li <hao.li@linux.dev>, Alexei Starovoitov <ast@kernel.org>,
"Paul E . McKenney" <paulmck@kernel.org>,
Frederic Weisbecker <frederic@kernel.org>,
Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
Joel Fernandes <joelagnelf@nvidia.com>,
Josh Triplett <josh@joshtriplett.org>,
Boqun Feng <boqun@kernel.org>, Zqiang <qiang.zhang@linux.dev>,
Steven Rostedt <rostedt@goodmis.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Lai Jiangshan <jiangshanlai@gmail.com>,
rcu@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org
Subject: Re: [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock()
Date: Wed, 20 May 2026 12:01:44 +0200 [thread overview]
Message-ID: <ag2GiLJSA2Rguxf3@milan> (raw)
In-Reply-To: <82d2145a-9b41-4ee4-b980-e7bd5d12f035@kernel.org>
On Tue, May 19, 2026 at 04:44:30PM +0900, Harry Yoo wrote:
> [Resending as it's rejected by mailing lists due to my broken email
> setup. Apologies for the noise.]
>
> *shows up late again after LSFMM and processing some backlog*
>
> On 4/30/26 9:10 PM, Uladzislau Rezki wrote:
> > Hello, Harry!
> >
> > >
> > > Hi Ulad. Apologies for the delayed response.
> > > I meant to reply sooner but sidetracked by other issues.
> > >
> > No problem, sometimes i also can lag because of other tasks :)
> >
> > > Your questions are fair, but let me try to clarify
> > > the current situation.
> > >
> > > And before diving into details, I would like to reiterate that
> > > there are potentially two points to discuss here:
> > >
> > > Point 1. Can we justify complicating subsystems by passing
> > > `allow_spin` parameter all over the place?
> > >
> > Yes, we can. But as i noted i see some drawbacks :)
> >
> > - all new incoming patches have to respect that new third argument;
>
> That is true :)
>
This is i would like to avoid :)
> > - the fallback mechanism which uses irq-work is not optimal in my
> > opinion:
>
> In most cases it would not fall back because most likely trylock would
> succeed. If most of the calls do not fall back, a bit of suboptimality
> on the fallback path is acceptable.
>
> > a) We introduce an extra window between queuing a pointer, mark
> > irq-work to be executed and then reenter the kfree_rcu() with
> > no-sync flag and now we need to wait a GP for them. But the GP
> > might be already passed for such pointers. So we potentially
> > need more time to offload. This is rather minus.
> >
> > b) Since it is for BPF, allow_spin is always false, thus only
> > fallback path is used. Decoupling comes to mind.
>
> No, allow_spin == true means spinning on a lock is safe. If allow_spin
> is false, it would do a trylock instead of spinning, and it is expected
> to succeed most of the time. As long as trylock succeeds, it uses the
> same data structures as the existing kvfree_rcu batching without fallback.
>
> >
> > c)
> >
> > Why should we mix those? What it is worth to do, is to prevent mixing
> > "unknown path which is for BPF/others" with generic kfree_rcu().
>
> Because we want to reuse the existing kvfree_rcu batching infrastructure
> without reinventing a new feature to do the same thing.
>
> The intent is to avoid the fallback in most cases when allow_spin is
> false, with fallback being there for correctness.
>
The problem is that, it is a random behaviour with trylocking, i.e.
it is not deterministic. If you apply some noise you end up in kicking
two paths anyway.
If the idea is to reuse "existing kvfree_rcu batching" you need to
access the array in lock-free way. If you can do that, i would agree
with it.
>
> > > Point 2. Can we avoid adding this complexity to kvfree_rcu() and
> > > let slab handle it instead? (as mentioned in [4])
> > >
> > it depends if BPF people want to free a pointer using RCU machinery?
> > Do you know if that an intention?
>
> They want to free slab objects after RCU grace period. Freeing slab
> objects without involving RCU is already supported by kfree_nolock().
> (There are other use cases as well, as recently posted in [1])
>
> I meant RCU sheaves can handle freeing slab objects after RCU grace
> period, and kfree_rcu_nolock() users don't need to handle vmalloc pages.
> So technically we don't have to add this complexity to kvfree_rcu
> batching and handle it in slab.
>
> But to do that, we shouldn't disable kfree_rcu_sheaf() completely on RT.
> Apparently Vlastimil has a suggestion to address this, and I'm going to
> digest his suggestion and explore that aspect.
>
> [1] https://lore.kernel.org/linux-mm/esepccfhqg7m6jo76ns2znj2cnuaepx2xvw5zaygtwohq4psma@563ypprp6rr3
>
> [2] https://lore.kernel.org/linux-mm/6811cc17-8ee4-48c8-8cbf-6bf4d9f98162@kernel.org
>
> > > On Point 1: IMHO it could be justified, but at the same time I hope we
> > > end up avoiding more complexity in the long term by working on Point 2.
> > >
> > > This reply focuses only on Point 1 and explains why it could be
> > > justified.
> > >
> > > On Thu, Apr 23, 2026 at 01:35:25PM +0200, Uladzislau Rezki wrote:
> > > > On Thu, Apr 23, 2026 at 01:23:25PM +0900, Harry Yoo (Oracle) wrote:
> > > > > On Wed, Apr 22, 2026 at 04:42:28PM +0200, Uladzislau Rezki wrote:
> > > > > How much performance do we sacrifice compared to
> > > > > letting them go through the kvfree_rcu() fastpath?
> > > >
> > > > Freeing an object over RCU from
> > > > NMI context is a corner case. It is __not_ generic.
> > >
> > > First, I want to clarify that kfree_rcu_nolock() is not just for NMI
> > > context. It is intended to be used when the context is unknown (because
> > > it can be called in an arbitrary code locations).
> > >
> > When we say "unknown" to me it sounds like a worst case, which is NMI :)
>
> If we say "allow_spin = false assumes the most restrictive context, such
> as NMI context", that is misleading. It sounds like we always fall back,
> but we don't. Even when the context is unknown, fallback isn't required
> most of the time.
>
> So I would like to say "the context is unknown", meaning that
> technically kvfree_rcu could be re-entered in the middle of kvfree_rcu
> and we need to be able to handle that for correctness (although in most
> cases there's no re-entry and no fallback).
>
> > > There are two kinds of problematic situations where BPF programs
> > > are attached to:
> > >
> > > - 1) a tracepoint or a function that can be invoked in a critical
> > > section (w/ a lock held), or
> > >
> > > - 2) a function that can be called in an NMI context, which might
> > > preempt an arbitrary context holding a lock.
> > >
> > > While 1) and 2) are not (I think) dominant use cases, and although
> > > most of users can legally call kvfree_rcu(), BPF can't use kvfree_rcu()
> > > and must consider the most restrictive contexts.
> > >
> > > > We even do not have(now
> > > > in mainline) users because we never support it from NMI,
> > > > just like call_rcu().
> > >
> > > Unfortunately, we've had this use case (of allocating memory for BPF
> > > programs) for a long time in the mainline. There are two current
> > > approaches to mitigate the limitation:
> > >
> > > - 1) Pre-allocate all memory. e.g.) allocate all hash table elements
> > > when creating a BPF map, rather than allocating them on demand.
> > > This ensures correctness but sacrifices memory.
> > >
> > > - 2) Use the BPF-specific memory allocator [1] [2] to allocate memory
> > > on demand and avoid preallocation. While this wastes less memory
> > > than 1) and also maintains performance, it is re-inventing yet
> > > another memory allocator.
> > >
> > > Also, the allocator reinvented kfree_rcu batching as well.
> > >
> > > Now, we're trying to avoid 1) and 2) as much as possible and use
> > > kmalloc_nolock() instead [3].
> > >
> > > > If BPF needs
> > > > it, then the first question which comes to mind is not about performance.
> > > > It is how to support this case in kfree_rcu() without adding noticeable
> > > > complexity or overhead or hacks to the generic path without making it harder
> > > > to maintain.
> > >
> > > Since there will be only few subsystems that needs it, and because
> > > they already use it on production systems, I don't see much value in
> > > maintaining a simple implementation if that compromises performance
> > > (and thus make the transition harder).
> > >
> > > > Performance wise you noted, you mean:
> > > >
> > > > a) call latency(this is probably the most important for NMI)?
> > > > b) memory footprint?
> > > > c) pointer-chasing overhead?
> > >
> > > I think it's either
> > >
> > > - The performance of kfree_rcu_nolock() itself (a), or
> > > - Not distrubing workloads running on the machine (b and c)
> > >
> > > depending on what people use BPF for.
> > >
> > Are you aware of any specific workloads which we can run? To test
> > and see what we have when it comes to performance metrics? I mean
> > exact uses cases with exact steps who to trigger them?
> >
> > That would be useful to see on behaviour.
>
> I'll share once I find what BPF folks are using for performance
> benchmarks. (Which means I'm not aware at the moment :D)
>
Please share.
--
Uladzislau Rezki
prev parent reply other threads:[~2026-05-20 10:01 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260416091022.36823-1-harry@kernel.org>
[not found] ` <20260416091022.36823-5-harry@kernel.org>
[not found] ` <aejeVK0J_jHSfVhD@milan>
[not found] ` <aemevSofweaUSx0n@hyeyoo>
[not found] ` <aeoD_Ts6hLsgNc9-@pc636>
[not found] ` <3s4jafam3la72a6y3dkfvhtzxk3fsngb2cka3bpfqrirl5m633@pz3vzizefoxb>
[not found] ` <afNG0jNQNYeZ940g@pc636>
2026-05-19 7:44 ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo
2026-05-20 10:01 ` Uladzislau Rezki [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ag2GiLJSA2Rguxf3@milan \
--to=urezki@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=boqun@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cl@gentwo.org \
--cc=frederic@kernel.org \
--cc=hao.li@linux.dev \
--cc=harry@kernel.org \
--cc=jiangshanlai@gmail.com \
--cc=joelagnelf@nvidia.com \
--cc=josh@joshtriplett.org \
--cc=linux-mm@kvack.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=neeraj.upadhyay@kernel.org \
--cc=paulmck@kernel.org \
--cc=qiang.zhang@linux.dev \
--cc=rcu@vger.kernel.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox