BPF List
 help / color / mirror / Atom feed
From: Uladzislau Rezki <urezki@gmail.com>
To: Harry Yoo <harry@kernel.org>
Cc: Uladzislau Rezki <urezki@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	Christoph Lameter <cl@gentwo.org>,
	David Rientjes <rientjes@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Hao Li <hao.li@linux.dev>, Alexei Starovoitov <ast@kernel.org>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	Josh Triplett <josh@joshtriplett.org>,
	Boqun Feng <boqun@kernel.org>, Zqiang <qiang.zhang@linux.dev>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	rcu@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org
Subject: Re: [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock()
Date: Wed, 20 May 2026 12:01:44 +0200	[thread overview]
Message-ID: <ag2GiLJSA2Rguxf3@milan> (raw)
In-Reply-To: <82d2145a-9b41-4ee4-b980-e7bd5d12f035@kernel.org>

On Tue, May 19, 2026 at 04:44:30PM +0900, Harry Yoo wrote:
> [Resending as it's rejected by mailing lists due to my broken email
>  setup. Apologies for the noise.]
> 
> *shows up late again after LSFMM and processing some backlog*
> 
> On 4/30/26 9:10 PM, Uladzislau Rezki wrote:
> > Hello, Harry!
> > 
> > > 
> > > Hi Ulad. Apologies for the delayed response.
> > > I meant to reply sooner but sidetracked by other issues.
> > > 
> > No problem, sometimes i also can lag because of other tasks :)
> > 
> > > Your questions are fair, but let me try to clarify
> > > the current situation.
> > > 
> > > And before diving into details, I would like to reiterate that
> > > there are potentially two points to discuss here:
> > > 
> > > Point 1. Can we justify complicating subsystems by passing
> > > 	 `allow_spin` parameter all over the place?
> > > 
> > Yes, we can. But as i noted i see some drawbacks :)
> > 
> > - all new incoming patches have to respect that new third argument;
> 
> That is true :)
> 
This is i would like to avoid :)

> > - the fallback mechanism which uses irq-work is not optimal in my
> >    opinion:
> 
> In most cases it would not fall back because most likely trylock would
> succeed. If most of the calls do not fall back, a bit of suboptimality
> on the fallback path is acceptable.
> 
> >      a) We introduce an extra window between queuing a pointer, mark
> >         irq-work to be executed and then reenter the kfree_rcu() with
> >         no-sync flag and now we need to wait a GP for them. But the GP
> >         might be already passed for such pointers. So we potentially
> >         need more time to offload. This is rather minus.
> > 
> >      b) Since it is for BPF, allow_spin is always false, thus only
> >         fallback path is used. Decoupling comes to mind.
> 
> No, allow_spin == true means spinning on a lock is safe. If allow_spin
> is false, it would do a trylock instead of spinning, and it is expected
> to succeed most of the time. As long as trylock succeeds, it uses the
> same data structures as the existing kvfree_rcu batching without fallback.
> 
> > 
> >      c)
> > 
> > Why should we mix those? What it is worth to do, is to prevent mixing
> > "unknown path which is for BPF/others" with generic kfree_rcu().
> 
> Because we want to reuse the existing kvfree_rcu batching infrastructure
> without reinventing a new feature to do the same thing.
> 
> The intent is to avoid the fallback in most cases when allow_spin is
> false, with fallback being there for correctness.
>
The problem is that, it is a random behaviour with trylocking, i.e.
it is not deterministic. If you apply some noise you end up in kicking
two paths anyway.

If the idea is to reuse "existing kvfree_rcu batching" you need to
access the array in lock-free way. If you can do that, i would agree
with it.

> 
> > > Point 2. Can we avoid adding this complexity to kvfree_rcu() and
> > > 	 let slab handle it instead? (as mentioned in [4])
> > > 
> > it depends if BPF people want to free a pointer using RCU machinery?
> > Do you know if that an intention?
> 
> They want to free slab objects after RCU grace period. Freeing slab
> objects without involving RCU is already supported by kfree_nolock().
> (There are other use cases as well, as recently posted in [1])
> 
> I meant RCU sheaves can handle freeing slab objects after RCU grace
> period, and kfree_rcu_nolock() users don't need to handle vmalloc pages.
> So technically we don't have to add this complexity to kvfree_rcu
> batching and handle it in slab.
> 
> But to do that, we shouldn't disable kfree_rcu_sheaf() completely on RT.
> Apparently Vlastimil has a suggestion to address this, and I'm going to
> digest his suggestion and explore that aspect.
> 
> [1] https://lore.kernel.org/linux-mm/esepccfhqg7m6jo76ns2znj2cnuaepx2xvw5zaygtwohq4psma@563ypprp6rr3
> 
> [2] https://lore.kernel.org/linux-mm/6811cc17-8ee4-48c8-8cbf-6bf4d9f98162@kernel.org
> 
> > > On Point 1: IMHO it could be justified, but at the same time I hope we
> > > end up avoiding more complexity in the long term by working on Point 2.
> > > 
> > > This reply focuses only on Point 1 and explains why it could be
> > > justified.
> > > 
> > > On Thu, Apr 23, 2026 at 01:35:25PM +0200, Uladzislau Rezki wrote:
> > > > On Thu, Apr 23, 2026 at 01:23:25PM +0900, Harry Yoo (Oracle) wrote:
> > > > > On Wed, Apr 22, 2026 at 04:42:28PM +0200, Uladzislau Rezki wrote:
> > > > > How much performance do we sacrifice compared to
> > > > > letting them go through the kvfree_rcu() fastpath?
> > > > 
> > > > Freeing an object over RCU from
> > > > NMI context is a corner case. It is __not_ generic.
> > > 
> > > First, I want to clarify that kfree_rcu_nolock() is not just for NMI
> > > context. It is intended to be used when the context is unknown (because
> > > it can be called in an arbitrary code locations).
> > > 
> > When we say "unknown" to me it sounds like a worst case, which is NMI :)
> 
> If we say "allow_spin = false assumes the most restrictive context, such
> as NMI context", that is misleading. It sounds like we always fall back,
> but we don't. Even when the context is unknown, fallback isn't required
> most of the time.
> 
> So I would like to say "the context is unknown", meaning that
> technically kvfree_rcu could be re-entered in the middle of kvfree_rcu
> and we need to be able to handle that for correctness (although in most
> cases there's no re-entry and no fallback).
> 
> > > There are two kinds of problematic situations where BPF programs
> > > are attached to:
> > > 
> > >    - 1) a tracepoint or a function that can be invoked in a critical
> > >         section (w/ a lock held), or
> > > 
> > >    - 2) a function that can be called in an NMI context, which might
> > >         preempt an arbitrary context holding a lock.
> > > 
> > > While 1) and 2) are not (I think) dominant use cases, and although
> > > most of users can legally call kvfree_rcu(), BPF can't use kvfree_rcu()
> > > and must consider the most restrictive contexts.
> > > 
> > > > We even do not have(now
> > > > in mainline) users because we never support it from NMI,
> > > > just like call_rcu().
> > > 
> > > Unfortunately, we've had this use case (of allocating memory for BPF
> > > programs) for a long time in the mainline. There are two current
> > > approaches to mitigate the limitation:
> > > 
> > >    - 1) Pre-allocate all memory. e.g.) allocate all hash table elements
> > >         when creating a BPF map, rather than allocating them on demand.
> > >         This ensures correctness but sacrifices memory.
> > > 
> > >    - 2) Use the BPF-specific memory allocator [1] [2] to allocate memory
> > >         on demand and avoid preallocation. While this wastes less memory
> > >         than 1) and also maintains performance, it is re-inventing yet
> > >         another memory allocator.
> > > 
> > >         Also, the allocator reinvented kfree_rcu batching as well.
> > > 
> > > Now, we're trying to avoid 1) and 2) as much as possible and use
> > > kmalloc_nolock() instead [3].
> > > 
> > > > If BPF needs
> > > > it, then the first question which comes to mind is not about performance.
> > > > It is how to support this case in kfree_rcu() without adding noticeable
> > > > complexity or overhead or hacks to the generic path without making it harder
> > > > to maintain.
> > > 
> > > Since there will be only few subsystems that needs it, and because
> > > they already use it on production systems, I don't see much value in
> > > maintaining a simple implementation if that compromises performance
> > > (and thus make the transition harder).
> > > 
> > > > Performance wise you noted, you mean:
> > > > 
> > > > a) call latency(this is probably the most important for NMI)?
> > > > b) memory footprint?
> > > > c) pointer-chasing overhead?
> > > 
> > > I think it's either
> > > 
> > > - The performance of kfree_rcu_nolock() itself (a), or
> > > - Not distrubing workloads running on the machine (b and c)
> > > 
> > > depending on what people use BPF for.
> > > 
> > Are you aware of any specific workloads which we can run? To test
> > and see what we have when it comes to performance metrics? I mean
> > exact uses cases with exact steps who to trigger them?
> > 
> > That would be useful to see on behaviour.
> 
> I'll share once I find what BPF folks are using for performance
> benchmarks. (Which means I'm not aware at the moment :D)
> 
Please share.

--
Uladzislau Rezki

      reply	other threads:[~2026-05-20 10:01 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260416091022.36823-1-harry@kernel.org>
     [not found] ` <20260416091022.36823-5-harry@kernel.org>
     [not found]   ` <aejeVK0J_jHSfVhD@milan>
     [not found]     ` <aemevSofweaUSx0n@hyeyoo>
     [not found]       ` <aeoD_Ts6hLsgNc9-@pc636>
     [not found]         ` <3s4jafam3la72a6y3dkfvhtzxk3fsngb2cka3bpfqrirl5m633@pz3vzizefoxb>
     [not found]           ` <afNG0jNQNYeZ940g@pc636>
2026-05-19  7:44             ` [PATCH 4/8] mm/slab: introduce kfree_rcu_nolock() Harry Yoo
2026-05-20 10:01               ` Uladzislau Rezki [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag2GiLJSA2Rguxf3@milan \
    --to=urezki@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=boqun@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cl@gentwo.org \
    --cc=frederic@kernel.org \
    --cc=hao.li@linux.dev \
    --cc=harry@kernel.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joelagnelf@nvidia.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-mm@kvack.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox