Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI

BPF List
 help / color / mirror / Atom feed

From: "Alexei Starovoitov" <alexei.starovoitov@gmail.com>
To: "Kumar Kartikeya Dwivedi" <memxor@gmail.com>
Cc: "Justin Suess" <utilityemal77@gmail.com>,
	<sashiko@lists.linux.dev>, "bpf" <bpf@vger.kernel.org>
Subject: Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
Date: Mon, 11 May 2026 19:10:05 -0700	[thread overview]
Message-ID: <DIGBZVSAOXGC.VA20SWFV4R74@gmail.com> (raw)
In-Reply-To: <CAP01T75y1CuPQ9af9SgYqY=c+3q7ZXTKJJu=znAE3HtEEBMwOw@mail.gmail.com>

On Mon May 11, 2026 at 7:03 PM PDT, Kumar Kartikeya Dwivedi wrote:
> On Tue, 12 May 2026 at 03:55, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
>> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
>> >>
>> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
>> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
>> >> > <alexei.starovoitov@gmail.com> wrote:
>> >> > >
>> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
>> >> > > >   [   21.604660] Call Trace:
>> >> > > >   [   21.604662]  <TASK>
>> >> > > >   [   21.604663]  dump_stack_lvl+0x5d/0x80
>> >> > > >   [   21.604666]  print_usage_bug.part.0+0x22b/0x2c0
>> >> > > >   [   21.604669]  lock_acquire+0x295/0x2e0
>> >> > > >   [   21.604671]  ? terminate_walk+0x33/0x160
>> >> > > >   [   21.604674]  ? __call_rcu_common.constprop.0+0x309/0x730
>> >> > > >   [   21.604679]  _raw_spin_lock+0x30/0x40
>> >> > > >   [   21.604680]  ? __call_rcu_common.constprop.0+0x309/0x730
>> >> > > >   [   21.604682]  __call_rcu_common.constprop.0+0x309/0x730
>> >> > > >   [   21.604686]  bpf_obj_free_fields+0x118/0x250
>> >> > > >   [   21.604691]  free_htab_elem+0x85/0xd0
>> >> > > >   [   21.604694]  htab_map_delete_elem+0x168/0x230
>> >> > > >   [   21.604698]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
>> >> > > >   [   21.604700]  bpf_trace_run3+0x126/0x430
>> >> > >
>> >> > > that's better.
>> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
>> >> > > but left check_and_free_fields() in free_htab_elem().
>> >> > >
>> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
>> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
>> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
>> >> > >
>> >> > > Kumar,
>> >> > >
>> >> > > thoughts?
>> >> > >
>> >> > >
>> >> >
>> >> > Yeah, removing it from the path that helpers can invoke seems simpler.
>> >> > Remember though, this splat is just for hashtab, we have similar
>> >> > bpf_obj_free_fields() in array map on update. I think fundamentally
>> >> > the main issue here is that we logically free special fields when a
>> >> > map value is freed or deleted. When updating array maps we logically
>> >> > 'free' and then 'update' the same map value together. For hashtab, it
>> >> > happens on update/delete.
>> >> >
>> >> > We could relax this behavior to avoid eagerly freeing these special
>> >> > fields on update or deletion. The only worry is how this would impact
>> >> > programs that have come to rely on the existing behavior. There are
>> >> > patterns where people expect kptr to be NULL on some new map value,
>> >> > which causes programs to return errors when that expectation is not
>> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the
>> >> > surprise side-effect. We need to decide upon this first before
>> >> > discussing the shape of the solution.
>> >> >
>> >> > This is the theoretical concern; In practice, I think most people who
>> >> > depend on such behavior use kptr in local storage maps (in
>> >> > schedulers). So it probably won't be a problem in practice, even
>> >> > though we can't judge this ahead of time. Also, we eagerly reuse map
>> >> > values when using memalloc, so the guarantees are already pretty weak
>> >> > I guess.
>> >> >
>> >> > So, if we are not going to go through a grace period (like local
>> >> > storage) and free back to kernel allocator before reuse, we should
>> >> > relax field freeing behavior. At best, we should cancel work for
>> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
>> >> > BPF_UPTR is used in task storage which I think is accessible to
>> >> > tracing programs, I am not sure how safe unpin_user_page() is when
>> >> > called from random reentrant contexts. We might have more cases in the
>> >> > future, we cannot guarantee we can handle everything in NMIs
>> >> > universally.
>> >> >
>> >> > So the best course of action seems to be relaxing
>> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
>> >> > on async work (timer, wq, task_work) for delete / update and let other
>> >> > fields be as-is. We likely need to do bpf_obj_free_fields()
>> >> > additionally before prealloc_destroy() now, but that should be simple.
>> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
>> >> > separate change.
>> >> >
>> >> > This should hopefully resolve the issue, unless I missed other cases.
>> >> This does sound good, so you'd set the bpf_obj_free_fields up in the
>> >> htab allocator dtor for the final free and rely on the allocators
>> >> existing nmi deferral?
>> >
>> > It is already set, except for prealloc maps. But we can call it before
>> > destroying the pcpu freelist etc.
>>
>> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
>> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
>>
>> >>
>> >> The missing piece is whether to handle this differently in NMI or just
>> >> always do it with the deferral. Also the prealloc question needs
>> >> answering.
>> >
>> > There is no deferral here. I'm saying that we just cancel for timer,
>> > wq, task work, and leave other fields as is. So we don't have active
>> > work pending for async items.
>> >
>> > So as long as the item keeps getting recycled in the allocator, we
>> > don't free these fields. Once the memalloc is destroyed, the dtor runs
>> > in a known safe context where we can assume bpf_obj_free_fields won't
>> > deadlock or run into any problems.
>>
>> So the plan is to do
>> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
>> just ignore it?
>> And no other changes anywhere at all?
>>
>> That would be too good to be true :)
>
> I don't know whether in_nmi() would be sufficient, we likely need
> irqs_disabled()? 

fair irqs_disabled() is safer.

> At that point, why not always ignore it, since
> freeing the fields is dependent on where you're running. I would still
> cancel async fields, since they're already any-context safe.

you mean never touch case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
regardless of running context and rely on final cleanup?
That's an idea to consider, but I suspect some rbtree, link list
tests will fail.

next prev parent reply	other threads:[~2026-05-12  2:10 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-07 18:43   ` bot+bpf-ci
2026-05-07 18:52     ` Justin Suess
2026-05-07 23:45   ` sashiko-bot
2026-05-10 15:13     ` Justin Suess
2026-05-10 22:38       ` Alexei Starovoitov
2026-05-11  1:49         ` Justin Suess
2026-05-11 15:51           ` Alexei Starovoitov
2026-05-11 16:38             ` Justin Suess
2026-05-11 17:18               ` Alexei Starovoitov
2026-05-11 20:10                 ` Kumar Kartikeya Dwivedi
2026-05-12  1:43                   ` Justin Suess
2026-05-12  1:46                     ` Kumar Kartikeya Dwivedi
2026-05-12  1:55                       ` Alexei Starovoitov
2026-05-12  2:03                         ` Kumar Kartikeya Dwivedi
2026-05-12  2:10                           ` Alexei Starovoitov [this message]
2026-05-12  2:13                             ` Kumar Kartikeya Dwivedi
2026-05-12  2:07                         ` Justin Suess
2026-05-12  2:08                           ` Kumar Kartikeya Dwivedi
2026-05-11 19:22             ` Justin Suess
2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-08  0:03   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DIGBZVSAOXGC.VA20SWFV4R74@gmail.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=memxor@gmail.com \
    --cc=sashiko@lists.linux.dev \
    --cc=utilityemal77@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox