From: Justin Suess <utilityemal77@gmail.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
sashiko@lists.linux.dev, bpf <bpf@vger.kernel.org>
Subject: Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
Date: Mon, 11 May 2026 21:43:00 -0400 [thread overview]
Message-ID: <agKBE4Ci4_LmJ1Dc@zenbox> (raw)
In-Reply-To: <CAP01T75942V+wKJD0yhs8yb7U5vj+y9s6efJ+UdpTMfzP31CWg@mail.gmail.com>
On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> > > [ 21.604660] Call Trace:
> > > [ 21.604662] <TASK>
> > > [ 21.604663] dump_stack_lvl+0x5d/0x80
> > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
> > > [ 21.604669] lock_acquire+0x295/0x2e0
> > > [ 21.604671] ? terminate_walk+0x33/0x160
> > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
> > > [ 21.604679] _raw_spin_lock+0x30/0x40
> > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
> > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
> > > [ 21.604686] bpf_obj_free_fields+0x118/0x250
> > > [ 21.604691] free_htab_elem+0x85/0xd0
> > > [ 21.604694] htab_map_delete_elem+0x168/0x230
> > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> > > [ 21.604700] bpf_trace_run3+0x126/0x430
> >
> > that's better.
> > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> > but left check_and_free_fields() in free_htab_elem().
> >
> > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> > and fallback to bpf_mem_alloc at map create time when map has kptrs
> > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> >
> > Kumar,
> >
> > thoughts?
> >
> >
>
> Yeah, removing it from the path that helpers can invoke seems simpler.
> Remember though, this splat is just for hashtab, we have similar
> bpf_obj_free_fields() in array map on update. I think fundamentally
> the main issue here is that we logically free special fields when a
> map value is freed or deleted. When updating array maps we logically
> 'free' and then 'update' the same map value together. For hashtab, it
> happens on update/delete.
>
> We could relax this behavior to avoid eagerly freeing these special
> fields on update or deletion. The only worry is how this would impact
> programs that have come to rely on the existing behavior. There are
> patterns where people expect kptr to be NULL on some new map value,
> which causes programs to return errors when that expectation is not
> met. Just doing the skip when irqs_disabled() doesn't save us from the
> surprise side-effect. We need to decide upon this first before
> discussing the shape of the solution.
>
> This is the theoretical concern; In practice, I think most people who
> depend on such behavior use kptr in local storage maps (in
> schedulers). So it probably won't be a problem in practice, even
> though we can't judge this ahead of time. Also, we eagerly reuse map
> values when using memalloc, so the guarantees are already pretty weak
> I guess.
>
> So, if we are not going to go through a grace period (like local
> storage) and free back to kernel allocator before reuse, we should
> relax field freeing behavior. At best, we should cancel work for
> timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> BPF_UPTR is used in task storage which I think is accessible to
> tracing programs, I am not sure how safe unpin_user_page() is when
> called from random reentrant contexts. We might have more cases in the
> future, we cannot guarantee we can handle everything in NMIs
> universally.
>
> So the best course of action seems to be relaxing
> bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> on async work (timer, wq, task_work) for delete / update and let other
> fields be as-is. We likely need to do bpf_obj_free_fields()
> additionally before prealloc_destroy() now, but that should be simple.
> Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> separate change.
>
> This should hopefully resolve the issue, unless I missed other cases.
This does sound good, so you'd set the bpf_obj_free_fields up in the
htab allocator dtor for the final free and rely on the allocators
existing nmi deferral?
The missing piece is whether to handle this differently in NMI or just
always do it with the deferral. Also the prealloc question needs
answering.
This fix would preserve the xchg inlining and allow for failure with
ENOMEM on exhaustion and not need weird locking or atomic counting.
Appreciate the time from both of you learned a lot more than I wanted to
about NMI than I bargained for writing this. There's probably more weird
NMI cases. FWIW there are zero tp_btf/nmi_handler progs anywhere I can
find on the internet so this is a weird edge case.
Since I brought this issue up I'll finish it out and send patches next
week.
Even if generic file kptr isn't possible hopefully this can stop sashiko
from complaining about every new kptr dtor getting introduced being unsafe in NMI :)
next prev parent reply other threads:[~2026-05-12 1:43 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-07 18:43 ` bot+bpf-ci
2026-05-07 18:52 ` Justin Suess
2026-05-07 23:45 ` sashiko-bot
2026-05-10 15:13 ` Justin Suess
2026-05-10 22:38 ` Alexei Starovoitov
2026-05-11 1:49 ` Justin Suess
2026-05-11 15:51 ` Alexei Starovoitov
2026-05-11 16:38 ` Justin Suess
2026-05-11 17:18 ` Alexei Starovoitov
2026-05-11 20:10 ` Kumar Kartikeya Dwivedi
2026-05-12 1:43 ` Justin Suess [this message]
2026-05-12 1:46 ` Kumar Kartikeya Dwivedi
2026-05-12 1:55 ` Alexei Starovoitov
2026-05-12 2:03 ` Kumar Kartikeya Dwivedi
2026-05-12 2:10 ` Alexei Starovoitov
2026-05-12 2:13 ` Kumar Kartikeya Dwivedi
2026-05-12 2:07 ` Justin Suess
2026-05-12 2:08 ` Kumar Kartikeya Dwivedi
2026-05-11 19:22 ` Justin Suess
2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-08 0:03 ` sashiko-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agKBE4Ci4_LmJ1Dc@zenbox \
--to=utilityemal77@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=memxor@gmail.com \
--cc=sashiko@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox