From: "Alexei Starovoitov" <alexei.starovoitov@gmail.com>
To: "Kumar Kartikeya Dwivedi" <memxor@gmail.com>,
"Justin Suess" <utilityemal77@gmail.com>
Cc: <sashiko@lists.linux.dev>, "bpf" <bpf@vger.kernel.org>
Subject: Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
Date: Mon, 11 May 2026 18:55:50 -0700 [thread overview]
Message-ID: <DIGBOYWXZ95S.3GAGVQGUJNQD4@gmail.com> (raw)
In-Reply-To: <CAP01T76i3S93kvXnQvK+NU1ee82s+ka2jQfmf-yL86r9AS2tuQ@mail.gmail.com>
On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
>>
>> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
>> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
>> > <alexei.starovoitov@gmail.com> wrote:
>> > >
>> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
>> > > > [ 21.604660] Call Trace:
>> > > > [ 21.604662] <TASK>
>> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80
>> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
>> > > > [ 21.604669] lock_acquire+0x295/0x2e0
>> > > > [ 21.604671] ? terminate_walk+0x33/0x160
>> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
>> > > > [ 21.604679] _raw_spin_lock+0x30/0x40
>> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
>> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
>> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250
>> > > > [ 21.604691] free_htab_elem+0x85/0xd0
>> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230
>> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
>> > > > [ 21.604700] bpf_trace_run3+0x126/0x430
>> > >
>> > > that's better.
>> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
>> > > but left check_and_free_fields() in free_htab_elem().
>> > >
>> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
>> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
>> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
>> > >
>> > > Kumar,
>> > >
>> > > thoughts?
>> > >
>> > >
>> >
>> > Yeah, removing it from the path that helpers can invoke seems simpler.
>> > Remember though, this splat is just for hashtab, we have similar
>> > bpf_obj_free_fields() in array map on update. I think fundamentally
>> > the main issue here is that we logically free special fields when a
>> > map value is freed or deleted. When updating array maps we logically
>> > 'free' and then 'update' the same map value together. For hashtab, it
>> > happens on update/delete.
>> >
>> > We could relax this behavior to avoid eagerly freeing these special
>> > fields on update or deletion. The only worry is how this would impact
>> > programs that have come to rely on the existing behavior. There are
>> > patterns where people expect kptr to be NULL on some new map value,
>> > which causes programs to return errors when that expectation is not
>> > met. Just doing the skip when irqs_disabled() doesn't save us from the
>> > surprise side-effect. We need to decide upon this first before
>> > discussing the shape of the solution.
>> >
>> > This is the theoretical concern; In practice, I think most people who
>> > depend on such behavior use kptr in local storage maps (in
>> > schedulers). So it probably won't be a problem in practice, even
>> > though we can't judge this ahead of time. Also, we eagerly reuse map
>> > values when using memalloc, so the guarantees are already pretty weak
>> > I guess.
>> >
>> > So, if we are not going to go through a grace period (like local
>> > storage) and free back to kernel allocator before reuse, we should
>> > relax field freeing behavior. At best, we should cancel work for
>> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
>> > BPF_UPTR is used in task storage which I think is accessible to
>> > tracing programs, I am not sure how safe unpin_user_page() is when
>> > called from random reentrant contexts. We might have more cases in the
>> > future, we cannot guarantee we can handle everything in NMIs
>> > universally.
>> >
>> > So the best course of action seems to be relaxing
>> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
>> > on async work (timer, wq, task_work) for delete / update and let other
>> > fields be as-is. We likely need to do bpf_obj_free_fields()
>> > additionally before prealloc_destroy() now, but that should be simple.
>> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
>> > separate change.
>> >
>> > This should hopefully resolve the issue, unless I missed other cases.
>> This does sound good, so you'd set the bpf_obj_free_fields up in the
>> htab allocator dtor for the final free and rely on the allocators
>> existing nmi deferral?
>
> It is already set, except for prealloc maps. But we can call it before
> destroying the pcpu freelist etc.
htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
>>
>> The missing piece is whether to handle this differently in NMI or just
>> always do it with the deferral. Also the prealloc question needs
>> answering.
>
> There is no deferral here. I'm saying that we just cancel for timer,
> wq, task work, and leave other fields as is. So we don't have active
> work pending for async items.
>
> So as long as the item keeps getting recycled in the allocator, we
> don't free these fields. Once the memalloc is destroyed, the dtor runs
> in a known safe context where we can assume bpf_obj_free_fields won't
> deadlock or run into any problems.
So the plan is to do
if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
just ignore it?
And no other changes anywhere at all?
That would be too good to be true :)
next prev parent reply other threads:[~2026-05-12 1:55 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-07 18:43 ` bot+bpf-ci
2026-05-07 18:52 ` Justin Suess
2026-05-07 23:45 ` sashiko-bot
2026-05-10 15:13 ` Justin Suess
2026-05-10 22:38 ` Alexei Starovoitov
2026-05-11 1:49 ` Justin Suess
2026-05-11 15:51 ` Alexei Starovoitov
2026-05-11 16:38 ` Justin Suess
2026-05-11 17:18 ` Alexei Starovoitov
2026-05-11 20:10 ` Kumar Kartikeya Dwivedi
2026-05-12 1:43 ` Justin Suess
2026-05-12 1:46 ` Kumar Kartikeya Dwivedi
2026-05-12 1:55 ` Alexei Starovoitov [this message]
2026-05-12 2:03 ` Kumar Kartikeya Dwivedi
2026-05-12 2:10 ` Alexei Starovoitov
2026-05-12 2:13 ` Kumar Kartikeya Dwivedi
2026-05-12 2:07 ` Justin Suess
2026-05-12 2:08 ` Kumar Kartikeya Dwivedi
2026-05-11 19:22 ` Justin Suess
2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-08 0:03 ` sashiko-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DIGBOYWXZ95S.3GAGVQGUJNQD4@gmail.com \
--to=alexei.starovoitov@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=memxor@gmail.com \
--cc=sashiko@lists.linux.dev \
--cc=utilityemal77@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.