From: Justin Suess <utilityemal77@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: sashiko@lists.linux.dev, bpf <bpf@vger.kernel.org>
Subject: Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
Date: Sun, 10 May 2026 21:49:49 -0400 [thread overview]
Message-ID: <agEnc0aYMhuIeDlq@zenbox> (raw)
In-Reply-To: <CAADnVQJdFCFJcg0Pqdnx=D2h5vHj_q4mrwSJYxRC5RW0NTFNvQ@mail.gmail.com>
On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
> >
> >
> > Any help or guidance on this would be appreciated!
>
> sorry for the delay. Everyone was at lsfmmbpf for a week+.
>
No worries! I hope it was an enjoyable trip and I look forward to
hearing about the conference.
> All of the solutions so far are way too complicated.
> bpf_kptr_xchg() has to remain inlined as single atomic xchg
> without slowpath otherwise it ruins the concept
> and makes usage unpredictable.
>
> Let's step back.
> What is the issue you're trying to solve?
>
> the commit log say:
>
> > A BPF program attached to tp_btf/nmi_handler can delete map entries or
> > swap out referenced kptrs from NMI context. Today that runs the kptr
> > destructor inline. Destructors such as bpf_cpumask_release() can take
> > RCU-related locks, so running them from NMI can deadlock the system.
>
> and looking at selftest from patch 2 you do:
>
> old = bpf_kptr_xchg(&value->mask, old);
> if (old)
> bpf_cpumask_release(old);
>
> so?
> bpf_cpumask_release() is fine to call from any context,
> because bpf_mem_cache_free_rcu() is safe everywhere including NMI.
>
My mistake on that. I picked a bad example for the test, but the test is
just exercising the nmi dtor path, and doesn't rely on the particular
type of kptr. I just picked one that was easy to acquire a reference to.
This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are
not. I've annotated why here:
crypto_ctx:
__bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx)
{
if (refcount_dec_and_test(&ctx->usage))
call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */
}
__bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx)
{
bpf_crypto_ctx_release(ctx);
}
task_struct:
__bpf_kfunc void bpf_task_release(struct task_struct *p)
{
put_task_struct_rcu_user(p);
}
__bpf_kfunc void bpf_task_release_dtor(void *p)
{
put_task_struct_rcu_user(p);
}
void put_task_struct_rcu_user(struct task_struct *task)
{
if (refcount_dec_and_test(&task->rcu_users))
call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu
*/
}
cgroup_release_dtor is more complex, goes through ultimately through
callbacks to:
static void css_release(struct percpu_ref *ref)
{
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);
INIT_WORK(&css->destroy_work, css_release_work_fn);
queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE:
workqueue */
}
More generally, unless it's a BPF allocated object or doesn't rely on
locking/call_rcu or workqueues, the dtor is unsafe.
> hashtab introduced dtor in bpf_mem_alloc,
> so bpf_obj_free_fields() and corresponding dtor's of kptr-s
> are called from valid context.
>
> What is the problematic sequence?
So from the beginning stepping back.
The problematic sequence:
1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map.
2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map
with that referenced kptr field.
3. dtor runs in the nmi context
4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock.
You can see this demonstrated in my task_struct reproducer [1].
It causes a deadlock by deliberately releasing the last reference to a
task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock.
The typical solution to this is to run the nmi unsafe code in non-nmi
context by offloading to NMI work, as you proposed.
The problem is we need space to for the jobs we enqueue. The required
information to run the dtor is the dtor function and the original object
pointer. The number of dtors that can run in a single tp_btf/nmi_handler
prog is nearly unbounded.
The other problem is even though bpf_mem_alloc is safe in NMI generally,
we cannot allocate in path that destroys an object. If the allocation
fails due to memory pressure, we leak the object.
There are a few options, all with drawbacks.
1. Dynamically allocate the job. Non-starter, failing to allocate is
unrecoverable, memory pressure means we can't ever schedule the dtor to
run.
2. Store job ntrusively in the object : Requires a safe place to place
it within the object. Bad because not all objects have a space we can write to.
Non-starter.
3. Within the map slot (after actual kptr): Taken with my initial approach in v1.
Significant complexity and requires per-map changes. Feasible but very
complex and would need DCAS or locking to make updating the map slot and
our job information atomic.
4. Wrapping the kptr in a box and storing it in place of the kptr [2] :
Proposed by Mykyta. Would break direct load access to kptr objects.
5. Make every dtor nmi safe individually. This would require a lot of
duplicated code and require updating every destructor invididually.
Feasible technically, but seems brittle.
6. One that would be the least complex, would be forbidding xchg operations
that can run the dtor in NMI context. That would preserve the inlining fix,
but limit our usage of referenced kptrs in BPF programs that run in NMI context.
The approach here:
7. Allocate a new spot for a free job every time we xchg into the map
and put it in a global list. When in NMI and we run a dtor, we pop a
job from that slot and use it to offload our work via irq_work. If
we're not in NMI we run normally. Downside is this breaks inlining for
ref kptrs.
...
I may be missing something critical, but everything I've looked at
points to this problem being much more complex that it initially seemed.
I'm happy to discuss further and do a respin.
Justin
[1] https://lore.kernel.org/bpf/383afba6-f732-49bc-a197-147b9d8b1a29@gmail.com/
[2] https://lore.kernel.org/bpf/51a054a0-e57f-49dc-9527-36da0535087c@gmail.com/
next prev parent reply other threads:[~2026-05-11 1:49 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-07 17:54 [bpf-next v3 0/2] bpf: Fix deadlock in kptr dtor in nmi Justin Suess
2026-05-07 17:54 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-07 18:43 ` bot+bpf-ci
2026-05-07 18:52 ` Justin Suess
2026-05-07 23:45 ` sashiko-bot
2026-05-10 15:13 ` Justin Suess
2026-05-10 22:38 ` Alexei Starovoitov
2026-05-11 1:49 ` Justin Suess [this message]
2026-05-11 15:51 ` Alexei Starovoitov
2026-05-11 16:38 ` Justin Suess
2026-05-11 17:18 ` Alexei Starovoitov
2026-05-11 20:10 ` Kumar Kartikeya Dwivedi
2026-05-12 1:43 ` Justin Suess
2026-05-12 1:46 ` Kumar Kartikeya Dwivedi
2026-05-12 1:55 ` Alexei Starovoitov
2026-05-12 2:03 ` Kumar Kartikeya Dwivedi
2026-05-12 2:10 ` Alexei Starovoitov
2026-05-12 2:13 ` Kumar Kartikeya Dwivedi
2026-05-12 2:07 ` Justin Suess
2026-05-12 2:08 ` Kumar Kartikeya Dwivedi
2026-05-11 19:22 ` Justin Suess
2026-05-07 17:54 ` [bpf-next v3 2/2] selftests/bpf: Add kptr destructor NMI exerciser Justin Suess
2026-05-08 0:03 ` sashiko-bot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agEnc0aYMhuIeDlq@zenbox \
--to=utilityemal77@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=sashiko@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox