* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
[not found] ` <20260507234520.646C4C2BCB2@smtp.kernel.org>
@ 2026-05-10 15:13 ` Justin Suess
2026-05-10 22:38 ` Alexei Starovoitov
0 siblings, 1 reply; 16+ messages in thread
From: Justin Suess @ 2026-05-10 15:13 UTC (permalink / raw)
To: sashiko; +Cc: bpf
On Thu, May 07, 2026 at 11:45:19PM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [High] TOCTOU race in `bpf_dtor_kptr_release_one()` can lead to NMI deadlocks.
> - [Low] Memory leak on initialization failure in `bpf_dtor_kptr_init()`.
> --
>
> commit 488cdc390b90427a591d47211600c09addc95fcd
> Author: Justin Suess <utilityemal77@gmail.com>
> Date: Thu May 7 17:54:52 2026 +0000
>
> bpf: Offload kptr destructors that run from NMI
>
> This commit queues destructor-backed kptr teardowns to irq_work when
> running from NMI context. It maintains an idle-slot surplus counter to
> ensure offload slots are preallocated to prevent deadlocks.
>
> Closes: https://lore.kernel.org/bpf/20260421201035.1729473-1-utilityemal77@gmail.com/
>
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 3b1f0ba02f613..d34fdb99eb8a9 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> [ ... ]
> > @@ -65,6 +69,111 @@ static DEFINE_SPINLOCK(map_idr_lock);
> > +static void bpf_dtor_kptr_release_one(void)
> > +{
> > + struct bpf_dtor_kptr_work *job;
> > + long surplus;
> > +
> > + for (;;) {
> > + surplus = atomic_long_read(&bpf_dtor_kptr_surplus);
> > + if (surplus <= 0)
> > + return;
> > +
> > + job = bpf_dtor_kptr_pop_idle();
>
> Could this sequence lead to a race condition that drains the idle pool
> too early?
>
> If multiple threads read a positive surplus value here, they might all call
> bpf_dtor_kptr_pop_idle() before the cmpxchg executes. This could temporarily
> drain the idle pool below the required number of reserved slots.
>
> If an NMI fires during this window and attempts to tear down a kptr, would
> bpf_dtor_kptr_offload() find the idle pool empty? If so, it looks like it
> would trigger the fallback warning and run the destructor inline from the
> NMI context, which might reintroduce the deadlocks this patch aims to
> prevent.
>
Yes this is real.
But there's a bigger problem: the push side of pcpu_freelist is not
interruptible.
If we jump to NMI during this stage then the pcpu_freelist will be
locked up because we were pushing to it, thus preventing us from
popping. Thus we just deadlock again.
I think we technically have exclusivity here on our percpu head
(we can't be in NMI and process context stimultaniously on the cpu).
So maybe it is safe to forcibly acquire that specific lock? I'm not 100%
sure.
llist won't work since we can't have multiple llist_del_first consumers,
and llist_del_all would render the list null and unusable if we jump to
it from NMI while we're borrowing it.
Making percpu llists won't work either, since the cpu we allocate the
job one might not be the one the dtor runs on, so we need percpu
stealing, which then runs into the same multiple consumers problem if
threads steal from same cpu head.
So this code is not ready either way.
I think 128 bit cmpxchg could technically solve this problem, but don't
think we can rely on it for all architectures BPF supports.
If anyone has any guidance on this it would be much appreciated. Been
stuck on this for several weeks...
Construct I need is a data structure with atomic push, pop, and move
between sets. No specific ordering needed, but updates from producers
must be immediately visible to all consumers. I've looked into Treiber
stacks.
Any help or guidance on this would be appreciated!
> > + if (!job)
> > + return;
> > +
> > + if (!atomic_long_try_cmpxchg(&bpf_dtor_kptr_surplus, &surplus,
> > + surplus - 1)) {
> > + pcpu_freelist_push(&bpf_dtor_kptr_idle, &job->fnode);
> > + continue;
> > + }
> > +
> > + bpf_mem_free(&bpf_global_ma, job);
> > + return;
> > + }
> > +}
> [ ... ]
> > +static int __init bpf_dtor_kptr_init(void)
> > +{
> > + int err;
> > +
> > + err = pcpu_freelist_init(&bpf_dtor_kptr_idle);
> > + if (err)
> > + return err;
> > +
> > + err = pcpu_freelist_init(&bpf_dtor_kptr_jobs);
> > + if (err)
> > + return err;
>
> Does this error path leak the per-cpu memory allocated for the idle list?
>
> If the second pcpu_freelist_init() fails, the function returns immediately.
> Should there be a cleanup path here to free the already initialized
> bpf_dtor_kptr_idle list?
>
> > +
> > + return 0;
> > +}
> > +late_initcall(bpf_dtor_kptr_init);
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260507175453.1140400-1-utilityemal77@gmail.com?part=1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-10 15:13 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
@ 2026-05-10 22:38 ` Alexei Starovoitov
2026-05-11 1:49 ` Justin Suess
0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2026-05-10 22:38 UTC (permalink / raw)
To: Justin Suess; +Cc: sashiko, bpf
On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
>
>
> Any help or guidance on this would be appreciated!
sorry for the delay. Everyone was at lsfmmbpf for a week+.
All of the solutions so far are way too complicated.
bpf_kptr_xchg() has to remain inlined as single atomic xchg
without slowpath otherwise it ruins the concept
and makes usage unpredictable.
Let's step back.
What is the issue you're trying to solve?
the commit log say:
> A BPF program attached to tp_btf/nmi_handler can delete map entries or
> swap out referenced kptrs from NMI context. Today that runs the kptr
> destructor inline. Destructors such as bpf_cpumask_release() can take
> RCU-related locks, so running them from NMI can deadlock the system.
and looking at selftest from patch 2 you do:
old = bpf_kptr_xchg(&value->mask, old);
if (old)
bpf_cpumask_release(old);
so?
bpf_cpumask_release() is fine to call from any context,
because bpf_mem_cache_free_rcu() is safe everywhere including NMI.
hashtab introduced dtor in bpf_mem_alloc,
so bpf_obj_free_fields() and corresponding dtor's of kptr-s
are called from valid context.
What is the problematic sequence?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-10 22:38 ` Alexei Starovoitov
@ 2026-05-11 1:49 ` Justin Suess
2026-05-11 15:51 ` Alexei Starovoitov
0 siblings, 1 reply; 16+ messages in thread
From: Justin Suess @ 2026-05-11 1:49 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: sashiko, bpf
On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
> >
> >
> > Any help or guidance on this would be appreciated!
>
> sorry for the delay. Everyone was at lsfmmbpf for a week+.
>
No worries! I hope it was an enjoyable trip and I look forward to
hearing about the conference.
> All of the solutions so far are way too complicated.
> bpf_kptr_xchg() has to remain inlined as single atomic xchg
> without slowpath otherwise it ruins the concept
> and makes usage unpredictable.
>
> Let's step back.
> What is the issue you're trying to solve?
>
> the commit log say:
>
> > A BPF program attached to tp_btf/nmi_handler can delete map entries or
> > swap out referenced kptrs from NMI context. Today that runs the kptr
> > destructor inline. Destructors such as bpf_cpumask_release() can take
> > RCU-related locks, so running them from NMI can deadlock the system.
>
> and looking at selftest from patch 2 you do:
>
> old = bpf_kptr_xchg(&value->mask, old);
> if (old)
> bpf_cpumask_release(old);
>
> so?
> bpf_cpumask_release() is fine to call from any context,
> because bpf_mem_cache_free_rcu() is safe everywhere including NMI.
>
My mistake on that. I picked a bad example for the test, but the test is
just exercising the nmi dtor path, and doesn't rely on the particular
type of kptr. I just picked one that was easy to acquire a reference to.
This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are
not. I've annotated why here:
crypto_ctx:
__bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx)
{
if (refcount_dec_and_test(&ctx->usage))
call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */
}
__bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx)
{
bpf_crypto_ctx_release(ctx);
}
task_struct:
__bpf_kfunc void bpf_task_release(struct task_struct *p)
{
put_task_struct_rcu_user(p);
}
__bpf_kfunc void bpf_task_release_dtor(void *p)
{
put_task_struct_rcu_user(p);
}
void put_task_struct_rcu_user(struct task_struct *task)
{
if (refcount_dec_and_test(&task->rcu_users))
call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu
*/
}
cgroup_release_dtor is more complex, goes through ultimately through
callbacks to:
static void css_release(struct percpu_ref *ref)
{
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);
INIT_WORK(&css->destroy_work, css_release_work_fn);
queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE:
workqueue */
}
More generally, unless it's a BPF allocated object or doesn't rely on
locking/call_rcu or workqueues, the dtor is unsafe.
> hashtab introduced dtor in bpf_mem_alloc,
> so bpf_obj_free_fields() and corresponding dtor's of kptr-s
> are called from valid context.
>
> What is the problematic sequence?
So from the beginning stepping back.
The problematic sequence:
1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map.
2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map
with that referenced kptr field.
3. dtor runs in the nmi context
4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock.
You can see this demonstrated in my task_struct reproducer [1].
It causes a deadlock by deliberately releasing the last reference to a
task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock.
The typical solution to this is to run the nmi unsafe code in non-nmi
context by offloading to NMI work, as you proposed.
The problem is we need space to for the jobs we enqueue. The required
information to run the dtor is the dtor function and the original object
pointer. The number of dtors that can run in a single tp_btf/nmi_handler
prog is nearly unbounded.
The other problem is even though bpf_mem_alloc is safe in NMI generally,
we cannot allocate in path that destroys an object. If the allocation
fails due to memory pressure, we leak the object.
There are a few options, all with drawbacks.
1. Dynamically allocate the job. Non-starter, failing to allocate is
unrecoverable, memory pressure means we can't ever schedule the dtor to
run.
2. Store job ntrusively in the object : Requires a safe place to place
it within the object. Bad because not all objects have a space we can write to.
Non-starter.
3. Within the map slot (after actual kptr): Taken with my initial approach in v1.
Significant complexity and requires per-map changes. Feasible but very
complex and would need DCAS or locking to make updating the map slot and
our job information atomic.
4. Wrapping the kptr in a box and storing it in place of the kptr [2] :
Proposed by Mykyta. Would break direct load access to kptr objects.
5. Make every dtor nmi safe individually. This would require a lot of
duplicated code and require updating every destructor invididually.
Feasible technically, but seems brittle.
6. One that would be the least complex, would be forbidding xchg operations
that can run the dtor in NMI context. That would preserve the inlining fix,
but limit our usage of referenced kptrs in BPF programs that run in NMI context.
The approach here:
7. Allocate a new spot for a free job every time we xchg into the map
and put it in a global list. When in NMI and we run a dtor, we pop a
job from that slot and use it to offload our work via irq_work. If
we're not in NMI we run normally. Downside is this breaks inlining for
ref kptrs.
...
I may be missing something critical, but everything I've looked at
points to this problem being much more complex that it initially seemed.
I'm happy to discuss further and do a respin.
Justin
[1] https://lore.kernel.org/bpf/383afba6-f732-49bc-a197-147b9d8b1a29@gmail.com/
[2] https://lore.kernel.org/bpf/51a054a0-e57f-49dc-9527-36da0535087c@gmail.com/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-11 1:49 ` Justin Suess
@ 2026-05-11 15:51 ` Alexei Starovoitov
2026-05-11 16:38 ` Justin Suess
2026-05-11 19:22 ` Justin Suess
0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2026-05-11 15:51 UTC (permalink / raw)
To: Justin Suess; +Cc: sashiko, bpf
On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote:
> On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
>> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
>> >
>> >
>> > Any help or guidance on this would be appreciated!
>>
>> sorry for the delay. Everyone was at lsfmmbpf for a week+.
>>
>
> No worries! I hope it was an enjoyable trip and I look forward to
> hearing about the conference.
>
>> All of the solutions so far are way too complicated.
>> bpf_kptr_xchg() has to remain inlined as single atomic xchg
>> without slowpath otherwise it ruins the concept
>> and makes usage unpredictable.
>>
>> Let's step back.
>> What is the issue you're trying to solve?
>>
>> the commit log say:
>>
>> > A BPF program attached to tp_btf/nmi_handler can delete map entries or
>> > swap out referenced kptrs from NMI context. Today that runs the kptr
>> > destructor inline. Destructors such as bpf_cpumask_release() can take
>> > RCU-related locks, so running them from NMI can deadlock the system.
>>
>> and looking at selftest from patch 2 you do:
>>
>> old = bpf_kptr_xchg(&value->mask, old);
>> if (old)
>> bpf_cpumask_release(old);
>>
>> so?
>> bpf_cpumask_release() is fine to call from any context,
>> because bpf_mem_cache_free_rcu() is safe everywhere including NMI.
>>
>
> My mistake on that. I picked a bad example for the test, but the test is
> just exercising the nmi dtor path, and doesn't rely on the particular
> type of kptr. I just picked one that was easy to acquire a reference to.
>
> This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are
> not. I've annotated why here:
>
> crypto_ctx:
>
> __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx)
> {
> if (refcount_dec_and_test(&ctx->usage))
> call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */
> }
>
> __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx)
> {
> bpf_crypto_ctx_release(ctx);
> }
bpf_crypto_ctx_release() is only allowed in syscall prog types
and dtor via hashmap free will execute in safe context as well.
So not an issue.
> task_struct:
>
> __bpf_kfunc void bpf_task_release(struct task_struct *p)
> {
> put_task_struct_rcu_user(p);
> }
>
> __bpf_kfunc void bpf_task_release_dtor(void *p)
> {
> put_task_struct_rcu_user(p);
> }
>
> void put_task_struct_rcu_user(struct task_struct *task)
> {
> if (refcount_dec_and_test(&task->rcu_users))
> call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu
> */
> }
In theory. I don't think there is a reproducer.
> cgroup_release_dtor is more complex, goes through ultimately through
> callbacks to:
>
> static void css_release(struct percpu_ref *ref)
> {
> struct cgroup_subsys_state *css =
> container_of(ref, struct cgroup_subsys_state, refcnt);
>
> INIT_WORK(&css->destroy_work, css_release_work_fn);
> queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE:
> workqueue */
> }
similar to task_struct. I don't think it's exploitable.
> More generally, unless it's a BPF allocated object or doesn't rely on
> locking/call_rcu or workqueues, the dtor is unsafe.
>
>> hashtab introduced dtor in bpf_mem_alloc,
>> so bpf_obj_free_fields() and corresponding dtor's of kptr-s
>> are called from valid context.
>>
>> What is the problematic sequence?
>
> So from the beginning stepping back.
>
> The problematic sequence:
>
> 1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map.
>
> 2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map
> with that referenced kptr field.
>
> 3. dtor runs in the nmi context
>
> 4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock.
>
> You can see this demonstrated in my task_struct reproducer [1].
did you?
That link points to your v2 with cpumask.
I don't recall seeing task_struct repro.
> It causes a deadlock by deliberately releasing the last reference to a
> task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock.
>
> The typical solution to this is to run the nmi unsafe code in non-nmi
> context by offloading to NMI work, as you proposed.
>
> The problem is we need space to for the jobs we enqueue. The required
> information to run the dtor is the dtor function and the original object
> pointer. The number of dtors that can run in a single tp_btf/nmi_handler
> prog is nearly unbounded.
>
> The other problem is even though bpf_mem_alloc is safe in NMI generally,
> we cannot allocate in path that destroys an object. If the allocation
> fails due to memory pressure, we leak the object.
>
> There are a few options, all with drawbacks.
>
> 1. Dynamically allocate the job. Non-starter, failing to allocate is
> unrecoverable, memory pressure means we can't ever schedule the dtor to
> run.
>
> 2. Store job ntrusively in the object : Requires a safe place to place
> it within the object. Bad because not all objects have a space we can write to.
> Non-starter.
>
> 3. Within the map slot (after actual kptr): Taken with my initial approach in v1.
> Significant complexity and requires per-map changes. Feasible but very
> complex and would need DCAS or locking to make updating the map slot and
> our job information atomic.
>
> 4. Wrapping the kptr in a box and storing it in place of the kptr [2] :
> Proposed by Mykyta. Would break direct load access to kptr objects.
>
> 5. Make every dtor nmi safe individually. This would require a lot of
> duplicated code and require updating every destructor invididually.
> Feasible technically, but seems brittle.
>
> 6. One that would be the least complex, would be forbidding xchg operations
> that can run the dtor in NMI context. That would preserve the inlining fix,
> but limit our usage of referenced kptrs in BPF programs that run in NMI context.
>
> The approach here:
>
> 7. Allocate a new spot for a free job every time we xchg into the map
> and put it in a global list. When in NMI and we run a dtor, we pop a
> job from that slot and use it to offload our work via irq_work. If
> we're not in NMI we run normally. Downside is this breaks inlining for
> ref kptrs.
>
> ...
>
> I may be missing something critical, but everything I've looked at
> points to this problem being much more complex that it initially seemed.
yes. it is complex. all 7 options are not good.
I recall the whole thing started with desire to add bpf_put_file_dtor().
This was discussed with VFS maintainers and they didn't like the idea,
since it needs a ton of work to make it safe:
. umount notifier to make sure stashed file doesn't hold umount
. potential circular refcnt issue if file to bpf map stashed into the same map
. scm_rights-like facility with garbage collection
So generic file stash is really no go.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-11 15:51 ` Alexei Starovoitov
@ 2026-05-11 16:38 ` Justin Suess
2026-05-11 17:18 ` Alexei Starovoitov
2026-05-11 19:22 ` Justin Suess
1 sibling, 1 reply; 16+ messages in thread
From: Justin Suess @ 2026-05-11 16:38 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: sashiko, bpf
On Mon, May 11, 2026 at 08:51:53AM -0700, Alexei Starovoitov wrote:
> On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote:
> > On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
> >> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
> >> >
> >> >
> >> > Any help or guidance on this would be appreciated!
> >>
> >> sorry for the delay. Everyone was at lsfmmbpf for a week+.
> >>
> >
> > No worries! I hope it was an enjoyable trip and I look forward to
> > hearing about the conference.
> >
> >> All of the solutions so far are way too complicated.
> >> bpf_kptr_xchg() has to remain inlined as single atomic xchg
> >> without slowpath otherwise it ruins the concept
> >> and makes usage unpredictable.
> >>
> >> Let's step back.
> >> What is the issue you're trying to solve?
> >>
> >> the commit log say:
> >>
> >> > A BPF program attached to tp_btf/nmi_handler can delete map entries or
> >> > swap out referenced kptrs from NMI context. Today that runs the kptr
> >> > destructor inline. Destructors such as bpf_cpumask_release() can take
> >> > RCU-related locks, so running them from NMI can deadlock the system.
> >>
> >> and looking at selftest from patch 2 you do:
> >>
> >> old = bpf_kptr_xchg(&value->mask, old);
> >> if (old)
> >> bpf_cpumask_release(old);
> >>
> >> so?
> >> bpf_cpumask_release() is fine to call from any context,
> >> because bpf_mem_cache_free_rcu() is safe everywhere including NMI.
> >>
> >
> > My mistake on that. I picked a bad example for the test, but the test is
> > just exercising the nmi dtor path, and doesn't rely on the particular
> > type of kptr. I just picked one that was easy to acquire a reference to.
> >
> > This dtor is safe. task_struct dtor, cgroup dtor, crypto_ctx dtor are
> > not. I've annotated why here:
> >
> > crypto_ctx:
> >
> > __bpf_kfunc void bpf_crypto_ctx_release(struct bpf_crypto_ctx *ctx)
> > {
> > if (refcount_dec_and_test(&ctx->usage))
> > call_rcu(&ctx->rcu, crypto_free_cb); /* UNSAFE: call_rcu */
> > }
> >
> > __bpf_kfunc void bpf_crypto_ctx_release_dtor(void *ctx)
> > {
> > bpf_crypto_ctx_release(ctx);
> > }
>
> bpf_crypto_ctx_release() is only allowed in syscall prog types
> and dtor via hashmap free will execute in safe context as well.
> So not an issue.
>
It doesn't matter if bpf_crypto_ctx_release is only allowed in syscall
program types; we don't need that to invoke it from a tracing program.
All we have to do is delete a map element with that type of kptr already
in it.
The release function is not the issue, it's bpf_map_delete_elem, which
ultimately invokes bpf_obj_free_fields and then the dtor.
Hashmap free doesn't execute in safe context. See the kernel splat
below.
> > task_struct:
> >
> > __bpf_kfunc void bpf_task_release(struct task_struct *p)
> > {
> > put_task_struct_rcu_user(p);
> > }
> >
> > __bpf_kfunc void bpf_task_release_dtor(void *p)
> > {
> > put_task_struct_rcu_user(p);
> > }
> >
> > void put_task_struct_rcu_user(struct task_struct *task)
> > {
> > if (refcount_dec_and_test(&task->rcu_users))
> > call_rcu(&task->rcu, delayed_put_task_struct); /* UNSAFE: call_rcu
> > */
> > }
>
> In theory. I don't think there is a reproducer.
>
I sent the wrong link in my reply. apologies.
This does indeed reproduce, even on bpf-next/master.
Using this reproducer:
https://gist.githubusercontent.com/RazeLighter777/5539336d79ab1854f9e9550c6dcab118/raw/082f1eeb2dd445936e64dd3a33861764690bde82/task_struct_dtor_deadlock.patch
On the latest bpf-next/master : "7e033543a2ab: Merge
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.1-rc3":
With these kconfig options set (not necessary, but increase
reproducibility):
[justin@zenbox linux-next (repro $%)]$ rg 'RCU_NOCB|RCU_EXPERT' .config
171:CONFIG_RCU_EXPERT=y
188:CONFIG_RCU_NOCB_CPU=y
Run ./test_progs -t task_kptr_nmi_deadlock_repro
[ 21.604612] ================================
[ 21.604612] WARNING: inconsistent lock state
[ 21.604613] 7.1.0-rc2-g7e033543a2ab #126 Tainted: G OE
[ 21.604614] --------------------------------
[ 21.604615] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[ 21.604616] test_progs/494 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 21.604617] ffffa2077d66f0e8 (&rdp->nocb_lock){....}-{2:2}, at: __call_rcu_common.constprop.0+0x309/0x730
[ 21.604625] {INITIAL USE} state was registered at:
[ 21.604625] lock_acquire+0xbf/0x2e0
[ 21.604628] _raw_spin_lock+0x30/0x40
[ 21.604632] rcu_nocb_gp_kthread+0x13f/0xb90
[ 21.604633] kthread+0xf4/0x130
[ 21.604636] ret_from_fork+0x264/0x330
[ 21.604639] ret_from_fork_asm+0x1a/0x30
[ 21.604642] irq event stamp: 194
[ 21.604642] hardirqs last enabled at (193): [<ffffffff9700012f>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 21.604644] hardirqs last disabled at (194): [<ffffffff980e7f4f>] exc_nmi+0x7f/0x110
[ 21.604646] softirqs last enabled at (0): [<ffffffff972d9838>] copy_process+0xfd8/0x2b80
[ 21.604648] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 21.604650]
other info that might help us debug this:
[ 21.604651] Possible unsafe locking scenario:
[ 21.604651] CPU0
[ 21.604651] ----
[ 21.604652] lock(&rdp->nocb_lock);
[ 21.604653] <Interrupt>
[ 21.604653] lock(&rdp->nocb_lock);
[ 21.604654]
*** DEADLOCK ***
[ 21.604654] no locks held by test_progs/494.
[ 21.604655]
stack backtrace:
[ 21.604657] CPU: 1 UID: 0 PID: 494 Comm: test_progs Tainted: G OE 7.1.0-rc2-g7e033543a2ab #126 PREEMPT(full)
[ 21.604659] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 21.604660] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 21.604660] Call Trace:
[ 21.604662] <TASK>
[ 21.604663] dump_stack_lvl+0x5d/0x80
[ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
[ 21.604669] lock_acquire+0x295/0x2e0
[ 21.604671] ? terminate_walk+0x33/0x160
[ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
[ 21.604679] _raw_spin_lock+0x30/0x40
[ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
[ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
[ 21.604686] bpf_obj_free_fields+0x118/0x250
[ 21.604691] free_htab_elem+0x85/0xd0
[ 21.604694] htab_map_delete_elem+0x168/0x230
[ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
[ 21.604700] bpf_trace_run3+0x126/0x430
[ 21.604703] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 21.604707] nmi_handle.part.0+0x15b/0x250
[ 21.604710] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 21.604712] default_do_nmi+0x120/0x180
[ 21.604715] exc_nmi+0xe3/0x110
[ 21.604717] asm_exc_nmi+0xb7/0x100
[ 21.604722] RIP: 0033:0x7fd27529a7ee
[ 21.604724] Code: ff 0f 1f 00 49 89 ca 48 8b 44 24 20 0f 05 48 83 c4 18 c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 10 ff 74 24 18 e8 63 ff ff ff 5a <59> 48 3d 00 f0 ff ff 77 09 48 83 c4 08 c3 0f 1f 40 00 48 8b 15 e9
[ 21.604725] RSP: 002b:00007ffe5e3eca08 EFLAGS: 00000202
[ 21.604726] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fd2756cef48
[ 21.604727] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 000000000000000c
[ 21.604728] RBP: 00007ffe5e3ecae0 R08: 0000000000000000 R09: 0000000000000000
[ 21.604728] R10: 0000000000000000 R11: 00007fd2756cec40 R12: 0000000000000003
[ 21.604729] R13: 00007fd275887000 R14: 00007ffe5e3eceb8 R15: 000055cd1271d8d0
[ 21.604734] </TASK>
[ 21.666237] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 21.691290] perf: interrupt took too long (3144 > 3126), lowering kernel.perf_event_max_sample_rate to 63000
[ 21.730227] perf: interrupt took too long (3932 > 3930), lowering kernel.perf_event_max_sample_rate to 50000
[ 21.805475] perf: interrupt took too long (4916 > 4915), lowering kernel.perf_event_max_sample_rate to 40000
> > cgroup_release_dtor is more complex, goes through ultimately through
> > callbacks to:
> >
> > static void css_release(struct percpu_ref *ref)
> > {
> > struct cgroup_subsys_state *css =
> > container_of(ref, struct cgroup_subsys_state, refcnt);
> >
> > INIT_WORK(&css->destroy_work, css_release_work_fn);
> > queue_work(cgroup_release_wq, &css->destroy_work); /* UNSAFE:
> > workqueue */
> > }
>
> similar to task_struct. I don't think it's exploitable.
>
> > More generally, unless it's a BPF allocated object or doesn't rely on
> > locking/call_rcu or workqueues, the dtor is unsafe.
> >
> >> hashtab introduced dtor in bpf_mem_alloc,
> >> so bpf_obj_free_fields() and corresponding dtor's of kptr-s
> >> are called from valid context.
> >>
> >> What is the problematic sequence?
> >
> > So from the beginning stepping back.
> >
> > The problematic sequence:
> >
> > 1. ref kptr (i.e task_struct, cgroup, crypto_ctx) xchg'd into map.
> >
> > 2. in a tp_btf/nmi_handler (NMI CTX) program we drop the item from the map
> > with that referenced kptr field.
> >
> > 3. dtor runs in the nmi context
> >
> > 4. dtor runs call_rcu/queue_work/some bad thing in nmi, causing deadlock.
> >
> > You can see this demonstrated in my task_struct reproducer [1].
>
> did you?
> That link points to your v2 with cpumask.
> I don't recall seeing task_struct repro.
>
I misplaced the link. See above
> > It causes a deadlock by deliberately releasing the last reference to a
> > task_struct via a ref kptr in nmi, getting it to call_rcu and deadlock.
> >
> > The typical solution to this is to run the nmi unsafe code in non-nmi
> > context by offloading to NMI work, as you proposed.
> >
> > The problem is we need space to for the jobs we enqueue. The required
> > information to run the dtor is the dtor function and the original object
> > pointer. The number of dtors that can run in a single tp_btf/nmi_handler
> > prog is nearly unbounded.
> >
> > The other problem is even though bpf_mem_alloc is safe in NMI generally,
> > we cannot allocate in path that destroys an object. If the allocation
> > fails due to memory pressure, we leak the object.
> >
> > There are a few options, all with drawbacks.
> >
> > 1. Dynamically allocate the job. Non-starter, failing to allocate is
> > unrecoverable, memory pressure means we can't ever schedule the dtor to
> > run.
> >
> > 2. Store job ntrusively in the object : Requires a safe place to place
> > it within the object. Bad because not all objects have a space we can write to.
> > Non-starter.
> >
> > 3. Within the map slot (after actual kptr): Taken with my initial approach in v1.
> > Significant complexity and requires per-map changes. Feasible but very
> > complex and would need DCAS or locking to make updating the map slot and
> > our job information atomic.
> >
> > 4. Wrapping the kptr in a box and storing it in place of the kptr [2] :
> > Proposed by Mykyta. Would break direct load access to kptr objects.
> >
> > 5. Make every dtor nmi safe individually. This would require a lot of
> > duplicated code and require updating every destructor invididually.
> > Feasible technically, but seems brittle.
> >
> > 6. One that would be the least complex, would be forbidding xchg operations
> > that can run the dtor in NMI context. That would preserve the inlining fix,
> > but limit our usage of referenced kptrs in BPF programs that run in NMI context.
> >
> > The approach here:
> >
> > 7. Allocate a new spot for a free job every time we xchg into the map
> > and put it in a global list. When in NMI and we run a dtor, we pop a
> > job from that slot and use it to offload our work via irq_work. If
> > we're not in NMI we run normally. Downside is this breaks inlining for
> > ref kptrs.
> >
> > ...
> >
> > I may be missing something critical, but everything I've looked at
> > points to this problem being much more complex that it initially seemed.
>
> yes. it is complex. all 7 options are not good.
>
Agreed. None of them are good but I just don't see any good
alternatives.
> I recall the whole thing started with desire to add bpf_put_file_dtor().
> This was discussed with VFS maintainers and they didn't like the idea,
> since it needs a ton of work to make it safe:
> . umount notifier to make sure stashed file doesn't hold umount
> . potential circular refcnt issue if file to bpf map stashed into the same map
> . scm_rights-like facility with garbage collection
>
> So generic file stash is really no go.
That may be the case; but this is a seprate issue. This would be a
bug that could apply to future new refcounted kptrs as well.
Thanks,
Justin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-11 16:38 ` Justin Suess
@ 2026-05-11 17:18 ` Alexei Starovoitov
2026-05-11 20:10 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2026-05-11 17:18 UTC (permalink / raw)
To: Justin Suess; +Cc: sashiko, bpf
On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> [ 21.604660] Call Trace:
> [ 21.604662] <TASK>
> [ 21.604663] dump_stack_lvl+0x5d/0x80
> [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
> [ 21.604669] lock_acquire+0x295/0x2e0
> [ 21.604671] ? terminate_walk+0x33/0x160
> [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
> [ 21.604679] _raw_spin_lock+0x30/0x40
> [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
> [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
> [ 21.604686] bpf_obj_free_fields+0x118/0x250
> [ 21.604691] free_htab_elem+0x85/0xd0
> [ 21.604694] htab_map_delete_elem+0x168/0x230
> [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> [ 21.604700] bpf_trace_run3+0x126/0x430
that's better.
Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
but left check_and_free_fields() in free_htab_elem().
I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
and fallback to bpf_mem_alloc at map create time when map has kptrs
with dtors. Even when BPF_F_NO_PREALLOC is not specified.
Kumar,
thoughts?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-11 15:51 ` Alexei Starovoitov
2026-05-11 16:38 ` Justin Suess
@ 2026-05-11 19:22 ` Justin Suess
1 sibling, 0 replies; 16+ messages in thread
From: Justin Suess @ 2026-05-11 19:22 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: sashiko, bpf
On Mon, May 11, 2026 at 08:51:53AM -0700, Alexei Starovoitov wrote:
> On Sun May 10, 2026 at 6:49 PM PDT, Justin Suess wrote:
> > On Sun, May 10, 2026 at 03:38:08PM -0700, Alexei Starovoitov wrote:
> >> On Sun, May 10, 2026 at 8:14 AM Justin Suess <utilityemal77@gmail.com> wrote:
Here's a reproducer for the cgroup case:
https://gist.githubusercontent.com/RazeLighter777/5f77cdfe035a4e22ee2642ae7db6387d/raw/10898d27040a07098cccc5d0785d9ad6620344e7/cgroup_kptr_nmi_deadlock_repro
Hacked together with an AI prompt but functional.
Exercises a different path, but more consistently splats even without
CONFIG_RCU_NOCB_CPU / CONFIG_RCU_EXPERT since this dtor uses workqueue.
Had to use an fexit hook to get the timing condition right to release
the last cgroup reference.
But this lets you see the deadlock is indeed in the dtor in NMI.
This is on the same bpf-next/master
7e033543a2ab4c72319201298ed458e3bbddd82f:
[ 15.160694] ================================
[ 15.160695] WARNING: inconsistent lock state
[ 15.160695] 7.1.0-rc2-g7e033543a2ab-dirty #130 Not tainted
[ 15.160697] --------------------------------
[ 15.160697] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[ 15.160698] test_progs/434 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 15.160700] ffff9096fd66ced8 (&pool->lock){-.-.}-{2:2}, at: __queue_work+0xde/0x720
[ 15.160707] {INITIAL USE} state was registered at:
[ 15.160708] lock_acquire+0xbf/0x2e0
[ 15.160711] _raw_spin_lock+0x30/0x40
[ 15.160715] __queue_work+0xde/0x720
[ 15.160716] queue_work_on+0x54/0xa0
[ 15.160716] start_poll_synchronize_rcu_expedited+0xaf/0x110
[ 15.160719] rcu_init+0x958/0x990
[ 15.160722] start_kernel+0x746/0x980
[ 15.160725] x86_64_start_reservations+0x24/0x30
[ 15.160727] __pfx_reserve_bios_regions+0x0/0x10
[ 15.160729] common_startup_64+0x12c/0x138
[ 15.160731] irq event stamp: 18704
[ 15.160732] hardirqs last enabled at (18703): [<ffffffffa200148a>] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 15.160734] hardirqs last disabled at (18704): [<ffffffffa30e3f8f>] exc_nmi+0x7f/0x110
[ 15.160737] softirqs last enabled at (18698): [<ffffffffa22e5800>] __irq_exit_rcu+0xc0/0x100
[ 15.160739] softirqs last disabled at (18687): [<ffffffffa22e5800>] __irq_exit_rcu+0xc0/0x100
[ 15.160741]
[ 15.160741] other info that might help us debug this:
[ 15.160741] Possible unsafe locking scenario:
[ 15.160741]
[ 15.160742] CPU0
[ 15.160742] ----
[ 15.160742] lock(&pool->lock);
[ 15.160743] <Interrupt>
[ 15.160743] lock(&pool->lock);
[ 15.160744]
[ 15.160744] *** DEADLOCK ***
[ 15.160744]
[ 15.160744] no locks held by test_progs/434.
[ 15.160745]
[ 15.160745] stack backtrace:
[ 15.160747] CPU: 1 UID: 0 PID: 434 Comm: test_progs Not tainted 7.1.0-rc2-g7e033543a2ab-dirty #130 PREEMPT(full)
[ 15.160749] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 15.160750] Call Trace:
[ 15.160751] <TASK>
[ 15.160753] dump_stack_lvl+0x5d/0x80
[ 15.160757] print_usage_bug.part.0+0x22b/0x2c0
[ 15.160760] lock_acquire+0x295/0x2e0
[ 15.160762] ? srso_alias_return_thunk+0x5/0xfbef5
[ 15.160763] ? __queue_work+0xde/0x720
[ 15.160767] _raw_spin_lock+0x30/0x40
[ 15.160768] ? __queue_work+0xde/0x720
[ 15.160769] __queue_work+0xde/0x720
[ 15.160772] queue_work_on+0x54/0xa0
[ 15.160774] bpf_cgroup_release_dtor+0x12e/0x140
[ 15.160778] bpf_obj_free_fields+0x118/0x250
[ 15.160782] free_htab_elem+0x85/0xd0
[ 15.160785] htab_map_delete_elem+0x168/0x230
[ 15.160790] bpf_prog_23fcbbeb395ac6b4_clear_cgroup_kptrs_from_nmi+0x54/0x74
[ 15.160792] bpf_trace_run3+0x126/0x430
[ 15.160795] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 15.160799] nmi_handle.part.0+0x15b/0x250
[ 15.160802] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 15.160804] default_do_nmi+0x120/0x180
[ 15.160807] exc_nmi+0xe3/0x110
[ 15.160809] asm_exc_nmi+0xb7/0x100
[ 15.160810] RIP: 0033:0x5607a669541b
[ 15.160813] Code: c7 45 f0 00 00 00 00 eb 1a 8b 55 f0 8b 45 f4 01 d0 48 63 d0 48 8b 45 a8 48 01 d0 48 89 45 a8 83 45 f0 01 81 7d f0 3f 42 0f 00 <7e> dd e8 7e f5 ff ff 48 89 45 f8 48 8b 45 f8 48 3b 45 e8 73 16 83
[ 15.160814] RSP: 002b:00007ffdb09c1dc0 EFLAGS: 00000293
[ 15.160816] RAX: 0000003aced4e2f4 RBX: 00007f1d8d574000 RCX: 000000000000000f
[ 15.160816] RDX: 00000000000ad857 RSI: 00007f1d8d577000 RDI: 0000000000000001
[ 15.160817] RBP: 00007ffdb09c1e30 R08: 00007ffdb09c1da0 R09: 00007f1d8d577010
[ 15.160818] R10: 0000000000001614 R11: 0009718b9187183f R12: 0000000000000003
[ 15.160818] R13: 00007f1d8d5b6000 R14: 00007ffdb09c3358 R15: 00005607a9daf890
[ 15.160824] </TASK>
[ 15.214040] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 15.246002] perf: interrupt took too long (3135 > 3126), lowering kernel.perf_event_max_sample_rate to 63000
[ 15.308032] perf: interrupt took too long (3928 > 3918), lowering kernel.perf_event_max_sample_rate to 50000
[ 15.500072] perf: interrupt took too long (4912 > 4910), lowering kernel.perf_event_max_sample_rate to 40000
Justin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-11 17:18 ` Alexei Starovoitov
@ 2026-05-11 20:10 ` Kumar Kartikeya Dwivedi
2026-05-12 1:43 ` Justin Suess
0 siblings, 1 reply; 16+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-11 20:10 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf
On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> > [ 21.604660] Call Trace:
> > [ 21.604662] <TASK>
> > [ 21.604663] dump_stack_lvl+0x5d/0x80
> > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
> > [ 21.604669] lock_acquire+0x295/0x2e0
> > [ 21.604671] ? terminate_walk+0x33/0x160
> > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
> > [ 21.604679] _raw_spin_lock+0x30/0x40
> > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
> > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
> > [ 21.604686] bpf_obj_free_fields+0x118/0x250
> > [ 21.604691] free_htab_elem+0x85/0xd0
> > [ 21.604694] htab_map_delete_elem+0x168/0x230
> > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> > [ 21.604700] bpf_trace_run3+0x126/0x430
>
> that's better.
> Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> but left check_and_free_fields() in free_htab_elem().
>
> I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> and fallback to bpf_mem_alloc at map create time when map has kptrs
> with dtors. Even when BPF_F_NO_PREALLOC is not specified.
>
> Kumar,
>
> thoughts?
>
>
Yeah, removing it from the path that helpers can invoke seems simpler.
Remember though, this splat is just for hashtab, we have similar
bpf_obj_free_fields() in array map on update. I think fundamentally
the main issue here is that we logically free special fields when a
map value is freed or deleted. When updating array maps we logically
'free' and then 'update' the same map value together. For hashtab, it
happens on update/delete.
We could relax this behavior to avoid eagerly freeing these special
fields on update or deletion. The only worry is how this would impact
programs that have come to rely on the existing behavior. There are
patterns where people expect kptr to be NULL on some new map value,
which causes programs to return errors when that expectation is not
met. Just doing the skip when irqs_disabled() doesn't save us from the
surprise side-effect. We need to decide upon this first before
discussing the shape of the solution.
This is the theoretical concern; In practice, I think most people who
depend on such behavior use kptr in local storage maps (in
schedulers). So it probably won't be a problem in practice, even
though we can't judge this ahead of time. Also, we eagerly reuse map
values when using memalloc, so the guarantees are already pretty weak
I guess.
So, if we are not going to go through a grace period (like local
storage) and free back to kernel allocator before reuse, we should
relax field freeing behavior. At best, we should cancel work for
timer, wq, task_work, and task_work, leaving other items as-is. E.g.
BPF_UPTR is used in task storage which I think is accessible to
tracing programs, I am not sure how safe unpin_user_page() is when
called from random reentrant contexts. We might have more cases in the
future, we cannot guarantee we can handle everything in NMIs
universally.
So the best course of action seems to be relaxing
bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
on async work (timer, wq, task_work) for delete / update and let other
fields be as-is. We likely need to do bpf_obj_free_fields()
additionally before prealloc_destroy() now, but that should be simple.
Whether or not to use bpf_ma when kptrs are used in prealloc map is a
separate change.
This should hopefully resolve the issue, unless I missed other cases.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-11 20:10 ` Kumar Kartikeya Dwivedi
@ 2026-05-12 1:43 ` Justin Suess
2026-05-12 1:46 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 16+ messages in thread
From: Justin Suess @ 2026-05-12 1:43 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi; +Cc: Alexei Starovoitov, sashiko, bpf
On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> > > [ 21.604660] Call Trace:
> > > [ 21.604662] <TASK>
> > > [ 21.604663] dump_stack_lvl+0x5d/0x80
> > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
> > > [ 21.604669] lock_acquire+0x295/0x2e0
> > > [ 21.604671] ? terminate_walk+0x33/0x160
> > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
> > > [ 21.604679] _raw_spin_lock+0x30/0x40
> > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
> > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
> > > [ 21.604686] bpf_obj_free_fields+0x118/0x250
> > > [ 21.604691] free_htab_elem+0x85/0xd0
> > > [ 21.604694] htab_map_delete_elem+0x168/0x230
> > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> > > [ 21.604700] bpf_trace_run3+0x126/0x430
> >
> > that's better.
> > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> > but left check_and_free_fields() in free_htab_elem().
> >
> > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> > and fallback to bpf_mem_alloc at map create time when map has kptrs
> > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> >
> > Kumar,
> >
> > thoughts?
> >
> >
>
> Yeah, removing it from the path that helpers can invoke seems simpler.
> Remember though, this splat is just for hashtab, we have similar
> bpf_obj_free_fields() in array map on update. I think fundamentally
> the main issue here is that we logically free special fields when a
> map value is freed or deleted. When updating array maps we logically
> 'free' and then 'update' the same map value together. For hashtab, it
> happens on update/delete.
>
> We could relax this behavior to avoid eagerly freeing these special
> fields on update or deletion. The only worry is how this would impact
> programs that have come to rely on the existing behavior. There are
> patterns where people expect kptr to be NULL on some new map value,
> which causes programs to return errors when that expectation is not
> met. Just doing the skip when irqs_disabled() doesn't save us from the
> surprise side-effect. We need to decide upon this first before
> discussing the shape of the solution.
>
> This is the theoretical concern; In practice, I think most people who
> depend on such behavior use kptr in local storage maps (in
> schedulers). So it probably won't be a problem in practice, even
> though we can't judge this ahead of time. Also, we eagerly reuse map
> values when using memalloc, so the guarantees are already pretty weak
> I guess.
>
> So, if we are not going to go through a grace period (like local
> storage) and free back to kernel allocator before reuse, we should
> relax field freeing behavior. At best, we should cancel work for
> timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> BPF_UPTR is used in task storage which I think is accessible to
> tracing programs, I am not sure how safe unpin_user_page() is when
> called from random reentrant contexts. We might have more cases in the
> future, we cannot guarantee we can handle everything in NMIs
> universally.
>
> So the best course of action seems to be relaxing
> bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> on async work (timer, wq, task_work) for delete / update and let other
> fields be as-is. We likely need to do bpf_obj_free_fields()
> additionally before prealloc_destroy() now, but that should be simple.
> Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> separate change.
>
> This should hopefully resolve the issue, unless I missed other cases.
This does sound good, so you'd set the bpf_obj_free_fields up in the
htab allocator dtor for the final free and rely on the allocators
existing nmi deferral?
The missing piece is whether to handle this differently in NMI or just
always do it with the deferral. Also the prealloc question needs
answering.
This fix would preserve the xchg inlining and allow for failure with
ENOMEM on exhaustion and not need weird locking or atomic counting.
Appreciate the time from both of you learned a lot more than I wanted to
about NMI than I bargained for writing this. There's probably more weird
NMI cases. FWIW there are zero tp_btf/nmi_handler progs anywhere I can
find on the internet so this is a weird edge case.
Since I brought this issue up I'll finish it out and send patches next
week.
Even if generic file kptr isn't possible hopefully this can stop sashiko
from complaining about every new kptr dtor getting introduced being unsafe in NMI :)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-12 1:43 ` Justin Suess
@ 2026-05-12 1:46 ` Kumar Kartikeya Dwivedi
2026-05-12 1:55 ` Alexei Starovoitov
0 siblings, 1 reply; 16+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12 1:46 UTC (permalink / raw)
To: Justin Suess; +Cc: Alexei Starovoitov, sashiko, bpf
On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> > > > [ 21.604660] Call Trace:
> > > > [ 21.604662] <TASK>
> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80
> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
> > > > [ 21.604669] lock_acquire+0x295/0x2e0
> > > > [ 21.604671] ? terminate_walk+0x33/0x160
> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
> > > > [ 21.604679] _raw_spin_lock+0x30/0x40
> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250
> > > > [ 21.604691] free_htab_elem+0x85/0xd0
> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230
> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> > > > [ 21.604700] bpf_trace_run3+0x126/0x430
> > >
> > > that's better.
> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> > > but left check_and_free_fields() in free_htab_elem().
> > >
> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> > >
> > > Kumar,
> > >
> > > thoughts?
> > >
> > >
> >
> > Yeah, removing it from the path that helpers can invoke seems simpler.
> > Remember though, this splat is just for hashtab, we have similar
> > bpf_obj_free_fields() in array map on update. I think fundamentally
> > the main issue here is that we logically free special fields when a
> > map value is freed or deleted. When updating array maps we logically
> > 'free' and then 'update' the same map value together. For hashtab, it
> > happens on update/delete.
> >
> > We could relax this behavior to avoid eagerly freeing these special
> > fields on update or deletion. The only worry is how this would impact
> > programs that have come to rely on the existing behavior. There are
> > patterns where people expect kptr to be NULL on some new map value,
> > which causes programs to return errors when that expectation is not
> > met. Just doing the skip when irqs_disabled() doesn't save us from the
> > surprise side-effect. We need to decide upon this first before
> > discussing the shape of the solution.
> >
> > This is the theoretical concern; In practice, I think most people who
> > depend on such behavior use kptr in local storage maps (in
> > schedulers). So it probably won't be a problem in practice, even
> > though we can't judge this ahead of time. Also, we eagerly reuse map
> > values when using memalloc, so the guarantees are already pretty weak
> > I guess.
> >
> > So, if we are not going to go through a grace period (like local
> > storage) and free back to kernel allocator before reuse, we should
> > relax field freeing behavior. At best, we should cancel work for
> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> > BPF_UPTR is used in task storage which I think is accessible to
> > tracing programs, I am not sure how safe unpin_user_page() is when
> > called from random reentrant contexts. We might have more cases in the
> > future, we cannot guarantee we can handle everything in NMIs
> > universally.
> >
> > So the best course of action seems to be relaxing
> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> > on async work (timer, wq, task_work) for delete / update and let other
> > fields be as-is. We likely need to do bpf_obj_free_fields()
> > additionally before prealloc_destroy() now, but that should be simple.
> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> > separate change.
> >
> > This should hopefully resolve the issue, unless I missed other cases.
> This does sound good, so you'd set the bpf_obj_free_fields up in the
> htab allocator dtor for the final free and rely on the allocators
> existing nmi deferral?
It is already set, except for prealloc maps. But we can call it before
destroying the pcpu freelist etc.
>
> The missing piece is whether to handle this differently in NMI or just
> always do it with the deferral. Also the prealloc question needs
> answering.
There is no deferral here. I'm saying that we just cancel for timer,
wq, task work, and leave other fields as is. So we don't have active
work pending for async items.
So as long as the item keeps getting recycled in the allocator, we
don't free these fields. Once the memalloc is destroyed, the dtor runs
in a known safe context where we can assume bpf_obj_free_fields won't
deadlock or run into any problems.
>
> [...]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-12 1:46 ` Kumar Kartikeya Dwivedi
@ 2026-05-12 1:55 ` Alexei Starovoitov
2026-05-12 2:03 ` Kumar Kartikeya Dwivedi
2026-05-12 2:07 ` Justin Suess
0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2026-05-12 1:55 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, Justin Suess; +Cc: sashiko, bpf
On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
>>
>> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
>> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
>> > <alexei.starovoitov@gmail.com> wrote:
>> > >
>> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
>> > > > [ 21.604660] Call Trace:
>> > > > [ 21.604662] <TASK>
>> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80
>> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
>> > > > [ 21.604669] lock_acquire+0x295/0x2e0
>> > > > [ 21.604671] ? terminate_walk+0x33/0x160
>> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
>> > > > [ 21.604679] _raw_spin_lock+0x30/0x40
>> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
>> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
>> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250
>> > > > [ 21.604691] free_htab_elem+0x85/0xd0
>> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230
>> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
>> > > > [ 21.604700] bpf_trace_run3+0x126/0x430
>> > >
>> > > that's better.
>> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
>> > > but left check_and_free_fields() in free_htab_elem().
>> > >
>> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
>> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
>> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
>> > >
>> > > Kumar,
>> > >
>> > > thoughts?
>> > >
>> > >
>> >
>> > Yeah, removing it from the path that helpers can invoke seems simpler.
>> > Remember though, this splat is just for hashtab, we have similar
>> > bpf_obj_free_fields() in array map on update. I think fundamentally
>> > the main issue here is that we logically free special fields when a
>> > map value is freed or deleted. When updating array maps we logically
>> > 'free' and then 'update' the same map value together. For hashtab, it
>> > happens on update/delete.
>> >
>> > We could relax this behavior to avoid eagerly freeing these special
>> > fields on update or deletion. The only worry is how this would impact
>> > programs that have come to rely on the existing behavior. There are
>> > patterns where people expect kptr to be NULL on some new map value,
>> > which causes programs to return errors when that expectation is not
>> > met. Just doing the skip when irqs_disabled() doesn't save us from the
>> > surprise side-effect. We need to decide upon this first before
>> > discussing the shape of the solution.
>> >
>> > This is the theoretical concern; In practice, I think most people who
>> > depend on such behavior use kptr in local storage maps (in
>> > schedulers). So it probably won't be a problem in practice, even
>> > though we can't judge this ahead of time. Also, we eagerly reuse map
>> > values when using memalloc, so the guarantees are already pretty weak
>> > I guess.
>> >
>> > So, if we are not going to go through a grace period (like local
>> > storage) and free back to kernel allocator before reuse, we should
>> > relax field freeing behavior. At best, we should cancel work for
>> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
>> > BPF_UPTR is used in task storage which I think is accessible to
>> > tracing programs, I am not sure how safe unpin_user_page() is when
>> > called from random reentrant contexts. We might have more cases in the
>> > future, we cannot guarantee we can handle everything in NMIs
>> > universally.
>> >
>> > So the best course of action seems to be relaxing
>> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
>> > on async work (timer, wq, task_work) for delete / update and let other
>> > fields be as-is. We likely need to do bpf_obj_free_fields()
>> > additionally before prealloc_destroy() now, but that should be simple.
>> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
>> > separate change.
>> >
>> > This should hopefully resolve the issue, unless I missed other cases.
>> This does sound good, so you'd set the bpf_obj_free_fields up in the
>> htab allocator dtor for the final free and rely on the allocators
>> existing nmi deferral?
>
> It is already set, except for prealloc maps. But we can call it before
> destroying the pcpu freelist etc.
htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
>>
>> The missing piece is whether to handle this differently in NMI or just
>> always do it with the deferral. Also the prealloc question needs
>> answering.
>
> There is no deferral here. I'm saying that we just cancel for timer,
> wq, task work, and leave other fields as is. So we don't have active
> work pending for async items.
>
> So as long as the item keeps getting recycled in the allocator, we
> don't free these fields. Once the memalloc is destroyed, the dtor runs
> in a known safe context where we can assume bpf_obj_free_fields won't
> deadlock or run into any problems.
So the plan is to do
if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
just ignore it?
And no other changes anywhere at all?
That would be too good to be true :)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-12 1:55 ` Alexei Starovoitov
@ 2026-05-12 2:03 ` Kumar Kartikeya Dwivedi
2026-05-12 2:10 ` Alexei Starovoitov
2026-05-12 2:07 ` Justin Suess
1 sibling, 1 reply; 16+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12 2:03 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf
On Tue, 12 May 2026 at 03:55, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
> >>
> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> >> > <alexei.starovoitov@gmail.com> wrote:
> >> > >
> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> >> > > > [ 21.604660] Call Trace:
> >> > > > [ 21.604662] <TASK>
> >> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80
> >> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
> >> > > > [ 21.604669] lock_acquire+0x295/0x2e0
> >> > > > [ 21.604671] ? terminate_walk+0x33/0x160
> >> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
> >> > > > [ 21.604679] _raw_spin_lock+0x30/0x40
> >> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
> >> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
> >> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250
> >> > > > [ 21.604691] free_htab_elem+0x85/0xd0
> >> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230
> >> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> >> > > > [ 21.604700] bpf_trace_run3+0x126/0x430
> >> > >
> >> > > that's better.
> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> >> > > but left check_and_free_fields() in free_htab_elem().
> >> > >
> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> >> > >
> >> > > Kumar,
> >> > >
> >> > > thoughts?
> >> > >
> >> > >
> >> >
> >> > Yeah, removing it from the path that helpers can invoke seems simpler.
> >> > Remember though, this splat is just for hashtab, we have similar
> >> > bpf_obj_free_fields() in array map on update. I think fundamentally
> >> > the main issue here is that we logically free special fields when a
> >> > map value is freed or deleted. When updating array maps we logically
> >> > 'free' and then 'update' the same map value together. For hashtab, it
> >> > happens on update/delete.
> >> >
> >> > We could relax this behavior to avoid eagerly freeing these special
> >> > fields on update or deletion. The only worry is how this would impact
> >> > programs that have come to rely on the existing behavior. There are
> >> > patterns where people expect kptr to be NULL on some new map value,
> >> > which causes programs to return errors when that expectation is not
> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the
> >> > surprise side-effect. We need to decide upon this first before
> >> > discussing the shape of the solution.
> >> >
> >> > This is the theoretical concern; In practice, I think most people who
> >> > depend on such behavior use kptr in local storage maps (in
> >> > schedulers). So it probably won't be a problem in practice, even
> >> > though we can't judge this ahead of time. Also, we eagerly reuse map
> >> > values when using memalloc, so the guarantees are already pretty weak
> >> > I guess.
> >> >
> >> > So, if we are not going to go through a grace period (like local
> >> > storage) and free back to kernel allocator before reuse, we should
> >> > relax field freeing behavior. At best, we should cancel work for
> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> >> > BPF_UPTR is used in task storage which I think is accessible to
> >> > tracing programs, I am not sure how safe unpin_user_page() is when
> >> > called from random reentrant contexts. We might have more cases in the
> >> > future, we cannot guarantee we can handle everything in NMIs
> >> > universally.
> >> >
> >> > So the best course of action seems to be relaxing
> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> >> > on async work (timer, wq, task_work) for delete / update and let other
> >> > fields be as-is. We likely need to do bpf_obj_free_fields()
> >> > additionally before prealloc_destroy() now, but that should be simple.
> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> >> > separate change.
> >> >
> >> > This should hopefully resolve the issue, unless I missed other cases.
> >> This does sound good, so you'd set the bpf_obj_free_fields up in the
> >> htab allocator dtor for the final free and rely on the allocators
> >> existing nmi deferral?
> >
> > It is already set, except for prealloc maps. But we can call it before
> > destroying the pcpu freelist etc.
>
> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
>
> >>
> >> The missing piece is whether to handle this differently in NMI or just
> >> always do it with the deferral. Also the prealloc question needs
> >> answering.
> >
> > There is no deferral here. I'm saying that we just cancel for timer,
> > wq, task work, and leave other fields as is. So we don't have active
> > work pending for async items.
> >
> > So as long as the item keeps getting recycled in the allocator, we
> > don't free these fields. Once the memalloc is destroyed, the dtor runs
> > in a known safe context where we can assume bpf_obj_free_fields won't
> > deadlock or run into any problems.
>
> So the plan is to do
> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> just ignore it?
> And no other changes anywhere at all?
>
> That would be too good to be true :)
I don't know whether in_nmi() would be sufficient, we likely need
irqs_disabled()? At that point, why not always ignore it, since
freeing the fields is dependent on where you're running. I would still
cancel async fields, since they're already any-context safe.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-12 1:55 ` Alexei Starovoitov
2026-05-12 2:03 ` Kumar Kartikeya Dwivedi
@ 2026-05-12 2:07 ` Justin Suess
2026-05-12 2:08 ` Kumar Kartikeya Dwivedi
1 sibling, 1 reply; 16+ messages in thread
From: Justin Suess @ 2026-05-12 2:07 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: Kumar Kartikeya Dwivedi, sashiko, bpf
On Mon, May 11, 2026 at 06:55:50PM -0700, Alexei Starovoitov wrote:
> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
> >>
> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> >> > <alexei.starovoitov@gmail.com> wrote:
> > There is no deferral here. I'm saying that we just cancel for timer,
> > wq, task work, and leave other fields as is. So we don't have active
> > work pending for async items.
> >
> > So as long as the item keeps getting recycled in the allocator, we
> > don't free these fields. Once the memalloc is destroyed, the dtor runs
> > in a known safe context where we can assume bpf_obj_free_fields won't
> > deadlock or run into any problems.
>
> So the plan is to do
> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> just ignore it?
> And no other changes anywhere at all?
>
> That would be too good to be true :)
I guess the last question I have is what would prevent an xchg() out of
a map on a recycled kptr from causing a UAF?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-12 2:07 ` Justin Suess
@ 2026-05-12 2:08 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 16+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12 2:08 UTC (permalink / raw)
To: Justin Suess; +Cc: Alexei Starovoitov, sashiko, bpf
On Tue, 12 May 2026 at 04:07, Justin Suess <utilityemal77@gmail.com> wrote:
>
> On Mon, May 11, 2026 at 06:55:50PM -0700, Alexei Starovoitov wrote:
> > On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
> > >>
> > >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> > >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> > >> > <alexei.starovoitov@gmail.com> wrote:
> > > There is no deferral here. I'm saying that we just cancel for timer,
> > > wq, task work, and leave other fields as is. So we don't have active
> > > work pending for async items.
> > >
> > > So as long as the item keeps getting recycled in the allocator, we
> > > don't free these fields. Once the memalloc is destroyed, the dtor runs
> > > in a known safe context where we can assume bpf_obj_free_fields won't
> > > deadlock or run into any problems.
> >
> > So the plan is to do
> > if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> > just ignore it?
> > And no other changes anywhere at all?
> >
> > That would be too good to be true :)
> I guess the last question I have is what would prevent an xchg() out of
> a map on a recycled kptr from causing a UAF?
xchg() is atomic, so either a program now has the value, or the map
has it. Whoever has it will end up freeing it.
If the program has it then the program frees it. I don't see why there
would be a UAF.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-12 2:03 ` Kumar Kartikeya Dwivedi
@ 2026-05-12 2:10 ` Alexei Starovoitov
2026-05-12 2:13 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2026-05-12 2:10 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi; +Cc: Justin Suess, sashiko, bpf
On Mon May 11, 2026 at 7:03 PM PDT, Kumar Kartikeya Dwivedi wrote:
> On Tue, 12 May 2026 at 03:55, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
>> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
>> >>
>> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
>> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
>> >> > <alexei.starovoitov@gmail.com> wrote:
>> >> > >
>> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
>> >> > > > [ 21.604660] Call Trace:
>> >> > > > [ 21.604662] <TASK>
>> >> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80
>> >> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
>> >> > > > [ 21.604669] lock_acquire+0x295/0x2e0
>> >> > > > [ 21.604671] ? terminate_walk+0x33/0x160
>> >> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
>> >> > > > [ 21.604679] _raw_spin_lock+0x30/0x40
>> >> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
>> >> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
>> >> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250
>> >> > > > [ 21.604691] free_htab_elem+0x85/0xd0
>> >> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230
>> >> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
>> >> > > > [ 21.604700] bpf_trace_run3+0x126/0x430
>> >> > >
>> >> > > that's better.
>> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
>> >> > > but left check_and_free_fields() in free_htab_elem().
>> >> > >
>> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
>> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
>> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
>> >> > >
>> >> > > Kumar,
>> >> > >
>> >> > > thoughts?
>> >> > >
>> >> > >
>> >> >
>> >> > Yeah, removing it from the path that helpers can invoke seems simpler.
>> >> > Remember though, this splat is just for hashtab, we have similar
>> >> > bpf_obj_free_fields() in array map on update. I think fundamentally
>> >> > the main issue here is that we logically free special fields when a
>> >> > map value is freed or deleted. When updating array maps we logically
>> >> > 'free' and then 'update' the same map value together. For hashtab, it
>> >> > happens on update/delete.
>> >> >
>> >> > We could relax this behavior to avoid eagerly freeing these special
>> >> > fields on update or deletion. The only worry is how this would impact
>> >> > programs that have come to rely on the existing behavior. There are
>> >> > patterns where people expect kptr to be NULL on some new map value,
>> >> > which causes programs to return errors when that expectation is not
>> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the
>> >> > surprise side-effect. We need to decide upon this first before
>> >> > discussing the shape of the solution.
>> >> >
>> >> > This is the theoretical concern; In practice, I think most people who
>> >> > depend on such behavior use kptr in local storage maps (in
>> >> > schedulers). So it probably won't be a problem in practice, even
>> >> > though we can't judge this ahead of time. Also, we eagerly reuse map
>> >> > values when using memalloc, so the guarantees are already pretty weak
>> >> > I guess.
>> >> >
>> >> > So, if we are not going to go through a grace period (like local
>> >> > storage) and free back to kernel allocator before reuse, we should
>> >> > relax field freeing behavior. At best, we should cancel work for
>> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
>> >> > BPF_UPTR is used in task storage which I think is accessible to
>> >> > tracing programs, I am not sure how safe unpin_user_page() is when
>> >> > called from random reentrant contexts. We might have more cases in the
>> >> > future, we cannot guarantee we can handle everything in NMIs
>> >> > universally.
>> >> >
>> >> > So the best course of action seems to be relaxing
>> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
>> >> > on async work (timer, wq, task_work) for delete / update and let other
>> >> > fields be as-is. We likely need to do bpf_obj_free_fields()
>> >> > additionally before prealloc_destroy() now, but that should be simple.
>> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
>> >> > separate change.
>> >> >
>> >> > This should hopefully resolve the issue, unless I missed other cases.
>> >> This does sound good, so you'd set the bpf_obj_free_fields up in the
>> >> htab allocator dtor for the final free and rely on the allocators
>> >> existing nmi deferral?
>> >
>> > It is already set, except for prealloc maps. But we can call it before
>> > destroying the pcpu freelist etc.
>>
>> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
>> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
>>
>> >>
>> >> The missing piece is whether to handle this differently in NMI or just
>> >> always do it with the deferral. Also the prealloc question needs
>> >> answering.
>> >
>> > There is no deferral here. I'm saying that we just cancel for timer,
>> > wq, task work, and leave other fields as is. So we don't have active
>> > work pending for async items.
>> >
>> > So as long as the item keeps getting recycled in the allocator, we
>> > don't free these fields. Once the memalloc is destroyed, the dtor runs
>> > in a known safe context where we can assume bpf_obj_free_fields won't
>> > deadlock or run into any problems.
>>
>> So the plan is to do
>> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
>> just ignore it?
>> And no other changes anywhere at all?
>>
>> That would be too good to be true :)
>
> I don't know whether in_nmi() would be sufficient, we likely need
> irqs_disabled()?
fair irqs_disabled() is safer.
> At that point, why not always ignore it, since
> freeing the fields is dependent on where you're running. I would still
> cancel async fields, since they're already any-context safe.
you mean never touch case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
regardless of running context and rely on final cleanup?
That's an idea to consider, but I suspect some rbtree, link list
tests will fail.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI
2026-05-12 2:10 ` Alexei Starovoitov
@ 2026-05-12 2:13 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 16+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12 2:13 UTC (permalink / raw)
To: Alexei Starovoitov; +Cc: Justin Suess, sashiko, bpf
On Tue, 12 May 2026 at 04:10, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon May 11, 2026 at 7:03 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 12 May 2026 at 03:55, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >> On Mon May 11, 2026 at 6:46 PM PDT, Kumar Kartikeya Dwivedi wrote:
> >> > On Tue, 12 May 2026 at 03:43, Justin Suess <utilityemal77@gmail.com> wrote:
> >> >>
> >> >> On Mon, May 11, 2026 at 10:10:07PM +0200, Kumar Kartikeya Dwivedi wrote:
> >> >> > On Mon, 11 May 2026 at 19:29, Alexei Starovoitov
> >> >> > <alexei.starovoitov@gmail.com> wrote:
> >> >> > >
> >> >> > > On Mon May 11, 2026 at 9:38 AM PDT, Justin Suess wrote:
> >> >> > > > [ 21.604660] Call Trace:
> >> >> > > > [ 21.604662] <TASK>
> >> >> > > > [ 21.604663] dump_stack_lvl+0x5d/0x80
> >> >> > > > [ 21.604666] print_usage_bug.part.0+0x22b/0x2c0
> >> >> > > > [ 21.604669] lock_acquire+0x295/0x2e0
> >> >> > > > [ 21.604671] ? terminate_walk+0x33/0x160
> >> >> > > > [ 21.604674] ? __call_rcu_common.constprop.0+0x309/0x730
> >> >> > > > [ 21.604679] _raw_spin_lock+0x30/0x40
> >> >> > > > [ 21.604680] ? __call_rcu_common.constprop.0+0x309/0x730
> >> >> > > > [ 21.604682] __call_rcu_common.constprop.0+0x309/0x730
> >> >> > > > [ 21.604686] bpf_obj_free_fields+0x118/0x250
> >> >> > > > [ 21.604691] free_htab_elem+0x85/0xd0
> >> >> > > > [ 21.604694] htab_map_delete_elem+0x168/0x230
> >> >> > > > [ 21.604698] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xeb/0x144
> >> >> > > > [ 21.604700] bpf_trace_run3+0x126/0x430
> >> >> > >
> >> >> > > that's better.
> >> >> > > Looks like we moved bpf_obj_free_fields() into htab_mem_dtor(),
> >> >> > > but left check_and_free_fields() in free_htab_elem().
> >> >> > >
> >> >> > > I think the fix is to remove check_and_free_fields() from ma path in free_htab_elem()
> >> >> > > and fallback to bpf_mem_alloc at map create time when map has kptrs
> >> >> > > with dtors. Even when BPF_F_NO_PREALLOC is not specified.
> >> >> > >
> >> >> > > Kumar,
> >> >> > >
> >> >> > > thoughts?
> >> >> > >
> >> >> > >
> >> >> >
> >> >> > Yeah, removing it from the path that helpers can invoke seems simpler.
> >> >> > Remember though, this splat is just for hashtab, we have similar
> >> >> > bpf_obj_free_fields() in array map on update. I think fundamentally
> >> >> > the main issue here is that we logically free special fields when a
> >> >> > map value is freed or deleted. When updating array maps we logically
> >> >> > 'free' and then 'update' the same map value together. For hashtab, it
> >> >> > happens on update/delete.
> >> >> >
> >> >> > We could relax this behavior to avoid eagerly freeing these special
> >> >> > fields on update or deletion. The only worry is how this would impact
> >> >> > programs that have come to rely on the existing behavior. There are
> >> >> > patterns where people expect kptr to be NULL on some new map value,
> >> >> > which causes programs to return errors when that expectation is not
> >> >> > met. Just doing the skip when irqs_disabled() doesn't save us from the
> >> >> > surprise side-effect. We need to decide upon this first before
> >> >> > discussing the shape of the solution.
> >> >> >
> >> >> > This is the theoretical concern; In practice, I think most people who
> >> >> > depend on such behavior use kptr in local storage maps (in
> >> >> > schedulers). So it probably won't be a problem in practice, even
> >> >> > though we can't judge this ahead of time. Also, we eagerly reuse map
> >> >> > values when using memalloc, so the guarantees are already pretty weak
> >> >> > I guess.
> >> >> >
> >> >> > So, if we are not going to go through a grace period (like local
> >> >> > storage) and free back to kernel allocator before reuse, we should
> >> >> > relax field freeing behavior. At best, we should cancel work for
> >> >> > timer, wq, task_work, and task_work, leaving other items as-is. E.g.
> >> >> > BPF_UPTR is used in task storage which I think is accessible to
> >> >> > tracing programs, I am not sure how safe unpin_user_page() is when
> >> >> > called from random reentrant contexts. We might have more cases in the
> >> >> > future, we cannot guarantee we can handle everything in NMIs
> >> >> > universally.
> >> >> >
> >> >> > So the best course of action seems to be relaxing
> >> >> > bpf_obj_free_fields() to bpf_obj_cancel_fields() that just does cancel
> >> >> > on async work (timer, wq, task_work) for delete / update and let other
> >> >> > fields be as-is. We likely need to do bpf_obj_free_fields()
> >> >> > additionally before prealloc_destroy() now, but that should be simple.
> >> >> > Whether or not to use bpf_ma when kptrs are used in prealloc map is a
> >> >> > separate change.
> >> >> >
> >> >> > This should hopefully resolve the issue, unless I missed other cases.
> >> >> This does sound good, so you'd set the bpf_obj_free_fields up in the
> >> >> htab allocator dtor for the final free and rely on the allocators
> >> >> existing nmi deferral?
> >> >
> >> > It is already set, except for prealloc maps. But we can call it before
> >> > destroying the pcpu freelist etc.
> >>
> >> htab_map_free->htab_free_prealloced_fields does bpf_obj_free_fields already.
> >> So scratch my suggestion to force bpf_mem_alloc on preallocated hash maps.
> >>
> >> >>
> >> >> The missing piece is whether to handle this differently in NMI or just
> >> >> always do it with the deferral. Also the prealloc question needs
> >> >> answering.
> >> >
> >> > There is no deferral here. I'm saying that we just cancel for timer,
> >> > wq, task work, and leave other fields as is. So we don't have active
> >> > work pending for async items.
> >> >
> >> > So as long as the item keeps getting recycled in the allocator, we
> >> > don't free these fields. Once the memalloc is destroyed, the dtor runs
> >> > in a known safe context where we can assume bpf_obj_free_fields won't
> >> > deadlock or run into any problems.
> >>
> >> So the plan is to do
> >> if (in_nmi()) && case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> >> just ignore it?
> >> And no other changes anywhere at all?
> >>
> >> That would be too good to be true :)
> >
> > I don't know whether in_nmi() would be sufficient, we likely need
> > irqs_disabled()?
>
> fair irqs_disabled() is safer.
>
> > At that point, why not always ignore it, since
> > freeing the fields is dependent on where you're running. I would still
> > cancel async fields, since they're already any-context safe.
>
> you mean never touch case BPF_KPTR* | BPF_LIST_HEAD | BPF_RB_ROOT
> regardless of running context and rely on final cleanup?
> That's an idea to consider, but I suspect some rbtree, link list
> tests will fail.
Yeah, but we should see which ones do, some of them were written with
expectation to test kernel clean up on map value delete / update etc.,
those don't count as much. Unless this change triggers a real example
in tests, it will likely go unnoticed (famous last words).
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-05-12 2:13 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260507175453.1140400-2-utilityemal77@gmail.com>
[not found] ` <20260507234520.646C4C2BCB2@smtp.kernel.org>
2026-05-10 15:13 ` [bpf-next v3 1/2] bpf: Offload kptr destructors that run from NMI Justin Suess
2026-05-10 22:38 ` Alexei Starovoitov
2026-05-11 1:49 ` Justin Suess
2026-05-11 15:51 ` Alexei Starovoitov
2026-05-11 16:38 ` Justin Suess
2026-05-11 17:18 ` Alexei Starovoitov
2026-05-11 20:10 ` Kumar Kartikeya Dwivedi
2026-05-12 1:43 ` Justin Suess
2026-05-12 1:46 ` Kumar Kartikeya Dwivedi
2026-05-12 1:55 ` Alexei Starovoitov
2026-05-12 2:03 ` Kumar Kartikeya Dwivedi
2026-05-12 2:10 ` Alexei Starovoitov
2026-05-12 2:13 ` Kumar Kartikeya Dwivedi
2026-05-12 2:07 ` Justin Suess
2026-05-12 2:08 ` Kumar Kartikeya Dwivedi
2026-05-11 19:22 ` Justin Suess
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox