* [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
@ 2026-04-21 20:10 Justin Suess
2026-04-21 20:23 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 7+ messages in thread
From: Justin Suess @ 2026-04-21 20:10 UTC (permalink / raw)
To: bpf; +Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, yonghong.song,
jolsa
Hello,
I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.
It was found after further investigation from a Sashiko report on my patch:
https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t
The issue is reproducible with a BPF selftest-derived reproducer that:
1. Stores exited task references in a BPF hash map as refcounted task kptrs.
2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
3. Runs on an `rcu_nocbs` CPU.
In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:
`perf_sched_delayed`
` -> static_key_disable()`
` -> arch_jump_label_transform_apply()`
` -> smp_text_poke_batch_finish()`
` -> on_each_cpu_cond_mask()`
` -> smp_call_function_many_cond()`
The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.
Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:
`bpf_map_delete_elem()`
` -> htab_map_delete_elem()`
` -> free_htab_elem()`
` -> bpf_obj_free_fields()`
` -> bpf_task_release_dtor()`
` -> put_task_struct_rcu_user()`
` -> call_rcu()`
This is triggered from:
`tp_btf/nmi_handler`
` -> clear_task_kptrs_from_nmi` (reproducer bpf prog)
Environment
- x86_64 QEMU VM
- PREEMPT(full)
- `CONFIG_RCU_EXPERT=y`
- `CONFIG_RCU_NOCB_CPU=y`
- booted with `rcu_nocbs=1-7`
(the CONFIG_RCU_NOCB_CPU makes reproducing this more likely)
Observed result
- watchdog reports a soft lockup
- kernel panics with `Kernel panic - not syncing: softlockup: hung tasks`
- the stuck task is a kworker running `perf_sched_delayed`
Logs:
env TASK_KPTR_NMI_DEADLOCK_REPRO=1 ./test_progs -t task_kptr_nmi_deadlock_repro
[ 1.336781] bpf_testmod: loading out-of-tree module taints kernel.
[ 1.336961] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
[ 1.358431]
[ 1.358433] ================================
[ 1.358433] WARNING: inconsistent lock state
[ 1.358434] 7.0.0-11169-ge4ef174588b8-dirty #16 Tainted: G OE
[ 1.358435] --------------------------------
[ 1.358436] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[ 1.358436] test_progs/134 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 1.358438] ffff8ae3bbc6f0e8 (&rdp->nocb_lock){....}-{2:2}, at: __call_rcu_common.constprop.0+0x316/0x740
[ 1.358445] {INITIAL USE} state was registered at:
[ 1.358445] lock_acquire+0xc0/0x2d0
[ 1.358448] _raw_spin_lock+0x33/0x50
[ 1.358451] rcu_nocb_gp_kthread+0x13b/0xbb0
[ 1.358453] kthread+0x10d/0x140
[ 1.358456] ret_from_fork+0x26d/0x330
[ 1.358458] ret_from_fork_asm+0x1a/0x30
[ 1.358465] irq event stamp: 47398
[ 1.358465] hardirqs last enabled at (47397): [<ffffffff8a1397cb>] _raw_spin_unlock_irqrestore+0x4b/0x60
[ 1.358467] hardirqs last disabled at (47398): [<ffffffff8a12ca1d>] __schedule+0xb1d/0x13e0
[ 1.358469] softirqs last enabled at (47298): [<ffffffff8928d770>] fpu_clone+0x80/0x210
[ 1.358471] softirqs last disabled at (47296): [<ffffffff8928d740>] fpu_clone+0x50/0x210
[ 1.358473]
[ 1.358473] other info that might help us debug this:
[ 1.358473] Possible unsafe locking scenario:
[ 1.358473]
[ 1.358473] CPU0
[ 1.358474] ----
[ 1.358474] lock(&rdp->nocb_lock);
[ 1.358475] <Interrupt>
[ 1.358475] lock(&rdp->nocb_lock);
[ 1.358476]
[ 1.358476] *** DEADLOCK ***
[ 1.358476]
[ 1.358476] 1 lock held by test_progs/134:
[ 1.358477] #0: ffff8ae3bbc6dfe0 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x119/0x13e0
[ 1.358480]
[ 1.358480] stack backtrace:
[ 1.358482] CPU: 1 UID: 0 PID: 134 Comm: test_progs Tainted: G OE 7.0.0-11169-ge4ef174588b8-dirty #16 PREEMPT(full)
[ 1.358484] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 1.358484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 1.358486] Call Trace:
[ 1.358487] <NMI>
[ 1.358488] dump_stack_lvl+0x63/0x90
[ 1.358491] print_usage_bug.part.0+0x233/0x2d0
[ 1.358495] lock_acquire+0x28b/0x2d0
[ 1.358497] ? __call_rcu_common.constprop.0+0x316/0x740
[ 1.358501] _raw_spin_lock+0x33/0x50
[ 1.358502] ? __call_rcu_common.constprop.0+0x316/0x740
[ 1.358505] __call_rcu_common.constprop.0+0x316/0x740
[ 1.358509] bpf_obj_free_fields+0x129/0x260
[ 1.358514] free_htab_elem+0x8d/0xe0
[ 1.358518] htab_map_delete_elem+0x16b/0x240
[ 1.358522] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xb3/0x144
[ 1.358524] bpf_trace_run3+0x11b/0x2f0
[ 1.358527] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 1.358530] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 1.358531] nmi_handle.part.0+0x15c/0x260
[ 1.358536] default_do_nmi+0x12f/0x190
[ 1.358539] exc_nmi+0xeb/0x120
[ 1.358542] end_repeat_nmi+0xf/0x53
[ 1.358544] RIP: 0010:dequeue_entities+0x7ba/0xd70
[ 1.358547] Code: fa ff ff f6 45 b4 01 0f 84 75 fa ff ff 4c 89 ff e8 6b 41 ff ff 4d 8b ad a8 00 00 00 49 83 3f 00 0f 84 6d fa ff ff 49 8b 47 40 <49> 8b 57 50 48 c7 c3 ff ff ff ff 48 85 c0 0f 84 54 05 00 00 48 85
[ 1.358548] RSP: 0018:ffffa83480a77c30 EFLAGS: 00000006
[ 1.358549] RAX: ffff8ae382fe4590 RBX: 0000000000000009 RCX: ffff8ae382fe45c8
[ 1.358550] RDX: ffff8ae380dd80c8 RSI: 0000000000000000 RDI: ffff8ae383d82300
[ 1.358551] RBP: ffffa83480a77ca8 R08: 000000000001084f R09: ffff8ae383d82300
[ 1.358551] R10: 0000000000000002 R11: 0000000008264572 R12: 0000000000000000
[ 1.358552] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8ae3bbc6e040
[ 1.358558] ? dequeue_entities+0x7ba/0xd70
[ 1.358561] ? dequeue_entities+0x7ba/0xd70
[ 1.358563] </NMI>
[ 1.358563] <TASK>
[ 1.358567] dequeue_task_fair+0xf2/0x480
[ 1.358570] __schedule+0x998/0x13e0
[ 1.358574] ? do_wait+0x63/0x1a0
[ 1.358577] schedule+0x3e/0x140
[ 1.358578] do_wait+0x7b/0x1a0
[ 1.358581] kernel_wait4+0xc0/0x170
[ 1.358584] ? __pfx_child_wait_callback+0x10/0x10
[ 1.358588] __do_sys_wait4+0xa7/0xc0
[ 1.358594] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1.358596] do_syscall_64+0xa1/0x5f0
[ 1.358598] ? irq_exit_rcu+0x12/0x20
[ 1.358601] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1.358602] RIP: 0033:0x7fc342ca6922
[ 1.358603] Code: 08 0f 85 51 36 ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66
[ 1.358604] RSP: 002b:00007fff75f85138 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
[ 1.358605] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc342ca6922
[ 1.358606] RDX: 0000000000000000 RSI: 00007fff75f851c8 RDI: 0000000000000090
[ 1.358606] RBP: 00007fff75f85160 R08: 0000000000000000 R09: 0000000000000000
[ 1.358607] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff75f85618
[ 1.358607] R13: 0000000000000003 R14: 00007fc343411000 R15: 000055f2f8140a30
[ 1.358613] </TASK>
[ 1.427470] perf: interrupt took too long (2528 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 1.450479] perf: interrupt took too long (3179 > 3160), lowering kernel.perf_event_max_sample_rate to 62000
[ 1.489552] perf: interrupt took too long (3984 > 3973), lowering kernel.perf_event_max_sample_rate to 50000
[ 1.567488] perf: interrupt took too long (4990 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
[ 1.696694] tsc: Refined TSC clocksource calibration: 4191.351 MHz
[ 1.696873] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x3c6a77879d2, max_idle_ns: 440795420607 ns
[ 1.697225] clocksource: Switched to clocksource tsc
#466 task_kptr_nmi_deadlock_repro:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
[ 2.017327] smc: removing smcd device lo
[ 2.018518] ACPI: PM: Preparing to enter system sleep state S5
[ 2.018712] reboot: Power down
I reduced this to a dedicated selftest-style reproducer.
Reproducer: https://gist.githubusercontent.com/RazeLighter777/5539336d79ab1854f9e9550c6dcab118/raw/082f1eeb2dd445936e64dd3a33861764690bde82/task_struct_dtor_deadlock.patch
This looks like an NMI-unsafety issue in the task kptr destructor path.
This may also apply to the cgroup release dtor, which I believe also can use call_rcu in this path. I haven't tried to make a reproducer for that case.
This should be fixed because even without that specific kconfig, call_rcu is not intended to be called from an NMI handler ever and can result in corruption.
Thanks,
Justin Suess
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
2026-04-21 20:10 [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU Justin Suess
@ 2026-04-21 20:23 ` Kumar Kartikeya Dwivedi
2026-04-21 21:34 ` Justin Suess
0 siblings, 1 reply; 7+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-04-21 20:23 UTC (permalink / raw)
To: Justin Suess
Cc: bpf, ast, daniel, andrii, eddyz87, martin.lau, yonghong.song,
jolsa
[-- Attachment #1: Type: text/plain, Size: 1882 bytes --]
On Tue, 21 Apr 2026 at 22:10, Justin Suess <utilityemal77@gmail.com> wrote:
>
> Hello,
>
> I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.
>
> It was found after further investigation from a Sashiko report on my patch:
> https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t
>
> The issue is reproducible with a BPF selftest-derived reproducer that:
>
> 1. Stores exited task references in a BPF hash map as refcounted task kptrs.
> 2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
> 3. Runs on an `rcu_nocbs` CPU.
>
> In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:
>
> `perf_sched_delayed`
> ` -> static_key_disable()`
> ` -> arch_jump_label_transform_apply()`
> ` -> smp_text_poke_batch_finish()`
> ` -> on_each_cpu_cond_mask()`
> ` -> smp_call_function_many_cond()`
>
> The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.
>
> Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:
>
> `bpf_map_delete_elem()`
> ` -> htab_map_delete_elem()`
> ` -> free_htab_elem()`
> ` -> bpf_obj_free_fields()`
> ` -> bpf_task_release_dtor()`
> ` -> put_task_struct_rcu_user()`
> ` -> call_rcu()`
>
> This is triggered from:
>
> `tp_btf/nmi_handler`
> ` -> clear_task_kptrs_from_nmi` (reproducer bpf prog)
>
> Environment
>
> - x86_64 QEMU VM
> - PREEMPT(full)
> - `CONFIG_RCU_EXPERT=y`
> - `CONFIG_RCU_NOCB_CPU=y`
> - booted with `rcu_nocbs=1-7`
>
> [...]
Makes sense. I think the reasonable path is to just close usage in the
NMI context, otherwise we must address each case. Could you try the
attached diff and let me know if it successfully rejects kptr usage
here? Thanks.
[-- Attachment #2: kptr.diff --]
[-- Type: application/octet-stream, Size: 710 bytes --]
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 185210b73385..5a6903432ca4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -18037,9 +18037,10 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
}
if (btf_record_has_field(map->record, BPF_LIST_HEAD) ||
- btf_record_has_field(map->record, BPF_RB_ROOT)) {
+ btf_record_has_field(map->record, BPF_RB_ROOT) ||
+ btf_record_has_field(map->record, BPF_KPTR)) {
if (is_tracing_prog_type(prog_type)) {
- verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
+ verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} or kptr yet\n");
return -EINVAL;
}
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
2026-04-21 20:23 ` Kumar Kartikeya Dwivedi
@ 2026-04-21 21:34 ` Justin Suess
2026-04-21 21:44 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 7+ messages in thread
From: Justin Suess @ 2026-04-21 21:34 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, ast, daniel, andrii, eddyz87, martin.lau, yonghong.song,
jolsa
On Tue, Apr 21, 2026 at 10:23:56PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Tue, 21 Apr 2026 at 22:10, Justin Suess <utilityemal77@gmail.com> wrote:
> >
> > Hello,
> >
> > I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.
> >
> > It was found after further investigation from a Sashiko report on my patch:
> > https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t
> >
> > The issue is reproducible with a BPF selftest-derived reproducer that:
> >
> > 1. Stores exited task references in a BPF hash map as refcounted task kptrs.
> > 2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
> > 3. Runs on an `rcu_nocbs` CPU.
> >
> > In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:
> >
> > `perf_sched_delayed`
> > ` -> static_key_disable()`
> > ` -> arch_jump_label_transform_apply()`
> > ` -> smp_text_poke_batch_finish()`
> > ` -> on_each_cpu_cond_mask()`
> > ` -> smp_call_function_many_cond()`
> >
> > The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.
> >
> > Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:
> >
> > `bpf_map_delete_elem()`
> > ` -> htab_map_delete_elem()`
> > ` -> free_htab_elem()`
> > ` -> bpf_obj_free_fields()`
> > ` -> bpf_task_release_dtor()`
> > ` -> put_task_struct_rcu_user()`
> > ` -> call_rcu()`
> >
> > This is triggered from:
> >
> > `tp_btf/nmi_handler`
> > ` -> clear_task_kptrs_from_nmi` (reproducer bpf prog)
> >
> > Environment
> >
> > - x86_64 QEMU VM
> > - PREEMPT(full)
> > - `CONFIG_RCU_EXPERT=y`
> > - `CONFIG_RCU_NOCB_CPU=y`
> > - booted with `rcu_nocbs=1-7`
> >
> > [...]
>
> Makes sense. I think the reasonable path is to just close usage in the
> NMI context, otherwise we must address each case. Could you try the
> attached diff and let me know if it successfully rejects kptr usage
> here? Thanks.
Didn't work for me.
is_tracing_prog_type, despite the name, does not return true
for BPF_PROG_TYPE_TRACING. Only BPF_PROG_TYPE_TRACEPOINT.
I'm honestly still not sure what the difference is, but they are
different [1]
Would you rather do this or just reject the dtors with a
kfunc filter for this program type?
Or teach the verifier that the kptr ops need to be offloaded with
bpf_task_work_schedule_resume_impl?
[1]: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_TRACING/
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
2026-04-21 21:34 ` Justin Suess
@ 2026-04-21 21:44 ` Kumar Kartikeya Dwivedi
2026-04-22 11:58 ` Justin Suess
2026-04-22 14:39 ` Justin Suess
0 siblings, 2 replies; 7+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-04-21 21:44 UTC (permalink / raw)
To: Justin Suess
Cc: bpf, ast, daniel, andrii, eddyz87, martin.lau, yonghong.song,
jolsa
On Tue, 21 Apr 2026 at 23:34, Justin Suess <utilityemal77@gmail.com> wrote:
>
> On Tue, Apr 21, 2026 at 10:23:56PM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 21 Apr 2026 at 22:10, Justin Suess <utilityemal77@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.
> > >
> > > It was found after further investigation from a Sashiko report on my patch:
> > > https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t
> > >
> > > The issue is reproducible with a BPF selftest-derived reproducer that:
> > >
> > > 1. Stores exited task references in a BPF hash map as refcounted task kptrs.
> > > 2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
> > > 3. Runs on an `rcu_nocbs` CPU.
> > >
> > > In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:
> > >
> > > `perf_sched_delayed`
> > > ` -> static_key_disable()`
> > > ` -> arch_jump_label_transform_apply()`
> > > ` -> smp_text_poke_batch_finish()`
> > > ` -> on_each_cpu_cond_mask()`
> > > ` -> smp_call_function_many_cond()`
> > >
> > > The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.
> > >
> > > Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:
> > >
> > > `bpf_map_delete_elem()`
> > > ` -> htab_map_delete_elem()`
> > > ` -> free_htab_elem()`
> > > ` -> bpf_obj_free_fields()`
> > > ` -> bpf_task_release_dtor()`
> > > ` -> put_task_struct_rcu_user()`
> > > ` -> call_rcu()`
> > >
> > > This is triggered from:
> > >
> > > `tp_btf/nmi_handler`
> > > ` -> clear_task_kptrs_from_nmi` (reproducer bpf prog)
> > >
> > > Environment
> > >
> > > - x86_64 QEMU VM
> > > - PREEMPT(full)
> > > - `CONFIG_RCU_EXPERT=y`
> > > - `CONFIG_RCU_NOCB_CPU=y`
> > > - booted with `rcu_nocbs=1-7`
> > >
> > > [...]
> >
> > Makes sense. I think the reasonable path is to just close usage in the
> > NMI context, otherwise we must address each case. Could you try the
> > attached diff and let me know if it successfully rejects kptr usage
> > here? Thanks.
> Didn't work for me.
>
> is_tracing_prog_type, despite the name, does not return true
> for BPF_PROG_TYPE_TRACING. Only BPF_PROG_TYPE_TRACEPOINT.
>
We can add it but return false when expected_attach_type ==
BPF_TRACE_ITER. For all other cases, allowing it doesn't make sense
because these might potentially run in NMI context.
Please let me know if you'd like to send a fix + tests, otherwise I
can follow up. Feel free to fold in the diff I sent into your fix, no
attributation needed.
> I'm honestly still not sure what the difference is, but they are
> different [1]
>
> Would you rather do this or just reject the dtors with a
> kfunc filter for this program type?
>
> Or teach the verifier that the kptr ops need to be offloaded with
> bpf_task_work_schedule_resume_impl?
>
> [1]: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_TRACING/
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
2026-04-21 21:44 ` Kumar Kartikeya Dwivedi
@ 2026-04-22 11:58 ` Justin Suess
2026-04-22 14:39 ` Justin Suess
1 sibling, 0 replies; 7+ messages in thread
From: Justin Suess @ 2026-04-22 11:58 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, ast, daniel, andrii, eddyz87, martin.lau, yonghong.song,
jolsa
On Tue, Apr 21, 2026 at 11:44:42PM +0200, Kumar Kartikeya Dwivedi wrote:
> > [...]
>
> We can add it but return false when expected_attach_type ==
> BPF_TRACE_ITER. For all other cases, allowing it doesn't make sense
> because these might potentially run in NMI context.
>
> Please let me know if you'd like to send a fix + tests, otherwise I
> can follow up. Feel free to fold in the diff I sent into your fix, no
> attributation needed.
>
I'll send a fix and selftest.
There needs to be some documentation matrix of prog/attach type to
kptr / other features to prevent bugs like this, but this can be a
future patch.
I may make a series improving documentation as there are many other
verifier "quirks" I had to discover by reading the kernel source. :)
Justin
> > I'm honestly still not sure what the difference is, but they are
> > different [1]
> >
> > Would you rather do this or just reject the dtors with a
> > kfunc filter for this program type?
> >
> > Or teach the verifier that the kptr ops need to be offloaded with
> > bpf_task_work_schedule_resume_impl?
> >
> > [1]: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_TRACING/
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
2026-04-21 21:44 ` Kumar Kartikeya Dwivedi
2026-04-22 11:58 ` Justin Suess
@ 2026-04-22 14:39 ` Justin Suess
2026-04-22 20:47 ` Kumar Kartikeya Dwivedi
1 sibling, 1 reply; 7+ messages in thread
From: Justin Suess @ 2026-04-22 14:39 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, ast, daniel, andrii, eddyz87, martin.lau, yonghong.song,
jolsa
On Tue, Apr 21, 2026 at 11:44:42PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Tue, 21 Apr 2026 at 23:34, Justin Suess <utilityemal77@gmail.com> wrote:
> >
> > On Tue, Apr 21, 2026 at 10:23:56PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Tue, 21 Apr 2026 at 22:10, Justin Suess <utilityemal77@gmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.
> > > >
> > > > It was found after further investigation from a Sashiko report on my patch:
> > > > https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t
> > > >
> > > > The issue is reproducible with a BPF selftest-derived reproducer that:
> > > >
> > > > 1. Stores exited task references in a BPF hash map as refcounted task kptrs.
> > > > 2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
> > > > 3. Runs on an `rcu_nocbs` CPU.
> > > >
> > > > In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:
> > > >
> > > > `perf_sched_delayed`
> > > > ` -> static_key_disable()`
> > > > ` -> arch_jump_label_transform_apply()`
> > > > ` -> smp_text_poke_batch_finish()`
> > > > ` -> on_each_cpu_cond_mask()`
> > > > ` -> smp_call_function_many_cond()`
> > > >
> > > > The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.
> > > >
> > > > Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:
> > > >
> > > > `bpf_map_delete_elem()`
> > > > ` -> htab_map_delete_elem()`
> > > > ` -> free_htab_elem()`
> > > > ` -> bpf_obj_free_fields()`
> > > > ` -> bpf_task_release_dtor()`
> > > > ` -> put_task_struct_rcu_user()`
> > > > ` -> call_rcu()`
> > > >
> > > > This is triggered from:
> > > >
> > > > `tp_btf/nmi_handler`
> > > > ` -> clear_task_kptrs_from_nmi` (reproducer bpf prog)
> > > >
> > > > Environment
> > > >
> > > > - x86_64 QEMU VM
> > > > - PREEMPT(full)
> > > > - `CONFIG_RCU_EXPERT=y`
> > > > - `CONFIG_RCU_NOCB_CPU=y`
> > > > - booted with `rcu_nocbs=1-7`
> > > >
> > > > [...]
> > >
> > > Makes sense. I think the reasonable path is to just close usage in the
> > > NMI context, otherwise we must address each case. Could you try the
> > > attached diff and let me know if it successfully rejects kptr usage
> > > here? Thanks.
> > Didn't work for me.
> >
> > is_tracing_prog_type, despite the name, does not return true
> > for BPF_PROG_TYPE_TRACING. Only BPF_PROG_TYPE_TRACEPOINT.
> >
>
> We can add it but return false when expected_attach_type ==
> BPF_TRACE_ITER. For all other cases, allowing it doesn't make sense
> because these might potentially run in NMI context.
>
> Please let me know if you'd like to send a fix + tests, otherwise I
> can follow up. Feel free to fold in the diff I sent into your fix, no
> attributation needed.
>
> > I'm honestly still not sure what the difference is, but they are
> > different [1]
> >
> > Would you rather do this or just reject the dtors with a
> > kfunc filter for this program type?
> >
> > Or teach the verifier that the kptr ops need to be offloaded with
> > bpf_task_work_schedule_resume_impl?
> >
> > [1]: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_TRACING/
Sorry for the double tap but the change you're requesting for the fix
will cause breakage.
This will at a minimum break test_bpf_ma and percpu_alloc_array tests.
More importantly, this will break existing progs that use kptrs in
tracepoints.
Would a narrower fix that filters the dtor kfuncs specifically be a
better option? Or better fix the kfuncs that use irq_work? I think
the real fix is to make bpf smarter about when it's running under
nmi, but that may be non-trivial.
If the breakage is acceptable, should I just remove those tests?
Justin
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
2026-04-22 14:39 ` Justin Suess
@ 2026-04-22 20:47 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 7+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-04-22 20:47 UTC (permalink / raw)
To: Justin Suess
Cc: bpf, ast, daniel, andrii, eddyz87, martin.lau, yonghong.song,
jolsa
On Wed, 22 Apr 2026 at 16:39, Justin Suess <utilityemal77@gmail.com> wrote:
>
> On Tue, Apr 21, 2026 at 11:44:42PM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Tue, 21 Apr 2026 at 23:34, Justin Suess <utilityemal77@gmail.com> wrote:
> > >
> > > On Tue, Apr 21, 2026 at 10:23:56PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > On Tue, 21 Apr 2026 at 22:10, Justin Suess <utilityemal77@gmail.com> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.
> > > > >
> > > > > It was found after further investigation from a Sashiko report on my patch:
> > > > > https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t
> > > > >
> > > > > The issue is reproducible with a BPF selftest-derived reproducer that:
> > > > >
> > > > > 1. Stores exited task references in a BPF hash map as refcounted task kptrs.
> > > > > 2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
> > > > > 3. Runs on an `rcu_nocbs` CPU.
> > > > >
> > > > > In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:
> > > > >
> > > > > `perf_sched_delayed`
> > > > > ` -> static_key_disable()`
> > > > > ` -> arch_jump_label_transform_apply()`
> > > > > ` -> smp_text_poke_batch_finish()`
> > > > > ` -> on_each_cpu_cond_mask()`
> > > > > ` -> smp_call_function_many_cond()`
> > > > >
> > > > > The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.
> > > > >
> > > > > Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:
> > > > >
> > > > > `bpf_map_delete_elem()`
> > > > > ` -> htab_map_delete_elem()`
> > > > > ` -> free_htab_elem()`
> > > > > ` -> bpf_obj_free_fields()`
> > > > > ` -> bpf_task_release_dtor()`
> > > > > ` -> put_task_struct_rcu_user()`
> > > > > ` -> call_rcu()`
> > > > >
> > > > > This is triggered from:
> > > > >
> > > > > `tp_btf/nmi_handler`
> > > > > ` -> clear_task_kptrs_from_nmi` (reproducer bpf prog)
> > > > >
> > > > > Environment
> > > > >
> > > > > - x86_64 QEMU VM
> > > > > - PREEMPT(full)
> > > > > - `CONFIG_RCU_EXPERT=y`
> > > > > - `CONFIG_RCU_NOCB_CPU=y`
> > > > > - booted with `rcu_nocbs=1-7`
> > > > >
> > > > > [...]
> > > >
> > > > Makes sense. I think the reasonable path is to just close usage in the
> > > > NMI context, otherwise we must address each case. Could you try the
> > > > attached diff and let me know if it successfully rejects kptr usage
> > > > here? Thanks.
> > > Didn't work for me.
> > >
> > > is_tracing_prog_type, despite the name, does not return true
> > > for BPF_PROG_TYPE_TRACING. Only BPF_PROG_TYPE_TRACEPOINT.
> > >
> >
> > We can add it but return false when expected_attach_type ==
> > BPF_TRACE_ITER. For all other cases, allowing it doesn't make sense
> > because these might potentially run in NMI context.
> >
> > Please let me know if you'd like to send a fix + tests, otherwise I
> > can follow up. Feel free to fold in the diff I sent into your fix, no
> > attributation needed.
> >
> > > I'm honestly still not sure what the difference is, but they are
> > > different [1]
> > >
> > > Would you rather do this or just reject the dtors with a
> > > kfunc filter for this program type?
> > >
> > > Or teach the verifier that the kptr ops need to be offloaded with
> > > bpf_task_work_schedule_resume_impl?
> > >
> > > [1]: https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_TRACING/
>
> Sorry for the double tap but the change you're requesting for the fix
> will cause breakage.
>
> This will at a minimum break test_bpf_ma and percpu_alloc_array tests.
>
> More importantly, this will break existing progs that use kptrs in
> tracepoints.
>
> Would a narrower fix that filters the dtor kfuncs specifically be a
> better option? Or better fix the kfuncs that use irq_work? I think
> the real fix is to make bpf smarter about when it's running under
> nmi, but that may be non-trivial.
>
> If the breakage is acceptable, should I just remove those tests?
>
I think let's be conservative and let it be for now, given this issue
was found by the bot, and not yet reported in practice.
I think breakage would definitely be problematic; tracing programs may
also run from non-NMI context, just blanket ban may not work in
practice.
We can think through the proper solution (deferral the operation in
NMI context) later on, when it comes up again.
Please just resend your file kptr series as is.
> Justin
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-04-22 20:48 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21 20:10 [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU Justin Suess
2026-04-21 20:23 ` Kumar Kartikeya Dwivedi
2026-04-21 21:34 ` Justin Suess
2026-04-21 21:44 ` Kumar Kartikeya Dwivedi
2026-04-22 11:58 ` Justin Suess
2026-04-22 14:39 ` Justin Suess
2026-04-22 20:47 ` Kumar Kartikeya Dwivedi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox