public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
@ 2026-04-21 20:10 Justin Suess
  2026-04-21 20:23 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 7+ messages in thread
From: Justin Suess @ 2026-04-21 20:10 UTC (permalink / raw)
  To: bpf; +Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, yonghong.song,
	jolsa

Hello,

I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.

It was found after further investigation from a Sashiko report on my patch:
https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t

The issue is reproducible with a BPF selftest-derived reproducer that:

1. Stores exited task references in a BPF hash map as refcounted task kptrs.
2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
3. Runs on an `rcu_nocbs` CPU.

In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:

`perf_sched_delayed`
`  -> static_key_disable()`
`  -> arch_jump_label_transform_apply()`
`  -> smp_text_poke_batch_finish()`
`  -> on_each_cpu_cond_mask()`
`  -> smp_call_function_many_cond()`

The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.

Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:

`bpf_map_delete_elem()`
`  -> htab_map_delete_elem()`
`  -> free_htab_elem()`
`  -> bpf_obj_free_fields()`
`  -> bpf_task_release_dtor()`
`  -> put_task_struct_rcu_user()`
`  -> call_rcu()`

This is triggered from:

`tp_btf/nmi_handler`
`  -> clear_task_kptrs_from_nmi` (reproducer bpf prog)

Environment

- x86_64 QEMU VM
- PREEMPT(full)
- `CONFIG_RCU_EXPERT=y`
- `CONFIG_RCU_NOCB_CPU=y`
- booted with `rcu_nocbs=1-7`

(the CONFIG_RCU_NOCB_CPU makes reproducing this more likely)

Observed result

- watchdog reports a soft lockup
- kernel panics with `Kernel panic - not syncing: softlockup: hung tasks`
- the stuck task is a kworker running `perf_sched_delayed`

Logs:

env TASK_KPTR_NMI_DEADLOCK_REPRO=1 ./test_progs -t task_kptr_nmi_deadlock_repro
[    1.336781] bpf_testmod: loading out-of-tree module taints kernel.
[    1.336961] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
[    1.358431] 
[    1.358433] ================================
[    1.358433] WARNING: inconsistent lock state
[    1.358434] 7.0.0-11169-ge4ef174588b8-dirty #16 Tainted: G           OE      
[    1.358435] --------------------------------
[    1.358436] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[    1.358436] test_progs/134 [HC1[1]:SC0[0]:HE0:SE1] takes:
[    1.358438] ffff8ae3bbc6f0e8 (&rdp->nocb_lock){....}-{2:2}, at: __call_rcu_common.constprop.0+0x316/0x740
[    1.358445] {INITIAL USE} state was registered at:
[    1.358445]   lock_acquire+0xc0/0x2d0
[    1.358448]   _raw_spin_lock+0x33/0x50
[    1.358451]   rcu_nocb_gp_kthread+0x13b/0xbb0
[    1.358453]   kthread+0x10d/0x140
[    1.358456]   ret_from_fork+0x26d/0x330
[    1.358458]   ret_from_fork_asm+0x1a/0x30
[    1.358465] irq event stamp: 47398
[    1.358465] hardirqs last  enabled at (47397): [<ffffffff8a1397cb>] _raw_spin_unlock_irqrestore+0x4b/0x60
[    1.358467] hardirqs last disabled at (47398): [<ffffffff8a12ca1d>] __schedule+0xb1d/0x13e0
[    1.358469] softirqs last  enabled at (47298): [<ffffffff8928d770>] fpu_clone+0x80/0x210
[    1.358471] softirqs last disabled at (47296): [<ffffffff8928d740>] fpu_clone+0x50/0x210
[    1.358473] 
[    1.358473] other info that might help us debug this:
[    1.358473]  Possible unsafe locking scenario:
[    1.358473] 
[    1.358473]        CPU0
[    1.358474]        ----
[    1.358474]   lock(&rdp->nocb_lock);
[    1.358475]   <Interrupt>
[    1.358475]     lock(&rdp->nocb_lock);
[    1.358476] 
[    1.358476]  *** DEADLOCK ***
[    1.358476] 
[    1.358476] 1 lock held by test_progs/134:
[    1.358477]  #0: ffff8ae3bbc6dfe0 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x119/0x13e0
[    1.358480] 
[    1.358480] stack backtrace:
[    1.358482] CPU: 1 UID: 0 PID: 134 Comm: test_progs Tainted: G           OE       7.0.0-11169-ge4ef174588b8-dirty #16 PREEMPT(full) 
[    1.358484] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[    1.358484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[    1.358486] Call Trace:
[    1.358487]  <NMI>
[    1.358488]  dump_stack_lvl+0x63/0x90
[    1.358491]  print_usage_bug.part.0+0x233/0x2d0
[    1.358495]  lock_acquire+0x28b/0x2d0
[    1.358497]  ? __call_rcu_common.constprop.0+0x316/0x740
[    1.358501]  _raw_spin_lock+0x33/0x50
[    1.358502]  ? __call_rcu_common.constprop.0+0x316/0x740
[    1.358505]  __call_rcu_common.constprop.0+0x316/0x740
[    1.358509]  bpf_obj_free_fields+0x129/0x260
[    1.358514]  free_htab_elem+0x8d/0xe0
[    1.358518]  htab_map_delete_elem+0x16b/0x240
[    1.358522]  bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xb3/0x144
[    1.358524]  bpf_trace_run3+0x11b/0x2f0
[    1.358527]  ? __pfx_perf_event_nmi_handler+0x10/0x10
[    1.358530]  ? __pfx_perf_event_nmi_handler+0x10/0x10
[    1.358531]  nmi_handle.part.0+0x15c/0x260
[    1.358536]  default_do_nmi+0x12f/0x190
[    1.358539]  exc_nmi+0xeb/0x120
[    1.358542]  end_repeat_nmi+0xf/0x53
[    1.358544] RIP: 0010:dequeue_entities+0x7ba/0xd70
[    1.358547] Code: fa ff ff f6 45 b4 01 0f 84 75 fa ff ff 4c 89 ff e8 6b 41 ff ff 4d 8b ad a8 00 00 00 49 83 3f 00 0f 84 6d fa ff ff 49 8b 47 40 <49> 8b 57 50 48 c7 c3 ff ff ff ff 48 85 c0 0f 84 54 05 00 00 48 85
[    1.358548] RSP: 0018:ffffa83480a77c30 EFLAGS: 00000006
[    1.358549] RAX: ffff8ae382fe4590 RBX: 0000000000000009 RCX: ffff8ae382fe45c8
[    1.358550] RDX: ffff8ae380dd80c8 RSI: 0000000000000000 RDI: ffff8ae383d82300
[    1.358551] RBP: ffffa83480a77ca8 R08: 000000000001084f R09: ffff8ae383d82300
[    1.358551] R10: 0000000000000002 R11: 0000000008264572 R12: 0000000000000000
[    1.358552] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8ae3bbc6e040
[    1.358558]  ? dequeue_entities+0x7ba/0xd70
[    1.358561]  ? dequeue_entities+0x7ba/0xd70
[    1.358563]  </NMI>
[    1.358563]  <TASK>
[    1.358567]  dequeue_task_fair+0xf2/0x480
[    1.358570]  __schedule+0x998/0x13e0
[    1.358574]  ? do_wait+0x63/0x1a0
[    1.358577]  schedule+0x3e/0x140
[    1.358578]  do_wait+0x7b/0x1a0
[    1.358581]  kernel_wait4+0xc0/0x170
[    1.358584]  ? __pfx_child_wait_callback+0x10/0x10
[    1.358588]  __do_sys_wait4+0xa7/0xc0
[    1.358594]  ? srso_alias_return_thunk+0x5/0xfbef5
[    1.358596]  do_syscall_64+0xa1/0x5f0
[    1.358598]  ? irq_exit_rcu+0x12/0x20
[    1.358601]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    1.358602] RIP: 0033:0x7fc342ca6922
[    1.358603] Code: 08 0f 85 51 36 ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66
[    1.358604] RSP: 002b:00007fff75f85138 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
[    1.358605] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc342ca6922
[    1.358606] RDX: 0000000000000000 RSI: 00007fff75f851c8 RDI: 0000000000000090
[    1.358606] RBP: 00007fff75f85160 R08: 0000000000000000 R09: 0000000000000000
[    1.358607] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff75f85618
[    1.358607] R13: 0000000000000003 R14: 00007fc343411000 R15: 000055f2f8140a30
[    1.358613]  </TASK>
[    1.427470] perf: interrupt took too long (2528 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[    1.450479] perf: interrupt took too long (3179 > 3160), lowering kernel.perf_event_max_sample_rate to 62000
[    1.489552] perf: interrupt took too long (3984 > 3973), lowering kernel.perf_event_max_sample_rate to 50000
[    1.567488] perf: interrupt took too long (4990 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
[    1.696694] tsc: Refined TSC clocksource calibration: 4191.351 MHz
[    1.696873] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x3c6a77879d2, max_idle_ns: 440795420607 ns
[    1.697225] clocksource: Switched to clocksource tsc
#466     task_kptr_nmi_deadlock_repro:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
[    2.017327] smc: removing smcd device lo
[    2.018518] ACPI: PM: Preparing to enter system sleep state S5
[    2.018712] reboot: Power down


I reduced this to a dedicated selftest-style reproducer.

Reproducer: https://gist.githubusercontent.com/RazeLighter777/5539336d79ab1854f9e9550c6dcab118/raw/082f1eeb2dd445936e64dd3a33861764690bde82/task_struct_dtor_deadlock.patch

This looks like an NMI-unsafety issue in the task kptr destructor path.

This may also apply to the cgroup release dtor, which I believe also can use call_rcu in this path. I haven't tried to make a reproducer for that case.

This should be fixed because even without that specific kconfig, call_rcu is not intended to be called from an NMI handler ever and can result in corruption.


Thanks,

Justin Suess

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-22 20:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21 20:10 [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU Justin Suess
2026-04-21 20:23 ` Kumar Kartikeya Dwivedi
2026-04-21 21:34   ` Justin Suess
2026-04-21 21:44     ` Kumar Kartikeya Dwivedi
2026-04-22 11:58       ` Justin Suess
2026-04-22 14:39       ` Justin Suess
2026-04-22 20:47         ` Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox