From: Justin Suess <utilityemal77@gmail.com>
To: bpf@vger.kernel.org
Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
yonghong.song@linux.dev, jolsa@kernel.org
Subject: [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
Date: Tue, 21 Apr 2026 16:10:33 -0400 [thread overview]
Message-ID: <20260421201035.1729473-1-utilityemal77@gmail.com> (raw)
Hello,
I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.
It was found after further investigation from a Sashiko report on my patch:
https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t
The issue is reproducible with a BPF selftest-derived reproducer that:
1. Stores exited task references in a BPF hash map as refcounted task kptrs.
2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
3. Runs on an `rcu_nocbs` CPU.
In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:
`perf_sched_delayed`
` -> static_key_disable()`
` -> arch_jump_label_transform_apply()`
` -> smp_text_poke_batch_finish()`
` -> on_each_cpu_cond_mask()`
` -> smp_call_function_many_cond()`
The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.
Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:
`bpf_map_delete_elem()`
` -> htab_map_delete_elem()`
` -> free_htab_elem()`
` -> bpf_obj_free_fields()`
` -> bpf_task_release_dtor()`
` -> put_task_struct_rcu_user()`
` -> call_rcu()`
This is triggered from:
`tp_btf/nmi_handler`
` -> clear_task_kptrs_from_nmi` (reproducer bpf prog)
Environment
- x86_64 QEMU VM
- PREEMPT(full)
- `CONFIG_RCU_EXPERT=y`
- `CONFIG_RCU_NOCB_CPU=y`
- booted with `rcu_nocbs=1-7`
(the CONFIG_RCU_NOCB_CPU makes reproducing this more likely)
Observed result
- watchdog reports a soft lockup
- kernel panics with `Kernel panic - not syncing: softlockup: hung tasks`
- the stuck task is a kworker running `perf_sched_delayed`
Logs:
env TASK_KPTR_NMI_DEADLOCK_REPRO=1 ./test_progs -t task_kptr_nmi_deadlock_repro
[ 1.336781] bpf_testmod: loading out-of-tree module taints kernel.
[ 1.336961] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
[ 1.358431]
[ 1.358433] ================================
[ 1.358433] WARNING: inconsistent lock state
[ 1.358434] 7.0.0-11169-ge4ef174588b8-dirty #16 Tainted: G OE
[ 1.358435] --------------------------------
[ 1.358436] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[ 1.358436] test_progs/134 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 1.358438] ffff8ae3bbc6f0e8 (&rdp->nocb_lock){....}-{2:2}, at: __call_rcu_common.constprop.0+0x316/0x740
[ 1.358445] {INITIAL USE} state was registered at:
[ 1.358445] lock_acquire+0xc0/0x2d0
[ 1.358448] _raw_spin_lock+0x33/0x50
[ 1.358451] rcu_nocb_gp_kthread+0x13b/0xbb0
[ 1.358453] kthread+0x10d/0x140
[ 1.358456] ret_from_fork+0x26d/0x330
[ 1.358458] ret_from_fork_asm+0x1a/0x30
[ 1.358465] irq event stamp: 47398
[ 1.358465] hardirqs last enabled at (47397): [<ffffffff8a1397cb>] _raw_spin_unlock_irqrestore+0x4b/0x60
[ 1.358467] hardirqs last disabled at (47398): [<ffffffff8a12ca1d>] __schedule+0xb1d/0x13e0
[ 1.358469] softirqs last enabled at (47298): [<ffffffff8928d770>] fpu_clone+0x80/0x210
[ 1.358471] softirqs last disabled at (47296): [<ffffffff8928d740>] fpu_clone+0x50/0x210
[ 1.358473]
[ 1.358473] other info that might help us debug this:
[ 1.358473] Possible unsafe locking scenario:
[ 1.358473]
[ 1.358473] CPU0
[ 1.358474] ----
[ 1.358474] lock(&rdp->nocb_lock);
[ 1.358475] <Interrupt>
[ 1.358475] lock(&rdp->nocb_lock);
[ 1.358476]
[ 1.358476] *** DEADLOCK ***
[ 1.358476]
[ 1.358476] 1 lock held by test_progs/134:
[ 1.358477] #0: ffff8ae3bbc6dfe0 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x119/0x13e0
[ 1.358480]
[ 1.358480] stack backtrace:
[ 1.358482] CPU: 1 UID: 0 PID: 134 Comm: test_progs Tainted: G OE 7.0.0-11169-ge4ef174588b8-dirty #16 PREEMPT(full)
[ 1.358484] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 1.358484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 1.358486] Call Trace:
[ 1.358487] <NMI>
[ 1.358488] dump_stack_lvl+0x63/0x90
[ 1.358491] print_usage_bug.part.0+0x233/0x2d0
[ 1.358495] lock_acquire+0x28b/0x2d0
[ 1.358497] ? __call_rcu_common.constprop.0+0x316/0x740
[ 1.358501] _raw_spin_lock+0x33/0x50
[ 1.358502] ? __call_rcu_common.constprop.0+0x316/0x740
[ 1.358505] __call_rcu_common.constprop.0+0x316/0x740
[ 1.358509] bpf_obj_free_fields+0x129/0x260
[ 1.358514] free_htab_elem+0x8d/0xe0
[ 1.358518] htab_map_delete_elem+0x16b/0x240
[ 1.358522] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xb3/0x144
[ 1.358524] bpf_trace_run3+0x11b/0x2f0
[ 1.358527] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 1.358530] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 1.358531] nmi_handle.part.0+0x15c/0x260
[ 1.358536] default_do_nmi+0x12f/0x190
[ 1.358539] exc_nmi+0xeb/0x120
[ 1.358542] end_repeat_nmi+0xf/0x53
[ 1.358544] RIP: 0010:dequeue_entities+0x7ba/0xd70
[ 1.358547] Code: fa ff ff f6 45 b4 01 0f 84 75 fa ff ff 4c 89 ff e8 6b 41 ff ff 4d 8b ad a8 00 00 00 49 83 3f 00 0f 84 6d fa ff ff 49 8b 47 40 <49> 8b 57 50 48 c7 c3 ff ff ff ff 48 85 c0 0f 84 54 05 00 00 48 85
[ 1.358548] RSP: 0018:ffffa83480a77c30 EFLAGS: 00000006
[ 1.358549] RAX: ffff8ae382fe4590 RBX: 0000000000000009 RCX: ffff8ae382fe45c8
[ 1.358550] RDX: ffff8ae380dd80c8 RSI: 0000000000000000 RDI: ffff8ae383d82300
[ 1.358551] RBP: ffffa83480a77ca8 R08: 000000000001084f R09: ffff8ae383d82300
[ 1.358551] R10: 0000000000000002 R11: 0000000008264572 R12: 0000000000000000
[ 1.358552] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8ae3bbc6e040
[ 1.358558] ? dequeue_entities+0x7ba/0xd70
[ 1.358561] ? dequeue_entities+0x7ba/0xd70
[ 1.358563] </NMI>
[ 1.358563] <TASK>
[ 1.358567] dequeue_task_fair+0xf2/0x480
[ 1.358570] __schedule+0x998/0x13e0
[ 1.358574] ? do_wait+0x63/0x1a0
[ 1.358577] schedule+0x3e/0x140
[ 1.358578] do_wait+0x7b/0x1a0
[ 1.358581] kernel_wait4+0xc0/0x170
[ 1.358584] ? __pfx_child_wait_callback+0x10/0x10
[ 1.358588] __do_sys_wait4+0xa7/0xc0
[ 1.358594] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1.358596] do_syscall_64+0xa1/0x5f0
[ 1.358598] ? irq_exit_rcu+0x12/0x20
[ 1.358601] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1.358602] RIP: 0033:0x7fc342ca6922
[ 1.358603] Code: 08 0f 85 51 36 ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66
[ 1.358604] RSP: 002b:00007fff75f85138 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
[ 1.358605] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc342ca6922
[ 1.358606] RDX: 0000000000000000 RSI: 00007fff75f851c8 RDI: 0000000000000090
[ 1.358606] RBP: 00007fff75f85160 R08: 0000000000000000 R09: 0000000000000000
[ 1.358607] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff75f85618
[ 1.358607] R13: 0000000000000003 R14: 00007fc343411000 R15: 000055f2f8140a30
[ 1.358613] </TASK>
[ 1.427470] perf: interrupt took too long (2528 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 1.450479] perf: interrupt took too long (3179 > 3160), lowering kernel.perf_event_max_sample_rate to 62000
[ 1.489552] perf: interrupt took too long (3984 > 3973), lowering kernel.perf_event_max_sample_rate to 50000
[ 1.567488] perf: interrupt took too long (4990 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
[ 1.696694] tsc: Refined TSC clocksource calibration: 4191.351 MHz
[ 1.696873] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x3c6a77879d2, max_idle_ns: 440795420607 ns
[ 1.697225] clocksource: Switched to clocksource tsc
#466 task_kptr_nmi_deadlock_repro:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
[ 2.017327] smc: removing smcd device lo
[ 2.018518] ACPI: PM: Preparing to enter system sleep state S5
[ 2.018712] reboot: Power down
I reduced this to a dedicated selftest-style reproducer.
Reproducer: https://gist.githubusercontent.com/RazeLighter777/5539336d79ab1854f9e9550c6dcab118/raw/082f1eeb2dd445936e64dd3a33861764690bde82/task_struct_dtor_deadlock.patch
This looks like an NMI-unsafety issue in the task kptr destructor path.
This may also apply to the cgroup release dtor, which I believe also can use call_rcu in this path. I haven't tried to make a reproducer for that case.
This should be fixed because even without that specific kconfig, call_rcu is not intended to be called from an NMI handler ever and can result in corruption.
Thanks,
Justin Suess
next reply other threads:[~2026-04-21 20:10 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-21 20:10 Justin Suess [this message]
2026-04-21 20:23 ` [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU Kumar Kartikeya Dwivedi
2026-04-21 21:34 ` Justin Suess
2026-04-21 21:44 ` Kumar Kartikeya Dwivedi
2026-04-22 11:58 ` Justin Suess
2026-04-22 14:39 ` Justin Suess
2026-04-22 20:47 ` Kumar Kartikeya Dwivedi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260421201035.1729473-1-utilityemal77@gmail.com \
--to=utilityemal77@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=jolsa@kernel.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox