From: Justin Suess <utilityemal77@gmail.com>
To: bpf@vger.kernel.org
Cc: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
eddyz87@gmail.com, memxor@gmail.com, martin.lau@linux.dev,
yonghong.song@linux.dev, jolsa@kernel.org
Subject: [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU
Date: Tue, 21 Apr 2026 16:10:33 -0400 [thread overview]
Message-ID: <20260421201035.1729473-1-utilityemal77@gmail.com> (raw)
Hello,
I found a reproducible soft lockup / panic involving BPF task kptr destruction from NMI context.
It was found after further investigation from a Sashiko report on my patch:
https://lore.kernel.org/bpf/20260420203306.3107246-1-utilityemal77@gmail.com/T/#t
The issue is reproducible with a BPF selftest-derived reproducer that:
1. Stores exited task references in a BPF hash map as refcounted task kptrs.
2. Deletes those kptrs from a `tp_btf/nmi_handler` program.
3. Runs on an `rcu_nocbs` CPU.
In my setup this eventually triggers a soft lockup and panic in a workqueue thread stuck in:
`perf_sched_delayed`
` -> static_key_disable()`
` -> arch_jump_label_transform_apply()`
` -> smp_text_poke_batch_finish()`
` -> on_each_cpu_cond_mask()`
` -> smp_call_function_many_cond()`
The triggering condition appears to be that `bpf_task_release_dtor()` can run in NMI context and reach the last-ref `put_task_struct_rcu_user()` path on an offloaded RCU callback CPU.
Affected code path is a dtor triggered by deleting the last reference to a task_struct kptr:
`bpf_map_delete_elem()`
` -> htab_map_delete_elem()`
` -> free_htab_elem()`
` -> bpf_obj_free_fields()`
` -> bpf_task_release_dtor()`
` -> put_task_struct_rcu_user()`
` -> call_rcu()`
This is triggered from:
`tp_btf/nmi_handler`
` -> clear_task_kptrs_from_nmi` (reproducer bpf prog)
Environment
- x86_64 QEMU VM
- PREEMPT(full)
- `CONFIG_RCU_EXPERT=y`
- `CONFIG_RCU_NOCB_CPU=y`
- booted with `rcu_nocbs=1-7`
(the CONFIG_RCU_NOCB_CPU makes reproducing this more likely)
Observed result
- watchdog reports a soft lockup
- kernel panics with `Kernel panic - not syncing: softlockup: hung tasks`
- the stuck task is a kworker running `perf_sched_delayed`
Logs:
env TASK_KPTR_NMI_DEADLOCK_REPRO=1 ./test_progs -t task_kptr_nmi_deadlock_repro
[ 1.336781] bpf_testmod: loading out-of-tree module taints kernel.
[ 1.336961] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
[ 1.358431]
[ 1.358433] ================================
[ 1.358433] WARNING: inconsistent lock state
[ 1.358434] 7.0.0-11169-ge4ef174588b8-dirty #16 Tainted: G OE
[ 1.358435] --------------------------------
[ 1.358436] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[ 1.358436] test_progs/134 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ 1.358438] ffff8ae3bbc6f0e8 (&rdp->nocb_lock){....}-{2:2}, at: __call_rcu_common.constprop.0+0x316/0x740
[ 1.358445] {INITIAL USE} state was registered at:
[ 1.358445] lock_acquire+0xc0/0x2d0
[ 1.358448] _raw_spin_lock+0x33/0x50
[ 1.358451] rcu_nocb_gp_kthread+0x13b/0xbb0
[ 1.358453] kthread+0x10d/0x140
[ 1.358456] ret_from_fork+0x26d/0x330
[ 1.358458] ret_from_fork_asm+0x1a/0x30
[ 1.358465] irq event stamp: 47398
[ 1.358465] hardirqs last enabled at (47397): [<ffffffff8a1397cb>] _raw_spin_unlock_irqrestore+0x4b/0x60
[ 1.358467] hardirqs last disabled at (47398): [<ffffffff8a12ca1d>] __schedule+0xb1d/0x13e0
[ 1.358469] softirqs last enabled at (47298): [<ffffffff8928d770>] fpu_clone+0x80/0x210
[ 1.358471] softirqs last disabled at (47296): [<ffffffff8928d740>] fpu_clone+0x50/0x210
[ 1.358473]
[ 1.358473] other info that might help us debug this:
[ 1.358473] Possible unsafe locking scenario:
[ 1.358473]
[ 1.358473] CPU0
[ 1.358474] ----
[ 1.358474] lock(&rdp->nocb_lock);
[ 1.358475] <Interrupt>
[ 1.358475] lock(&rdp->nocb_lock);
[ 1.358476]
[ 1.358476] *** DEADLOCK ***
[ 1.358476]
[ 1.358476] 1 lock held by test_progs/134:
[ 1.358477] #0: ffff8ae3bbc6dfe0 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x119/0x13e0
[ 1.358480]
[ 1.358480] stack backtrace:
[ 1.358482] CPU: 1 UID: 0 PID: 134 Comm: test_progs Tainted: G OE 7.0.0-11169-ge4ef174588b8-dirty #16 PREEMPT(full)
[ 1.358484] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 1.358484] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[ 1.358486] Call Trace:
[ 1.358487] <NMI>
[ 1.358488] dump_stack_lvl+0x63/0x90
[ 1.358491] print_usage_bug.part.0+0x233/0x2d0
[ 1.358495] lock_acquire+0x28b/0x2d0
[ 1.358497] ? __call_rcu_common.constprop.0+0x316/0x740
[ 1.358501] _raw_spin_lock+0x33/0x50
[ 1.358502] ? __call_rcu_common.constprop.0+0x316/0x740
[ 1.358505] __call_rcu_common.constprop.0+0x316/0x740
[ 1.358509] bpf_obj_free_fields+0x129/0x260
[ 1.358514] free_htab_elem+0x8d/0xe0
[ 1.358518] htab_map_delete_elem+0x16b/0x240
[ 1.358522] bpf_prog_f6a7136050cb5431_clear_task_kptrs_from_nmi+0xb3/0x144
[ 1.358524] bpf_trace_run3+0x11b/0x2f0
[ 1.358527] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 1.358530] ? __pfx_perf_event_nmi_handler+0x10/0x10
[ 1.358531] nmi_handle.part.0+0x15c/0x260
[ 1.358536] default_do_nmi+0x12f/0x190
[ 1.358539] exc_nmi+0xeb/0x120
[ 1.358542] end_repeat_nmi+0xf/0x53
[ 1.358544] RIP: 0010:dequeue_entities+0x7ba/0xd70
[ 1.358547] Code: fa ff ff f6 45 b4 01 0f 84 75 fa ff ff 4c 89 ff e8 6b 41 ff ff 4d 8b ad a8 00 00 00 49 83 3f 00 0f 84 6d fa ff ff 49 8b 47 40 <49> 8b 57 50 48 c7 c3 ff ff ff ff 48 85 c0 0f 84 54 05 00 00 48 85
[ 1.358548] RSP: 0018:ffffa83480a77c30 EFLAGS: 00000006
[ 1.358549] RAX: ffff8ae382fe4590 RBX: 0000000000000009 RCX: ffff8ae382fe45c8
[ 1.358550] RDX: ffff8ae380dd80c8 RSI: 0000000000000000 RDI: ffff8ae383d82300
[ 1.358551] RBP: ffffa83480a77ca8 R08: 000000000001084f R09: ffff8ae383d82300
[ 1.358551] R10: 0000000000000002 R11: 0000000008264572 R12: 0000000000000000
[ 1.358552] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8ae3bbc6e040
[ 1.358558] ? dequeue_entities+0x7ba/0xd70
[ 1.358561] ? dequeue_entities+0x7ba/0xd70
[ 1.358563] </NMI>
[ 1.358563] <TASK>
[ 1.358567] dequeue_task_fair+0xf2/0x480
[ 1.358570] __schedule+0x998/0x13e0
[ 1.358574] ? do_wait+0x63/0x1a0
[ 1.358577] schedule+0x3e/0x140
[ 1.358578] do_wait+0x7b/0x1a0
[ 1.358581] kernel_wait4+0xc0/0x170
[ 1.358584] ? __pfx_child_wait_callback+0x10/0x10
[ 1.358588] __do_sys_wait4+0xa7/0xc0
[ 1.358594] ? srso_alias_return_thunk+0x5/0xfbef5
[ 1.358596] do_syscall_64+0xa1/0x5f0
[ 1.358598] ? irq_exit_rcu+0x12/0x20
[ 1.358601] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1.358602] RIP: 0033:0x7fc342ca6922
[ 1.358603] Code: 08 0f 85 51 36 ff ff 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 66
[ 1.358604] RSP: 002b:00007fff75f85138 EFLAGS: 00000246 ORIG_RAX: 000000000000003d
[ 1.358605] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc342ca6922
[ 1.358606] RDX: 0000000000000000 RSI: 00007fff75f851c8 RDI: 0000000000000090
[ 1.358606] RBP: 00007fff75f85160 R08: 0000000000000000 R09: 0000000000000000
[ 1.358607] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff75f85618
[ 1.358607] R13: 0000000000000003 R14: 00007fc343411000 R15: 000055f2f8140a30
[ 1.358613] </TASK>
[ 1.427470] perf: interrupt took too long (2528 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 1.450479] perf: interrupt took too long (3179 > 3160), lowering kernel.perf_event_max_sample_rate to 62000
[ 1.489552] perf: interrupt took too long (3984 > 3973), lowering kernel.perf_event_max_sample_rate to 50000
[ 1.567488] perf: interrupt took too long (4990 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
[ 1.696694] tsc: Refined TSC clocksource calibration: 4191.351 MHz
[ 1.696873] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x3c6a77879d2, max_idle_ns: 440795420607 ns
[ 1.697225] clocksource: Switched to clocksource tsc
#466 task_kptr_nmi_deadlock_repro:OK
Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
[ 2.017327] smc: removing smcd device lo
[ 2.018518] ACPI: PM: Preparing to enter system sleep state S5
[ 2.018712] reboot: Power down
I reduced this to a dedicated selftest-style reproducer.
Reproducer: https://gist.githubusercontent.com/RazeLighter777/5539336d79ab1854f9e9550c6dcab118/raw/082f1eeb2dd445936e64dd3a33861764690bde82/task_struct_dtor_deadlock.patch
This looks like an NMI-unsafety issue in the task kptr destructor path.
This may also apply to the cgroup release dtor, which I believe also can use call_rcu in this path. I haven't tried to make a reproducer for that case.
This should be fixed because even without that specific kconfig, call_rcu is not intended to be called from an NMI handler ever and can result in corruption.
Thanks,
Justin Suess
next reply other threads:[~2026-04-21 20:10 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-21 20:10 Justin Suess [this message]
2026-04-21 20:23 ` [BUG] bpf: Soft lockup / panic triggered by bpf_task_release_dtor from NMI on rcu_nocbs CPU Kumar Kartikeya Dwivedi
2026-04-21 21:34 ` Justin Suess
2026-04-21 21:44 ` Kumar Kartikeya Dwivedi
2026-04-22 11:58 ` Justin Suess
2026-04-22 14:39 ` Justin Suess
2026-04-22 20:47 ` Kumar Kartikeya Dwivedi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260421201035.1729473-1-utilityemal77@gmail.com \
--to=utilityemal77@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=jolsa@kernel.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.