From: Andrea Righi <arighi@nvidia.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Amery Hung <ameryhung@gmail.com>,
Tejun Heo <tj@kernel.org>, Emil Tsalapatis <emil@etsalapatis.com>,
bpf@vger.kernel.org, sched-ext@lists.linux.dev,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] bpf: Always defer local storage free
Date: Tue, 17 Mar 2026 07:25:03 +0100 [thread overview]
Message-ID: <abjzvz_tL_siV17s@gpd4> (raw)
In-Reply-To: <CAP01T75eKpvw+95NqNWg9P-1+kzVzojpN0NLat+28SF1B9wQQQ@mail.gmail.com>
Hi Kumar,
On Tue, Mar 17, 2026 at 12:39:00AM +0100, Kumar Kartikeya Dwivedi wrote:
> On Mon, 16 Mar 2026 at 23:28, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > bpf_task_storage_delete() can be invoked from contexts that hold a raw
> > spinlock, such as sched_ext's ops.exit_task() callback, that is running
> > with the rq lock held.
> >
> > The delete path eventually calls bpf_selem_unlink(), which frees the
> > element via bpf_selem_free_list() -> bpf_selem_free(). For task storage
> > with use_kmalloc_nolock, call_rcu_tasks_trace() is used, which is not
> > safe from raw spinlock context, triggering the following:
> >
>
> Paul posted [0] to fix it in SRCU. It was always safe to
> call_rcu_tasks_trace() under raw spin lock, but became problematic on
> RT with the recent conversion that uses SRCU underneath, please give
> [0] a spin. While I couldn't reproduce the warning using scx_cosmos, I
> verified that it goes away for me when calling the path from atomic
> context.
>
> [0]: https://lore.kernel.org/rcu/841c8a0b-0f50-4617-98b2-76523e13b910@paulmck-laptop
With this applied I get the following:
[ 26.986798] ======================================================
[ 26.986883] WARNING: possible circular locking dependency detected
[ 26.986957] 7.0.0-rc4-virtme #15 Not tainted
[ 26.987020] ------------------------------------------------------
[ 26.987094] schbench/532 is trying to acquire lock:
[ 26.987155] ffffffff9cd70d90 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{2:2}, at: raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.987313]
[ 26.987313] but task is already holding lock:
[ 26.987394] ffff8df7fb9bdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
[ 26.987512]
[ 26.987512] which lock already depends on the new lock.
[ 26.987512]
[ 26.987598]
[ 26.987598] the existing dependency chain (in reverse order) is:
[ 26.987704]
[ 26.987704] -> #3 (&rq->__lock){-.-.}-{2:2}:
[ 26.987779] lock_acquire+0xcf/0x310
[ 26.987844] _raw_spin_lock_nested+0x2e/0x40
[ 26.987911] raw_spin_rq_lock_nested+0x24/0xb0
[ 26.987973] ___task_rq_lock+0x42/0x110
[ 26.988034] wake_up_new_task+0x198/0x440
[ 26.988099] kernel_clone+0x118/0x3c0
[ 26.988149] user_mode_thread+0x61/0x90
[ 26.988222] rest_init+0x1e/0x160
[ 26.988272] start_kernel+0x7a2/0x7b0
[ 26.988329] x86_64_start_reservations+0x24/0x30
[ 26.988392] x86_64_start_kernel+0xd1/0xe0
[ 26.988451] common_startup_64+0x13e/0x148
[ 26.988523]
[ 26.988523] -> #2 (&p->pi_lock){-.-.}-{2:2}:
[ 26.988598] lock_acquire+0xcf/0x310
[ 26.988650] _raw_spin_lock_irqsave+0x39/0x60
[ 26.988718] try_to_wake_up+0x57/0xbb0
[ 26.988779] create_worker+0x17e/0x200
[ 26.988839] workqueue_init+0x28d/0x300
[ 26.988902] kernel_init_freeable+0x134/0x2b0
[ 26.988964] kernel_init+0x1a/0x130
[ 26.989016] ret_from_fork+0x2bd/0x370
[ 26.989079] ret_from_fork_asm+0x1a/0x30
[ 26.989143]
[ 26.989143] -> #1 (&pool->lock){-.-.}-{2:2}:
[ 26.989217] lock_acquire+0xcf/0x310
[ 26.989263] _raw_spin_lock+0x30/0x40
[ 26.989315] __queue_work+0xdb/0x6d0
[ 26.989367] queue_delayed_work_on+0xc7/0xe0
[ 26.989427] srcu_gp_start_if_needed+0x3cc/0x540
[ 26.989507] __synchronize_srcu+0xf6/0x1b0
[ 26.989567] rcu_init_tasks_generic+0xfe/0x120
[ 26.989626] do_one_initcall+0x6f/0x300
[ 26.989691] kernel_init_freeable+0x24b/0x2b0
[ 26.989750] kernel_init+0x1a/0x130
[ 26.989797] ret_from_fork+0x2bd/0x370
[ 26.989857] ret_from_fork_asm+0x1a/0x30
[ 26.989916]
[ 26.989916] -> #0 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{2:2}:
[ 26.990015] check_prev_add+0xe1/0xd30
[ 26.990076] __lock_acquire+0x1561/0x1de0
[ 26.990137] lock_acquire+0xcf/0x310
[ 26.990182] _raw_spin_lock_irqsave+0x39/0x60
[ 26.990240] raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.990312] srcu_gp_start_if_needed+0x92/0x540
[ 26.990370] bpf_selem_unlink+0x267/0x5c0
[ 26.990430] bpf_task_storage_delete+0x3a/0x90
[ 26.990495] bpf_prog_134dba630b11d3b7_scx_pmu_task_fini+0x26/0x2a
[ 26.990566] bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
[ 26.990636] bpf__sched_ext_ops_exit_task+0x4b/0xa7
[ 26.990694] scx_exit_task+0x17a/0x230
[ 26.990753] sched_ext_dead+0xb2/0x120
[ 26.990811] finish_task_switch.isra.0+0x305/0x370
[ 26.990870] __schedule+0x576/0x1d60
[ 26.990917] schedule+0x3a/0x130
[ 26.990962] futex_do_wait+0x4a/0xa0
[ 26.991008] __futex_wait+0x8e/0xf0
[ 26.991054] futex_wait+0x78/0x120
[ 26.991099] do_futex+0xc5/0x190
[ 26.991144] __x64_sys_futex+0x12d/0x220
[ 26.991202] do_syscall_64+0x117/0xf80
[ 26.991260] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 26.991318]
[ 26.991318] other info that might help us debug this:
[ 26.991318]
[ 26.991400] Chain exists of:
[ 26.991400] rcu_tasks_trace_srcu_struct_srcu_usage.lock --> &p->pi_lock --> &rq->__lock
[ 26.991400]
[ 26.991524] Possible unsafe locking scenario:
[ 26.991524]
[ 26.991592] CPU0 CPU1
[ 26.991647] ---- ----
[ 26.991702] lock(&rq->__lock);
[ 26.991747] lock(&p->pi_lock);
[ 26.991816] lock(&rq->__lock);
[ 26.991884] lock(rcu_tasks_trace_srcu_struct_srcu_usage.lock);
[ 26.991953]
[ 26.991953] *** DEADLOCK ***
[ 26.991953]
[ 26.992021] 3 locks held by schbench/532:
[ 26.992065] #0: ffff8df7cc154f18 (&p->pi_lock){-.-.}-{2:2}, at: _task_rq_lock+0x2c/0x100
[ 26.992151] #1: ffff8df7fb9bdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
[ 26.992250] #2: ffffffff9cd71b20 (rcu_read_lock){....}-{1:3}, at: __bpf_prog_enter+0x64/0x110
[ 26.992348]
[ 26.992348] stack backtrace:
[ 26.992406] CPU: 7 UID: 1000 PID: 532 Comm: schbench Not tainted 7.0.0-rc4-virtme #15 PREEMPT(full)
[ 26.992409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 26.992411] Sched_ext: cosmos_1.1.0_g0949d453c_x86_64_unknown^_linux_gnu (enabled+all), task: runnable_at=+0ms
[ 26.992412] Call Trace:
[ 26.992414] C<TASK>
[ 26.992415] dump_stack_lvl+0x6f/0xb0
[ 26.992418] print_circular_bug.cold+0x18b/0x1d6
[ 26.992422] check_noncircular+0x165/0x190
[ 26.992425] check_prev_add+0xe1/0xd30
[ 26.992428] __lock_acquire+0x1561/0x1de0
[ 26.992430] lock_acquire+0xcf/0x310
[ 26.992431] ? raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.992434] _raw_spin_lock_irqsave+0x39/0x60
[ 26.992435] ? raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.992437] raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.992439] srcu_gp_start_if_needed+0x92/0x540
[ 26.992441] bpf_selem_unlink+0x267/0x5c0
[ 26.992443] bpf_task_storage_delete+0x3a/0x90
[ 26.992445] bpf_prog_134dba630b11d3b7_scx_pmu_task_fini+0x26/0x2a
[ 26.992447] bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
[ 26.992448] bpf__sched_ext_ops_exit_task+0x4b/0xa7
[ 26.992449] scx_exit_task+0x17a/0x230
[ 26.992451] sched_ext_dead+0xb2/0x120
[ 26.992453] finish_task_switch.isra.0+0x305/0x370
[ 26.992455] __schedule+0x576/0x1d60
[ 26.992457] ? find_held_lock+0x2b/0x80
[ 26.992460] schedule+0x3a/0x130
[ 26.992462] futex_do_wait+0x4a/0xa0
[ 26.992463] __futex_wait+0x8e/0xf0
[ 26.992465] ? __pfx_futex_wake_mark+0x10/0x10
[ 26.992468] futex_wait+0x78/0x120
[ 26.992469] ? find_held_lock+0x2b/0x80
[ 26.992472] do_futex+0xc5/0x190
[ 26.992473] __x64_sys_futex+0x12d/0x220
[ 26.992474] ? restore_fpregs_from_fpstate+0x48/0xd0
[ 26.992477] do_syscall_64+0x117/0xf80
[ 26.992478] ? __irq_exit_rcu+0x38/0xc0
[ 26.992481] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 26.992482] RIP: 0033:0x7fe20e52eb1d
I can easily reproduce this (or the previous one) inside virtme-ng:
$ cat << EOF > /tmp/config
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_KALLSYMS_ALL=y
CONFIG_FUNCTION_TRACER=y
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_SCHED_CORE=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_PROVE_LOCKING=y
CONFIG_BPF_EVENTS=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
CONFIG_DEBUG_FS=y
CONFIG_IKHEADERS=y
CONFIG_IKCONFIG_PROC=y
CONFIG_IKCONFIG=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_SCHED=y
CONFIG_EXT_GROUP_SCHED=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_BTF=y
EOF
$ vng -vb --config /tmp/config
$ vng -v -- "scx_cosmos & schbench -L -m 4 -t 48 -n 0"
Thanks,
-Andrea
>
> > =============================
> > [ BUG: Invalid wait context ]
> > 7.0.0-rc1-virtme #1 Not tainted
> > -----------------------------
> > (udev-worker)/115 is trying to lock:
> > ffffffffa6970dd0 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{3:3}, at: spin_lock_irqsave_ssp_contention+0x54/0x90
> > other info that might help us debug this:
> > context-{5:5}
> > 3 locks held by (udev-worker)/115:
> > #0: ffff8e16c634ce58 (&p->pi_lock){-.-.}-{2:2}, at: _task_rq_lock+0x2c/0x100
> > #1: ffff8e16fbdbdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
> > #2: ffffffffa6971b60 (rcu_read_lock){....}-{1:3}, at: __bpf_prog_enter+0x64/0x110
> > ...
> > Sched_ext: cosmos_1.0.7_g780e898fc_dirty_x86_64_unknown_linux_gnu (enabled+all), task: runnable_at=-2ms
> > Call Trace:
> > dump_stack_lvl+0x6f/0xb0
> > __lock_acquire+0xf86/0x1de0
> > lock_acquire+0xcf/0x310
> > _raw_spin_lock_irqsave+0x39/0x60
> > spin_lock_irqsave_ssp_contention+0x54/0x90
> > srcu_gp_start_if_needed+0x2a7/0x490
> > bpf_selem_unlink+0x24b/0x590
> > bpf_task_storage_delete+0x3a/0x90
> > bpf_prog_3b623b4be76cfb86_scx_pmu_task_fini+0x26/0x2a
> > bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
> > bpf__sched_ext_ops_exit_task+0x4b/0xa7
> > __scx_disable_and_exit_task+0x10a/0x200
> > scx_disable_and_exit_task+0xe/0x60
> >
> > Fix by deferring memory deallocation to ensure it occurs outside the raw
> > spinlock context.
> >
> > Fixes: f484f4a3e058 ("bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage")
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > [...]
next prev parent reply other threads:[~2026-03-17 6:25 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 22:27 [PATCH] bpf: Always defer local storage free Andrea Righi
2026-03-16 23:07 ` bot+bpf-ci
2026-03-16 23:39 ` Kumar Kartikeya Dwivedi
2026-03-17 6:25 ` Andrea Righi [this message]
2026-03-17 8:15 ` Andrea Righi
2026-03-17 18:46 ` Cheng-Yang Chou
2026-03-17 18:53 ` Kumar Kartikeya Dwivedi
2026-03-17 18:57 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abjzvz_tL_siV17s@gpd4 \
--to=arighi@nvidia.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=emil@etsalapatis.com \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=paulmck@kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=sdf@fomichev.me \
--cc=song@kernel.org \
--cc=tj@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox