From: Andrea Righi <arighi@nvidia.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
John Fastabend <john.fastabend@gmail.com>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Amery Hung <ameryhung@gmail.com>,
Tejun Heo <tj@kernel.org>, Emil Tsalapatis <emil@etsalapatis.com>,
bpf@vger.kernel.org, sched-ext@lists.linux.dev,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] bpf: Always defer local storage free
Date: Tue, 17 Mar 2026 07:25:03 +0100 [thread overview]
Message-ID: <abjzvz_tL_siV17s@gpd4> (raw)
In-Reply-To: <CAP01T75eKpvw+95NqNWg9P-1+kzVzojpN0NLat+28SF1B9wQQQ@mail.gmail.com>
Hi Kumar,
On Tue, Mar 17, 2026 at 12:39:00AM +0100, Kumar Kartikeya Dwivedi wrote:
> On Mon, 16 Mar 2026 at 23:28, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > bpf_task_storage_delete() can be invoked from contexts that hold a raw
> > spinlock, such as sched_ext's ops.exit_task() callback, that is running
> > with the rq lock held.
> >
> > The delete path eventually calls bpf_selem_unlink(), which frees the
> > element via bpf_selem_free_list() -> bpf_selem_free(). For task storage
> > with use_kmalloc_nolock, call_rcu_tasks_trace() is used, which is not
> > safe from raw spinlock context, triggering the following:
> >
>
> Paul posted [0] to fix it in SRCU. It was always safe to
> call_rcu_tasks_trace() under raw spin lock, but became problematic on
> RT with the recent conversion that uses SRCU underneath, please give
> [0] a spin. While I couldn't reproduce the warning using scx_cosmos, I
> verified that it goes away for me when calling the path from atomic
> context.
>
> [0]: https://lore.kernel.org/rcu/841c8a0b-0f50-4617-98b2-76523e13b910@paulmck-laptop
With this applied I get the following:
[ 26.986798] ======================================================
[ 26.986883] WARNING: possible circular locking dependency detected
[ 26.986957] 7.0.0-rc4-virtme #15 Not tainted
[ 26.987020] ------------------------------------------------------
[ 26.987094] schbench/532 is trying to acquire lock:
[ 26.987155] ffffffff9cd70d90 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{2:2}, at: raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.987313]
[ 26.987313] but task is already holding lock:
[ 26.987394] ffff8df7fb9bdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
[ 26.987512]
[ 26.987512] which lock already depends on the new lock.
[ 26.987512]
[ 26.987598]
[ 26.987598] the existing dependency chain (in reverse order) is:
[ 26.987704]
[ 26.987704] -> #3 (&rq->__lock){-.-.}-{2:2}:
[ 26.987779] lock_acquire+0xcf/0x310
[ 26.987844] _raw_spin_lock_nested+0x2e/0x40
[ 26.987911] raw_spin_rq_lock_nested+0x24/0xb0
[ 26.987973] ___task_rq_lock+0x42/0x110
[ 26.988034] wake_up_new_task+0x198/0x440
[ 26.988099] kernel_clone+0x118/0x3c0
[ 26.988149] user_mode_thread+0x61/0x90
[ 26.988222] rest_init+0x1e/0x160
[ 26.988272] start_kernel+0x7a2/0x7b0
[ 26.988329] x86_64_start_reservations+0x24/0x30
[ 26.988392] x86_64_start_kernel+0xd1/0xe0
[ 26.988451] common_startup_64+0x13e/0x148
[ 26.988523]
[ 26.988523] -> #2 (&p->pi_lock){-.-.}-{2:2}:
[ 26.988598] lock_acquire+0xcf/0x310
[ 26.988650] _raw_spin_lock_irqsave+0x39/0x60
[ 26.988718] try_to_wake_up+0x57/0xbb0
[ 26.988779] create_worker+0x17e/0x200
[ 26.988839] workqueue_init+0x28d/0x300
[ 26.988902] kernel_init_freeable+0x134/0x2b0
[ 26.988964] kernel_init+0x1a/0x130
[ 26.989016] ret_from_fork+0x2bd/0x370
[ 26.989079] ret_from_fork_asm+0x1a/0x30
[ 26.989143]
[ 26.989143] -> #1 (&pool->lock){-.-.}-{2:2}:
[ 26.989217] lock_acquire+0xcf/0x310
[ 26.989263] _raw_spin_lock+0x30/0x40
[ 26.989315] __queue_work+0xdb/0x6d0
[ 26.989367] queue_delayed_work_on+0xc7/0xe0
[ 26.989427] srcu_gp_start_if_needed+0x3cc/0x540
[ 26.989507] __synchronize_srcu+0xf6/0x1b0
[ 26.989567] rcu_init_tasks_generic+0xfe/0x120
[ 26.989626] do_one_initcall+0x6f/0x300
[ 26.989691] kernel_init_freeable+0x24b/0x2b0
[ 26.989750] kernel_init+0x1a/0x130
[ 26.989797] ret_from_fork+0x2bd/0x370
[ 26.989857] ret_from_fork_asm+0x1a/0x30
[ 26.989916]
[ 26.989916] -> #0 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{2:2}:
[ 26.990015] check_prev_add+0xe1/0xd30
[ 26.990076] __lock_acquire+0x1561/0x1de0
[ 26.990137] lock_acquire+0xcf/0x310
[ 26.990182] _raw_spin_lock_irqsave+0x39/0x60
[ 26.990240] raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.990312] srcu_gp_start_if_needed+0x92/0x540
[ 26.990370] bpf_selem_unlink+0x267/0x5c0
[ 26.990430] bpf_task_storage_delete+0x3a/0x90
[ 26.990495] bpf_prog_134dba630b11d3b7_scx_pmu_task_fini+0x26/0x2a
[ 26.990566] bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
[ 26.990636] bpf__sched_ext_ops_exit_task+0x4b/0xa7
[ 26.990694] scx_exit_task+0x17a/0x230
[ 26.990753] sched_ext_dead+0xb2/0x120
[ 26.990811] finish_task_switch.isra.0+0x305/0x370
[ 26.990870] __schedule+0x576/0x1d60
[ 26.990917] schedule+0x3a/0x130
[ 26.990962] futex_do_wait+0x4a/0xa0
[ 26.991008] __futex_wait+0x8e/0xf0
[ 26.991054] futex_wait+0x78/0x120
[ 26.991099] do_futex+0xc5/0x190
[ 26.991144] __x64_sys_futex+0x12d/0x220
[ 26.991202] do_syscall_64+0x117/0xf80
[ 26.991260] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 26.991318]
[ 26.991318] other info that might help us debug this:
[ 26.991318]
[ 26.991400] Chain exists of:
[ 26.991400] rcu_tasks_trace_srcu_struct_srcu_usage.lock --> &p->pi_lock --> &rq->__lock
[ 26.991400]
[ 26.991524] Possible unsafe locking scenario:
[ 26.991524]
[ 26.991592] CPU0 CPU1
[ 26.991647] ---- ----
[ 26.991702] lock(&rq->__lock);
[ 26.991747] lock(&p->pi_lock);
[ 26.991816] lock(&rq->__lock);
[ 26.991884] lock(rcu_tasks_trace_srcu_struct_srcu_usage.lock);
[ 26.991953]
[ 26.991953] *** DEADLOCK ***
[ 26.991953]
[ 26.992021] 3 locks held by schbench/532:
[ 26.992065] #0: ffff8df7cc154f18 (&p->pi_lock){-.-.}-{2:2}, at: _task_rq_lock+0x2c/0x100
[ 26.992151] #1: ffff8df7fb9bdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
[ 26.992250] #2: ffffffff9cd71b20 (rcu_read_lock){....}-{1:3}, at: __bpf_prog_enter+0x64/0x110
[ 26.992348]
[ 26.992348] stack backtrace:
[ 26.992406] CPU: 7 UID: 1000 PID: 532 Comm: schbench Not tainted 7.0.0-rc4-virtme #15 PREEMPT(full)
[ 26.992409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 26.992411] Sched_ext: cosmos_1.1.0_g0949d453c_x86_64_unknown^_linux_gnu (enabled+all), task: runnable_at=+0ms
[ 26.992412] Call Trace:
[ 26.992414] C<TASK>
[ 26.992415] dump_stack_lvl+0x6f/0xb0
[ 26.992418] print_circular_bug.cold+0x18b/0x1d6
[ 26.992422] check_noncircular+0x165/0x190
[ 26.992425] check_prev_add+0xe1/0xd30
[ 26.992428] __lock_acquire+0x1561/0x1de0
[ 26.992430] lock_acquire+0xcf/0x310
[ 26.992431] ? raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.992434] _raw_spin_lock_irqsave+0x39/0x60
[ 26.992435] ? raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.992437] raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
[ 26.992439] srcu_gp_start_if_needed+0x92/0x540
[ 26.992441] bpf_selem_unlink+0x267/0x5c0
[ 26.992443] bpf_task_storage_delete+0x3a/0x90
[ 26.992445] bpf_prog_134dba630b11d3b7_scx_pmu_task_fini+0x26/0x2a
[ 26.992447] bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
[ 26.992448] bpf__sched_ext_ops_exit_task+0x4b/0xa7
[ 26.992449] scx_exit_task+0x17a/0x230
[ 26.992451] sched_ext_dead+0xb2/0x120
[ 26.992453] finish_task_switch.isra.0+0x305/0x370
[ 26.992455] __schedule+0x576/0x1d60
[ 26.992457] ? find_held_lock+0x2b/0x80
[ 26.992460] schedule+0x3a/0x130
[ 26.992462] futex_do_wait+0x4a/0xa0
[ 26.992463] __futex_wait+0x8e/0xf0
[ 26.992465] ? __pfx_futex_wake_mark+0x10/0x10
[ 26.992468] futex_wait+0x78/0x120
[ 26.992469] ? find_held_lock+0x2b/0x80
[ 26.992472] do_futex+0xc5/0x190
[ 26.992473] __x64_sys_futex+0x12d/0x220
[ 26.992474] ? restore_fpregs_from_fpstate+0x48/0xd0
[ 26.992477] do_syscall_64+0x117/0xf80
[ 26.992478] ? __irq_exit_rcu+0x38/0xc0
[ 26.992481] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 26.992482] RIP: 0033:0x7fe20e52eb1d
I can easily reproduce this (or the previous one) inside virtme-ng:
$ cat << EOF > /tmp/config
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_KALLSYMS_ALL=y
CONFIG_FUNCTION_TRACER=y
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_SCHED_CORE=y
CONFIG_SCHED_MC=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_PROVE_LOCKING=y
CONFIG_BPF_EVENTS=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
CONFIG_DEBUG_FS=y
CONFIG_IKHEADERS=y
CONFIG_IKCONFIG_PROC=y
CONFIG_IKCONFIG=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_SCHED=y
CONFIG_EXT_GROUP_SCHED=y
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_BTF=y
EOF
$ vng -vb --config /tmp/config
$ vng -v -- "scx_cosmos & schbench -L -m 4 -t 48 -n 0"
Thanks,
-Andrea
>
> > =============================
> > [ BUG: Invalid wait context ]
> > 7.0.0-rc1-virtme #1 Not tainted
> > -----------------------------
> > (udev-worker)/115 is trying to lock:
> > ffffffffa6970dd0 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{3:3}, at: spin_lock_irqsave_ssp_contention+0x54/0x90
> > other info that might help us debug this:
> > context-{5:5}
> > 3 locks held by (udev-worker)/115:
> > #0: ffff8e16c634ce58 (&p->pi_lock){-.-.}-{2:2}, at: _task_rq_lock+0x2c/0x100
> > #1: ffff8e16fbdbdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
> > #2: ffffffffa6971b60 (rcu_read_lock){....}-{1:3}, at: __bpf_prog_enter+0x64/0x110
> > ...
> > Sched_ext: cosmos_1.0.7_g780e898fc_dirty_x86_64_unknown_linux_gnu (enabled+all), task: runnable_at=-2ms
> > Call Trace:
> > dump_stack_lvl+0x6f/0xb0
> > __lock_acquire+0xf86/0x1de0
> > lock_acquire+0xcf/0x310
> > _raw_spin_lock_irqsave+0x39/0x60
> > spin_lock_irqsave_ssp_contention+0x54/0x90
> > srcu_gp_start_if_needed+0x2a7/0x490
> > bpf_selem_unlink+0x24b/0x590
> > bpf_task_storage_delete+0x3a/0x90
> > bpf_prog_3b623b4be76cfb86_scx_pmu_task_fini+0x26/0x2a
> > bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
> > bpf__sched_ext_ops_exit_task+0x4b/0xa7
> > __scx_disable_and_exit_task+0x10a/0x200
> > scx_disable_and_exit_task+0xe/0x60
> >
> > Fix by deferring memory deallocation to ensure it occurs outside the raw
> > spinlock context.
> >
> > Fixes: f484f4a3e058 ("bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage")
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > [...]
next prev parent reply other threads:[~2026-03-17 6:25 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-16 22:27 [PATCH] bpf: Always defer local storage free Andrea Righi
2026-03-16 23:07 ` bot+bpf-ci
2026-03-16 23:39 ` Kumar Kartikeya Dwivedi
2026-03-17 6:25 ` Andrea Righi [this message]
2026-03-17 8:15 ` Andrea Righi
2026-03-17 18:46 ` Cheng-Yang Chou
2026-03-17 18:53 ` Kumar Kartikeya Dwivedi
2026-03-17 18:57 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abjzvz_tL_siV17s@gpd4 \
--to=arighi@nvidia.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=emil@etsalapatis.com \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=paulmck@kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=sdf@fomichev.me \
--cc=song@kernel.org \
--cc=tj@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.