All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Amery Hung <ameryhung@gmail.com>,
	Tejun Heo <tj@kernel.org>, Emil Tsalapatis <emil@etsalapatis.com>,
	bpf@vger.kernel.org, sched-ext@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] bpf: Always defer local storage free
Date: Tue, 17 Mar 2026 09:15:49 +0100	[thread overview]
Message-ID: <abkNtYPMARqxXibW@gpd4> (raw)
In-Reply-To: <abjzvz_tL_siV17s@gpd4>

On Tue, Mar 17, 2026 at 07:25:18AM +0100, Andrea Righi wrote:
> Hi Kumar,
> 
> On Tue, Mar 17, 2026 at 12:39:00AM +0100, Kumar Kartikeya Dwivedi wrote:
> > On Mon, 16 Mar 2026 at 23:28, Andrea Righi <arighi@nvidia.com> wrote:
> > >
> > > bpf_task_storage_delete() can be invoked from contexts that hold a raw
> > > spinlock, such as sched_ext's ops.exit_task() callback, that is running
> > > with the rq lock held.
> > >
> > > The delete path eventually calls bpf_selem_unlink(), which frees the
> > > element via bpf_selem_free_list() -> bpf_selem_free(). For task storage
> > > with use_kmalloc_nolock, call_rcu_tasks_trace() is used, which is not
> > > safe from raw spinlock context, triggering the following:
> > >
> > 
> > Paul posted [0] to fix it in SRCU. It was always safe to
> > call_rcu_tasks_trace() under raw spin lock, but became problematic on
> > RT with the recent conversion that uses SRCU underneath, please give
> > [0] a spin. While I couldn't reproduce the warning using scx_cosmos, I
> > verified that it goes away for me when calling the path from atomic
> > context.
> > 
> >   [0]: https://lore.kernel.org/rcu/841c8a0b-0f50-4617-98b2-76523e13b910@paulmck-laptop
> 
> With this applied I get the following:
> 
> [   26.986798] ======================================================
> [   26.986883] WARNING: possible circular locking dependency detected
> [   26.986957] 7.0.0-rc4-virtme #15 Not tainted
> [   26.987020] ------------------------------------------------------
> [   26.987094] schbench/532 is trying to acquire lock:
> [   26.987155] ffffffff9cd70d90 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{2:2}, at: raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> [   26.987313]
> [   26.987313] but task is already holding lock:
> [   26.987394] ffff8df7fb9bdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
> [   26.987512]
> [   26.987512] which lock already depends on the new lock.
> [   26.987512]
> [   26.987598]
> [   26.987598] the existing dependency chain (in reverse order) is:
> [   26.987704]
> [   26.987704] -> #3 (&rq->__lock){-.-.}-{2:2}:
> [   26.987779]        lock_acquire+0xcf/0x310
> [   26.987844]        _raw_spin_lock_nested+0x2e/0x40
> [   26.987911]        raw_spin_rq_lock_nested+0x24/0xb0
> [   26.987973]        ___task_rq_lock+0x42/0x110
> [   26.988034]        wake_up_new_task+0x198/0x440
> [   26.988099]        kernel_clone+0x118/0x3c0
> [   26.988149]        user_mode_thread+0x61/0x90
> [   26.988222]        rest_init+0x1e/0x160
> [   26.988272]        start_kernel+0x7a2/0x7b0
> [   26.988329]        x86_64_start_reservations+0x24/0x30
> [   26.988392]        x86_64_start_kernel+0xd1/0xe0
> [   26.988451]        common_startup_64+0x13e/0x148
> [   26.988523]
> [   26.988523] -> #2 (&p->pi_lock){-.-.}-{2:2}:
> [   26.988598]        lock_acquire+0xcf/0x310
> [   26.988650]        _raw_spin_lock_irqsave+0x39/0x60
> [   26.988718]        try_to_wake_up+0x57/0xbb0
> [   26.988779]        create_worker+0x17e/0x200
> [   26.988839]        workqueue_init+0x28d/0x300
> [   26.988902]        kernel_init_freeable+0x134/0x2b0
> [   26.988964]        kernel_init+0x1a/0x130
> [   26.989016]        ret_from_fork+0x2bd/0x370
> [   26.989079]        ret_from_fork_asm+0x1a/0x30
> [   26.989143]
> [   26.989143] -> #1 (&pool->lock){-.-.}-{2:2}:
> [   26.989217]        lock_acquire+0xcf/0x310
> [   26.989263]        _raw_spin_lock+0x30/0x40
> [   26.989315]        __queue_work+0xdb/0x6d0
> [   26.989367]        queue_delayed_work_on+0xc7/0xe0
> [   26.989427]        srcu_gp_start_if_needed+0x3cc/0x540
> [   26.989507]        __synchronize_srcu+0xf6/0x1b0
> [   26.989567]        rcu_init_tasks_generic+0xfe/0x120
> [   26.989626]        do_one_initcall+0x6f/0x300
> [   26.989691]        kernel_init_freeable+0x24b/0x2b0
> [   26.989750]        kernel_init+0x1a/0x130
> [   26.989797]        ret_from_fork+0x2bd/0x370
> [   26.989857]        ret_from_fork_asm+0x1a/0x30
> [   26.989916]
> [   26.989916] -> #0 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{2:2}:
> [   26.990015]        check_prev_add+0xe1/0xd30
> [   26.990076]        __lock_acquire+0x1561/0x1de0
> [   26.990137]        lock_acquire+0xcf/0x310
> [   26.990182]        _raw_spin_lock_irqsave+0x39/0x60
> [   26.990240]        raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> [   26.990312]        srcu_gp_start_if_needed+0x92/0x540
> [   26.990370]        bpf_selem_unlink+0x267/0x5c0
> [   26.990430]        bpf_task_storage_delete+0x3a/0x90
> [   26.990495]        bpf_prog_134dba630b11d3b7_scx_pmu_task_fini+0x26/0x2a
> [   26.990566]        bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
> [   26.990636]        bpf__sched_ext_ops_exit_task+0x4b/0xa7
> [   26.990694]        scx_exit_task+0x17a/0x230
> [   26.990753]        sched_ext_dead+0xb2/0x120
> [   26.990811]        finish_task_switch.isra.0+0x305/0x370
> [   26.990870]        __schedule+0x576/0x1d60
> [   26.990917]        schedule+0x3a/0x130
> [   26.990962]        futex_do_wait+0x4a/0xa0
> [   26.991008]        __futex_wait+0x8e/0xf0
> [   26.991054]        futex_wait+0x78/0x120
> [   26.991099]        do_futex+0xc5/0x190
> [   26.991144]        __x64_sys_futex+0x12d/0x220
> [   26.991202]        do_syscall_64+0x117/0xf80
> [   26.991260]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   26.991318]
> [   26.991318] other info that might help us debug this:
> [   26.991318]
> [   26.991400] Chain exists of:
> [   26.991400]   rcu_tasks_trace_srcu_struct_srcu_usage.lock --> &p->pi_lock --> &rq->__lock
> [   26.991400]
> [   26.991524]  Possible unsafe locking scenario:
> [   26.991524]
> [   26.991592]        CPU0                    CPU1
> [   26.991647]        ----                    ----
> [   26.991702]   lock(&rq->__lock);
> [   26.991747]                                lock(&p->pi_lock);
> [   26.991816]                                lock(&rq->__lock);
> [   26.991884]   lock(rcu_tasks_trace_srcu_struct_srcu_usage.lock);
> [   26.991953]
> [   26.991953]  *** DEADLOCK ***
> [   26.991953]
> [   26.992021] 3 locks held by schbench/532:
> [   26.992065]  #0: ffff8df7cc154f18 (&p->pi_lock){-.-.}-{2:2}, at: _task_rq_lock+0x2c/0x100
> [   26.992151]  #1: ffff8df7fb9bdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
> [   26.992250]  #2: ffffffff9cd71b20 (rcu_read_lock){....}-{1:3}, at: __bpf_prog_enter+0x64/0x110
> [   26.992348]
> [   26.992348] stack backtrace:
> [   26.992406] CPU: 7 UID: 1000 PID: 532 Comm: schbench Not tainted 7.0.0-rc4-virtme #15 PREEMPT(full)
> [   26.992409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   26.992411] Sched_ext: cosmos_1.1.0_g0949d453c_x86_64_unknown^_linux_gnu (enabled+all), task: runnable_at=+0ms
> [   26.992412] Call Trace:
> [   26.992414]  C<TASK>
> [   26.992415]  dump_stack_lvl+0x6f/0xb0
> [   26.992418]  print_circular_bug.cold+0x18b/0x1d6
> [   26.992422]  check_noncircular+0x165/0x190
> [   26.992425]  check_prev_add+0xe1/0xd30
> [   26.992428]  __lock_acquire+0x1561/0x1de0
> [   26.992430]  lock_acquire+0xcf/0x310
> [   26.992431]  ? raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> [   26.992434]  _raw_spin_lock_irqsave+0x39/0x60
> [   26.992435]  ? raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> [   26.992437]  raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> [   26.992439]  srcu_gp_start_if_needed+0x92/0x540
> [   26.992441]  bpf_selem_unlink+0x267/0x5c0
> [   26.992443]  bpf_task_storage_delete+0x3a/0x90
> [   26.992445]  bpf_prog_134dba630b11d3b7_scx_pmu_task_fini+0x26/0x2a
> [   26.992447]  bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
> [   26.992448]  bpf__sched_ext_ops_exit_task+0x4b/0xa7
> [   26.992449]  scx_exit_task+0x17a/0x230
> [   26.992451]  sched_ext_dead+0xb2/0x120
> [   26.992453]  finish_task_switch.isra.0+0x305/0x370
> [   26.992455]  __schedule+0x576/0x1d60
> [   26.992457]  ? find_held_lock+0x2b/0x80
> [   26.992460]  schedule+0x3a/0x130
> [   26.992462]  futex_do_wait+0x4a/0xa0
> [   26.992463]  __futex_wait+0x8e/0xf0
> [   26.992465]  ? __pfx_futex_wake_mark+0x10/0x10
> [   26.992468]  futex_wait+0x78/0x120
> [   26.992469]  ? find_held_lock+0x2b/0x80
> [   26.992472]  do_futex+0xc5/0x190
> [   26.992473]  __x64_sys_futex+0x12d/0x220
> [   26.992474]  ? restore_fpregs_from_fpstate+0x48/0xd0
> [   26.992477]  do_syscall_64+0x117/0xf80
> [   26.992478]  ? __irq_exit_rcu+0x38/0xc0
> [   26.992481]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   26.992482] RIP: 0033:0x7fe20e52eb1d

With the following on top everything looks good on my side, let me know
what you think.

Thanks,
-Andrea

From: Andrea Righi <arighi@nvidia.com>
Subject: [PATCH] bpf: Avoid circular lock dependency when deleting local
 storage

Calling bpf_task_storage_delete() from a context that holds the runqueue
lock (e.g., sched_ext's ops.exit_task() callback) can lead to a circular
lock dependency:

 WARNING: possible circular locking dependency detected
 ...
 Chain exists of:
   rcu_tasks_trace_srcu_struct_srcu_usage.lock --> &p->pi_lock --> &rq->__lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&rq->__lock);
                                lock(&p->pi_lock);
                                lock(&rq->__lock);
   lock(rcu_tasks_trace_srcu_struct_srcu_usage.lock);

  *** DEADLOCK ***

Fix by adding a reuse_now flag to bpf_selem_unlink() with the same
meaning as in bpf_selem_free() and bpf_local_storage_free(). When the
task is in the TASK_DEAD state it will not run sleepable BPF again, so
it is safe to free storage immediately via call_rcu() instead of
call_rcu_tasks_trace() and we can prevent the circular lock dependency.

Other local storage types (sk, cgrp, inode) use reuse_now=false and keep
waiting for sleepable BPF before freeing.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 include/linux/bpf_local_storage.h | 2 +-
 kernel/bpf/bpf_cgrp_storage.c     | 2 +-
 kernel/bpf/bpf_inode_storage.c    | 2 +-
 kernel/bpf/bpf_local_storage.c    | 6 +++---
 kernel/bpf/bpf_task_storage.c     | 7 ++++++-
 net/core/bpf_sk_storage.c         | 2 +-
 6 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 8157e8da61d40..f5d4159646a83 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -184,7 +184,7 @@ int bpf_local_storage_map_check_btf(struct bpf_map *map,
 void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
 				   struct bpf_local_storage_elem *selem);
 
-int bpf_selem_unlink(struct bpf_local_storage_elem *selem);
+int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now);
 
 int bpf_selem_link_map(struct bpf_local_storage_map *smap,
 		       struct bpf_local_storage *local_storage,
diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c
index c2a2ead1f466d..853183eead2c2 100644
--- a/kernel/bpf/bpf_cgrp_storage.c
+++ b/kernel/bpf/bpf_cgrp_storage.c
@@ -89,7 +89,7 @@ static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map)
 	if (!sdata)
 		return -ENOENT;
 
-	return bpf_selem_unlink(SELEM(sdata));
+	return bpf_selem_unlink(SELEM(sdata), false);
 }
 
 static long bpf_cgrp_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index e86734609f3d2..470f4b02c79ea 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -110,7 +110,7 @@ static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
 	if (!sdata)
 		return -ENOENT;
 
-	return bpf_selem_unlink(SELEM(sdata));
+	return bpf_selem_unlink(SELEM(sdata), false);
 }
 
 static long bpf_fd_inode_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 9c96a4477f81a..caa1aa5bc17c7 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -385,7 +385,7 @@ static void bpf_selem_link_map_nolock(struct bpf_local_storage_map_bucket *b,
  * Unlink an selem from map and local storage with lock held.
  * This is the common path used by local storages to delete an selem.
  */
-int bpf_selem_unlink(struct bpf_local_storage_elem *selem)
+int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
 {
 	struct bpf_local_storage *local_storage;
 	bool free_local_storage = false;
@@ -419,10 +419,10 @@ int bpf_selem_unlink(struct bpf_local_storage_elem *selem)
 out:
 	raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
 
-	bpf_selem_free_list(&selem_free_list, false);
+	bpf_selem_free_list(&selem_free_list, reuse_now);
 
 	if (free_local_storage)
-		bpf_local_storage_free(local_storage, false);
+		bpf_local_storage_free(local_storage, reuse_now);
 
 	return err;
 }
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index 605506792b5b4..0311e2cd3f3e6 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -134,7 +134,12 @@ static int task_storage_delete(struct task_struct *task, struct bpf_map *map)
 	if (!sdata)
 		return -ENOENT;
 
-	return bpf_selem_unlink(SELEM(sdata));
+	/*
+	 * When the task is dead it won't run sleepable BPF again, so it is
+	 * safe to reuse storage immediately.
+	 */
+	return bpf_selem_unlink(SELEM(sdata),
+				READ_ONCE(task->__state) == TASK_DEAD);
 }
 
 static long bpf_pid_task_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index f8338acebf077..d20b4b5c99ef7 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -40,7 +40,7 @@ static int bpf_sk_storage_del(struct sock *sk, struct bpf_map *map)
 	if (!sdata)
 		return -ENOENT;
 
-	return bpf_selem_unlink(SELEM(sdata));
+	return bpf_selem_unlink(SELEM(sdata), false);
 }
 
 /* Called by __sk_destruct() & bpf_sk_storage_clone() */
-- 
2.53.0


  reply	other threads:[~2026-03-17  8:16 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 22:27 [PATCH] bpf: Always defer local storage free Andrea Righi
2026-03-16 23:07 ` bot+bpf-ci
2026-03-16 23:39 ` Kumar Kartikeya Dwivedi
2026-03-17  6:25   ` Andrea Righi
2026-03-17  8:15     ` Andrea Righi [this message]
2026-03-17 18:46       ` Cheng-Yang Chou
2026-03-17 18:53       ` Kumar Kartikeya Dwivedi
2026-03-17 18:57         ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abkNtYPMARqxXibW@gpd4 \
    --to=arighi@nvidia.com \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=emil@etsalapatis.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.