public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Amery Hung <ameryhung@gmail.com>,
	Tejun Heo <tj@kernel.org>, Emil Tsalapatis <emil@etsalapatis.com>,
	bpf@vger.kernel.org, sched-ext@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] bpf: Always defer local storage free
Date: Tue, 17 Mar 2026 19:57:38 +0100	[thread overview]
Message-ID: <abmkIt5LD9VGiWh_@gpd4> (raw)
In-Reply-To: <CAP01T75+2GWPbuda+NUBn+oon8Zdq=_45Me8BUE8B9=xcCoqyg@mail.gmail.com>

Hi Kumar,

On Tue, Mar 17, 2026 at 07:53:04PM +0100, Kumar Kartikeya Dwivedi wrote:
> On Tue, 17 Mar 2026 at 09:16, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On Tue, Mar 17, 2026 at 07:25:18AM +0100, Andrea Righi wrote:
> > > Hi Kumar,
> > >
> > > On Tue, Mar 17, 2026 at 12:39:00AM +0100, Kumar Kartikeya Dwivedi wrote:
> > > > On Mon, 16 Mar 2026 at 23:28, Andrea Righi <arighi@nvidia.com> wrote:
> > > > >
> > > > > bpf_task_storage_delete() can be invoked from contexts that hold a raw
> > > > > spinlock, such as sched_ext's ops.exit_task() callback, that is running
> > > > > with the rq lock held.
> > > > >
> > > > > The delete path eventually calls bpf_selem_unlink(), which frees the
> > > > > element via bpf_selem_free_list() -> bpf_selem_free(). For task storage
> > > > > with use_kmalloc_nolock, call_rcu_tasks_trace() is used, which is not
> > > > > safe from raw spinlock context, triggering the following:
> > > > >
> > > >
> > > > Paul posted [0] to fix it in SRCU. It was always safe to
> > > > call_rcu_tasks_trace() under raw spin lock, but became problematic on
> > > > RT with the recent conversion that uses SRCU underneath, please give
> > > > [0] a spin. While I couldn't reproduce the warning using scx_cosmos, I
> > > > verified that it goes away for me when calling the path from atomic
> > > > context.
> > > >
> > > >   [0]: https://lore.kernel.org/rcu/841c8a0b-0f50-4617-98b2-76523e13b910@paulmck-laptop
> > >
> > > With this applied I get the following:
> > >
> > > [   26.986798] ======================================================
> > > [   26.986883] WARNING: possible circular locking dependency detected
> > > [   26.986957] 7.0.0-rc4-virtme #15 Not tainted
> > > [   26.987020] ------------------------------------------------------
> > > [   26.987094] schbench/532 is trying to acquire lock:
> > > [   26.987155] ffffffff9cd70d90 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{2:2}, at: raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> > > [   26.987313]
> > > [   26.987313] but task is already holding lock:
> > > [   26.987394] ffff8df7fb9bdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
> > > [   26.987512]
> > > [   26.987512] which lock already depends on the new lock.
> > > [   26.987512]
> > > [   26.987598]
> > > [   26.987598] the existing dependency chain (in reverse order) is:
> > > [   26.987704]
> > > [   26.987704] -> #3 (&rq->__lock){-.-.}-{2:2}:
> > > [   26.987779]        lock_acquire+0xcf/0x310
> > > [   26.987844]        _raw_spin_lock_nested+0x2e/0x40
> > > [   26.987911]        raw_spin_rq_lock_nested+0x24/0xb0
> > > [   26.987973]        ___task_rq_lock+0x42/0x110
> > > [   26.988034]        wake_up_new_task+0x198/0x440
> > > [   26.988099]        kernel_clone+0x118/0x3c0
> > > [   26.988149]        user_mode_thread+0x61/0x90
> > > [   26.988222]        rest_init+0x1e/0x160
> > > [   26.988272]        start_kernel+0x7a2/0x7b0
> > > [   26.988329]        x86_64_start_reservations+0x24/0x30
> > > [   26.988392]        x86_64_start_kernel+0xd1/0xe0
> > > [   26.988451]        common_startup_64+0x13e/0x148
> > > [   26.988523]
> > > [   26.988523] -> #2 (&p->pi_lock){-.-.}-{2:2}:
> > > [   26.988598]        lock_acquire+0xcf/0x310
> > > [   26.988650]        _raw_spin_lock_irqsave+0x39/0x60
> > > [   26.988718]        try_to_wake_up+0x57/0xbb0
> > > [   26.988779]        create_worker+0x17e/0x200
> > > [   26.988839]        workqueue_init+0x28d/0x300
> > > [   26.988902]        kernel_init_freeable+0x134/0x2b0
> > > [   26.988964]        kernel_init+0x1a/0x130
> > > [   26.989016]        ret_from_fork+0x2bd/0x370
> > > [   26.989079]        ret_from_fork_asm+0x1a/0x30
> > > [   26.989143]
> > > [   26.989143] -> #1 (&pool->lock){-.-.}-{2:2}:
> > > [   26.989217]        lock_acquire+0xcf/0x310
> > > [   26.989263]        _raw_spin_lock+0x30/0x40
> > > [   26.989315]        __queue_work+0xdb/0x6d0
> > > [   26.989367]        queue_delayed_work_on+0xc7/0xe0
> > > [   26.989427]        srcu_gp_start_if_needed+0x3cc/0x540
> > > [   26.989507]        __synchronize_srcu+0xf6/0x1b0
> > > [   26.989567]        rcu_init_tasks_generic+0xfe/0x120
> > > [   26.989626]        do_one_initcall+0x6f/0x300
> > > [   26.989691]        kernel_init_freeable+0x24b/0x2b0
> > > [   26.989750]        kernel_init+0x1a/0x130
> > > [   26.989797]        ret_from_fork+0x2bd/0x370
> > > [   26.989857]        ret_from_fork_asm+0x1a/0x30
> > > [   26.989916]
> > > [   26.989916] -> #0 (rcu_tasks_trace_srcu_struct_srcu_usage.lock){....}-{2:2}:
> > > [   26.990015]        check_prev_add+0xe1/0xd30
> > > [   26.990076]        __lock_acquire+0x1561/0x1de0
> > > [   26.990137]        lock_acquire+0xcf/0x310
> > > [   26.990182]        _raw_spin_lock_irqsave+0x39/0x60
> > > [   26.990240]        raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> > > [   26.990312]        srcu_gp_start_if_needed+0x92/0x540
> > > [   26.990370]        bpf_selem_unlink+0x267/0x5c0
> > > [   26.990430]        bpf_task_storage_delete+0x3a/0x90
> > > [   26.990495]        bpf_prog_134dba630b11d3b7_scx_pmu_task_fini+0x26/0x2a
> > > [   26.990566]        bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
> > > [   26.990636]        bpf__sched_ext_ops_exit_task+0x4b/0xa7
> > > [   26.990694]        scx_exit_task+0x17a/0x230
> > > [   26.990753]        sched_ext_dead+0xb2/0x120
> > > [   26.990811]        finish_task_switch.isra.0+0x305/0x370
> > > [   26.990870]        __schedule+0x576/0x1d60
> > > [   26.990917]        schedule+0x3a/0x130
> > > [   26.990962]        futex_do_wait+0x4a/0xa0
> > > [   26.991008]        __futex_wait+0x8e/0xf0
> > > [   26.991054]        futex_wait+0x78/0x120
> > > [   26.991099]        do_futex+0xc5/0x190
> > > [   26.991144]        __x64_sys_futex+0x12d/0x220
> > > [   26.991202]        do_syscall_64+0x117/0xf80
> > > [   26.991260]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > [   26.991318]
> > > [   26.991318] other info that might help us debug this:
> > > [   26.991318]
> > > [   26.991400] Chain exists of:
> > > [   26.991400]   rcu_tasks_trace_srcu_struct_srcu_usage.lock --> &p->pi_lock --> &rq->__lock
> > > [   26.991400]
> > > [   26.991524]  Possible unsafe locking scenario:
> > > [   26.991524]
> > > [   26.991592]        CPU0                    CPU1
> > > [   26.991647]        ----                    ----
> > > [   26.991702]   lock(&rq->__lock);
> > > [   26.991747]                                lock(&p->pi_lock);
> > > [   26.991816]                                lock(&rq->__lock);
> > > [   26.991884]   lock(rcu_tasks_trace_srcu_struct_srcu_usage.lock);
> > > [   26.991953]
> > > [   26.991953]  *** DEADLOCK ***
> > > [   26.991953]
> > > [   26.992021] 3 locks held by schbench/532:
> > > [   26.992065]  #0: ffff8df7cc154f18 (&p->pi_lock){-.-.}-{2:2}, at: _task_rq_lock+0x2c/0x100
> > > [   26.992151]  #1: ffff8df7fb9bdae0 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x24/0xb0
> > > [   26.992250]  #2: ffffffff9cd71b20 (rcu_read_lock){....}-{1:3}, at: __bpf_prog_enter+0x64/0x110
> > > [   26.992348]
> > > [   26.992348] stack backtrace:
> > > [   26.992406] CPU: 7 UID: 1000 PID: 532 Comm: schbench Not tainted 7.0.0-rc4-virtme #15 PREEMPT(full)
> > > [   26.992409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > [   26.992411] Sched_ext: cosmos_1.1.0_g0949d453c_x86_64_unknown^_linux_gnu (enabled+all), task: runnable_at=+0ms
> > > [   26.992412] Call Trace:
> > > [   26.992414]  C<TASK>
> > > [   26.992415]  dump_stack_lvl+0x6f/0xb0
> > > [   26.992418]  print_circular_bug.cold+0x18b/0x1d6
> > > [   26.992422]  check_noncircular+0x165/0x190
> > > [   26.992425]  check_prev_add+0xe1/0xd30
> > > [   26.992428]  __lock_acquire+0x1561/0x1de0
> > > [   26.992430]  lock_acquire+0xcf/0x310
> > > [   26.992431]  ? raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> > > [   26.992434]  _raw_spin_lock_irqsave+0x39/0x60
> > > [   26.992435]  ? raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> > > [   26.992437]  raw_spin_lock_irqsave_sdp_contention+0x5b/0xe0
> > > [   26.992439]  srcu_gp_start_if_needed+0x92/0x540
> > > [   26.992441]  bpf_selem_unlink+0x267/0x5c0
> > > [   26.992443]  bpf_task_storage_delete+0x3a/0x90
> > > [   26.992445]  bpf_prog_134dba630b11d3b7_scx_pmu_task_fini+0x26/0x2a
> > > [   26.992447]  bpf_prog_4b1530d9d9852432_cosmos_exit_task+0x1d/0x1f
> > > [   26.992448]  bpf__sched_ext_ops_exit_task+0x4b/0xa7
> > > [   26.992449]  scx_exit_task+0x17a/0x230
> > > [   26.992451]  sched_ext_dead+0xb2/0x120
> > > [   26.992453]  finish_task_switch.isra.0+0x305/0x370
> > > [   26.992455]  __schedule+0x576/0x1d60
> > > [   26.992457]  ? find_held_lock+0x2b/0x80
> > > [   26.992460]  schedule+0x3a/0x130
> > > [   26.992462]  futex_do_wait+0x4a/0xa0
> > > [   26.992463]  __futex_wait+0x8e/0xf0
> > > [   26.992465]  ? __pfx_futex_wake_mark+0x10/0x10
> > > [   26.992468]  futex_wait+0x78/0x120
> > > [   26.992469]  ? find_held_lock+0x2b/0x80
> > > [   26.992472]  do_futex+0xc5/0x190
> > > [   26.992473]  __x64_sys_futex+0x12d/0x220
> > > [   26.992474]  ? restore_fpregs_from_fpstate+0x48/0xd0
> > > [   26.992477]  do_syscall_64+0x117/0xf80
> > > [   26.992478]  ? __irq_exit_rcu+0x38/0xc0
> > > [   26.992481]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > > [   26.992482] RIP: 0033:0x7fe20e52eb1d
> >
> > With the following on top everything looks good on my side, let me know
> > what you think.
> >
> > Thanks,
> > -Andrea
> >
> > From: Andrea Righi <arighi@nvidia.com>
> > Subject: [PATCH] bpf: Avoid circular lock dependency when deleting local
> >  storage
> >
> > Calling bpf_task_storage_delete() from a context that holds the runqueue
> > lock (e.g., sched_ext's ops.exit_task() callback) can lead to a circular
> > lock dependency:
> >
> >  WARNING: possible circular locking dependency detected
> >  ...
> >  Chain exists of:
> >    rcu_tasks_trace_srcu_struct_srcu_usage.lock --> &p->pi_lock --> &rq->__lock
> >
> >   Possible unsafe locking scenario:
> >
> >         CPU0                    CPU1
> >         ----                    ----
> >    lock(&rq->__lock);
> >                                 lock(&p->pi_lock);
> >                                 lock(&rq->__lock);
> >    lock(rcu_tasks_trace_srcu_struct_srcu_usage.lock);
> >
> >   *** DEADLOCK ***
> >
> > Fix by adding a reuse_now flag to bpf_selem_unlink() with the same
> > meaning as in bpf_selem_free() and bpf_local_storage_free(). When the
> > task is in the TASK_DEAD state it will not run sleepable BPF again, so
> > it is safe to free storage immediately via call_rcu() instead of
> > call_rcu_tasks_trace() and we can prevent the circular lock dependency.
> >
> > Other local storage types (sk, cgrp, inode) use reuse_now=false and keep
> > waiting for sleepable BPF before freeing.
> >
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> > [...]
> 
> Thanks for the report Andrea. The bug noted by lockdep looks real, and
> Paul agrees it is something to fix, which he will look into.
> 
> https://lore.kernel.org/rcu/fe28d664-3872-40f6-83c6-818627ad5b7d@paulmck-laptop

Thanks!

> 
> The fix you provided below unfortunately can't work, we cannot free
> the selem immediately as the program may have formed pointers to the
> local storage before calling delete, so even if the task is dead
> (which is task specific anyway, we don't address other local storages)
> we can still have use-after-free after we return from
> bpf_task_storage_delete() back to the program. We discussed this
> 'instant free' optimization several times in the past for local
> storage to reduce call_rcu() pressure and realized it cannot work
> correctly.
> 
> So the right fix again would be in SRCU, which would be to defer the
> pi->lock -> rq->lock in call_srcu() when irqs_disabled() is true. This
> should address the circular deadlock when calling it under the
> protection of rq->lock, such as the case you hit.

Sure, I sent that "fix" just to provide more details on the issue. :)

-Andrea

      reply	other threads:[~2026-03-17 18:57 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 22:27 [PATCH] bpf: Always defer local storage free Andrea Righi
2026-03-16 23:07 ` bot+bpf-ci
2026-03-16 23:39 ` Kumar Kartikeya Dwivedi
2026-03-17  6:25   ` Andrea Righi
2026-03-17  8:15     ` Andrea Righi
2026-03-17 18:46       ` Cheng-Yang Chou
2026-03-17 18:53       ` Kumar Kartikeya Dwivedi
2026-03-17 18:57         ` Andrea Righi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abmkIt5LD9VGiWh_@gpd4 \
    --to=arighi@nvidia.com \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=emil@etsalapatis.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox