From: Andrea Righi <arighi@nvidia.com>
To: Samuele Mariotti <smariotti@disroot.org>
Cc: tj@kernel.org, void@manifault.com, changwoo@igalia.com,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org,
Paolo Valente <paolo.valente@unimore.it>
Subject: Re: [PATCH V2 1/1] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()
Date: Thu, 21 May 2026 18:45:25 +0200 [thread overview]
Message-ID: <ag82paswEla1JYa8@gpd4> (raw)
In-Reply-To: <20260521105911.1814606-2-smariotti@disroot.org>
Hi Samuele,
On Thu, May 21, 2026 at 12:59:11PM +0200, Samuele Mariotti wrote:
> ops_dequeue() can race with finish_dispatch() and spuriously trigger the
> "queued task must be in BPF scheduler's custody" warning.
>
> ops_dequeue() snapshots p->scx.ops_state via atomic_long_read_acquire()
> and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY
> is set. The two reads are not atomic w.r.t. a concurrent
> finish_dispatch() running on another CPU:
>
> CPU 1 CPU 2
> ===== =====
> dequeue_task_scx()
> ops_dequeue()
> opss = read_acquire(ops_state)
> = SCX_OPSS_QUEUED
> finish_dispatch()
> cmpxchg ops_state:
> SCX_OPSS_QUEUED -> SCX_OPSS_DISPATCHING [succeeds]
> dispatch_enqueue(SCX_DSQ_GLOBAL,
> SCX_ENQ_CLEAR_OPSS)
> call_task_dequeue()
> p->scx.flags &= ~SCX_TASK_IN_CUSTODY
> WARN_ON_ONCE(!(p->scx.flags &
> SCX_TASK_IN_CUSTODY))
> /* opss is stale: QUEUED,
> * but task already claimed */
> set_release(ops_state, SCX_OPSS_NONE)
>
> The race has been observed via two distinct call chains: the most common
> goes through sched_setaffinity(), a rarer variant through
> sched_change_begin().
>
> For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears
> SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE
> (intentional, to avoid concurrent non-atomic RMW of p->scx.flags against
> ops_dequeue()). The window between those two writes is exactly what
> ops_dequeue() observes as "QUEUED without custody".
>
> The observed state is not actually inconsistent, it just means CPU 1 has
> already claimed the task and the QUEUED value held by CPU 2 is stale.
> Re-read ops_state in that case; the next read is guaranteed to return
> SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch
> cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has
> already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED
> would require re-enqueue under p's rq lock, which CPU 2 holds.
>
> Changes in v2:
> - Use READ_ONCE() for p->scx.flags to ensure fresh reads and prevent
> compiler reordering in the lockless path
> - Add cpu_relax() to reduce power consumption and improve performance
> during the spin-wait
> - Use unlikely() to optimize branch prediction for the common case
> - Expand the in-code comment to document the race condition and
> bounded retry guarantee
>
> Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
> Suggested-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Samuele Mariotti <smariotti@disroot.org>
> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Looks good to me.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
> ---
> kernel/sched/ext.c | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 547ca398f646..c1762420cc35 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2078,6 +2078,7 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> /* dequeue is always temporary, don't reset runnable_at */
> clr_task_runnable(p, false);
>
> +retry:
> /* acquire ensures that we see the preceding updates on QUEUED */
> opss = atomic_long_read_acquire(&p->scx.ops_state);
>
> @@ -2091,8 +2092,20 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> */
> BUG();
> case SCX_OPSS_QUEUED:
> - /* A queued task must always be in BPF scheduler's custody */
> - WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY));
> + /*
> + * A queued task must always be in BPF scheduler's custody. If
> + * SCX_TASK_IN_CUSTODY is clear, finish_dispatch() on another
> + * CPU has already passed call_task_dequeue() (which clears the
> + * flag), but has not yet written SCX_OPSS_NONE. That final
> + * store does not require this rq's lock, so retrying with
> + * cpu_relax() is bounded: we will observe NONE (or DISPATCHING,
> + * handled by the fallthrough) on a subsequent iteration.
> + */
> + if (unlikely(!(READ_ONCE(p->scx.flags) & SCX_TASK_IN_CUSTODY))) {
> + cpu_relax();
> + goto retry;
> + }
> +
> if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
> SCX_OPSS_NONE))
> break;
> --
> 2.54.0
>
next prev parent reply other threads:[~2026-05-21 16:45 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-21 10:59 [PATCH V2 0/1] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue() Samuele Mariotti
2026-05-21 10:59 ` [PATCH V2 1/1] " Samuele Mariotti
2026-05-21 16:45 ` Andrea Righi [this message]
2026-05-21 16:29 ` [PATCH V2 0/1] " Tejun Heo
2026-05-21 16:30 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ag82paswEla1JYa8@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=paolo.valente@unimore.it \
--cc=sched-ext@lists.linux.dev \
--cc=smariotti@disroot.org \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.