[PATCH] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Samuele Mariotti <smariotti@disroot.org>
To: arighi@nvidia.com, tj@kernel.org, void@manifault.com,
	changwoo@igalia.com
Cc: sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org,
	Samuele Mariotti <smariotti@disroot.org>,
	Paolo Valente <paolo.valente@unimore.it>
Subject: [PATCH] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()
Date: Wed, 13 May 2026 11:53:29 +0200	[thread overview]
Message-ID: <20260513095329.4029345-1-smariotti@disroot.org> (raw)

ops_dequeue() can race with finish_dispatch() and spuriously trigger the
"queued task must be in BPF scheduler's custody" warning.

ops_dequeue() snapshots p->scx.ops_state via atomic_long_read_acquire()
and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY
is set. The two reads are not atomic w.r.t. a concurrent
finish_dispatch() running on another CPU:

CPU 1                                    CPU 2
=====                                    =====
                                         dequeue_task_scx()
                                           ops_dequeue()
                                             opss = read_acquire(ops_state)
                                                  = SCX_OPSS_QUEUED
finish_dispatch()
  cmpxchg ops_state:
    SCX_OPSS_QUEUED -> SCX_OPSS_DISPATCHING  [succeeds]
  dispatch_enqueue(SCX_DSQ_GLOBAL,
                   SCX_ENQ_CLEAR_OPSS)
    call_task_dequeue()
      p->scx.flags &= ~SCX_TASK_IN_CUSTODY
                                             WARN_ON_ONCE(!(p->scx.flags &
                                                     SCX_TASK_IN_CUSTODY))
                                            /* opss is stale: QUEUED,
                                             * but task already claimed */
    set_release(ops_state, SCX_OPSS_NONE)

The race has been observed via two distinct call chains: the most common
goes through sched_setaffinity(), a rarer variant through
sched_change_begin().

For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears
SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE
(intentional, to avoid concurrent non-atomic RMW of p->scx.flags against
ops_dequeue()). The window between those two writes is exactly what
ops_dequeue() observes as "QUEUED without custody".

The observed state is not actually inconsistent, it just means CPU 1 has
already claimed the task and the QUEUED value held by CPU 2 is stale.
Re-read ops_state in that case; the next read is guaranteed to return
SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch
cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has
already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED
would require re-enqueue under p's rq lock, which CPU 2 holds.

Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
Suggested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Samuele Mariotti <smariotti@disroot.org>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
---
 kernel/sched/ext.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 23f7b3f63b09..d285e37f2177 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2078,6 +2078,7 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 	/* dequeue is always temporary, don't reset runnable_at */
 	clr_task_runnable(p, false);

+retry:
 	/* acquire ensures that we see the preceding updates on QUEUED */
 	opss = atomic_long_read_acquire(&p->scx.ops_state);

@@ -2092,7 +2093,9 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		BUG();
 	case SCX_OPSS_QUEUED:
 		/* A queued task must always be in BPF scheduler's custody */
-		WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY));
+		if (!(p->scx.flags & SCX_TASK_IN_CUSTODY))
+			goto retry;
+
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
 			break;
-- 
2.54.0

next             reply	other threads:[~2026-05-13  9:56 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-13  9:53 Samuele Mariotti [this message]
2026-05-13 14:26 ` [PATCH] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue() Andrea Righi
2026-05-13 16:41   ` Samuele Mariotti
2026-05-13 16:49     ` Andrea Righi
2026-05-13 20:01     ` Tejun Heo
2026-05-14  9:13       ` Samuele Mariotti
2026-05-14 20:08         ` Andrea Righi
2026-05-14  4:00 ` sashiko-bot

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:23f7b3f63b0 dfblob:d285e37f217 )
 OR (
bs:"[PATCH] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260513095329.4029345-1-smariotti@disroot.org \
    --to=smariotti@disroot.org \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paolo.valente@unimore.it \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.