Re: [PATCH] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Samuele Mariotti <smariotti@disroot.org>
To: Andrea Righi <arighi@nvidia.com>
Cc: tj@kernel.org, void@manifault.com, changwoo@igalia.com,
	 sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org,
	 Paolo Valente <paolo.valente@unimore.it>
Subject: Re: [PATCH] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()
Date: Wed, 13 May 2026 18:41:26 +0200	[thread overview]
Message-ID: <agSpf9QRx4zqhZxx@cachyos> (raw)
In-Reply-To: <agSKJbtmpRPLgYJW@gpd4>


Hi Andrea,

On 13/05/2026 16:26, Andrea Righi wrote:
>Hi Samuele,
>
>On Wed, May 13, 2026 at 11:53:29AM +0200, Samuele Mariotti wrote:
>> ops_dequeue() can race with finish_dispatch() and spuriously trigger the
>> "queued task must be in BPF scheduler's custody" warning.
>>
>> ops_dequeue() snapshots p->scx.ops_state via atomic_long_read_acquire()
>> and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY
>> is set. The two reads are not atomic w.r.t. a concurrent
>> finish_dispatch() running on another CPU:
>>
>> CPU 1                                    CPU 2
>> =====                                    =====
>>                                          dequeue_task_scx()
>>                                            ops_dequeue()
>>                                              opss = read_acquire(ops_state)
>>                                                   = SCX_OPSS_QUEUED
>> finish_dispatch()
>>   cmpxchg ops_state:
>>     SCX_OPSS_QUEUED -> SCX_OPSS_DISPATCHING  [succeeds]
>>   dispatch_enqueue(SCX_DSQ_GLOBAL,
>>                    SCX_ENQ_CLEAR_OPSS)
>>     call_task_dequeue()
>>       p->scx.flags &= ~SCX_TASK_IN_CUSTODY
>>                                              WARN_ON_ONCE(!(p->scx.flags &
>>                                                      SCX_TASK_IN_CUSTODY))
>>                                             /* opss is stale: QUEUED,
>>                                              * but task already claimed */
>>     set_release(ops_state, SCX_OPSS_NONE)
>>
>> The race has been observed via two distinct call chains: the most common
>> goes through sched_setaffinity(), a rarer variant through
>> sched_change_begin().
>>
>> For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears
>> SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE
>> (intentional, to avoid concurrent non-atomic RMW of p->scx.flags against
>> ops_dequeue()). The window between those two writes is exactly what
>> ops_dequeue() observes as "QUEUED without custody".
>>
>> The observed state is not actually inconsistent, it just means CPU 1 has
>> already claimed the task and the QUEUED value held by CPU 2 is stale.
>> Re-read ops_state in that case; the next read is guaranteed to return
>> SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch
>> cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has
>> already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED
>> would require re-enqueue under p's rq lock, which CPU 2 holds.
>>
>> Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
>> Suggested-by: Andrea Righi <arighi@nvidia.com>
>> Signed-off-by: Samuele Mariotti <smariotti@disroot.org>
>> Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
>> ---
>>  kernel/sched/ext.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 23f7b3f63b09..d285e37f2177 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -2078,6 +2078,7 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>>      /* dequeue is always temporary, don't reset runnable_at */
>>      clr_task_runnable(p, false);
>>
>> +retry:
>>      /* acquire ensures that we see the preceding updates on QUEUED */
>>      opss = atomic_long_read_acquire(&p->scx.ops_state);
>>
>> @@ -2092,7 +2093,9 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>>          BUG();
>>      case SCX_OPSS_QUEUED:
>>          /* A queued task must always be in BPF scheduler's custody */
>> -        WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY));
>> +        if (!(p->scx.flags & SCX_TASK_IN_CUSTODY))
>> +            goto retry;
>
>Can we add a cpu_relax() before the goto? A hot spin polling two cachelines from
>another CPU could be very unkind to SMT siblings and bus traffic.
>
>Moreover, we completely lose the original WARN_ON_ONCE(), so we don't catch the
>case where the invariant QUEUED -> IN_CUSTODY is violated by a realy bug. How
>about adding a max retries as well, i.e., something like this:
>
>    int retries = 0;
>
>    ...
>retry:
>    ...
>    if (!(p->scx.flags & SCX_TASK_IN_CUSTODY) &&
>        !WARN_ON_ONCE(retries++ >= 128)) {
>        cpu_relax();
>        goto retry;
>    }
>
>> +
>>          if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>>                          SCX_OPSS_NONE))
>>              break;
>> --
>> 2.54.0
>>
>
>Thanks,
>-Andrea

Thanks for the suggestion. I agree with adding cpu_relax() and the
retry limit to preserve the original WARN_ON_ONCE() as a safety net
for real bugs.

Given the improvements to efficiency, I would also improve the non-atomic
read of p->scx.flags by using READ_ONCE(), preventing the compiler from
caching the value across retries and ensuring each iteration observes the
latest value written by the concurrent finish_dispatch(). I would also
lower the retry limit from 128 to 4, since the maximum number of retries
observed empirically is 1, so 4 gives a reasonable safety margin without
spinning unnecessarily long. 

Something like this: 

if (!(READ_ONCE(p->scx.flags) & SCX_TASK_IN_CUSTODY) &&
     !WARN_ON_ONCE(retries++ >= 4)) {
         cpu_relax();
         goto retry;
}

Let me know if this looks good to you.

Thanks,
Samuele

next prev parent reply	other threads:[~2026-05-13 16:41 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-13  9:53 [PATCH] sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue() Samuele Mariotti
2026-05-13 14:26 ` Andrea Righi
2026-05-13 16:41   ` Samuele Mariotti [this message]
2026-05-13 16:49     ` Andrea Righi
2026-05-13 20:01     ` Tejun Heo
2026-05-14  9:13       ` Samuele Mariotti
2026-05-14 20:08         ` Andrea Righi
2026-05-15 10:12           ` Samuele Mariotti
2026-05-14  4:00 ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agSpf9QRx4zqhZxx@cachyos \
    --to=smariotti@disroot.org \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paolo.valente@unimore.it \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.