From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Christian Loehle <christian.loehle@arm.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
Daniel Hodges <hodgesd@meta.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes
Date: Fri, 6 Feb 2026 09:43:20 +0100 [thread overview]
Message-ID: <aYWpqDtzjci42qa0@gpd4> (raw)
In-Reply-To: <aYUgQGvzJsoEaDZL@slm.duckdns.org>
On Thu, Feb 05, 2026 at 12:57:04PM -1000, Tejun Heo wrote:
> Hello,
>
> On Thu, Feb 05, 2026 at 05:40:05PM +0100, Andrea Righi wrote:
> ...
> > > It shouldn't be returned, right? set_cpus_allowed() dequeues and
> > > re-enqueues. What the seq invalidation detected is dequeue racing the async
> > > dispatch and the invalidation means that the task was dequeued while on the
> > > async buffer (to be re-enqueued once the property change is complete). It
> > > should just be ignored.
> >
> > Yeah, the only downside is that the scheduler doesn't know that the task
> > has been re-enqueued due to a failed dispatch, but that's probably fine for
> > now.
>
> Yeah, but does that matter? Consider the following three scenarios:
>
> A. Task gets dispatched into local DSQ, CPU mask gets updated while in async
> buffer, the dispatch is ignored and then the task gets re-enqueued later.
>
> B. The same as A but the CPU mask update happens after the task lands in the
> local DSQ but before starts executing.
>
> C. Task gets dispatched into local DSQ and starts running, CPU mask gets
> updated so that the task can't run on the current CPU anymore, migration
> task preempts the task and it gets enqueued.
>
> A and B woould be indistinguishible from BPF sched's POV. C would be a bit
> different in that the task would transition through ops->running/stopping().
>
> I don't see anything significantly different across the three scenarios -
> the task was dispatched but cpumask got updated and the scheduler needs to
> place it again.
Ack, knowing that an enqueue is coming from a failed dispatch or a regular
running/stopping transition probably doesn't matter. And the scheduler can
probably try to infer this information, if needed.
>
> ...
> > > Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue()
> > > synchronization and let things slide (probably optionally w/ an OPS flag),
> > > but for that, falling back to global DSQ is fine, no?
> >
> > I think the problem with the global DSQ fallback is that we're essentially
> > ignoring a request from the BPF scheduler to dispatch a task to a specific
> > CPU. Moreover, the global DSQ can potentially introduce starvation: if a
> > task is silently dispatched to the global DSQ and the BPF scheduler keeps
> > dispatching tasks to the local DSQs, the task waiting in the global DSQ
> > will never be consumed.
>
> While starvation is possible, it's not very likely:
>
> - ops.select_cpu/enqueue() usually don't direct dispatch to local CPUs
> unless they're idle.
>
> - ops.dispatch() is only called after global DSQ is drained.
>
> If ops.select_cpu/enqueue() keeps DD'ing to local CPUs while there are other
> tasks waiting, it's gonna stall whether we fall back to global DSQ or not.
Well, if a scheduler is only using DD's with SCX_DSQ_LOCAL[_ON] and a
per-CPU task ends up in the global DSQ the chance of starvation is not that
remote.
If we re-enqueue the task in the regular enqueue path, without the global
DSQ fallback, the task will be dispatched again by the BPF scheduler via
SCX_DSQ_LOCAL[_ON] and it's less likely to be starved.
I think if we just drop the dispatch without the global DSQ fallback
everything should work.
>
> But, taking a step back, the sloppy fallback behavior is secondary. What
> really matters is once we fix ops.dequeue(), can the BPF scheduler properly
> synchronize dequeue against scx_bpf_dsq_insert() to avoid triggering cpumask
> or migration disabled state mismatches? If so, ops.dequeue() would be the
> primary way to deal with these issues.
Agreed, we should handle this in ops_dequeue().
>
> Maybe not implementing ops.dequeue() can enable sloppy fallbacks as that
> indicates the scheduler isn't taking property changes into account at all,
> but that's really secondary. Let's first focus on making ops.dequeue()
> working properly so that the BPF scheduler can synchronize correctly.
Ack.
>
> ...
> > > I wonder whether we should define an invalid qseq and use that instead. The
> > > queueing instance really is invalid after this and it would help catching
> > > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
> > > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
> > > enqueued instances? While the symptom we most immediately see are through
> > > cpumask changes, the underlying problem is dequeue not shooting down
> > > existing enqueued tasks.
> >
> > I think I like the idea of having an INVALID_QSEQ or similar, it'd also
> > make debugging easier.
> >
> > I'm not sure about moving the logic to dequeue_task_scx(), more exactly,
> > I'm not sure if there're nasty locking implications. I'll do some
> > experiments, if it works, sure, dequeue would be a better place to cancel
> > invalid enqueued instances.
>
> I was confused while writing above. All of the above is already happening.
> When a task is dequeued, it's OPSS is cleared and the task won't be eligible
> for dispatching anymore. The only "confused" case is where the task finishes
> reenqueueing before the previous dispatch attempt is finished, which the BPF
> scheduler should be able to handle once ops.dequeue() is fixed.
Yeah I think we can handle this in the dequeue path. I think I have a new
working (maybe) patch. Will run some tests and send the new version later
today.
Thanks,
-Andrea
prev parent reply other threads:[~2026-02-06 8:43 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
2026-02-04 13:20 ` Kuba Piecuch
2026-02-04 15:36 ` Andrea Righi
2026-02-04 16:58 ` Kuba Piecuch
2026-02-04 17:56 ` Andrea Righi
2026-02-05 17:20 ` Kuba Piecuch
2026-02-05 17:37 ` Andrea Righi
2026-02-04 15:07 ` Christian Loehle
2026-02-04 23:31 ` Tejun Heo
2026-02-05 1:15 ` Tejun Heo
2026-02-05 16:40 ` Andrea Righi
2026-02-05 22:57 ` Tejun Heo
2026-02-06 8:43 ` Andrea Righi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aYWpqDtzjci42qa0@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=emil@etsalapatis.com \
--cc=hodgesd@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.