Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	Daniel Hodges <hodgesd@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes
Date: Thu, 5 Feb 2026 17:40:05 +0100	[thread overview]
Message-ID: <aYTH5XT0YFkznqvv@gpd4> (raw)
In-Reply-To: <aYPW121Y6vGtmtST@slm.duckdns.org>

On Wed, Feb 04, 2026 at 01:31:35PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 04, 2026 at 12:06:39AM +0100, Andrea Righi wrote:
> ...
> >   CPU0                                      CPU1
> >   ----                                      ----
> >                                             task_rq_lock(p)
> >   if (cpumask_test_cpu(cpu, p->cpus_ptr))
> >                                             set_cpus_allowed_scx(p, new_mask)
> >                                             task_rq_unlock(p)
> >       scx_bpf_dsq_insert(p,
> >               SCX_DSQ_LOCAL_ON | cpu, 0)
> > 
> > Fix this by extending the existing qseq invalidation mechanism to also
> > cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> > occurring between dispatch decision and finalization.
> > 
> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> > re-dispatched using up-to-date affinity information.
> 
> It shouldn't be returned, right? set_cpus_allowed() dequeues and
> re-enqueues. What the seq invalidation detected is dequeue racing the async
> dispatch and the invalidation means that the task was dequeued while on the
> async buffer (to be re-enqueued once the property change is complete). It
> should just be ignored.

Yeah, the only downside is that the scheduler doesn't know that the task
has been re-enqueued due to a failed dispatch, but that's probably fine for
now.

> 
> >  static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  					 struct task_struct *p, u64 enq_flags,
> > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  	if (dst_dsq->id == SCX_DSQ_LOCAL) {
> >  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
> >  		if (src_rq != dst_rq &&
> > -		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -			dst_dsq = find_global_dsq(sch, p);
> > -			dst_rq = src_rq;
> > +		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +			/*
> > +			 * Task affinity changed after dispatch decision:
> > +			 * drop the dispatch, caller will handle returning
> > +			 * the task to its original DSQ.
> > +			 */
> > +			return NULL;
> ...
> > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  	}
> >  
> >  	if (src_rq != dst_rq &&
> > -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> > -				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> > +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +		/*
> > +		 * Task affinity changed after dispatch decision: drop the
> > +		 * dispatch, task remains in its current state and will be
> > +		 * dispatched again in a future cycle.
> > +		 */
> > +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
> > +					(atomic_long_read(&p->scx.ops_state) &
> > +					 SCX_OPSS_QSEQ_MASK));
> 
> I don't quite follow why we need the above slippery behavior. The qseq
> invalidation, if reliable, means that there's no race window if the BPF
> scheduler correctly implements ops.dequeue() (after the kernel side fixes it
> of course).
> 
> ie. The BPF scheduler is reponsible for synchronizing its
> scx_bpf_dsq_insert() call against whatever it needs to do in ops.dequeue().
> If ops.dequeue() wins, the task shouldn't be inserted in the first place. If
> ops.dequeue() loses, the qseq invalidation should kill it while on async
> buffer if it wins over finish_dispatch(). If finish_dispatch() wins, the
> task will just be dequeued from the inserted DSQ or the property change will
> happen while the task is running.
> 
> Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue()
> synchronization and let things slide (probably optionally w/ an OPS flag),
> but for that, falling back to global DSQ is fine, no?

I think the problem with the global DSQ fallback is that we're essentially
ignoring a request from the BPF scheduler to dispatch a task to a specific
CPU. Moreover, the global DSQ can potentially introduce starvation: if a
task is silently dispatched to the global DSQ and the BPF scheduler keeps
dispatching tasks to the local DSQs, the task waiting in the global DSQ
will never be consumed.

> 
> > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  				 struct affinity_context *ac)
> >  {
> >  	struct scx_sched *sch = scx_root;
> > +	struct rq *rq = task_rq(p);
> > +
> > +	lockdep_assert_rq_held(rq);
> >  
> >  	set_cpus_allowed_common(p, ac);
> >  
> >  	if (unlikely(!sch))
> >  		return;
> >  
> > +	/*
> > +	 * Affinity changes invalidate any pending dispatch decisions made
> > +	 * with the old affinity. Increment the runqueue's ops_qseq and
> > +	 * update the task's qseq to invalidate in-flight dispatches.
> > +	 */
> > +	if (p->scx.flags & SCX_TASK_QUEUED) {
> > +		unsigned long opss;
> > +
> > +		rq->scx.ops_qseq++;
> > +		opss = atomic_long_read(&p->scx.ops_state);
> > +		atomic_long_set(&p->scx.ops_state,
> > +				(opss & SCX_OPSS_STATE_MASK) |
> > +				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> > +	}
> > +
> 
> I wonder whether we should define an invalid qseq and use that instead. The
> queueing instance really is invalid after this and it would help catching
> cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't
> dequeue_task_scx() or ops_dequeue() be a better place to shoot down the
> enqueued instances? While the symptom we most immediately see are through
> cpumask changes, the underlying problem is dequeue not shooting down
> existing enqueued tasks.

I think I like the idea of having an INVALID_QSEQ or similar, it'd also
make debugging easier.

I'm not sure about moving the logic to dequeue_task_scx(), more exactly,
I'm not sure if there're nasty locking implications. I'll do some
experiments, if it works, sure, dequeue would be a better place to cancel
invalid enqueued instances.

Thanks,
-Andrea

next prev parent reply	other threads:[~2026-02-05 16:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
2026-02-04 13:20 ` Kuba Piecuch
2026-02-04 15:36   ` Andrea Righi
2026-02-04 16:58     ` Kuba Piecuch
2026-02-04 17:56       ` Andrea Righi
2026-02-05 17:20         ` Kuba Piecuch
2026-02-05 17:37           ` Andrea Righi
2026-02-04 15:07 ` Christian Loehle
2026-02-04 23:31 ` Tejun Heo
2026-02-05  1:15   ` Tejun Heo
2026-02-05 16:40   ` Andrea Righi [this message]
2026-02-05 22:57     ` Tejun Heo
2026-02-06  8:43       ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYTH5XT0YFkznqvv@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.