public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	Daniel Hodges <hodgesd@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes
Date: Wed, 4 Feb 2026 16:36:56 +0100	[thread overview]
Message-ID: <aYNnmI26oS7YNuMP@gpd4> (raw)
In-Reply-To: <DG67IMOVU2GX.2CKP769EXJS12@google.com>

Hi Kuba,

On Wed, Feb 04, 2026 at 01:20:52PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> On Tue Feb 3, 2026 at 11:06 PM UTC, Andrea Righi wrote:
> > A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> > target CPU. However, task affinity can change between the dispatch
> > decision and its finalization in finish_dispatch(). When this happens,
> > the scheduler may attempt to dispatch a task to a CPU that is no longer
> > allowed, resulting in fatal errors such as:
> >
> >  EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
> >
> > This race exists because ops.dispatch() runs without holding the task's
> > run queue lock, allowing a concurrent set_cpus_allowed() to update
> > p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> > then finalized using stale affinity information.
> >
> > Example timeline:
> >
> >   CPU0                                      CPU1
> >   ----                                      ----
> >                                             task_rq_lock(p)
> >   if (cpumask_test_cpu(cpu, p->cpus_ptr))
> >                                             set_cpus_allowed_scx(p, new_mask)
> >                                             task_rq_unlock(p)
> >       scx_bpf_dsq_insert(p,
> >               SCX_DSQ_LOCAL_ON | cpu, 0)
> >
> > Fix this by extending the existing qseq invalidation mechanism to also
> > cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> > occurring between dispatch decision and finalization.
> 
> I'm not convinced this will prevent the exact race scenario you describe.
> For the qseq increment inside set_cpus_allowed_scx() to be noticed, it needs
> to happen _after_ scx_bpf_dsq_insert() reads the qseq and stores it in the
> dispatch buffer entry.
> 
> We can still have the cpumask change and qseq increment on CPU1 happen between
> cpumask_test_cpu() and scx_bpf_dsq_insert() on CPU0, and it will not be
> detected because the dispatch buffer entry will contain the incremented value.

Yeah, I think you're right, it makes things better, but it's still racy.

> 
> >
> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> > re-dispatched using up-to-date affinity information.
> 
> How will the scheduler know that the dispatch was dropped? Is the scheduler
> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
> on CPU1?

The idea was that, if the dispatch is dropped, we'll see another
ops.enqueue() for the task, so at least the task is not "lost" and the
BPF scheduler gets another chance what to do with it. In this case it'd be
useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
the enqueue resulted from a dropped dispatch.

> 
> >
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 47 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 0ab994180f655..6128a2529a7c7 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
> >   * will change. As @p's task_rq is locked, this function doesn't need to use the
> >   * holding_cpu mechanism.
> >   *
> > - * On return, @src_dsq is unlocked and only @p's new task_rq, which is the
> > - * return value, is locked.
> > + * On success, @src_dsq is unlocked and only @p's new task_rq, which is the
> > + * return value, is locked. On failure (affinity change invalidated the move),
> > + * returns NULL with @src_dsq unlocked and task remaining in @src_dsq.
> >   */
> >  static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  					 struct task_struct *p, u64 enq_flags,
> > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  	if (dst_dsq->id == SCX_DSQ_LOCAL) {
> >  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
> >  		if (src_rq != dst_rq &&
> > -		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -			dst_dsq = find_global_dsq(sch, p);
> > -			dst_rq = src_rq;
> > +		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +			/*
> > +			 * Task affinity changed after dispatch decision:
> > +			 * drop the dispatch, caller will handle returning
> > +			 * the task to its original DSQ.
> > +			 */
> > +			return NULL;
> >  		}
> >  	} else {
> >  		/* no need to migrate if destination is a non-local DSQ */
> > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  	}
> >  
> >  	if (src_rq != dst_rq &&
> > -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> > -				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> > +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +		/*
> > +		 * Task affinity changed after dispatch decision: drop the
> > +		 * dispatch, task remains in its current state and will be
> > +		 * dispatched again in a future cycle.
> > +		 */
> > +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
> > +					(atomic_long_read(&p->scx.ops_state) &
> > +					 SCX_OPSS_QSEQ_MASK));
> >  		return;
> >  	}
> >  
> > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
> >  				 struct affinity_context *ac)
> >  {
> >  	struct scx_sched *sch = scx_root;
> > +	struct rq *rq = task_rq(p);
> > +
> > +	lockdep_assert_rq_held(rq);
> >  
> >  	set_cpus_allowed_common(p, ac);
> >  
> >  	if (unlikely(!sch))
> >  		return;
> >  
> > +	/*
> > +	 * Affinity changes invalidate any pending dispatch decisions made
> > +	 * with the old affinity. Increment the runqueue's ops_qseq and
> > +	 * update the task's qseq to invalidate in-flight dispatches.
> > +	 */
> > +	if (p->scx.flags & SCX_TASK_QUEUED) {
> > +		unsigned long opss;
> > +
> > +		rq->scx.ops_qseq++;
> > +		opss = atomic_long_read(&p->scx.ops_state);
> > +		atomic_long_set(&p->scx.ops_state,
> > +				(opss & SCX_OPSS_STATE_MASK) |
> > +				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> > +	}
> 
> Will we ever enter this if statement? IIUC set_cpus_allowed_scx() is always
> called under `scoped_guard (sched_change, p, DEQUEUE_SAVE)`, so assuming the
> task is on a runqueue, set_cpus_allowed_scx() will always be preceded by
> dequeue_task_scx(), which clears SCX_TASK_QUEUED.

I think you're right, we can probably remove this.

> 
> > +
> >  	/*
> >  	 * The effective cpumask is stored in @p->cpus_ptr which may temporarily
> >  	 * differ from the configured one in @p->cpus_mask. Always tell the bpf
> > @@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
> >  
> >  	/* execute move */
> >  	locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
> > -	dispatched = true;
> > +	if (locked_rq) {
> > +		dispatched = true;
> > +	} else {
> > +		/* Move failed: task stays in src_dsq */
> > +		raw_spin_unlock(&src_dsq->lock);
> > +		locked_rq = in_balance ? this_rq : NULL;
> > +	}
> >  out:
> >  	if (in_balance) {
> >  		if (this_rq != locked_rq) {
> > -			raw_spin_rq_unlock(locked_rq);
> > +			if (locked_rq)
> > +				raw_spin_rq_unlock(locked_rq);
> >  			raw_spin_rq_lock(this_rq);
> >  		}
> > -	} else {
> > +	} else if (locked_rq) {
> >  		raw_spin_rq_unlock_irqrestore(locked_rq, flags);
> >  	}
> >  
> 
> Thanks,
> Kuba
> 

Thanks,
-Andrea

  reply	other threads:[~2026-02-04 15:37 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
2026-02-04 13:20 ` Kuba Piecuch
2026-02-04 15:36   ` Andrea Righi [this message]
2026-02-04 16:58     ` Kuba Piecuch
2026-02-04 17:56       ` Andrea Righi
2026-02-05 17:20         ` Kuba Piecuch
2026-02-05 17:37           ` Andrea Righi
2026-02-04 15:07 ` Christian Loehle
2026-02-04 23:31 ` Tejun Heo
2026-02-05  1:15   ` Tejun Heo
2026-02-05 16:40   ` Andrea Righi
2026-02-05 22:57     ` Tejun Heo
2026-02-06  8:43       ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYNnmI26oS7YNuMP@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=jpiecuch@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox