[PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
@ 2026-03-19  8:35 Andrea Righi
  2026-03-19 10:31 ` Kuba Piecuch
  2026-03-19 15:18 ` Kuba Piecuch
  0 siblings, 2 replies; 8+ messages in thread
From: Andrea Righi @ 2026-03-19  8:35 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
target CPU. However, task affinity can change between the dispatch
decision and its finalization in finish_dispatch(). When this happens,
the scheduler may attempt to dispatch a task to a CPU that is no longer
allowed, resulting in fatal errors such as:

 EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])

This race exists because ops.dispatch() runs without holding the task's
run queue lock, allowing a concurrent set_cpus_allowed() to update
p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
then finalized using stale affinity information.

Example timeline:

  CPU0                                      CPU1
  ----                                      ----
                                            task_rq_lock(p)
  if (cpumask_test_cpu(cpu, p->cpus_ptr))
                                            set_cpus_allowed_scx(p, new_mask)
                                            task_rq_unlock(p)
      scx_bpf_dsq_insert(p,
              SCX_DSQ_LOCAL_ON | cpu, 0)

With commit ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics"), BPF
schedulers can avoid the affinity race by tracking task state and
handling %SCX_DEQ_SCHED_CHANGE in ops.dequeue(): when a task is dequeued
due to a property change, the scheduler can update the task state and
skip the direct dispatch from ops.dispatch() for non-queued tasks.

However, schedulers that do not implement task state tracking and
dispatch directly to a local DSQ directly from ops.dispatch() may
trigger the scx_error() condition when the kernel validates the
destination in dispatch_to_local_dsq().

Improve this by shooting down in-flight dispatches from the dequeue path
in the sched_ext core, instead of using the global DSQ as a fallback.
When a QUEUED task is dequeued, increment the runqueue's ops_qseq before
transitioning the task's ops_state to NONE. A finish_dispatch() that
runs after the transition sees NONE and drops the dispatch, one that
runs later, after the task has been re-enqueued (with the new qseq),
sees a qseq mismatch and also drops. Either way the stale dispatch is
discarded and the task is already, or will be, handled by the scheduler
again.

Since this change removes the global DSQ fallback, also drop
%SCX_ENQ_GDSQ_FALLBACK, which is now unused.

This allows reducing boilerplate in BPF schedulers for task state
tracking and simplifies their implementation.

Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
Changes in v2:
 - Rework the patch based on the new ops.dequeue() semantic
 - Drop SCX_ENQ_GDSQ_FALLBACK
 - Link to v1: https://lore.kernel.org/all/20260203230639.1259869-1-arighi@nvidia.com

 kernel/sched/ext.c          | 55 +++++++++++++++++++++++++++++--------
 kernel/sched/ext_internal.h |  1 -
 2 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 94548ee9ad858..8c199c548b27e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1382,10 +1382,8 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
 	 * e.g. SAVE/RESTORE cycles and slice extensions.
 	 */
 	if (enq_flags & SCX_ENQ_IMMED) {
-		if (unlikely(dsq->id != SCX_DSQ_LOCAL)) {
-			WARN_ON_ONCE(!(enq_flags & SCX_ENQ_GDSQ_FALLBACK));
+		if (unlikely(dsq->id != SCX_DSQ_LOCAL))
 			return;
-		}
 		p->scx.flags |= SCX_TASK_IMMED;
 	}
 
@@ -2043,6 +2041,13 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
+		/*
+		 * Invalidate any in-flight dispatches for this task. The
+		 * task is leaving the runqueue, so any dispatch decision
+		 * made while it was queued is stale.
+		 */
+		rq->scx.ops_qseq++;
+
 		/* A queued task must always be in BPF scheduler's custody */
 		WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY));
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
@@ -2390,8 +2395,10 @@ static bool consume_remote_task(struct rq *this_rq,
  * will change. As @p's task_rq is locked, this function doesn't need to use the
  * holding_cpu mechanism.
  *
- * On return, @src_dsq is unlocked and only @p's new task_rq, which is the
- * return value, is locked.
+ * On success, @src_dsq is unlocked and only @p's new task_rq, which is the
+ * return value, is locked. On failure (affinity change invalidated the
+ * move), returns NULL with @src_dsq still locked and task remaining in
+ * @src_dsq.
  */
 static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 					 struct task_struct *p, u64 enq_flags,
@@ -2408,9 +2415,11 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
 		if (src_rq != dst_rq &&
 		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-			dst_dsq = find_global_dsq(sch, task_cpu(p));
-			dst_rq = src_rq;
-			enq_flags |= SCX_ENQ_GDSQ_FALLBACK;
+			/*
+			 * Affinity changed after dispatch: abort the move,
+			 * task stays on src_dsq.
+			 */
+			return NULL;
 		}
 	} else {
 		/* no need to migrate if destination is a non-local DSQ */
@@ -2537,9 +2546,26 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 	}
 
 	if (src_rq != dst_rq &&
-	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
-				 enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
+	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
+		/*
+		 * Affinity changed after dispatch decision and the task
+		 * can't run anymore on the destination rq.
+		 *
+		 * Drop the dispatch, the task will be re-enqueued. Set the
+		 * task back to QUEUED so dequeue (if waiting) can proceed
+		 * using current qseq from the task's rq.
+		 */
+		if (src_rq != rq) {
+			raw_spin_rq_unlock(rq);
+			raw_spin_rq_lock(src_rq);
+		}
+		atomic_long_set_release(&p->scx.ops_state,
+			       SCX_OPSS_QUEUED |
+			       (src_rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
+		if (src_rq != rq) {
+			raw_spin_rq_unlock(src_rq);
+			raw_spin_rq_lock(rq);
+		}
 		return;
 	}
 
@@ -8112,7 +8138,12 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 
 	/* execute move */
 	locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
-	dispatched = true;
+	if (locked_rq) {
+		dispatched = true;
+	} else {
+		raw_spin_unlock(&src_dsq->lock);
+		locked_rq = src_rq;
+	}
 out:
 	if (in_balance) {
 		if (this_rq != locked_rq) {
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index b4f36d8b9c1dd..49cef302b26bd 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1145,7 +1145,6 @@ enum scx_enq_flags {
 	SCX_ENQ_CLEAR_OPSS	= 1LLU << 56,
 	SCX_ENQ_DSQ_PRIQ	= 1LLU << 57,
 	SCX_ENQ_NESTED		= 1LLU << 58,
-	SCX_ENQ_GDSQ_FALLBACK	= 1LLU << 59,	/* fell back to global DSQ */
 };
 
 enum scx_deq_flags {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
  2026-03-19  8:35 [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
@ 2026-03-19 10:31 ` Kuba Piecuch
  2026-03-19 13:54   ` Kuba Piecuch
  2026-03-19 21:09   ` Andrea Righi
  2026-03-19 15:18 ` Kuba Piecuch
  1 sibling, 2 replies; 8+ messages in thread
From: Kuba Piecuch @ 2026-03-19 10:31 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Hi Andrea,

On Thu Mar 19, 2026 at 8:35 AM UTC, Andrea Righi wrote:
> @@ -2043,6 +2041,13 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> +		/*
> +		 * Invalidate any in-flight dispatches for this task. The
> +		 * task is leaving the runqueue, so any dispatch decision
> +		 * made while it was queued is stale.
> +		 */
> +		rq->scx.ops_qseq++;

I'm not sure why this is necessary. Isn't setting the ops_state to
SCX_OPSS_NONE enough to invalidate in-flight dispatches? Could you describe
a scenario where incrementing qseq on dequeue is necessary?

> @@ -2537,9 +2546,26 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  	}
>  
>  	if (src_rq != dst_rq &&
> -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
> -				 enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
> +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> +		/*
> +		 * Affinity changed after dispatch decision and the task
> +		 * can't run anymore on the destination rq.

More of a nitpick, but this doesn't necessarily mean that the affinity changed.
The scheduler could have also issued an invalid dispatch to a CPU outside of
the task's cpumask (e.g. due to a bug), in which case the task won't be
re-enqueued if we simply drop the dispatch, correct?

> +		 *
> +		 * Drop the dispatch, the task will be re-enqueued. Set the

Just to clarify, is this referring to the enqueue that happens in
do_set_cpus_allowed(), immediately after the actual cpumask change?

> +		 * task back to QUEUED so dequeue (if waiting) can proceed
> +		 * using current qseq from the task's rq.
> +		 */
> +		if (src_rq != rq) {
> +			raw_spin_rq_unlock(rq);
> +			raw_spin_rq_lock(src_rq);
> +		}
> +		atomic_long_set_release(&p->scx.ops_state,
> +			       SCX_OPSS_QUEUED |
> +			       (src_rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> +		if (src_rq != rq) {
> +			raw_spin_rq_unlock(src_rq);
> +			raw_spin_rq_lock(rq);
> +		}
>  		return;
>  	}

My understanding is that task_can_run_on_remote_rq() can run without src_rq
locked, so it's possible that @p's cpumask changes after the check, isn't it?
In that case, I think it's still possible to move the task to the local DSQ
of a CPU that is outside of its cpumask, triggering a warning in
move_remote_task_to_local_dsq().

Thanks,
Kuba

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
  2026-03-19 10:31 ` Kuba Piecuch
@ 2026-03-19 13:54   ` Kuba Piecuch
  2026-03-19 21:09   ` Andrea Righi
  1 sibling, 0 replies; 8+ messages in thread
From: Kuba Piecuch @ 2026-03-19 13:54 UTC (permalink / raw)
  To: Kuba Piecuch, Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext,
	linux-kernel

On Thu Mar 19, 2026 at 10:31 AM UTC, Kuba Piecuch wrote:
>> @@ -2537,9 +2546,26 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>>  	}
>>  
>>  	if (src_rq != dst_rq &&
>> -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
>> -		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
>> -				 enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
>> +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
>> +		/*
>> +		 * Affinity changed after dispatch decision and the task
>> +		 * can't run anymore on the destination rq.
>> +		 *
>> +		 * Drop the dispatch, the task will be re-enqueued. Set the
>> +		 * task back to QUEUED so dequeue (if waiting) can proceed
>> +		 * using current qseq from the task's rq.
>> +		 */
>> +		if (src_rq != rq) {
>> +			raw_spin_rq_unlock(rq);
>> +			raw_spin_rq_lock(src_rq);
>> +		}
>> +		atomic_long_set_release(&p->scx.ops_state,
>> +			       SCX_OPSS_QUEUED |
>> +			       (src_rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
>> +		if (src_rq != rq) {
>> +			raw_spin_rq_unlock(src_rq);
>> +			raw_spin_rq_lock(rq);
>> +		}
>>  		return;
>>  	}
>
> My understanding is that task_can_run_on_remote_rq() can run without src_rq
> locked, so it's possible that @p's cpumask changes after the check, isn't it?
> In that case, I think it's still possible to move the task to the local DSQ
> of a CPU that is outside of its cpumask, triggering a warning in
> move_remote_task_to_local_dsq().

I've looked at the code more carefully and I don't think this is an issue.
It's true that task_can_run_on_remote_rq() can run without src_rq locked, and
it's possible that the cpumask changes after the check, but then the dequeue
preceding the cpumask change must have waited for ops_state to change from
SCX_OPSS_DISPATCHING to SCX_OPSS_NONE and it must have reset holding_cpu to -1,
so thanks to the holding_cpu check later we won't insert the task into the DSQ.

Apologies for the confusion.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
  2026-03-19 10:31 ` Kuba Piecuch
  2026-03-19 13:54   ` Kuba Piecuch
@ 2026-03-19 21:09   ` Andrea Righi
  2026-03-20  9:18     ` Kuba Piecuch
  1 sibling, 1 reply; 8+ messages in thread
From: Andrea Righi @ 2026-03-19 21:09 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Thu, Mar 19, 2026 at 10:31:30AM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> On Thu Mar 19, 2026 at 8:35 AM UTC, Andrea Righi wrote:
> > @@ -2043,6 +2041,13 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  		 */
> >  		BUG();
> >  	case SCX_OPSS_QUEUED:
> > +		/*
> > +		 * Invalidate any in-flight dispatches for this task. The
> > +		 * task is leaving the runqueue, so any dispatch decision
> > +		 * made while it was queued is stale.
> > +		 */
> > +		rq->scx.ops_qseq++;
> 
> I'm not sure why this is necessary. Isn't setting the ops_state to
> SCX_OPSS_NONE enough to invalidate in-flight dispatches? Could you describe
> a scenario where incrementing qseq on dequeue is necessary?

I'm looking back at the code and I think you're right, ops_qseq is already
incremented by the new enqueue, so setting the state to NONE on dequeue
should be enough to drop the in-flight dispatch.

I did some quick tests and everything seems to work also without this
increment. Thanks for noticing this!

> 
> > @@ -2537,9 +2546,26 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  	}
> >  
> >  	if (src_rq != dst_rq &&
> > -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> > -		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
> > -				 enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
> > +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> > +		/*
> > +		 * Affinity changed after dispatch decision and the task
> > +		 * can't run anymore on the destination rq.
> 
> More of a nitpick, but this doesn't necessarily mean that the affinity changed.
> The scheduler could have also issued an invalid dispatch to a CPU outside of
> the task's cpumask (e.g. due to a bug), in which case the task won't be
> re-enqueued if we simply drop the dispatch, correct?

That's right, the scheduler could have issues an invalid dispatch and in
that case we would just drop the task on the floor, which is not really
nice, it'd be better to immediately error in this case. And we don't need
the global DSQ fallback, since we're erroring anyway.

I need to rethink this part...

> 
> > +		 *
> > +		 * Drop the dispatch, the task will be re-enqueued. Set the
> 
> Just to clarify, is this referring to the enqueue that happens in
> do_set_cpus_allowed(), immediately after the actual cpumask change?

Correct, it's the enqueue that happens from sched_change_end() in
do_set_cpus_allowed().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
  2026-03-19 21:09   ` Andrea Righi
@ 2026-03-20  9:18     ` Kuba Piecuch
  2026-03-23 23:13       ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Kuba Piecuch @ 2026-03-20  9:18 UTC (permalink / raw)
  To: Andrea Righi, Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Thu Mar 19, 2026 at 9:09 PM UTC, Andrea Righi wrote:
> On Thu, Mar 19, 2026 at 10:31:30AM +0000, Kuba Piecuch wrote:
>> On Thu Mar 19, 2026 at 8:35 AM UTC, Andrea Righi wrote:
>> > @@ -2537,9 +2546,26 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>> >  	}
>> >  
>> >  	if (src_rq != dst_rq &&
>> > -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
>> > -		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
>> > -				 enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
>> > +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
>> > +		/*
>> > +		 * Affinity changed after dispatch decision and the task
>> > +		 * can't run anymore on the destination rq.
>> 
>> More of a nitpick, but this doesn't necessarily mean that the affinity changed.
>> The scheduler could have also issued an invalid dispatch to a CPU outside of
>> the task's cpumask (e.g. due to a bug), in which case the task won't be
>> re-enqueued if we simply drop the dispatch, correct?
>
> That's right, the scheduler could have issues an invalid dispatch and in
> that case we would just drop the task on the floor, which is not really
> nice, it'd be better to immediately error in this case. And we don't need
> the global DSQ fallback, since we're erroring anyway.
>
> I need to rethink this part...

The fundamental problem here is differentiating between buggy dispatches that
should have never been issued and dispatches that were valid at the moment
the BPF scheduler was preparing the task for dispatch, but became invalid due
to racing cpumask changes.

The crucial observation is that SCX will only detect racing dequeues/enqueues
if they race with the window between scx_bpf_dsq_insert() and finish_dispatch().
That's because scx_bpf_dsq_insert() stores a snapshot of the task's current
qseq value, which is compared in finish_dispatch().

The BPF-side cpumask checks traditionally happen outside of this window, making
finish_dispatch() incapable of detecting cpumask changes that raced with the
BPF-side check but happened strictly before scx_bpf_dsq_insert().

To resolve this, we need to extend the race detection window so that it
includes the BPF-side checks.

The simple way to do this is to do scx_bpf_dsq_insert() at the very beginning,
once we know which task we would like to dispatch, and cancel the pending
dispatch via scx_bpf_dispatch_cancel() if any of the pre-dispatch checks fail
on the BPF side. This way, the "critical section" includes BPF-side checks, and
SCX will ignore the dispatch if there was a dequeue/enqueue racing with the
critical section.

With this solution, we can throw an error if task_can_run_on_remote_rq() is
false, because we know that there was no racing cpumask change (if there was,
it would have been caught earlier, in finish_dispatch()).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
  2026-03-20  9:18     ` Kuba Piecuch
@ 2026-03-23 23:13       ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-03-23 23:13 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Andrea Righi, David Vernet, Changwoo Min, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hello,

On Fri, Mar 20, 2026 at 09:18:20AM +0000, Kuba Piecuch wrote:
...
> The simple way to do this is to do scx_bpf_dsq_insert() at the very beginning,
> once we know which task we would like to dispatch, and cancel the pending
> dispatch via scx_bpf_dispatch_cancel() if any of the pre-dispatch checks fail
> on the BPF side. This way, the "critical section" includes BPF-side checks, and
> SCX will ignore the dispatch if there was a dequeue/enqueue racing with the
> critical section.
> 
> With this solution, we can throw an error if task_can_run_on_remote_rq() is
> false, because we know that there was no racing cpumask change (if there was,
> it would have been caught earlier, in finish_dispatch()).

Yeah, I think this makes more sense. qseq is already there to provide
protection against these events. It's just that the capturing of qseq is too
late. If insert/cancel is too ugly, we can introduce another kfunc to
capture the qseq - scx_bpf_dsq_insert_begin() or something like that - and
stash it in a per-cpu variable. That way, qseq would be cover the "current"
queued instance and the existing qseq mechanism would be able to reliably
ignore the ones that lost race to dequeue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
  2026-03-19  8:35 [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
  2026-03-19 10:31 ` Kuba Piecuch
@ 2026-03-19 15:18 ` Kuba Piecuch
  2026-03-19 19:01   ` Andrea Righi
  1 sibling, 1 reply; 8+ messages in thread
From: Kuba Piecuch @ 2026-03-19 15:18 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Hi Andrea,

On Thu Mar 19, 2026 at 8:35 AM UTC, Andrea Righi wrote:
> A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> target CPU. However, task affinity can change between the dispatch
> decision and its finalization in finish_dispatch(). When this happens,
> the scheduler may attempt to dispatch a task to a CPU that is no longer
> allowed, resulting in fatal errors such as:
>
>  EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
>
> This race exists because ops.dispatch() runs without holding the task's
> run queue lock, allowing a concurrent set_cpus_allowed() to update
> p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> then finalized using stale affinity information.
>
> Example timeline:
>
>   CPU0                                      CPU1
>   ----                                      ----
>                                             task_rq_lock(p)
>   if (cpumask_test_cpu(cpu, p->cpus_ptr))
>                                             set_cpus_allowed_scx(p, new_mask)
>                                             task_rq_unlock(p)
>       scx_bpf_dsq_insert(p,
>               SCX_DSQ_LOCAL_ON | cpu, 0)
>
> With commit ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics"), BPF
> schedulers can avoid the affinity race by tracking task state and
> handling %SCX_DEQ_SCHED_CHANGE in ops.dequeue(): when a task is dequeued
> due to a property change, the scheduler can update the task state and
> skip the direct dispatch from ops.dispatch() for non-queued tasks.
>
> However, schedulers that do not implement task state tracking and
> dispatch directly to a local DSQ directly from ops.dispatch() may
> trigger the scx_error() condition when the kernel validates the
> destination in dispatch_to_local_dsq().

The two paragraphs above mention "direct dispatch from ops.dispatch()"
and "dispatch directly to a local DSQ directly from ops.dispatch()".
My understanding is that a "direct dispatch" can only happen from
ops.select_cpu() or ops.enqueue(), not from ops.dispatch(). Is this just
an unfortunate choice of words?
Would "dispatch to a local DSQ" be a more accurate phrase here?

Thanks,
Kuba

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
  2026-03-19 15:18 ` Kuba Piecuch
@ 2026-03-19 19:01   ` Andrea Righi
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Righi @ 2026-03-19 19:01 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hi Kuba,

On Thu, Mar 19, 2026 at 03:18:38PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> On Thu Mar 19, 2026 at 8:35 AM UTC, Andrea Righi wrote:
> > A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> > target CPU. However, task affinity can change between the dispatch
> > decision and its finalization in finish_dispatch(). When this happens,
> > the scheduler may attempt to dispatch a task to a CPU that is no longer
> > allowed, resulting in fatal errors such as:
> >
> >  EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
> >
> > This race exists because ops.dispatch() runs without holding the task's
> > run queue lock, allowing a concurrent set_cpus_allowed() to update
> > p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> > then finalized using stale affinity information.
> >
> > Example timeline:
> >
> >   CPU0                                      CPU1
> >   ----                                      ----
> >                                             task_rq_lock(p)
> >   if (cpumask_test_cpu(cpu, p->cpus_ptr))
> >                                             set_cpus_allowed_scx(p, new_mask)
> >                                             task_rq_unlock(p)
> >       scx_bpf_dsq_insert(p,
> >               SCX_DSQ_LOCAL_ON | cpu, 0)
> >
> > With commit ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics"), BPF
> > schedulers can avoid the affinity race by tracking task state and
> > handling %SCX_DEQ_SCHED_CHANGE in ops.dequeue(): when a task is dequeued
> > due to a property change, the scheduler can update the task state and
> > skip the direct dispatch from ops.dispatch() for non-queued tasks.
> >
> > However, schedulers that do not implement task state tracking and
> > dispatch directly to a local DSQ directly from ops.dispatch() may
> > trigger the scx_error() condition when the kernel validates the
> > destination in dispatch_to_local_dsq().
> 
> The two paragraphs above mention "direct dispatch from ops.dispatch()"
> and "dispatch directly to a local DSQ directly from ops.dispatch()".
> My understanding is that a "direct dispatch" can only happen from
> ops.select_cpu() or ops.enqueue(), not from ops.dispatch(). Is this just
> an unfortunate choice of words?
> Would "dispatch to a local DSQ" be a more accurate phrase here?

Oh yes, poor wording on my side. What I mean is
scx_bpf_dsq_insert(SCX_DSQ_LOCAL_ON | cpu) from ops.dispatch(), so
"dispatch to a local DSQ" is definitely better, thanks!

-Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-23 23:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19  8:35 [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
2026-03-19 10:31 ` Kuba Piecuch
2026-03-19 13:54   ` Kuba Piecuch
2026-03-19 21:09   ` Andrea Righi
2026-03-20  9:18     ` Kuba Piecuch
2026-03-23 23:13       ` Tejun Heo
2026-03-19 15:18 ` Kuba Piecuch
2026-03-19 19:01   ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox