* [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes
@ 2026-02-03 23:06 Andrea Righi
2026-02-04 13:20 ` Kuba Piecuch
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Andrea Righi @ 2026-02-03 23:06 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext,
linux-kernel
A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
target CPU. However, task affinity can change between the dispatch
decision and its finalization in finish_dispatch(). When this happens,
the scheduler may attempt to dispatch a task to a CPU that is no longer
allowed, resulting in fatal errors such as:
EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
This race exists because ops.dispatch() runs without holding the task's
run queue lock, allowing a concurrent set_cpus_allowed() to update
p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
then finalized using stale affinity information.
Example timeline:
CPU0 CPU1
---- ----
task_rq_lock(p)
if (cpumask_test_cpu(cpu, p->cpus_ptr))
set_cpus_allowed_scx(p, new_mask)
task_rq_unlock(p)
scx_bpf_dsq_insert(p,
SCX_DSQ_LOCAL_ON | cpu, 0)
Fix this by extending the existing qseq invalidation mechanism to also
cover CPU affinity changes, in addition to task dequeues/re-enqueues,
occurring between dispatch decision and finalization.
When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
re-dispatched using up-to-date affinity information.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++---------
1 file changed, 47 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0ab994180f655..6128a2529a7c7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
* will change. As @p's task_rq is locked, this function doesn't need to use the
* holding_cpu mechanism.
*
- * On return, @src_dsq is unlocked and only @p's new task_rq, which is the
- * return value, is locked.
+ * On success, @src_dsq is unlocked and only @p's new task_rq, which is the
+ * return value, is locked. On failure (affinity change invalidated the move),
+ * returns NULL with @src_dsq unlocked and task remaining in @src_dsq.
*/
static struct rq *move_task_between_dsqs(struct scx_sched *sch,
struct task_struct *p, u64 enq_flags,
@@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
if (dst_dsq->id == SCX_DSQ_LOCAL) {
dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
if (src_rq != dst_rq &&
- unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
- dst_dsq = find_global_dsq(sch, p);
- dst_rq = src_rq;
+ unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
+ /*
+ * Task affinity changed after dispatch decision:
+ * drop the dispatch, caller will handle returning
+ * the task to its original DSQ.
+ */
+ return NULL;
}
} else {
/* no need to migrate if destination is a non-local DSQ */
@@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
}
if (src_rq != dst_rq &&
- unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
- dispatch_enqueue(sch, find_global_dsq(sch, p), p,
- enq_flags | SCX_ENQ_CLEAR_OPSS);
+ unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
+ /*
+ * Task affinity changed after dispatch decision: drop the
+ * dispatch, task remains in its current state and will be
+ * dispatched again in a future cycle.
+ */
+ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
+ (atomic_long_read(&p->scx.ops_state) &
+ SCX_OPSS_QSEQ_MASK));
return;
}
@@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
struct affinity_context *ac)
{
struct scx_sched *sch = scx_root;
+ struct rq *rq = task_rq(p);
+
+ lockdep_assert_rq_held(rq);
set_cpus_allowed_common(p, ac);
if (unlikely(!sch))
return;
+ /*
+ * Affinity changes invalidate any pending dispatch decisions made
+ * with the old affinity. Increment the runqueue's ops_qseq and
+ * update the task's qseq to invalidate in-flight dispatches.
+ */
+ if (p->scx.flags & SCX_TASK_QUEUED) {
+ unsigned long opss;
+
+ rq->scx.ops_qseq++;
+ opss = atomic_long_read(&p->scx.ops_state);
+ atomic_long_set(&p->scx.ops_state,
+ (opss & SCX_OPSS_STATE_MASK) |
+ (rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
+ }
+
/*
* The effective cpumask is stored in @p->cpus_ptr which may temporarily
* differ from the configured one in @p->cpus_mask. Always tell the bpf
@@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
/* execute move */
locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
- dispatched = true;
+ if (locked_rq) {
+ dispatched = true;
+ } else {
+ /* Move failed: task stays in src_dsq */
+ raw_spin_unlock(&src_dsq->lock);
+ locked_rq = in_balance ? this_rq : NULL;
+ }
out:
if (in_balance) {
if (this_rq != locked_rq) {
- raw_spin_rq_unlock(locked_rq);
+ if (locked_rq)
+ raw_spin_rq_unlock(locked_rq);
raw_spin_rq_lock(this_rq);
}
- } else {
+ } else if (locked_rq) {
raw_spin_rq_unlock_irqrestore(locked_rq, flags);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi @ 2026-02-04 13:20 ` Kuba Piecuch 2026-02-04 15:36 ` Andrea Righi 2026-02-04 15:07 ` Christian Loehle 2026-02-04 23:31 ` Tejun Heo 2 siblings, 1 reply; 13+ messages in thread From: Kuba Piecuch @ 2026-02-04 13:20 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hi Andrea, On Tue Feb 3, 2026 at 11:06 PM UTC, Andrea Righi wrote: > A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a > target CPU. However, task affinity can change between the dispatch > decision and its finalization in finish_dispatch(). When this happens, > the scheduler may attempt to dispatch a task to a CPU that is no longer > allowed, resulting in fatal errors such as: > > EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565]) > > This race exists because ops.dispatch() runs without holding the task's > run queue lock, allowing a concurrent set_cpus_allowed() to update > p->cpus_ptr while the BPF scheduler is still using it. The dispatch is > then finalized using stale affinity information. > > Example timeline: > > CPU0 CPU1 > ---- ---- > task_rq_lock(p) > if (cpumask_test_cpu(cpu, p->cpus_ptr)) > set_cpus_allowed_scx(p, new_mask) > task_rq_unlock(p) > scx_bpf_dsq_insert(p, > SCX_DSQ_LOCAL_ON | cpu, 0) > > Fix this by extending the existing qseq invalidation mechanism to also > cover CPU affinity changes, in addition to task dequeues/re-enqueues, > occurring between dispatch decision and finalization. I'm not convinced this will prevent the exact race scenario you describe. For the qseq increment inside set_cpus_allowed_scx() to be noticed, it needs to happen _after_ scx_bpf_dsq_insert() reads the qseq and stores it in the dispatch buffer entry. We can still have the cpumask change and qseq increment on CPU1 happen between cpumask_test_cpu() and scx_bpf_dsq_insert() on CPU0, and it will not be detected because the dispatch buffer entry will contain the incremented value. > > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be > re-dispatched using up-to-date affinity information. How will the scheduler know that the dispatch was dropped? Is the scheduler expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx() on CPU1? > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- > kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++--------- > 1 file changed, 47 insertions(+), 11 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 0ab994180f655..6128a2529a7c7 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, > * will change. As @p's task_rq is locked, this function doesn't need to use the > * holding_cpu mechanism. > * > - * On return, @src_dsq is unlocked and only @p's new task_rq, which is the > - * return value, is locked. > + * On success, @src_dsq is unlocked and only @p's new task_rq, which is the > + * return value, is locked. On failure (affinity change invalidated the move), > + * returns NULL with @src_dsq unlocked and task remaining in @src_dsq. > */ > static struct rq *move_task_between_dsqs(struct scx_sched *sch, > struct task_struct *p, u64 enq_flags, > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, > if (dst_dsq->id == SCX_DSQ_LOCAL) { > dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); > if (src_rq != dst_rq && > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > - dst_dsq = find_global_dsq(sch, p); > - dst_rq = src_rq; > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) { > + /* > + * Task affinity changed after dispatch decision: > + * drop the dispatch, caller will handle returning > + * the task to its original DSQ. > + */ > + return NULL; > } > } else { > /* no need to migrate if destination is a non-local DSQ */ > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > } > > if (src_rq != dst_rq && > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > - dispatch_enqueue(sch, find_global_dsq(sch, p), p, > - enq_flags | SCX_ENQ_CLEAR_OPSS); > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) { > + /* > + * Task affinity changed after dispatch decision: drop the > + * dispatch, task remains in its current state and will be > + * dispatched again in a future cycle. > + */ > + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | > + (atomic_long_read(&p->scx.ops_state) & > + SCX_OPSS_QSEQ_MASK)); > return; > } > > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p, > struct affinity_context *ac) > { > struct scx_sched *sch = scx_root; > + struct rq *rq = task_rq(p); > + > + lockdep_assert_rq_held(rq); > > set_cpus_allowed_common(p, ac); > > if (unlikely(!sch)) > return; > > + /* > + * Affinity changes invalidate any pending dispatch decisions made > + * with the old affinity. Increment the runqueue's ops_qseq and > + * update the task's qseq to invalidate in-flight dispatches. > + */ > + if (p->scx.flags & SCX_TASK_QUEUED) { > + unsigned long opss; > + > + rq->scx.ops_qseq++; > + opss = atomic_long_read(&p->scx.ops_state); > + atomic_long_set(&p->scx.ops_state, > + (opss & SCX_OPSS_STATE_MASK) | > + (rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT)); > + } Will we ever enter this if statement? IIUC set_cpus_allowed_scx() is always called under `scoped_guard (sched_change, p, DEQUEUE_SAVE)`, so assuming the task is on a runqueue, set_cpus_allowed_scx() will always be preceded by dequeue_task_scx(), which clears SCX_TASK_QUEUED. > + > /* > * The effective cpumask is stored in @p->cpus_ptr which may temporarily > * differ from the configured one in @p->cpus_mask. Always tell the bpf > @@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, > > /* execute move */ > locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq); > - dispatched = true; > + if (locked_rq) { > + dispatched = true; > + } else { > + /* Move failed: task stays in src_dsq */ > + raw_spin_unlock(&src_dsq->lock); > + locked_rq = in_balance ? this_rq : NULL; > + } > out: > if (in_balance) { > if (this_rq != locked_rq) { > - raw_spin_rq_unlock(locked_rq); > + if (locked_rq) > + raw_spin_rq_unlock(locked_rq); > raw_spin_rq_lock(this_rq); > } > - } else { > + } else if (locked_rq) { > raw_spin_rq_unlock_irqrestore(locked_rq, flags); > } > Thanks, Kuba ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-04 13:20 ` Kuba Piecuch @ 2026-02-04 15:36 ` Andrea Righi 2026-02-04 16:58 ` Kuba Piecuch 0 siblings, 1 reply; 13+ messages in thread From: Andrea Righi @ 2026-02-04 15:36 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hi Kuba, On Wed, Feb 04, 2026 at 01:20:52PM +0000, Kuba Piecuch wrote: > Hi Andrea, > > On Tue Feb 3, 2026 at 11:06 PM UTC, Andrea Righi wrote: > > A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a > > target CPU. However, task affinity can change between the dispatch > > decision and its finalization in finish_dispatch(). When this happens, > > the scheduler may attempt to dispatch a task to a CPU that is no longer > > allowed, resulting in fatal errors such as: > > > > EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565]) > > > > This race exists because ops.dispatch() runs without holding the task's > > run queue lock, allowing a concurrent set_cpus_allowed() to update > > p->cpus_ptr while the BPF scheduler is still using it. The dispatch is > > then finalized using stale affinity information. > > > > Example timeline: > > > > CPU0 CPU1 > > ---- ---- > > task_rq_lock(p) > > if (cpumask_test_cpu(cpu, p->cpus_ptr)) > > set_cpus_allowed_scx(p, new_mask) > > task_rq_unlock(p) > > scx_bpf_dsq_insert(p, > > SCX_DSQ_LOCAL_ON | cpu, 0) > > > > Fix this by extending the existing qseq invalidation mechanism to also > > cover CPU affinity changes, in addition to task dequeues/re-enqueues, > > occurring between dispatch decision and finalization. > > I'm not convinced this will prevent the exact race scenario you describe. > For the qseq increment inside set_cpus_allowed_scx() to be noticed, it needs > to happen _after_ scx_bpf_dsq_insert() reads the qseq and stores it in the > dispatch buffer entry. > > We can still have the cpumask change and qseq increment on CPU1 happen between > cpumask_test_cpu() and scx_bpf_dsq_insert() on CPU0, and it will not be > detected because the dispatch buffer entry will contain the incremented value. Yeah, I think you're right, it makes things better, but it's still racy. > > > > > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped > > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be > > re-dispatched using up-to-date affinity information. > > How will the scheduler know that the dispatch was dropped? Is the scheduler > expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx() > on CPU1? The idea was that, if the dispatch is dropped, we'll see another ops.enqueue() for the task, so at least the task is not "lost" and the BPF scheduler gets another chance what to do with it. In this case it'd be useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that the enqueue resulted from a dropped dispatch. > > > > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > > --- > > kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++--------- > > 1 file changed, 47 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index 0ab994180f655..6128a2529a7c7 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, > > * will change. As @p's task_rq is locked, this function doesn't need to use the > > * holding_cpu mechanism. > > * > > - * On return, @src_dsq is unlocked and only @p's new task_rq, which is the > > - * return value, is locked. > > + * On success, @src_dsq is unlocked and only @p's new task_rq, which is the > > + * return value, is locked. On failure (affinity change invalidated the move), > > + * returns NULL with @src_dsq unlocked and task remaining in @src_dsq. > > */ > > static struct rq *move_task_between_dsqs(struct scx_sched *sch, > > struct task_struct *p, u64 enq_flags, > > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, > > if (dst_dsq->id == SCX_DSQ_LOCAL) { > > dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); > > if (src_rq != dst_rq && > > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > > - dst_dsq = find_global_dsq(sch, p); > > - dst_rq = src_rq; > > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) { > > + /* > > + * Task affinity changed after dispatch decision: > > + * drop the dispatch, caller will handle returning > > + * the task to its original DSQ. > > + */ > > + return NULL; > > } > > } else { > > /* no need to migrate if destination is a non-local DSQ */ > > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > > } > > > > if (src_rq != dst_rq && > > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > > - dispatch_enqueue(sch, find_global_dsq(sch, p), p, > > - enq_flags | SCX_ENQ_CLEAR_OPSS); > > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) { > > + /* > > + * Task affinity changed after dispatch decision: drop the > > + * dispatch, task remains in its current state and will be > > + * dispatched again in a future cycle. > > + */ > > + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | > > + (atomic_long_read(&p->scx.ops_state) & > > + SCX_OPSS_QSEQ_MASK)); > > return; > > } > > > > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p, > > struct affinity_context *ac) > > { > > struct scx_sched *sch = scx_root; > > + struct rq *rq = task_rq(p); > > + > > + lockdep_assert_rq_held(rq); > > > > set_cpus_allowed_common(p, ac); > > > > if (unlikely(!sch)) > > return; > > > > + /* > > + * Affinity changes invalidate any pending dispatch decisions made > > + * with the old affinity. Increment the runqueue's ops_qseq and > > + * update the task's qseq to invalidate in-flight dispatches. > > + */ > > + if (p->scx.flags & SCX_TASK_QUEUED) { > > + unsigned long opss; > > + > > + rq->scx.ops_qseq++; > > + opss = atomic_long_read(&p->scx.ops_state); > > + atomic_long_set(&p->scx.ops_state, > > + (opss & SCX_OPSS_STATE_MASK) | > > + (rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT)); > > + } > > Will we ever enter this if statement? IIUC set_cpus_allowed_scx() is always > called under `scoped_guard (sched_change, p, DEQUEUE_SAVE)`, so assuming the > task is on a runqueue, set_cpus_allowed_scx() will always be preceded by > dequeue_task_scx(), which clears SCX_TASK_QUEUED. I think you're right, we can probably remove this. > > > + > > /* > > * The effective cpumask is stored in @p->cpus_ptr which may temporarily > > * differ from the configured one in @p->cpus_mask. Always tell the bpf > > @@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit, > > > > /* execute move */ > > locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq); > > - dispatched = true; > > + if (locked_rq) { > > + dispatched = true; > > + } else { > > + /* Move failed: task stays in src_dsq */ > > + raw_spin_unlock(&src_dsq->lock); > > + locked_rq = in_balance ? this_rq : NULL; > > + } > > out: > > if (in_balance) { > > if (this_rq != locked_rq) { > > - raw_spin_rq_unlock(locked_rq); > > + if (locked_rq) > > + raw_spin_rq_unlock(locked_rq); > > raw_spin_rq_lock(this_rq); > > } > > - } else { > > + } else if (locked_rq) { > > raw_spin_rq_unlock_irqrestore(locked_rq, flags); > > } > > > > Thanks, > Kuba > Thanks, -Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-04 15:36 ` Andrea Righi @ 2026-02-04 16:58 ` Kuba Piecuch 2026-02-04 17:56 ` Andrea Righi 0 siblings, 1 reply; 13+ messages in thread From: Kuba Piecuch @ 2026-02-04 16:58 UTC (permalink / raw) To: Andrea Righi, Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote: >> > >> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped >> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be >> > re-dispatched using up-to-date affinity information. >> >> How will the scheduler know that the dispatch was dropped? Is the scheduler >> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx() >> on CPU1? > > The idea was that, if the dispatch is dropped, we'll see another > ops.enqueue() for the task, so at least the task is not "lost" and the > BPF scheduler gets another chance what to do with it. In this case it'd be > useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that > the enqueue resulted from a dropped dispatch. I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag if a need for it arises. I still worry about the scenario you described. In particular, I think it can lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch. CPU0 CPU1 ---- ---- if (cpumask_test_cpu(cpu, p->cpus_ptr)) task_rq_lock(p) dequeue_task_scx(p, ...) (remove p from internal queues) set_cpus_allowed_scx(p, new_mask) enqueue_task_scx(p, ...) (add p to internal queues) task_rq_unlock(p) (remove p from internal queues) scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, 0) In this scenario, the ops.enqueue() which is supposed to notify the BPF scheduler about the failed dispatch actually happens _before_ the actual dispatch, so once the dispatch fails, the task won't be re-enqueued. There are two problems here: 1. CPU0 makes a scheduling decision based on stale data and it isn't detected. 2. Even if it is detected and the dispatch aborted, the task won't be re-enqueued. The way we deal with the first problem in ghOSt (Google's equivalent of sched_ext) is we expose the per-task sequence number to the BPF scheduler. On the dispatch path, when the BPF scheduler has a candidate task, it retrieves its seqnum, re-checks the task state to ensure that it's still eligible for dispatch, and passes the seqnum to the kernel's dispatch helper for verification. If the kernel detects that the seqnum has changed already, it synchronously fails the dispatch attempt (dispatch always happens synchronously in ghOSt). In sched_ext, we could do the synchronous check, but we also need to do the same check later in finish_dispatch(), comparing the current qseq against the qseq passed by the BPF scheduler. To fix the second problem, we would need to explicitly call ops.enqueue() from finish_dispatch() and the other places where we abort dispatch if the qseq is out of date. Either that, or just add locking to the BPF scheduler to prevent the race from happening in the first place. Thanks, Kuba ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-04 16:58 ` Kuba Piecuch @ 2026-02-04 17:56 ` Andrea Righi 2026-02-05 17:20 ` Kuba Piecuch 0 siblings, 1 reply; 13+ messages in thread From: Andrea Righi @ 2026-02-04 17:56 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Wed, Feb 04, 2026 at 04:58:47PM +0000, Kuba Piecuch wrote: > On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote: > >> > > >> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped > >> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be > >> > re-dispatched using up-to-date affinity information. > >> > >> How will the scheduler know that the dispatch was dropped? Is the scheduler > >> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx() > >> on CPU1? > > > > The idea was that, if the dispatch is dropped, we'll see another > > ops.enqueue() for the task, so at least the task is not "lost" and the > > BPF scheduler gets another chance what to do with it. In this case it'd be > > useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that > > the enqueue resulted from a dropped dispatch. > > I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag > if a need for it arises. > > I still worry about the scenario you described. In particular, I think it can > lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch. > > CPU0 CPU1 > ---- ---- > if (cpumask_test_cpu(cpu, p->cpus_ptr)) > task_rq_lock(p) > dequeue_task_scx(p, ...) > (remove p from internal queues) > set_cpus_allowed_scx(p, new_mask) > enqueue_task_scx(p, ...) > (add p to internal queues) > task_rq_unlock(p) > (remove p from internal queues) > scx_bpf_dsq_insert(p, > SCX_DSQ_LOCAL_ON | cpu, 0) > > In this scenario, the ops.enqueue() which is supposed to notify the BPF > scheduler about the failed dispatch actually happens _before_ the actual > dispatch, so once the dispatch fails, the task won't be re-enqueued. > > There are two problems here: > > 1. CPU0 makes a scheduling decision based on stale data and it isn't detected. > 2. Even if it is detected and the dispatch aborted, the task won't be > re-enqueued. Right. At this point I think we can just rely on the affinity validation via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and just drop invalid dispatches. And to prevent dropped tasks, I was wondering if we could just insert the task into a per-rq fallback DSQ, that can be consumed from balance_scx() to re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the re-enqueue problem avoiding the locking complexity of calling ops.enqueue() directly from finish_dispatch(). Thoughts? > > The way we deal with the first problem in ghOSt (Google's equivalent of > sched_ext) is we expose the per-task sequence number to the BPF scheduler. > On the dispatch path, when the BPF scheduler has a candidate task, > it retrieves its seqnum, re-checks the task state to ensure that it's still > eligible for dispatch, and passes the seqnum to the kernel's dispatch helper > for verification. If the kernel detects that the seqnum has changed already, > it synchronously fails the dispatch attempt (dispatch always happens > synchronously in ghOSt). In sched_ext, we could do the synchronous check, but > we also need to do the same check later in finish_dispatch(), comparing > the current qseq against the qseq passed by the BPF scheduler. > > To fix the second problem, we would need to explicitly call ops.enqueue() > from finish_dispatch() and the other places where we abort dispatch if the > qseq is out of date. > > Either that, or just add locking to the BPF scheduler to prevent the race from > happening in the first place. Thanks, -Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-04 17:56 ` Andrea Righi @ 2026-02-05 17:20 ` Kuba Piecuch 2026-02-05 17:37 ` Andrea Righi 0 siblings, 1 reply; 13+ messages in thread From: Kuba Piecuch @ 2026-02-05 17:20 UTC (permalink / raw) To: Andrea Righi, Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Wed Feb 4, 2026 at 5:56 PM UTC, Andrea Righi wrote: > Right. At this point I think we can just rely on the affinity validation > via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and > just drop invalid dispatches. > > And to prevent dropped tasks, I was wondering if we could just insert the > task into a per-rq fallback DSQ, that can be consumed from balance_scx() to > re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the > re-enqueue problem avoiding the locking complexity of calling ops.enqueue() > directly from finish_dispatch(). > > Thoughts? How would these fallback DSQs work? 1. Would inserting the task into the fallback DSQ trigger ops.dequeue(), so that we can later balance it with the re-enqueue? 2. Which rq's fallback DSQ will the task be inserted into? The one belonging to the CPU doing the dispatch? 3. Is the re-enqueue going to happen inside the same call to balance_one() that tried to dispatch the task? I'm not opposed to the idea, I'm curious to see how it works in practice. Thanks, Kuba ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-05 17:20 ` Kuba Piecuch @ 2026-02-05 17:37 ` Andrea Righi 0 siblings, 0 replies; 13+ messages in thread From: Andrea Righi @ 2026-02-05 17:37 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Thu, Feb 05, 2026 at 05:20:11PM +0000, Kuba Piecuch wrote: > On Wed Feb 4, 2026 at 5:56 PM UTC, Andrea Righi wrote: > > Right. At this point I think we can just rely on the affinity validation > > via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and > > just drop invalid dispatches. > > > > And to prevent dropped tasks, I was wondering if we could just insert the > > task into a per-rq fallback DSQ, that can be consumed from balance_scx() to > > re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the > > re-enqueue problem avoiding the locking complexity of calling ops.enqueue() > > directly from finish_dispatch(). > > > > Thoughts? > > How would these fallback DSQs work? > > 1. Would inserting the task into the fallback DSQ trigger ops.dequeue(), so > that we can later balance it with the re-enqueue? Yeah, but ... see below. > > 2. Which rq's fallback DSQ will the task be inserted into? The one belonging to > the CPU doing the dispatch? I was thinking the task's rq. > > 3. Is the re-enqueue going to happen inside the same call to balance_one() that > tried to dispatch the task? balance_one(). > > I'm not opposed to the idea, I'm curious to see how it works in practice. Thinking more about this, it's a bit problematic, when set_cpus_allowed_scx() triggers dequeue+enqueue we get another enqueue without the SCX_ENQ_REENQ flag and it's a bit tricky to manage that with the fallback DSQ. So, I'm back to the drawing board, trying to explore the qseq approach... -Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi 2026-02-04 13:20 ` Kuba Piecuch @ 2026-02-04 15:07 ` Christian Loehle 2026-02-04 23:31 ` Tejun Heo 2 siblings, 0 replies; 13+ messages in thread From: Christian Loehle @ 2026-02-04 15:07 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On 2/3/26 23:06, Andrea Righi wrote: > A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a > target CPU. However, task affinity can change between the dispatch > decision and its finalization in finish_dispatch(). When this happens, > the scheduler may attempt to dispatch a task to a CPU that is no longer > allowed, resulting in fatal errors such as: > > EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565]) > > This race exists because ops.dispatch() runs without holding the task's > run queue lock, allowing a concurrent set_cpus_allowed() to update > p->cpus_ptr while the BPF scheduler is still using it. The dispatch is > then finalized using stale affinity information. > > Example timeline: > > CPU0 CPU1 > ---- ---- > task_rq_lock(p) > if (cpumask_test_cpu(cpu, p->cpus_ptr)) > set_cpus_allowed_scx(p, new_mask) > task_rq_unlock(p) > scx_bpf_dsq_insert(p, > SCX_DSQ_LOCAL_ON | cpu, 0) > > Fix this by extending the existing qseq invalidation mechanism to also > cover CPU affinity changes, in addition to task dequeues/re-enqueues, > occurring between dispatch decision and finalization. > > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be > re-dispatched using up-to-date affinity information. > Hi Andrea, so this fixes the default scx_storm insert / dequeue race indeed, let me go review in-depth and create some more tests... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi 2026-02-04 13:20 ` Kuba Piecuch 2026-02-04 15:07 ` Christian Loehle @ 2026-02-04 23:31 ` Tejun Heo 2026-02-05 1:15 ` Tejun Heo 2026-02-05 16:40 ` Andrea Righi 2 siblings, 2 replies; 13+ messages in thread From: Tejun Heo @ 2026-02-04 23:31 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, On Wed, Feb 04, 2026 at 12:06:39AM +0100, Andrea Righi wrote: ... > CPU0 CPU1 > ---- ---- > task_rq_lock(p) > if (cpumask_test_cpu(cpu, p->cpus_ptr)) > set_cpus_allowed_scx(p, new_mask) > task_rq_unlock(p) > scx_bpf_dsq_insert(p, > SCX_DSQ_LOCAL_ON | cpu, 0) > > Fix this by extending the existing qseq invalidation mechanism to also > cover CPU affinity changes, in addition to task dequeues/re-enqueues, > occurring between dispatch decision and finalization. > > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be > re-dispatched using up-to-date affinity information. It shouldn't be returned, right? set_cpus_allowed() dequeues and re-enqueues. What the seq invalidation detected is dequeue racing the async dispatch and the invalidation means that the task was dequeued while on the async buffer (to be re-enqueued once the property change is complete). It should just be ignored. > static struct rq *move_task_between_dsqs(struct scx_sched *sch, > struct task_struct *p, u64 enq_flags, > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, > if (dst_dsq->id == SCX_DSQ_LOCAL) { > dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); > if (src_rq != dst_rq && > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > - dst_dsq = find_global_dsq(sch, p); > - dst_rq = src_rq; > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) { > + /* > + * Task affinity changed after dispatch decision: > + * drop the dispatch, caller will handle returning > + * the task to its original DSQ. > + */ > + return NULL; ... > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > } > > if (src_rq != dst_rq && > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > - dispatch_enqueue(sch, find_global_dsq(sch, p), p, > - enq_flags | SCX_ENQ_CLEAR_OPSS); > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) { > + /* > + * Task affinity changed after dispatch decision: drop the > + * dispatch, task remains in its current state and will be > + * dispatched again in a future cycle. > + */ > + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | > + (atomic_long_read(&p->scx.ops_state) & > + SCX_OPSS_QSEQ_MASK)); I don't quite follow why we need the above slippery behavior. The qseq invalidation, if reliable, means that there's no race window if the BPF scheduler correctly implements ops.dequeue() (after the kernel side fixes it of course). ie. The BPF scheduler is reponsible for synchronizing its scx_bpf_dsq_insert() call against whatever it needs to do in ops.dequeue(). If ops.dequeue() wins, the task shouldn't be inserted in the first place. If ops.dequeue() loses, the qseq invalidation should kill it while on async buffer if it wins over finish_dispatch(). If finish_dispatch() wins, the task will just be dequeued from the inserted DSQ or the property change will happen while the task is running. Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue() synchronization and let things slide (probably optionally w/ an OPS flag), but for that, falling back to global DSQ is fine, no? > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p, > struct affinity_context *ac) > { > struct scx_sched *sch = scx_root; > + struct rq *rq = task_rq(p); > + > + lockdep_assert_rq_held(rq); > > set_cpus_allowed_common(p, ac); > > if (unlikely(!sch)) > return; > > + /* > + * Affinity changes invalidate any pending dispatch decisions made > + * with the old affinity. Increment the runqueue's ops_qseq and > + * update the task's qseq to invalidate in-flight dispatches. > + */ > + if (p->scx.flags & SCX_TASK_QUEUED) { > + unsigned long opss; > + > + rq->scx.ops_qseq++; > + opss = atomic_long_read(&p->scx.ops_state); > + atomic_long_set(&p->scx.ops_state, > + (opss & SCX_OPSS_STATE_MASK) | > + (rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT)); > + } > + I wonder whether we should define an invalid qseq and use that instead. The queueing instance really is invalid after this and it would help catching cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't dequeue_task_scx() or ops_dequeue() be a better place to shoot down the enqueued instances? While the symptom we most immediately see are through cpumask changes, the underlying problem is dequeue not shooting down existing enqueued tasks. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-04 23:31 ` Tejun Heo @ 2026-02-05 1:15 ` Tejun Heo 2026-02-05 16:40 ` Andrea Righi 1 sibling, 0 replies; 13+ messages in thread From: Tejun Heo @ 2026-02-05 1:15 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, On Wed, Feb 04, 2026 at 01:31:35PM -1000, Tejun Heo wrote: ... > I wonder whether we should define an invalid qseq and use that instead. The > queueing instance really is invalid after this and it would help catching > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the > enqueued instances? While the symptom we most immediately see are through > cpumask changes, the underlying problem is dequeue not shooting down > existing enqueued tasks. Hmmm... in fact, this is all happening already, right? Isn't all that's missing the BPF scheduler's ops.dequeue() synchronizing against scx_bpf_dsq_insert()? Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-04 23:31 ` Tejun Heo 2026-02-05 1:15 ` Tejun Heo @ 2026-02-05 16:40 ` Andrea Righi 2026-02-05 22:57 ` Tejun Heo 1 sibling, 1 reply; 13+ messages in thread From: Andrea Righi @ 2026-02-05 16:40 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Wed, Feb 04, 2026 at 01:31:35PM -1000, Tejun Heo wrote: > Hello, > > On Wed, Feb 04, 2026 at 12:06:39AM +0100, Andrea Righi wrote: > ... > > CPU0 CPU1 > > ---- ---- > > task_rq_lock(p) > > if (cpumask_test_cpu(cpu, p->cpus_ptr)) > > set_cpus_allowed_scx(p, new_mask) > > task_rq_unlock(p) > > scx_bpf_dsq_insert(p, > > SCX_DSQ_LOCAL_ON | cpu, 0) > > > > Fix this by extending the existing qseq invalidation mechanism to also > > cover CPU affinity changes, in addition to task dequeues/re-enqueues, > > occurring between dispatch decision and finalization. > > > > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped > > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be > > re-dispatched using up-to-date affinity information. > > It shouldn't be returned, right? set_cpus_allowed() dequeues and > re-enqueues. What the seq invalidation detected is dequeue racing the async > dispatch and the invalidation means that the task was dequeued while on the > async buffer (to be re-enqueued once the property change is complete). It > should just be ignored. Yeah, the only downside is that the scheduler doesn't know that the task has been re-enqueued due to a failed dispatch, but that's probably fine for now. > > > static struct rq *move_task_between_dsqs(struct scx_sched *sch, > > struct task_struct *p, u64 enq_flags, > > @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, > > if (dst_dsq->id == SCX_DSQ_LOCAL) { > > dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq); > > if (src_rq != dst_rq && > > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > > - dst_dsq = find_global_dsq(sch, p); > > - dst_rq = src_rq; > > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) { > > + /* > > + * Task affinity changed after dispatch decision: > > + * drop the dispatch, caller will handle returning > > + * the task to its original DSQ. > > + */ > > + return NULL; > ... > > @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > > } > > > > if (src_rq != dst_rq && > > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > > - dispatch_enqueue(sch, find_global_dsq(sch, p), p, > > - enq_flags | SCX_ENQ_CLEAR_OPSS); > > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) { > > + /* > > + * Task affinity changed after dispatch decision: drop the > > + * dispatch, task remains in its current state and will be > > + * dispatched again in a future cycle. > > + */ > > + atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | > > + (atomic_long_read(&p->scx.ops_state) & > > + SCX_OPSS_QSEQ_MASK)); > > I don't quite follow why we need the above slippery behavior. The qseq > invalidation, if reliable, means that there's no race window if the BPF > scheduler correctly implements ops.dequeue() (after the kernel side fixes it > of course). > > ie. The BPF scheduler is reponsible for synchronizing its > scx_bpf_dsq_insert() call against whatever it needs to do in ops.dequeue(). > If ops.dequeue() wins, the task shouldn't be inserted in the first place. If > ops.dequeue() loses, the qseq invalidation should kill it while on async > buffer if it wins over finish_dispatch(). If finish_dispatch() wins, the > task will just be dequeued from the inserted DSQ or the property change will > happen while the task is running. > > Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue() > synchronization and let things slide (probably optionally w/ an OPS flag), > but for that, falling back to global DSQ is fine, no? I think the problem with the global DSQ fallback is that we're essentially ignoring a request from the BPF scheduler to dispatch a task to a specific CPU. Moreover, the global DSQ can potentially introduce starvation: if a task is silently dispatched to the global DSQ and the BPF scheduler keeps dispatching tasks to the local DSQs, the task waiting in the global DSQ will never be consumed. > > > @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p, > > struct affinity_context *ac) > > { > > struct scx_sched *sch = scx_root; > > + struct rq *rq = task_rq(p); > > + > > + lockdep_assert_rq_held(rq); > > > > set_cpus_allowed_common(p, ac); > > > > if (unlikely(!sch)) > > return; > > > > + /* > > + * Affinity changes invalidate any pending dispatch decisions made > > + * with the old affinity. Increment the runqueue's ops_qseq and > > + * update the task's qseq to invalidate in-flight dispatches. > > + */ > > + if (p->scx.flags & SCX_TASK_QUEUED) { > > + unsigned long opss; > > + > > + rq->scx.ops_qseq++; > > + opss = atomic_long_read(&p->scx.ops_state); > > + atomic_long_set(&p->scx.ops_state, > > + (opss & SCX_OPSS_STATE_MASK) | > > + (rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT)); > > + } > > + > > I wonder whether we should define an invalid qseq and use that instead. The > queueing instance really is invalid after this and it would help catching > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the > enqueued instances? While the symptom we most immediately see are through > cpumask changes, the underlying problem is dequeue not shooting down > existing enqueued tasks. I think I like the idea of having an INVALID_QSEQ or similar, it'd also make debugging easier. I'm not sure about moving the logic to dequeue_task_scx(), more exactly, I'm not sure if there're nasty locking implications. I'll do some experiments, if it works, sure, dequeue would be a better place to cancel invalid enqueued instances. Thanks, -Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-05 16:40 ` Andrea Righi @ 2026-02-05 22:57 ` Tejun Heo 2026-02-06 8:43 ` Andrea Righi 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2026-02-05 22:57 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, On Thu, Feb 05, 2026 at 05:40:05PM +0100, Andrea Righi wrote: ... > > It shouldn't be returned, right? set_cpus_allowed() dequeues and > > re-enqueues. What the seq invalidation detected is dequeue racing the async > > dispatch and the invalidation means that the task was dequeued while on the > > async buffer (to be re-enqueued once the property change is complete). It > > should just be ignored. > > Yeah, the only downside is that the scheduler doesn't know that the task > has been re-enqueued due to a failed dispatch, but that's probably fine for > now. Yeah, but does that matter? Consider the following three scenarios: A. Task gets dispatched into local DSQ, CPU mask gets updated while in async buffer, the dispatch is ignored and then the task gets re-enqueued later. B. The same as A but the CPU mask update happens after the task lands in the local DSQ but before starts executing. C. Task gets dispatched into local DSQ and starts running, CPU mask gets updated so that the task can't run on the current CPU anymore, migration task preempts the task and it gets enqueued. A and B woould be indistinguishible from BPF sched's POV. C would be a bit different in that the task would transition through ops->running/stopping(). I don't see anything significantly different across the three scenarios - the task was dispatched but cpumask got updated and the scheduler needs to place it again. ... > > Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue() > > synchronization and let things slide (probably optionally w/ an OPS flag), > > but for that, falling back to global DSQ is fine, no? > > I think the problem with the global DSQ fallback is that we're essentially > ignoring a request from the BPF scheduler to dispatch a task to a specific > CPU. Moreover, the global DSQ can potentially introduce starvation: if a > task is silently dispatched to the global DSQ and the BPF scheduler keeps > dispatching tasks to the local DSQs, the task waiting in the global DSQ > will never be consumed. While starvation is possible, it's not very likely: - ops.select_cpu/enqueue() usually don't direct dispatch to local CPUs unless they're idle. - ops.dispatch() is only called after global DSQ is drained. If ops.select_cpu/enqueue() keeps DD'ing to local CPUs while there are other tasks waiting, it's gonna stall whether we fall back to global DSQ or not. But, taking a step back, the sloppy fallback behavior is secondary. What really matters is once we fix ops.dequeue(), can the BPF scheduler properly synchronize dequeue against scx_bpf_dsq_insert() to avoid triggering cpumask or migration disabled state mismatches? If so, ops.dequeue() would be the primary way to deal with these issues. Maybe not implementing ops.dequeue() can enable sloppy fallbacks as that indicates the scheduler isn't taking property changes into account at all, but that's really secondary. Let's first focus on making ops.dequeue() working properly so that the BPF scheduler can synchronize correctly. ... > > I wonder whether we should define an invalid qseq and use that instead. The > > queueing instance really is invalid after this and it would help catching > > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't > > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the > > enqueued instances? While the symptom we most immediately see are through > > cpumask changes, the underlying problem is dequeue not shooting down > > existing enqueued tasks. > > I think I like the idea of having an INVALID_QSEQ or similar, it'd also > make debugging easier. > > I'm not sure about moving the logic to dequeue_task_scx(), more exactly, > I'm not sure if there're nasty locking implications. I'll do some > experiments, if it works, sure, dequeue would be a better place to cancel > invalid enqueued instances. I was confused while writing above. All of the above is already happening. When a task is dequeued, it's OPSS is cleared and the task won't be eligible for dispatching anymore. The only "confused" case is where the task finishes reenqueueing before the previous dispatch attempt is finished, which the BPF scheduler should be able to handle once ops.dequeue() is fixed. Thanks. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes 2026-02-05 22:57 ` Tejun Heo @ 2026-02-06 8:43 ` Andrea Righi 0 siblings, 0 replies; 13+ messages in thread From: Andrea Righi @ 2026-02-06 8:43 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Christian Loehle, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Thu, Feb 05, 2026 at 12:57:04PM -1000, Tejun Heo wrote: > Hello, > > On Thu, Feb 05, 2026 at 05:40:05PM +0100, Andrea Righi wrote: > ... > > > It shouldn't be returned, right? set_cpus_allowed() dequeues and > > > re-enqueues. What the seq invalidation detected is dequeue racing the async > > > dispatch and the invalidation means that the task was dequeued while on the > > > async buffer (to be re-enqueued once the property change is complete). It > > > should just be ignored. > > > > Yeah, the only downside is that the scheduler doesn't know that the task > > has been re-enqueued due to a failed dispatch, but that's probably fine for > > now. > > Yeah, but does that matter? Consider the following three scenarios: > > A. Task gets dispatched into local DSQ, CPU mask gets updated while in async > buffer, the dispatch is ignored and then the task gets re-enqueued later. > > B. The same as A but the CPU mask update happens after the task lands in the > local DSQ but before starts executing. > > C. Task gets dispatched into local DSQ and starts running, CPU mask gets > updated so that the task can't run on the current CPU anymore, migration > task preempts the task and it gets enqueued. > > A and B woould be indistinguishible from BPF sched's POV. C would be a bit > different in that the task would transition through ops->running/stopping(). > > I don't see anything significantly different across the three scenarios - > the task was dispatched but cpumask got updated and the scheduler needs to > place it again. Ack, knowing that an enqueue is coming from a failed dispatch or a regular running/stopping transition probably doesn't matter. And the scheduler can probably try to infer this information, if needed. > > ... > > > Now, maybe we want to allow BPF schedulre to be lax about ops.dequeue() > > > synchronization and let things slide (probably optionally w/ an OPS flag), > > > but for that, falling back to global DSQ is fine, no? > > > > I think the problem with the global DSQ fallback is that we're essentially > > ignoring a request from the BPF scheduler to dispatch a task to a specific > > CPU. Moreover, the global DSQ can potentially introduce starvation: if a > > task is silently dispatched to the global DSQ and the BPF scheduler keeps > > dispatching tasks to the local DSQs, the task waiting in the global DSQ > > will never be consumed. > > While starvation is possible, it's not very likely: > > - ops.select_cpu/enqueue() usually don't direct dispatch to local CPUs > unless they're idle. > > - ops.dispatch() is only called after global DSQ is drained. > > If ops.select_cpu/enqueue() keeps DD'ing to local CPUs while there are other > tasks waiting, it's gonna stall whether we fall back to global DSQ or not. Well, if a scheduler is only using DD's with SCX_DSQ_LOCAL[_ON] and a per-CPU task ends up in the global DSQ the chance of starvation is not that remote. If we re-enqueue the task in the regular enqueue path, without the global DSQ fallback, the task will be dispatched again by the BPF scheduler via SCX_DSQ_LOCAL[_ON] and it's less likely to be starved. I think if we just drop the dispatch without the global DSQ fallback everything should work. > > But, taking a step back, the sloppy fallback behavior is secondary. What > really matters is once we fix ops.dequeue(), can the BPF scheduler properly > synchronize dequeue against scx_bpf_dsq_insert() to avoid triggering cpumask > or migration disabled state mismatches? If so, ops.dequeue() would be the > primary way to deal with these issues. Agreed, we should handle this in ops_dequeue(). > > Maybe not implementing ops.dequeue() can enable sloppy fallbacks as that > indicates the scheduler isn't taking property changes into account at all, > but that's really secondary. Let's first focus on making ops.dequeue() > working properly so that the BPF scheduler can synchronize correctly. Ack. > > ... > > > I wonder whether we should define an invalid qseq and use that instead. The > > > queueing instance really is invalid after this and it would help catching > > > cases where BPF scheduler makes mistakes w/ synchronization. Also, wouldn't > > > dequeue_task_scx() or ops_dequeue() be a better place to shoot down the > > > enqueued instances? While the symptom we most immediately see are through > > > cpumask changes, the underlying problem is dequeue not shooting down > > > existing enqueued tasks. > > > > I think I like the idea of having an INVALID_QSEQ or similar, it'd also > > make debugging easier. > > > > I'm not sure about moving the logic to dequeue_task_scx(), more exactly, > > I'm not sure if there're nasty locking implications. I'll do some > > experiments, if it works, sure, dequeue would be a better place to cancel > > invalid enqueued instances. > > I was confused while writing above. All of the above is already happening. > When a task is dequeued, it's OPSS is cleared and the task won't be eligible > for dispatching anymore. The only "confused" case is where the task finishes > reenqueueing before the previous dispatch attempt is finished, which the BPF > scheduler should be able to handle once ops.dequeue() is fixed. Yeah I think we can handle this in the dequeue path. I think I have a new working (maybe) patch. Will run some tests and send the new version later today. Thanks, -Andrea ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-02-06 8:43 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi 2026-02-04 13:20 ` Kuba Piecuch 2026-02-04 15:36 ` Andrea Righi 2026-02-04 16:58 ` Kuba Piecuch 2026-02-04 17:56 ` Andrea Righi 2026-02-05 17:20 ` Kuba Piecuch 2026-02-05 17:37 ` Andrea Righi 2026-02-04 15:07 ` Christian Loehle 2026-02-04 23:31 ` Tejun Heo 2026-02-05 1:15 ` Tejun Heo 2026-02-05 16:40 ` Andrea Righi 2026-02-05 22:57 ` Tejun Heo 2026-02-06 8:43 ` Andrea Righi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox