* [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock
@ 2026-06-18 17:00 Kuba Piecuch
2026-06-19 7:31 ` Andrea Righi
0 siblings, 1 reply; 2+ messages in thread
From: Kuba Piecuch @ 2026-06-18 17:00 UTC (permalink / raw)
To: Tejun Heo, Andrea Righi, Changwoo Min, David Vernet
Cc: linux-kernel, sched-ext, Kuba Piecuch
task_can_run_on_remote_rq() operates under the assumption that
p->migration_disabled is stable, i.e. if the kernel observed
is_migration_disabled(p) == true, then the BPF scheduler must have also
been able to see this when dispatching the task, and it's the BPF
scheduler's fault that it tried to dispatch a task with migration
disabled to a CPU other than the task's current CPU.
This assumption does not always hold. It's possible that the BPF
scheduler saw is_migration_disabled(p) == false, while the kernel
observes is_migration_disabled(p) == true in dispatch_to_local_dsq()
-> task_can_run_on_remote_rq().
The crucial thing here is that with CONFIG_PREEMPT_RCU, migration is
disabled while a task is executing a BPF program. So, if there's a
situation where the BPF scheduler checks a task while it's not executing
a BPF program, while the kernel checks it while it is executing one,
the BPF scheduler will be killed through no fault of its own.
Consider the following scenario:
1. SCX task @p is executing on CPU A and CPU A gets preempted by a
higher-priority scheduling class. On entry to __schedule(),
p->migration_disabled == 0.
2. In put_prev_task_scx() @p is enqueued on the BPF scheduler's internal
data structures, making it available for other CPUs to dispatch.
3. CPU B enters ops.dispatch(), pops @p from the BPF scheduler's data
structures, checks is_migration_disabled(p) which returns false,
and dispatches @p to CPU B's local DSQ.
4. On CPU A, @p hasn't been switched out yet. Execution reaches
trace_sched_switch() which enters a BPF program, as the BPF scheduler
hooks into the sched_switch tracepoint to detect idle->fair
transitions. On entry into the BPF program, @p disables migration.
5. CPU B enters finish_dispatch() -> dispatch_to_local_dsq() ->
task_can_run_on_remote_rq() which observes
is_migration_disabled(p) == true, triggering scx_error().
This all happens while holding CPU B's rq lock, so it's not
synchronized with @p switching out.
This patch fixes this by moving the call to task_can_run_on_remote_rq()
after @p's rq lock is acquired in dispatch_to_local_dsq(). This way, we
synchronize with @p switching out, since @p holds its rq lock all
the way until it's switched out. Thus, any BPF programs that are called
between put_prev_task_scx() and the end of the context switch are
guaranteed to have finished and cannot influence p->migration_disabled.
Also add a lockdep assertion in task_can_run_on_remote_rq() which
ensures the task rq lock is held if enforce == true.
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
---
kernel/sched/ext.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6567f626b3f0..4ae7ca4e0a41 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2422,6 +2422,7 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
* no to the BPF scheduler initiated migrations while offline.
*
* The caller must ensure that @p and @rq are on different CPUs.
+ * If enforce == true, caller must hold @p's rq lock.
*/
static bool task_can_run_on_remote_rq(struct scx_sched *sch,
struct task_struct *p, struct rq *rq,
@@ -2429,6 +2430,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
{
s32 cpu = cpu_of(rq);
+ /*
+ * To prevent races with @p still running on its old CPU while switching
+ * out, make sure we're holding @p's rq lock so as not to risk
+ * erroneously killing the BPF scheduler.
+ */
+ if (enforce)
+ lockdep_assert_rq_held(task_rq(p));
+
WARN_ON_ONCE(task_cpu(p) == cpu);
/*
@@ -2696,13 +2705,6 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
return;
}
- if (src_rq != dst_rq &&
- unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
- dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
- enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
- return;
- }
-
/*
* @p is on a possibly remote @src_rq which we need to lock to move the
* task. If dequeue is in progress, it'd be locking @src_rq and waiting
@@ -2729,6 +2731,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
/* task_rq couldn't have changed if we're still the holding cpu */
if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
!WARN_ON_ONCE(src_rq != task_rq(p))) {
+ bool fallback = false;
/*
* If @p is staying on the same rq, there's no need to go
* through the full deactivate/activate cycle. Optimize by
@@ -2738,6 +2741,11 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
p->scx.holding_cpu = -1;
dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
enq_flags);
+ } else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
+ p->scx.holding_cpu = -1;
+ fallback = true;
+ dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
+ p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
} else {
move_remote_task_to_local_dsq(p, enq_flags,
src_rq, dst_rq);
@@ -2746,7 +2754,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
}
/* if the destination CPU is idle, wake it up */
- if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
+ if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class))
resched_curr(dst_rq);
}
--
2.55.0.rc0.786.g65d90a0328-goog
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock
2026-06-18 17:00 [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock Kuba Piecuch
@ 2026-06-19 7:31 ` Andrea Righi
0 siblings, 0 replies; 2+ messages in thread
From: Andrea Righi @ 2026-06-19 7:31 UTC (permalink / raw)
To: Kuba Piecuch
Cc: Tejun Heo, Changwoo Min, David Vernet, linux-kernel, sched-ext
Hi Kuba,
On Thu, Jun 18, 2026 at 05:00:47PM +0000, Kuba Piecuch wrote:
> task_can_run_on_remote_rq() operates under the assumption that
> p->migration_disabled is stable, i.e. if the kernel observed
> is_migration_disabled(p) == true, then the BPF scheduler must have also
> been able to see this when dispatching the task, and it's the BPF
> scheduler's fault that it tried to dispatch a task with migration
> disabled to a CPU other than the task's current CPU.
>
> This assumption does not always hold. It's possible that the BPF
> scheduler saw is_migration_disabled(p) == false, while the kernel
> observes is_migration_disabled(p) == true in dispatch_to_local_dsq()
> -> task_can_run_on_remote_rq().
>
> The crucial thing here is that with CONFIG_PREEMPT_RCU, migration is
> disabled while a task is executing a BPF program. So, if there's a
> situation where the BPF scheduler checks a task while it's not executing
> a BPF program, while the kernel checks it while it is executing one,
> the BPF scheduler will be killed through no fault of its own.
>
> Consider the following scenario:
>
> 1. SCX task @p is executing on CPU A and CPU A gets preempted by a
> higher-priority scheduling class. On entry to __schedule(),
> p->migration_disabled == 0.
>
> 2. In put_prev_task_scx() @p is enqueued on the BPF scheduler's internal
> data structures, making it available for other CPUs to dispatch.
>
> 3. CPU B enters ops.dispatch(), pops @p from the BPF scheduler's data
> structures, checks is_migration_disabled(p) which returns false,
> and dispatches @p to CPU B's local DSQ.
>
> 4. On CPU A, @p hasn't been switched out yet. Execution reaches
> trace_sched_switch() which enters a BPF program, as the BPF scheduler
> hooks into the sched_switch tracepoint to detect idle->fair
> transitions. On entry into the BPF program, @p disables migration.
>
> 5. CPU B enters finish_dispatch() -> dispatch_to_local_dsq() ->
> task_can_run_on_remote_rq() which observes
> is_migration_disabled(p) == true, triggering scx_error().
> This all happens while holding CPU B's rq lock, so it's not
> synchronized with @p switching out.
>
> This patch fixes this by moving the call to task_can_run_on_remote_rq()
> after @p's rq lock is acquired in dispatch_to_local_dsq(). This way, we
> synchronize with @p switching out, since @p holds its rq lock all
> the way until it's switched out. Thus, any BPF programs that are called
> between put_prev_task_scx() and the end of the context switch are
> guaranteed to have finished and cannot influence p->migration_disabled.
>
> Also add a lockdep assertion in task_can_run_on_remote_rq() which
> ensures the task rq lock is held if enforce == true.
>
> Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Looks good to me, but we should also update the "Cross-CPU Task Migration" doc
in kernel/sched/ext_internal.h to be consistent with this change (see below if
it makes sense to you).
With that:
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
kernel/sched/ext_internal.h | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index b76633c52c96a..b63d80ae21157 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1471,18 +1471,20 @@ static const char *scx_enable_state_str[] = {
* The sched_ext core uses a "lock dancing" protocol coordinated by
* p->scx.holding_cpu. When moving a task to a different rq:
*
- * 1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
- * 2. Set p->scx.holding_cpu to the current CPU
- * 3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
+ * 1. Set p->scx.holding_cpu to the current CPU
+ * 2. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
* is set, so clearing DISPATCHING first prevents the circular wait
* (safe to lock the rq we need)
- * 4. Unlock the current CPU's rq
- * 5. Lock src_rq (where the task currently lives)
- * 6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
+ * 3. Unlock the current CPU's rq
+ * 4. Lock src_rq (where the task currently lives)
+ * 5. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
* race (dequeue clears holding_cpu to -1 when it takes the task), in
* this case migration is aborted
- * 7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
+ * 6. If src_rq == dst_rq: clear holding_cpu and enqueue directly
* into dst_rq's local DSQ (no lock swap needed)
+ * 7. Otherwise, verify under src_rq that the task can be moved to dst_rq
+ * (CPU affinity, migration_disabled, etc.). If not, clear holding_cpu
+ * and enqueue the task on the fallback DSQ on src_rq.
* 8. Otherwise: call move_remote_task_to_local_dsq(), which releases
* src_rq, locks dst_rq, and performs the deactivate/activate
* migration cycle (dst_rq is held on return)
> ---
> kernel/sched/ext.c | 24 ++++++++++++++++--------
> 1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 6567f626b3f0..4ae7ca4e0a41 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2422,6 +2422,7 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> * no to the BPF scheduler initiated migrations while offline.
> *
> * The caller must ensure that @p and @rq are on different CPUs.
> + * If enforce == true, caller must hold @p's rq lock.
> */
> static bool task_can_run_on_remote_rq(struct scx_sched *sch,
> struct task_struct *p, struct rq *rq,
> @@ -2429,6 +2430,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
> {
> s32 cpu = cpu_of(rq);
>
> + /*
> + * To prevent races with @p still running on its old CPU while switching
> + * out, make sure we're holding @p's rq lock so as not to risk
> + * erroneously killing the BPF scheduler.
> + */
> + if (enforce)
> + lockdep_assert_rq_held(task_rq(p));
> +
> WARN_ON_ONCE(task_cpu(p) == cpu);
>
> /*
> @@ -2696,13 +2705,6 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> return;
> }
>
> - if (src_rq != dst_rq &&
> - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> - dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
> - enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
> - return;
> - }
> -
> /*
> * @p is on a possibly remote @src_rq which we need to lock to move the
> * task. If dequeue is in progress, it'd be locking @src_rq and waiting
> @@ -2729,6 +2731,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> /* task_rq couldn't have changed if we're still the holding cpu */
> if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
> !WARN_ON_ONCE(src_rq != task_rq(p))) {
> + bool fallback = false;
> /*
> * If @p is staying on the same rq, there's no need to go
> * through the full deactivate/activate cycle. Optimize by
> @@ -2738,6 +2741,11 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> p->scx.holding_cpu = -1;
> dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
> enq_flags);
> + } else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> + p->scx.holding_cpu = -1;
> + fallback = true;
> + dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
> + p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
> } else {
> move_remote_task_to_local_dsq(p, enq_flags,
> src_rq, dst_rq);
> @@ -2746,7 +2754,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> }
>
> /* if the destination CPU is idle, wake it up */
> - if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
> + if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class))
> resched_curr(dst_rq);
> }
>
> --
> 2.55.0.rc0.786.g65d90a0328-goog
>
^ permalink raw reply related [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-06-19 7:33 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-18 17:00 [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock Kuba Piecuch
2026-06-19 7:31 ` Andrea Righi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.