* [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock @ 2026-06-18 17:00 Kuba Piecuch 2026-06-19 7:31 ` Andrea Righi 0 siblings, 1 reply; 3+ messages in thread From: Kuba Piecuch @ 2026-06-18 17:00 UTC (permalink / raw) To: Tejun Heo, Andrea Righi, Changwoo Min, David Vernet Cc: linux-kernel, sched-ext, Kuba Piecuch task_can_run_on_remote_rq() operates under the assumption that p->migration_disabled is stable, i.e. if the kernel observed is_migration_disabled(p) == true, then the BPF scheduler must have also been able to see this when dispatching the task, and it's the BPF scheduler's fault that it tried to dispatch a task with migration disabled to a CPU other than the task's current CPU. This assumption does not always hold. It's possible that the BPF scheduler saw is_migration_disabled(p) == false, while the kernel observes is_migration_disabled(p) == true in dispatch_to_local_dsq() -> task_can_run_on_remote_rq(). The crucial thing here is that with CONFIG_PREEMPT_RCU, migration is disabled while a task is executing a BPF program. So, if there's a situation where the BPF scheduler checks a task while it's not executing a BPF program, while the kernel checks it while it is executing one, the BPF scheduler will be killed through no fault of its own. Consider the following scenario: 1. SCX task @p is executing on CPU A and CPU A gets preempted by a higher-priority scheduling class. On entry to __schedule(), p->migration_disabled == 0. 2. In put_prev_task_scx() @p is enqueued on the BPF scheduler's internal data structures, making it available for other CPUs to dispatch. 3. CPU B enters ops.dispatch(), pops @p from the BPF scheduler's data structures, checks is_migration_disabled(p) which returns false, and dispatches @p to CPU B's local DSQ. 4. On CPU A, @p hasn't been switched out yet. Execution reaches trace_sched_switch() which enters a BPF program, as the BPF scheduler hooks into the sched_switch tracepoint to detect idle->fair transitions. On entry into the BPF program, @p disables migration. 5. CPU B enters finish_dispatch() -> dispatch_to_local_dsq() -> task_can_run_on_remote_rq() which observes is_migration_disabled(p) == true, triggering scx_error(). This all happens while holding CPU B's rq lock, so it's not synchronized with @p switching out. This patch fixes this by moving the call to task_can_run_on_remote_rq() after @p's rq lock is acquired in dispatch_to_local_dsq(). This way, we synchronize with @p switching out, since @p holds its rq lock all the way until it's switched out. Thus, any BPF programs that are called between put_prev_task_scx() and the end of the context switch are guaranteed to have finished and cannot influence p->migration_disabled. Also add a lockdep assertion in task_can_run_on_remote_rq() which ensures the task rq lock is held if enforce == true. Signed-off-by: Kuba Piecuch <jpiecuch@google.com> --- kernel/sched/ext.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6567f626b3f0..4ae7ca4e0a41 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2422,6 +2422,7 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags, * no to the BPF scheduler initiated migrations while offline. * * The caller must ensure that @p and @rq are on different CPUs. + * If enforce == true, caller must hold @p's rq lock. */ static bool task_can_run_on_remote_rq(struct scx_sched *sch, struct task_struct *p, struct rq *rq, @@ -2429,6 +2430,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch, { s32 cpu = cpu_of(rq); + /* + * To prevent races with @p still running on its old CPU while switching + * out, make sure we're holding @p's rq lock so as not to risk + * erroneously killing the BPF scheduler. + */ + if (enforce) + lockdep_assert_rq_held(task_rq(p)); + WARN_ON_ONCE(task_cpu(p) == cpu); /* @@ -2696,13 +2705,6 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, return; } - if (src_rq != dst_rq && - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { - dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p, - enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK); - return; - } - /* * @p is on a possibly remote @src_rq which we need to lock to move the * task. If dequeue is in progress, it'd be locking @src_rq and waiting @@ -2729,6 +2731,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, /* task_rq couldn't have changed if we're still the holding cpu */ if (likely(p->scx.holding_cpu == raw_smp_processor_id()) && !WARN_ON_ONCE(src_rq != task_rq(p))) { + bool fallback = false; /* * If @p is staying on the same rq, there's no need to go * through the full deactivate/activate cycle. Optimize by @@ -2738,6 +2741,11 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, p->scx.holding_cpu = -1; dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, enq_flags); + } else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { + p->scx.holding_cpu = -1; + fallback = true; + dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)), + p, enq_flags | SCX_ENQ_GDSQ_FALLBACK); } else { move_remote_task_to_local_dsq(p, enq_flags, src_rq, dst_rq); @@ -2746,7 +2754,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, } /* if the destination CPU is idle, wake it up */ - if (sched_class_above(p->sched_class, dst_rq->curr->sched_class)) + if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class)) resched_curr(dst_rq); } -- 2.55.0.rc0.786.g65d90a0328-goog ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock 2026-06-18 17:00 [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock Kuba Piecuch @ 2026-06-19 7:31 ` Andrea Righi 2026-06-19 13:12 ` Kuba Piecuch 0 siblings, 1 reply; 3+ messages in thread From: Andrea Righi @ 2026-06-19 7:31 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, Changwoo Min, David Vernet, linux-kernel, sched-ext Hi Kuba, On Thu, Jun 18, 2026 at 05:00:47PM +0000, Kuba Piecuch wrote: > task_can_run_on_remote_rq() operates under the assumption that > p->migration_disabled is stable, i.e. if the kernel observed > is_migration_disabled(p) == true, then the BPF scheduler must have also > been able to see this when dispatching the task, and it's the BPF > scheduler's fault that it tried to dispatch a task with migration > disabled to a CPU other than the task's current CPU. > > This assumption does not always hold. It's possible that the BPF > scheduler saw is_migration_disabled(p) == false, while the kernel > observes is_migration_disabled(p) == true in dispatch_to_local_dsq() > -> task_can_run_on_remote_rq(). > > The crucial thing here is that with CONFIG_PREEMPT_RCU, migration is > disabled while a task is executing a BPF program. So, if there's a > situation where the BPF scheduler checks a task while it's not executing > a BPF program, while the kernel checks it while it is executing one, > the BPF scheduler will be killed through no fault of its own. > > Consider the following scenario: > > 1. SCX task @p is executing on CPU A and CPU A gets preempted by a > higher-priority scheduling class. On entry to __schedule(), > p->migration_disabled == 0. > > 2. In put_prev_task_scx() @p is enqueued on the BPF scheduler's internal > data structures, making it available for other CPUs to dispatch. > > 3. CPU B enters ops.dispatch(), pops @p from the BPF scheduler's data > structures, checks is_migration_disabled(p) which returns false, > and dispatches @p to CPU B's local DSQ. > > 4. On CPU A, @p hasn't been switched out yet. Execution reaches > trace_sched_switch() which enters a BPF program, as the BPF scheduler > hooks into the sched_switch tracepoint to detect idle->fair > transitions. On entry into the BPF program, @p disables migration. > > 5. CPU B enters finish_dispatch() -> dispatch_to_local_dsq() -> > task_can_run_on_remote_rq() which observes > is_migration_disabled(p) == true, triggering scx_error(). > This all happens while holding CPU B's rq lock, so it's not > synchronized with @p switching out. > > This patch fixes this by moving the call to task_can_run_on_remote_rq() > after @p's rq lock is acquired in dispatch_to_local_dsq(). This way, we > synchronize with @p switching out, since @p holds its rq lock all > the way until it's switched out. Thus, any BPF programs that are called > between put_prev_task_scx() and the end of the context switch are > guaranteed to have finished and cannot influence p->migration_disabled. > > Also add a lockdep assertion in task_can_run_on_remote_rq() which > ensures the task rq lock is held if enforce == true. > > Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Looks good to me, but we should also update the "Cross-CPU Task Migration" doc in kernel/sched/ext_internal.h to be consistent with this change (see below if it makes sense to you). With that: Reviewed-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea kernel/sched/ext_internal.h | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index b76633c52c96a..b63d80ae21157 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -1471,18 +1471,20 @@ static const char *scx_enable_state_str[] = { * The sched_ext core uses a "lock dancing" protocol coordinated by * p->scx.holding_cpu. When moving a task to a different rq: * - * 1. Verify task can be moved (CPU affinity, migration_disabled, etc.) - * 2. Set p->scx.holding_cpu to the current CPU - * 3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING + * 1. Set p->scx.holding_cpu to the current CPU + * 2. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING * is set, so clearing DISPATCHING first prevents the circular wait * (safe to lock the rq we need) - * 4. Unlock the current CPU's rq - * 5. Lock src_rq (where the task currently lives) - * 6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the + * 3. Unlock the current CPU's rq + * 4. Lock src_rq (where the task currently lives) + * 5. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the * race (dequeue clears holding_cpu to -1 when it takes the task), in * this case migration is aborted - * 7. If src_rq == dst_rq: clear holding_cpu and enqueue directly + * 6. If src_rq == dst_rq: clear holding_cpu and enqueue directly * into dst_rq's local DSQ (no lock swap needed) + * 7. Otherwise, verify under src_rq that the task can be moved to dst_rq + * (CPU affinity, migration_disabled, etc.). If not, clear holding_cpu + * and enqueue the task on the fallback DSQ on src_rq. * 8. Otherwise: call move_remote_task_to_local_dsq(), which releases * src_rq, locks dst_rq, and performs the deactivate/activate * migration cycle (dst_rq is held on return) > --- > kernel/sched/ext.c | 24 ++++++++++++++++-------- > 1 file changed, 16 insertions(+), 8 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 6567f626b3f0..4ae7ca4e0a41 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -2422,6 +2422,7 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > * no to the BPF scheduler initiated migrations while offline. > * > * The caller must ensure that @p and @rq are on different CPUs. > + * If enforce == true, caller must hold @p's rq lock. > */ > static bool task_can_run_on_remote_rq(struct scx_sched *sch, > struct task_struct *p, struct rq *rq, > @@ -2429,6 +2430,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch, > { > s32 cpu = cpu_of(rq); > > + /* > + * To prevent races with @p still running on its old CPU while switching > + * out, make sure we're holding @p's rq lock so as not to risk > + * erroneously killing the BPF scheduler. > + */ > + if (enforce) > + lockdep_assert_rq_held(task_rq(p)); > + > WARN_ON_ONCE(task_cpu(p) == cpu); > > /* > @@ -2696,13 +2705,6 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > return; > } > > - if (src_rq != dst_rq && > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > - dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p, > - enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK); > - return; > - } > - > /* > * @p is on a possibly remote @src_rq which we need to lock to move the > * task. If dequeue is in progress, it'd be locking @src_rq and waiting > @@ -2729,6 +2731,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > /* task_rq couldn't have changed if we're still the holding cpu */ > if (likely(p->scx.holding_cpu == raw_smp_processor_id()) && > !WARN_ON_ONCE(src_rq != task_rq(p))) { > + bool fallback = false; > /* > * If @p is staying on the same rq, there's no need to go > * through the full deactivate/activate cycle. Optimize by > @@ -2738,6 +2741,11 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > p->scx.holding_cpu = -1; > dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, > enq_flags); > + } else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > + p->scx.holding_cpu = -1; > + fallback = true; > + dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)), > + p, enq_flags | SCX_ENQ_GDSQ_FALLBACK); > } else { > move_remote_task_to_local_dsq(p, enq_flags, > src_rq, dst_rq); > @@ -2746,7 +2754,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > } > > /* if the destination CPU is idle, wake it up */ > - if (sched_class_above(p->sched_class, dst_rq->curr->sched_class)) > + if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class)) > resched_curr(dst_rq); > } > > -- > 2.55.0.rc0.786.g65d90a0328-goog > ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock 2026-06-19 7:31 ` Andrea Righi @ 2026-06-19 13:12 ` Kuba Piecuch 0 siblings, 0 replies; 3+ messages in thread From: Kuba Piecuch @ 2026-06-19 13:12 UTC (permalink / raw) To: Andrea Righi, Kuba Piecuch Cc: Tejun Heo, Changwoo Min, David Vernet, linux-kernel, sched-ext Hi Andrea, Thank you for the review! On Fri Jun 19, 2026 at 7:31 AM UTC, Andrea Righi wrote: > Looks good to me, but we should also update the "Cross-CPU Task Migration" doc > in kernel/sched/ext_internal.h to be consistent with this change (see below if > it makes sense to you). > > With that: > > Reviewed-by: Andrea Righi <arighi@nvidia.com> > > Thanks, > -Andrea > > kernel/sched/ext_internal.h | 16 +++++++++------- > 1 file changed, 9 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index b76633c52c96a..b63d80ae21157 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -1471,18 +1471,20 @@ static const char *scx_enable_state_str[] = { > * The sched_ext core uses a "lock dancing" protocol coordinated by > * p->scx.holding_cpu. When moving a task to a different rq: > * > - * 1. Verify task can be moved (CPU affinity, migration_disabled, etc.) > - * 2. Set p->scx.holding_cpu to the current CPU > - * 3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING > + * 1. Set p->scx.holding_cpu to the current CPU > + * 2. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING > * is set, so clearing DISPATCHING first prevents the circular wait > * (safe to lock the rq we need) > - * 4. Unlock the current CPU's rq > - * 5. Lock src_rq (where the task currently lives) > - * 6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the > + * 3. Unlock the current CPU's rq > + * 4. Lock src_rq (where the task currently lives) > + * 5. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the > * race (dequeue clears holding_cpu to -1 when it takes the task), in > * this case migration is aborted > - * 7. If src_rq == dst_rq: clear holding_cpu and enqueue directly > + * 6. If src_rq == dst_rq: clear holding_cpu and enqueue directly > * into dst_rq's local DSQ (no lock swap needed) > + * 7. Otherwise, verify under src_rq that the task can be moved to dst_rq > + * (CPU affinity, migration_disabled, etc.). If not, clear holding_cpu > + * and enqueue the task on the fallback DSQ on src_rq. > * 8. Otherwise: call move_remote_task_to_local_dsq(), which releases > * src_rq, locks dst_rq, and performs the deactivate/activate > * migration cycle (dst_rq is held on return) > Absolutely, will send a v2 shortly that includes the documentation change. Thanks, Kuba ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-19 13:12 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-18 17:00 [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock Kuba Piecuch 2026-06-19 7:31 ` Andrea Righi 2026-06-19 13:12 ` Kuba Piecuch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox