[PATCH v2 sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kuba Piecuch <jpiecuch@google.com>
To: Tejun Heo <tj@kernel.org>, Andrea Righi <arighi@nvidia.com>,
	Changwoo Min <changwoo@igalia.com>,
	 David Vernet <void@manifault.com>
Cc: linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev,
	 Kuba Piecuch <jpiecuch@google.com>
Subject: [PATCH v2 sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock
Date: Fri, 19 Jun 2026 13:23:59 +0000	[thread overview]
Message-ID: <20260619132359.831936-1-jpiecuch@google.com> (raw)

task_can_run_on_remote_rq() operates under the assumption that
p->migration_disabled is stable, i.e. if the kernel observed
is_migration_disabled(p) == true, then the BPF scheduler must have also
been able to see this when dispatching the task, and it's the BPF
scheduler's fault that it tried to dispatch a task with migration
disabled to a CPU other than the task's current CPU.

This assumption does not always hold. It's possible that the BPF
scheduler saw is_migration_disabled(p) == false, while the kernel
observes is_migration_disabled(p) == true in dispatch_to_local_dsq()
-> task_can_run_on_remote_rq().

The crucial thing here is that with CONFIG_PREEMPT_RCU, migration is
disabled while a task is executing a BPF program. So, if there's a
situation where the BPF scheduler checks a task while it's not executing
a BPF program, while the kernel checks it while it is executing one,
the BPF scheduler will be killed through no fault of its own.

Consider the following scenario:

1. SCX task @p is executing on CPU A and CPU A gets preempted by a
   higher-priority scheduling class. On entry to __schedule(),
   p->migration_disabled == 0.

2. In put_prev_task_scx() @p is enqueued on the BPF scheduler's internal
   data structures, making it available for other CPUs to dispatch.

3. CPU B enters ops.dispatch(), pops @p from the BPF scheduler's data
   structures, checks is_migration_disabled(p) which returns false,
   and dispatches @p to CPU B's local DSQ.

4. On CPU A, @p hasn't been switched out yet. Execution reaches
   trace_sched_switch() which enters a BPF program, as the BPF scheduler
   hooks into the sched_switch tracepoint to detect idle->fair
   transitions. On entry into the BPF program, @p disables migration.

5. CPU B enters finish_dispatch() -> dispatch_to_local_dsq() ->
   task_can_run_on_remote_rq() which observes
   is_migration_disabled(p) == true, triggering scx_error().
   This all happens while holding CPU B's rq lock, so it's not
   synchronized with @p switching out.

This patch fixes this by moving the call to task_can_run_on_remote_rq()
after @p's rq lock is acquired in dispatch_to_local_dsq(). This way, we
synchronize with @p switching out, since @p holds its rq lock all
the way until it's switched out. Thus, any BPF programs that are called
between put_prev_task_scx() and the end of the context switch are
guaranteed to have finished and cannot influence p->migration_disabled.

Also add a lockdep assertion in task_can_run_on_remote_rq() which
ensures the task rq lock is held if enforce == true.

Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
---

Changes from v1:
 - Update documentation in ext_internal.h explaining the locking dance
   around task rq migration (Andrea)
Link to v1: https://lore.kernel.org/all/20260618170047.283701-1-jpiecuch@google.com/

 kernel/sched/ext.c          | 24 ++++++++++++++++--------
 kernel/sched/ext_internal.h | 23 +++++++++++++----------
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6567f626b3f0..4ae7ca4e0a41 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2422,6 +2422,7 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
  *   no to the BPF scheduler initiated migrations while offline.
  *
  * The caller must ensure that @p and @rq are on different CPUs.
+ * If enforce == true, caller must hold @p's rq lock.
  */
 static bool task_can_run_on_remote_rq(struct scx_sched *sch,
 				      struct task_struct *p, struct rq *rq,
@@ -2429,6 +2430,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
 {
 	s32 cpu = cpu_of(rq);
 
+	/*
+	 * To prevent races with @p still running on its old CPU while switching
+	 * out, make sure we're holding @p's rq lock so as not to risk
+	 * erroneously killing the BPF scheduler.
+	 */
+	if (enforce)
+		lockdep_assert_rq_held(task_rq(p));
+
 	WARN_ON_ONCE(task_cpu(p) == cpu);
 
 	/*
@@ -2696,13 +2705,6 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 		return;
 	}
 
-	if (src_rq != dst_rq &&
-	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
-				 enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
-		return;
-	}
-
 	/*
 	 * @p is on a possibly remote @src_rq which we need to lock to move the
 	 * task. If dequeue is in progress, it'd be locking @src_rq and waiting
@@ -2729,6 +2731,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 	/* task_rq couldn't have changed if we're still the holding cpu */
 	if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
 	    !WARN_ON_ONCE(src_rq != task_rq(p))) {
+		bool fallback = false;
 		/*
 		 * If @p is staying on the same rq, there's no need to go
 		 * through the full deactivate/activate cycle. Optimize by
@@ -2738,6 +2741,11 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 			p->scx.holding_cpu = -1;
 			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
 					 enq_flags);
+		} else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
+			p->scx.holding_cpu = -1;
+			fallback = true;
+			dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
+					 p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
 		} else {
 			move_remote_task_to_local_dsq(p, enq_flags,
 						      src_rq, dst_rq);
@@ -2746,7 +2754,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 		}
 
 		/* if the destination CPU is idle, wake it up */
-		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
+		if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class))
 			resched_curr(dst_rq);
 	}
 
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index b04701190b23..457df0bebcd9 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1463,21 +1463,24 @@ static const char *scx_enable_state_str[] = {
  * The sched_ext core uses a "lock dancing" protocol coordinated by
  * p->scx.holding_cpu. When moving a task to a different rq:
  *
- *   1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
- *   2. Set p->scx.holding_cpu to the current CPU
- *   3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
+ *   1. Set p->scx.holding_cpu to the current CPU
+ *   2. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
  *      is set, so clearing DISPATCHING first prevents the circular wait
  *      (safe to lock the rq we need)
- *   4. Unlock the current CPU's rq
- *   5. Lock src_rq (where the task currently lives)
- *   6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
+ *   3. Unlock the current CPU's rq
+ *   4. Lock src_rq (where the task currently lives)
+ *   5. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
  *      race (dequeue clears holding_cpu to -1 when it takes the task), in
  *      this case migration is aborted
- *   7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
+ *   6. If src_rq == dst_rq: clear holding_cpu and enqueue directly
  *      into dst_rq's local DSQ (no lock swap needed)
- *   8. Otherwise: call move_remote_task_to_local_dsq(), which releases
- *      src_rq, locks dst_rq, and performs the deactivate/activate
- *      migration cycle (dst_rq is held on return)
+ *   7. Otherwise, verify under src_rq lock that the task can be moved to dst_rq
+ *      (CPU affinity, migration_disabled, etc.). If not, clear holding_cpu,
+ *      leave the task on src_rq, and enqueue it on the fallback DSQ.
+ *   8. Otherwise (i.e. if the task can be moved to dst_rq), call
+ *      move_remote_task_to_local_dsq(), which releases src_rq, locks dst_rq,
+ *      and performs the deactivate/activate migration cycle
+ *      (dst_rq is held on return)
  *   9. Unlock dst_rq and re-lock the current CPU's rq to restore
  *      the lock state expected by the caller
  *
-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog

next             reply	other threads:[~2026-06-19 13:24 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-19 13:23 Kuba Piecuch [this message]
2026-06-19 13:32 ` [PATCH v2 sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock Andrea Righi

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:6567f626b3f dfblob:4ae7ca4e0a4 dfblob:b04701190b2
dfblob:457df0bebcd )
 OR (
bs:"[PATCH v2 sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260619132359.831936-1-jpiecuch@google.com \
    --to=jpiecuch@google.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.