All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, Changwoo Min <changwoo@igalia.com>,
	David Vernet <void@manifault.com>,
	linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev
Subject: Re: [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock
Date: Fri, 19 Jun 2026 09:31:11 +0200	[thread overview]
Message-ID: <ajTwP6tQxR_6UGsQ@gpd4> (raw)
In-Reply-To: <20260618170047.283701-1-jpiecuch@google.com>

Hi Kuba,

On Thu, Jun 18, 2026 at 05:00:47PM +0000, Kuba Piecuch wrote:
> task_can_run_on_remote_rq() operates under the assumption that
> p->migration_disabled is stable, i.e. if the kernel observed
> is_migration_disabled(p) == true, then the BPF scheduler must have also
> been able to see this when dispatching the task, and it's the BPF
> scheduler's fault that it tried to dispatch a task with migration
> disabled to a CPU other than the task's current CPU.
> 
> This assumption does not always hold. It's possible that the BPF
> scheduler saw is_migration_disabled(p) == false, while the kernel
> observes is_migration_disabled(p) == true in dispatch_to_local_dsq()
> -> task_can_run_on_remote_rq().
> 
> The crucial thing here is that with CONFIG_PREEMPT_RCU, migration is
> disabled while a task is executing a BPF program. So, if there's a
> situation where the BPF scheduler checks a task while it's not executing
> a BPF program, while the kernel checks it while it is executing one,
> the BPF scheduler will be killed through no fault of its own.
> 
> Consider the following scenario:
> 
> 1. SCX task @p is executing on CPU A and CPU A gets preempted by a
>    higher-priority scheduling class. On entry to __schedule(),
>    p->migration_disabled == 0.
> 
> 2. In put_prev_task_scx() @p is enqueued on the BPF scheduler's internal
>    data structures, making it available for other CPUs to dispatch.
> 
> 3. CPU B enters ops.dispatch(), pops @p from the BPF scheduler's data
>    structures, checks is_migration_disabled(p) which returns false,
>    and dispatches @p to CPU B's local DSQ.
> 
> 4. On CPU A, @p hasn't been switched out yet. Execution reaches
>    trace_sched_switch() which enters a BPF program, as the BPF scheduler
>    hooks into the sched_switch tracepoint to detect idle->fair
>    transitions. On entry into the BPF program, @p disables migration.
> 
> 5. CPU B enters finish_dispatch() -> dispatch_to_local_dsq() ->
>    task_can_run_on_remote_rq() which observes
>    is_migration_disabled(p) == true, triggering scx_error().
>    This all happens while holding CPU B's rq lock, so it's not
>    synchronized with @p switching out.
> 
> This patch fixes this by moving the call to task_can_run_on_remote_rq()
> after @p's rq lock is acquired in dispatch_to_local_dsq(). This way, we
> synchronize with @p switching out, since @p holds its rq lock all
> the way until it's switched out. Thus, any BPF programs that are called
> between put_prev_task_scx() and the end of the context switch are
> guaranteed to have finished and cannot influence p->migration_disabled.
> 
> Also add a lockdep assertion in task_can_run_on_remote_rq() which
> ensures the task rq lock is held if enforce == true.
> 
> Signed-off-by: Kuba Piecuch <jpiecuch@google.com>

Looks good to me, but we should also update the "Cross-CPU Task Migration" doc
in kernel/sched/ext_internal.h to be consistent with this change (see below if
it makes sense to you).

With that:

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

 kernel/sched/ext_internal.h | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index b76633c52c96a..b63d80ae21157 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1471,18 +1471,20 @@ static const char *scx_enable_state_str[] = {
  * The sched_ext core uses a "lock dancing" protocol coordinated by
  * p->scx.holding_cpu. When moving a task to a different rq:
  *
- *   1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
- *   2. Set p->scx.holding_cpu to the current CPU
- *   3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
+ *   1. Set p->scx.holding_cpu to the current CPU
+ *   2. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
  *      is set, so clearing DISPATCHING first prevents the circular wait
  *      (safe to lock the rq we need)
- *   4. Unlock the current CPU's rq
- *   5. Lock src_rq (where the task currently lives)
- *   6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
+ *   3. Unlock the current CPU's rq
+ *   4. Lock src_rq (where the task currently lives)
+ *   5. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
  *      race (dequeue clears holding_cpu to -1 when it takes the task), in
  *      this case migration is aborted
- *   7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
+ *   6. If src_rq == dst_rq: clear holding_cpu and enqueue directly
  *      into dst_rq's local DSQ (no lock swap needed)
+ *   7. Otherwise, verify under src_rq that the task can be moved to dst_rq
+ *      (CPU affinity, migration_disabled, etc.). If not, clear holding_cpu
+ *      and enqueue the task on the fallback DSQ on src_rq.
  *   8. Otherwise: call move_remote_task_to_local_dsq(), which releases
  *      src_rq, locks dst_rq, and performs the deactivate/activate
  *      migration cycle (dst_rq is held on return)

> ---
>  kernel/sched/ext.c | 24 ++++++++++++++++--------
>  1 file changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 6567f626b3f0..4ae7ca4e0a41 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2422,6 +2422,7 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>   *   no to the BPF scheduler initiated migrations while offline.
>   *
>   * The caller must ensure that @p and @rq are on different CPUs.
> + * If enforce == true, caller must hold @p's rq lock.
>   */
>  static bool task_can_run_on_remote_rq(struct scx_sched *sch,
>  				      struct task_struct *p, struct rq *rq,
> @@ -2429,6 +2430,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
>  {
>  	s32 cpu = cpu_of(rq);
>  
> +	/*
> +	 * To prevent races with @p still running on its old CPU while switching
> +	 * out, make sure we're holding @p's rq lock so as not to risk
> +	 * erroneously killing the BPF scheduler.
> +	 */
> +	if (enforce)
> +		lockdep_assert_rq_held(task_rq(p));
> +
>  	WARN_ON_ONCE(task_cpu(p) == cpu);
>  
>  	/*
> @@ -2696,13 +2705,6 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		return;
>  	}
>  
> -	if (src_rq != dst_rq &&
> -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
> -				 enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
> -		return;
> -	}
> -
>  	/*
>  	 * @p is on a possibly remote @src_rq which we need to lock to move the
>  	 * task. If dequeue is in progress, it'd be locking @src_rq and waiting
> @@ -2729,6 +2731,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  	/* task_rq couldn't have changed if we're still the holding cpu */
>  	if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
>  	    !WARN_ON_ONCE(src_rq != task_rq(p))) {
> +		bool fallback = false;
>  		/*
>  		 * If @p is staying on the same rq, there's no need to go
>  		 * through the full deactivate/activate cycle. Optimize by
> @@ -2738,6 +2741,11 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  			p->scx.holding_cpu = -1;
>  			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
>  					 enq_flags);
> +		} else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> +			p->scx.holding_cpu = -1;
> +			fallback = true;
> +			dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
> +					 p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
>  		} else {
>  			move_remote_task_to_local_dsq(p, enq_flags,
>  						      src_rq, dst_rq);
> @@ -2746,7 +2754,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		}
>  
>  		/* if the destination CPU is idle, wake it up */
> -		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
> +		if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class))
>  			resched_curr(dst_rq);
>  	}
>  
> -- 
> 2.55.0.rc0.786.g65d90a0328-goog
> 

      reply	other threads:[~2026-06-19  7:33 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-18 17:00 [PATCH sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock Kuba Piecuch
2026-06-19  7:31 ` Andrea Righi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajTwP6tQxR_6UGsQ@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=jpiecuch@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.