Sched_ext development
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, Changwoo Min <changwoo@igalia.com>,
	David Vernet <void@manifault.com>,
	linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev
Subject: Re: [PATCH v2 sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock
Date: Fri, 19 Jun 2026 15:32:36 +0200	[thread overview]
Message-ID: <ajVE9Msy9kEv1ihV@gpd4> (raw)
In-Reply-To: <20260619132359.831936-1-jpiecuch@google.com>

Hi Kuba,

On Fri, Jun 19, 2026 at 01:23:59PM +0000, Kuba Piecuch wrote:
> task_can_run_on_remote_rq() operates under the assumption that
> p->migration_disabled is stable, i.e. if the kernel observed
> is_migration_disabled(p) == true, then the BPF scheduler must have also
> been able to see this when dispatching the task, and it's the BPF
> scheduler's fault that it tried to dispatch a task with migration
> disabled to a CPU other than the task's current CPU.
> 
> This assumption does not always hold. It's possible that the BPF
> scheduler saw is_migration_disabled(p) == false, while the kernel
> observes is_migration_disabled(p) == true in dispatch_to_local_dsq()
> -> task_can_run_on_remote_rq().
> 
> The crucial thing here is that with CONFIG_PREEMPT_RCU, migration is
> disabled while a task is executing a BPF program. So, if there's a
> situation where the BPF scheduler checks a task while it's not executing
> a BPF program, while the kernel checks it while it is executing one,
> the BPF scheduler will be killed through no fault of its own.
> 
> Consider the following scenario:
> 
> 1. SCX task @p is executing on CPU A and CPU A gets preempted by a
>    higher-priority scheduling class. On entry to __schedule(),
>    p->migration_disabled == 0.
> 
> 2. In put_prev_task_scx() @p is enqueued on the BPF scheduler's internal
>    data structures, making it available for other CPUs to dispatch.
> 
> 3. CPU B enters ops.dispatch(), pops @p from the BPF scheduler's data
>    structures, checks is_migration_disabled(p) which returns false,
>    and dispatches @p to CPU B's local DSQ.
> 
> 4. On CPU A, @p hasn't been switched out yet. Execution reaches
>    trace_sched_switch() which enters a BPF program, as the BPF scheduler
>    hooks into the sched_switch tracepoint to detect idle->fair
>    transitions. On entry into the BPF program, @p disables migration.
> 
> 5. CPU B enters finish_dispatch() -> dispatch_to_local_dsq() ->
>    task_can_run_on_remote_rq() which observes
>    is_migration_disabled(p) == true, triggering scx_error().
>    This all happens while holding CPU B's rq lock, so it's not
>    synchronized with @p switching out.
> 
> This patch fixes this by moving the call to task_can_run_on_remote_rq()
> after @p's rq lock is acquired in dispatch_to_local_dsq(). This way, we
> synchronize with @p switching out, since @p holds its rq lock all
> the way until it's switched out. Thus, any BPF programs that are called
> between put_prev_task_scx() and the end of the context switch are
> guaranteed to have finished and cannot influence p->migration_disabled.
> 
> Also add a lockdep assertion in task_can_run_on_remote_rq() which
> ensures the task rq lock is held if enforce == true.
> 
> Signed-off-by: Kuba Piecuch <jpiecuch@google.com>

Looks good to me.

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> ---
> 
> Changes from v1:
>  - Update documentation in ext_internal.h explaining the locking dance
>    around task rq migration (Andrea)
> Link to v1: https://lore.kernel.org/all/20260618170047.283701-1-jpiecuch@google.com/
> 
>  kernel/sched/ext.c          | 24 ++++++++++++++++--------
>  kernel/sched/ext_internal.h | 23 +++++++++++++----------
>  2 files changed, 29 insertions(+), 18 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 6567f626b3f0..4ae7ca4e0a41 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2422,6 +2422,7 @@ static void move_remote_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>   *   no to the BPF scheduler initiated migrations while offline.
>   *
>   * The caller must ensure that @p and @rq are on different CPUs.
> + * If enforce == true, caller must hold @p's rq lock.
>   */
>  static bool task_can_run_on_remote_rq(struct scx_sched *sch,
>  				      struct task_struct *p, struct rq *rq,
> @@ -2429,6 +2430,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
>  {
>  	s32 cpu = cpu_of(rq);
>  
> +	/*
> +	 * To prevent races with @p still running on its old CPU while switching
> +	 * out, make sure we're holding @p's rq lock so as not to risk
> +	 * erroneously killing the BPF scheduler.
> +	 */
> +	if (enforce)
> +		lockdep_assert_rq_held(task_rq(p));
> +
>  	WARN_ON_ONCE(task_cpu(p) == cpu);
>  
>  	/*
> @@ -2696,13 +2705,6 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		return;
>  	}
>  
> -	if (src_rq != dst_rq &&
> -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
> -				 enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
> -		return;
> -	}
> -
>  	/*
>  	 * @p is on a possibly remote @src_rq which we need to lock to move the
>  	 * task. If dequeue is in progress, it'd be locking @src_rq and waiting
> @@ -2729,6 +2731,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  	/* task_rq couldn't have changed if we're still the holding cpu */
>  	if (likely(p->scx.holding_cpu == raw_smp_processor_id()) &&
>  	    !WARN_ON_ONCE(src_rq != task_rq(p))) {
> +		bool fallback = false;
>  		/*
>  		 * If @p is staying on the same rq, there's no need to go
>  		 * through the full deactivate/activate cycle. Optimize by
> @@ -2738,6 +2741,11 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  			p->scx.holding_cpu = -1;
>  			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
>  					 enq_flags);
> +		} else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> +			p->scx.holding_cpu = -1;
> +			fallback = true;
> +			dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
> +					 p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
>  		} else {
>  			move_remote_task_to_local_dsq(p, enq_flags,
>  						      src_rq, dst_rq);
> @@ -2746,7 +2754,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		}
>  
>  		/* if the destination CPU is idle, wake it up */
> -		if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
> +		if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class))
>  			resched_curr(dst_rq);
>  	}
>  
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index b04701190b23..457df0bebcd9 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -1463,21 +1463,24 @@ static const char *scx_enable_state_str[] = {
>   * The sched_ext core uses a "lock dancing" protocol coordinated by
>   * p->scx.holding_cpu. When moving a task to a different rq:
>   *
> - *   1. Verify task can be moved (CPU affinity, migration_disabled, etc.)
> - *   2. Set p->scx.holding_cpu to the current CPU
> - *   3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
> + *   1. Set p->scx.holding_cpu to the current CPU
> + *   2. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING
>   *      is set, so clearing DISPATCHING first prevents the circular wait
>   *      (safe to lock the rq we need)
> - *   4. Unlock the current CPU's rq
> - *   5. Lock src_rq (where the task currently lives)
> - *   6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
> + *   3. Unlock the current CPU's rq
> + *   4. Lock src_rq (where the task currently lives)
> + *   5. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the
>   *      race (dequeue clears holding_cpu to -1 when it takes the task), in
>   *      this case migration is aborted
> - *   7. If src_rq == dst_rq: clear holding_cpu and enqueue directly
> + *   6. If src_rq == dst_rq: clear holding_cpu and enqueue directly
>   *      into dst_rq's local DSQ (no lock swap needed)
> - *   8. Otherwise: call move_remote_task_to_local_dsq(), which releases
> - *      src_rq, locks dst_rq, and performs the deactivate/activate
> - *      migration cycle (dst_rq is held on return)
> + *   7. Otherwise, verify under src_rq lock that the task can be moved to dst_rq
> + *      (CPU affinity, migration_disabled, etc.). If not, clear holding_cpu,
> + *      leave the task on src_rq, and enqueue it on the fallback DSQ.
> + *   8. Otherwise (i.e. if the task can be moved to dst_rq), call
> + *      move_remote_task_to_local_dsq(), which releases src_rq, locks dst_rq,
> + *      and performs the deactivate/activate migration cycle
> + *      (dst_rq is held on return)
>   *   9. Unlock dst_rq and re-lock the current CPU's rq to restore
>   *      the lock state expected by the caller
>   *
> -- 
> 2.55.0.rc0.738.g0c8ab3ebcc-goog
> 

  reply	other threads:[~2026-06-19 13:32 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-19 13:23 [PATCH v2 sched_ext/for-7.2] sched_ext: check remote rq eligibility under task's rq lock Kuba Piecuch
2026-06-19 13:32 ` Andrea Righi [this message]
2026-06-24 22:09 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajVE9Msy9kEv1ihV@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=jpiecuch@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox