public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Kuba Piecuch <jpiecuch@google.com>
To: Andrea Righi <arighi@nvidia.com>, Tejun Heo <tj@kernel.org>,
	 David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>
Cc: Christian Loehle <christian.loehle@arm.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	 Daniel Hodges <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
	 <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes
Date: Wed, 04 Feb 2026 13:20:52 +0000	[thread overview]
Message-ID: <DG67IMOVU2GX.2CKP769EXJS12@google.com> (raw)
In-Reply-To: <20260203230639.1259869-1-arighi@nvidia.com>

Hi Andrea,

On Tue Feb 3, 2026 at 11:06 PM UTC, Andrea Righi wrote:
> A BPF scheduler may rely on p->cpus_ptr from ops.dispatch() to select a
> target CPU. However, task affinity can change between the dispatch
> decision and its finalization in finish_dispatch(). When this happens,
> the scheduler may attempt to dispatch a task to a CPU that is no longer
> allowed, resulting in fatal errors such as:
>
>  EXIT: runtime error (SCX_DSQ_LOCAL[_ON] target CPU 10 not allowed for stress-ng-race-[13565])
>
> This race exists because ops.dispatch() runs without holding the task's
> run queue lock, allowing a concurrent set_cpus_allowed() to update
> p->cpus_ptr while the BPF scheduler is still using it. The dispatch is
> then finalized using stale affinity information.
>
> Example timeline:
>
>   CPU0                                      CPU1
>   ----                                      ----
>                                             task_rq_lock(p)
>   if (cpumask_test_cpu(cpu, p->cpus_ptr))
>                                             set_cpus_allowed_scx(p, new_mask)
>                                             task_rq_unlock(p)
>       scx_bpf_dsq_insert(p,
>               SCX_DSQ_LOCAL_ON | cpu, 0)
>
> Fix this by extending the existing qseq invalidation mechanism to also
> cover CPU affinity changes, in addition to task dequeues/re-enqueues,
> occurring between dispatch decision and finalization.

I'm not convinced this will prevent the exact race scenario you describe.
For the qseq increment inside set_cpus_allowed_scx() to be noticed, it needs
to happen _after_ scx_bpf_dsq_insert() reads the qseq and stores it in the
dispatch buffer entry.

We can still have the cpumask change and qseq increment on CPU1 happen between
cpumask_test_cpu() and scx_bpf_dsq_insert() on CPU0, and it will not be
detected because the dispatch buffer entry will contain the incremented value.

>
> When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> re-dispatched using up-to-date affinity information.

How will the scheduler know that the dispatch was dropped? Is the scheduler
expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
on CPU1?

>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/ext.c | 58 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 47 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0ab994180f655..6128a2529a7c7 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1828,8 +1828,9 @@ static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
>   * will change. As @p's task_rq is locked, this function doesn't need to use the
>   * holding_cpu mechanism.
>   *
> - * On return, @src_dsq is unlocked and only @p's new task_rq, which is the
> - * return value, is locked.
> + * On success, @src_dsq is unlocked and only @p's new task_rq, which is the
> + * return value, is locked. On failure (affinity change invalidated the move),
> + * returns NULL with @src_dsq unlocked and task remaining in @src_dsq.
>   */
>  static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  					 struct task_struct *p, u64 enq_flags,
> @@ -1845,9 +1846,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  	if (dst_dsq->id == SCX_DSQ_LOCAL) {
>  		dst_rq = container_of(dst_dsq, struct rq, scx.local_dsq);
>  		if (src_rq != dst_rq &&
> -		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -			dst_dsq = find_global_dsq(sch, p);
> -			dst_rq = src_rq;
> +		    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> +			/*
> +			 * Task affinity changed after dispatch decision:
> +			 * drop the dispatch, caller will handle returning
> +			 * the task to its original DSQ.
> +			 */
> +			return NULL;
>  		}
>  	} else {
>  		/* no need to migrate if destination is a non-local DSQ */
> @@ -1974,9 +1979,15 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  	}
>  
>  	if (src_rq != dst_rq &&
> -	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> -				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> +	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
> +		/*
> +		 * Task affinity changed after dispatch decision: drop the
> +		 * dispatch, task remains in its current state and will be
> +		 * dispatched again in a future cycle.
> +		 */
> +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED |
> +					(atomic_long_read(&p->scx.ops_state) &
> +					 SCX_OPSS_QSEQ_MASK));
>  		return;
>  	}
>  
> @@ -2616,12 +2627,30 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  				 struct affinity_context *ac)
>  {
>  	struct scx_sched *sch = scx_root;
> +	struct rq *rq = task_rq(p);
> +
> +	lockdep_assert_rq_held(rq);
>  
>  	set_cpus_allowed_common(p, ac);
>  
>  	if (unlikely(!sch))
>  		return;
>  
> +	/*
> +	 * Affinity changes invalidate any pending dispatch decisions made
> +	 * with the old affinity. Increment the runqueue's ops_qseq and
> +	 * update the task's qseq to invalidate in-flight dispatches.
> +	 */
> +	if (p->scx.flags & SCX_TASK_QUEUED) {
> +		unsigned long opss;
> +
> +		rq->scx.ops_qseq++;
> +		opss = atomic_long_read(&p->scx.ops_state);
> +		atomic_long_set(&p->scx.ops_state,
> +				(opss & SCX_OPSS_STATE_MASK) |
> +				(rq->scx.ops_qseq << SCX_OPSS_QSEQ_SHIFT));
> +	}

Will we ever enter this if statement? IIUC set_cpus_allowed_scx() is always
called under `scoped_guard (sched_change, p, DEQUEUE_SAVE)`, so assuming the
task is on a runqueue, set_cpus_allowed_scx() will always be preceded by
dequeue_task_scx(), which clears SCX_TASK_QUEUED.

> +
>  	/*
>  	 * The effective cpumask is stored in @p->cpus_ptr which may temporarily
>  	 * differ from the configured one in @p->cpus_mask. Always tell the bpf
> @@ -6013,14 +6042,21 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
>  
>  	/* execute move */
>  	locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
> -	dispatched = true;
> +	if (locked_rq) {
> +		dispatched = true;
> +	} else {
> +		/* Move failed: task stays in src_dsq */
> +		raw_spin_unlock(&src_dsq->lock);
> +		locked_rq = in_balance ? this_rq : NULL;
> +	}
>  out:
>  	if (in_balance) {
>  		if (this_rq != locked_rq) {
> -			raw_spin_rq_unlock(locked_rq);
> +			if (locked_rq)
> +				raw_spin_rq_unlock(locked_rq);
>  			raw_spin_rq_lock(this_rq);
>  		}
> -	} else {
> +	} else if (locked_rq) {
>  		raw_spin_rq_unlock_irqrestore(locked_rq, flags);
>  	}
>  

Thanks,
Kuba


  reply	other threads:[~2026-02-04 13:20 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03 23:06 [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
2026-02-04 13:20 ` Kuba Piecuch [this message]
2026-02-04 15:36   ` Andrea Righi
2026-02-04 16:58     ` Kuba Piecuch
2026-02-04 17:56       ` Andrea Righi
2026-02-05 17:20         ` Kuba Piecuch
2026-02-05 17:37           ` Andrea Righi
2026-02-04 15:07 ` Christian Loehle
2026-02-04 23:31 ` Tejun Heo
2026-02-05  1:15   ` Tejun Heo
2026-02-05 16:40   ` Andrea Righi
2026-02-05 22:57     ` Tejun Heo
2026-02-06  8:43       ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DG67IMOVU2GX.2CKP769EXJS12@google.com \
    --to=jpiecuch@google.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox