Re: sched_ext: Fix SCX_KICK_WAIT to work reliably

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev,
	bpf@vger.kernel.org, Wen-Fang Liu <liuwenfang@honor.com>
Subject: Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
Date: Wed, 22 Oct 2025 09:43:25 +0200	[thread overview]
Message-ID: <aPiLHWVf0Vp1qUzV@gpd4> (raw)
In-Reply-To: <20251021210354.89570-3-tj@kernel.org>

Hi Tejun,

On Tue, Oct 21, 2025 at 11:03:54AM -1000, Tejun Heo wrote:
> SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
> a reschedule and can be used to implement operations like core scheduling.
> 
> This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
> which was always called when a CPU picks the next task to run, allowing
> SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
> pick the next task.
> 
> However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
> with switch_class()") replaced scx_next_task_picked() with the
> switch_class() callback, which is only called when switching between sched
> classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
> reliably incremented unless the previous task was SCX and the next task was
> not.
> 
> This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
> pick_task_scx()") which refactored the pick path making put_prev_task_scx()
> the natural place to track task switches for SCX_KICK_WAIT. The fix moves
> pnt_seq increment to put_prev_task_scx() and refines the semantics: If the
> current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task
> switches out. This provides sufficient guarantee for use cases like core
> scheduling while keeping the operation self-contained within SCX.
> 
> Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c          |   31 ++++++++++++++++++-------------
>  kernel/sched/ext_internal.h |    6 ++++--
>  2 files changed, 22 insertions(+), 15 deletions(-)
> 
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq,
>  	struct scx_sched *sch = scx_root;
>  	const struct sched_class *next_class = next->sched_class;
>  
> -	/*
> -	 * Pairs with the smp_load_acquire() issued by a CPU in
> -	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> -	 * resched.
> -	 */
> -	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
>  	if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
>  		return;
>  
> @@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq
>  			      struct task_struct *next)
>  {
>  	struct scx_sched *sch = scx_root;
> +
> +	/*
> +	 * Pairs with the smp_load_acquire() issued by a CPU in
> +	 * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> +	 * resched.
> +	 */
> +	smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> +
>  	update_curr_scx(rq);
>  
>  	/* see dequeue_task_scx() on why we skip when !QUEUED */
> @@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct
>  		}
>  
>  		if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
> -			pseqs[cpu] = rq->scx.pnt_seq;
> -			should_wait = true;
> +			if (cur_class == &ext_sched_class) {
> +				pseqs[cpu] = rq->scx.pnt_seq;
> +				should_wait = true;
> +			} else {
> +				cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
> +			}
>  		}
>  
>  		resched_curr(rq);
> @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
>  
>  		if (cpu != cpu_of(this_rq)) {

It's probably fine anyway, but should we check for cpu_online(cpu) here?

>  			/*
> -			 * Pairs with smp_store_release() issued by this CPU in
> -			 * switch_class() on the resched path.
> +			 * Pairs with store_release in put_prev_task_scx().
>  			 *
> -			 * We busy-wait here to guarantee that no other task can
> -			 * be scheduled on our core before the target CPU has
> -			 * entered the resched path.
> +			 * We busy-wait here to guarantee that the task running
> +			 * at the time of kicking is no longer running. This can
> +			 * be used to implement e.g. core scheduling.
>  			 */
>  			while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
>  				cpu_relax();

I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice
is refilled concurrently between kick_one_cpu() and this busy wait. In this
case we return, because wait_pnt_seq is incremented, but we keep running
the same task.

Should we introduce a flag (or something similar) to force the re-enqueue
of the prev task in this case?

> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -997,8 +997,10 @@ enum scx_kick_flags {
>  	SCX_KICK_PREEMPT	= 1LLU << 1,
>  
>  	/*
> -	 * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
> -	 * return after the target CPU finishes picking the next task.
> +	 * The scx_bpf_kick_cpu() call will return after the current SCX task of
> +	 * the target CPU switches out. This can be used to implement e.g. core
> +	 * scheduling. This has no effect if the current task on the target CPU
> +	 * is not on SCX.
>  	 */
>  	SCX_KICK_WAIT		= 1LLU << 2,
>  };

Thanks,
-Andrea

next prev parent reply	other threads:[~2025-10-22  7:43 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo
2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo
2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo
2025-10-22  7:43   ` Andrea Righi [this message]
2025-10-22 18:37     ` Tejun Heo
2025-10-22 19:25       ` Andrea Righi
2025-10-22  8:03   ` Peter Zijlstra
2025-10-22 18:38     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPiLHWVf0Vp1qUzV@gpd4 \
    --to=arighi@nvidia.com \
    --cc=bpf@vger.kernel.org \
    --cc=changwoo@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liuwenfang@honor.com \
    --cc=peterz@infradead.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.