* [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability
@ 2025-10-21 21:03 Tejun Heo
2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo
2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo
0 siblings, 2 replies; 8+ messages in thread
From: Tejun Heo @ 2025-10-21 21:03 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Peter Zijlstra, linux-kernel, sched-ext, bpf
SCX_KICK_WAIT is used to synchronously wait for a target CPU to complete a
reschedule, which is needed for implementing operations like core scheduling.
This broke when scx_next_task_picked() was replaced with switch_class() in
commit b999e365c298, because the sequence counter increment that SCX_KICK_WAIT
depends on was no longer reliably called.
This patchset fixes the regression by moving the sequence counter update to
put_prev_task_scx() and refining the semantics to work correctly with the
updated scheduler structure. The first patch adds a prerequisite check to
skip kicking CPUs running higher sched classes, which SCX has no control over.
Based on sched_ext/for-6.19 (2dbbdeda77a6).
0001 sched_ext: Don't kick CPUs running higher classes
0002 sched_ext: Fix SCX_KICK_WAIT to work reliably
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-fix-kick_wait
kernel/sched/ext.c | 42 ++++++++++++++++++++++++++----------------
kernel/sched/ext_internal.h | 6 ++++--
2 files changed, 30 insertions(+), 18 deletions(-)
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* sched_ext: Don't kick CPUs running higher classes
2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo
@ 2025-10-21 21:03 ` Tejun Heo
2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo
1 sibling, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2025-10-21 21:03 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Peter Zijlstra, linux-kernel, sched-ext, bpf, Tejun Heo
When a sched_ext scheduler tries to kick a CPU, the CPU may be running a
higher class task. sched_ext has no control over such CPUs. A sched_ext
scheduler couldn't have expected to get access to the CPU after kicking it
anyway. Skip kicking when the target CPU is running a higher class.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5122,18 +5122,23 @@ static bool kick_one_cpu(s32 cpu, struct
{
struct rq *rq = cpu_rq(cpu);
struct scx_rq *this_scx = &this_rq->scx;
+ const struct sched_class *cur_class;
bool should_wait = false;
unsigned long flags;
raw_spin_rq_lock_irqsave(rq, flags);
+ cur_class = rq->curr->sched_class;
/*
* During CPU hotplug, a CPU may depend on kicking itself to make
- * forward progress. Allow kicking self regardless of online state.
+ * forward progress. Allow kicking self regardless of online state. If
+ * @cpu is running a higher class task, we have no control over @cpu.
+ * Skip kicking.
*/
- if (cpu_online(cpu) || cpu == cpu_of(this_rq)) {
+ if ((cpu_online(cpu) || cpu == cpu_of(this_rq)) &&
+ !sched_class_above(cur_class, &ext_sched_class)) {
if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
- if (rq->curr->sched_class == &ext_sched_class)
+ if (cur_class == &ext_sched_class)
rq->curr->scx.slice = 0;
cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* sched_ext: Fix SCX_KICK_WAIT to work reliably
2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo
2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo
@ 2025-10-21 21:03 ` Tejun Heo
2025-10-22 7:43 ` Andrea Righi
2025-10-22 8:03 ` Peter Zijlstra
1 sibling, 2 replies; 8+ messages in thread
From: Tejun Heo @ 2025-10-21 21:03 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: Peter Zijlstra, linux-kernel, sched-ext, bpf, Tejun Heo,
Wen-Fang Liu
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.
This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.
However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.
This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and refines the semantics: If the
current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task
switches out. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.
Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 31 ++++++++++++++++++-------------
kernel/sched/ext_internal.h | 6 ++++--
2 files changed, 22 insertions(+), 15 deletions(-)
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq,
struct scx_sched *sch = scx_root;
const struct sched_class *next_class = next->sched_class;
- /*
- * Pairs with the smp_load_acquire() issued by a CPU in
- * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
- * resched.
- */
- smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
return;
@@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq
struct task_struct *next)
{
struct scx_sched *sch = scx_root;
+
+ /*
+ * Pairs with the smp_load_acquire() issued by a CPU in
+ * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+ * resched.
+ */
+ smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+
update_curr_scx(rq);
/* see dequeue_task_scx() on why we skip when !QUEUED */
@@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct
}
if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
- pseqs[cpu] = rq->scx.pnt_seq;
- should_wait = true;
+ if (cur_class == &ext_sched_class) {
+ pseqs[cpu] = rq->scx.pnt_seq;
+ should_wait = true;
+ } else {
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
+ }
}
resched_curr(rq);
@@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
if (cpu != cpu_of(this_rq)) {
/*
- * Pairs with smp_store_release() issued by this CPU in
- * switch_class() on the resched path.
+ * Pairs with store_release in put_prev_task_scx().
*
- * We busy-wait here to guarantee that no other task can
- * be scheduled on our core before the target CPU has
- * entered the resched path.
+ * We busy-wait here to guarantee that the task running
+ * at the time of kicking is no longer running. This can
+ * be used to implement e.g. core scheduling.
*/
while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
cpu_relax();
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -997,8 +997,10 @@ enum scx_kick_flags {
SCX_KICK_PREEMPT = 1LLU << 1,
/*
- * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
- * return after the target CPU finishes picking the next task.
+ * The scx_bpf_kick_cpu() call will return after the current SCX task of
+ * the target CPU switches out. This can be used to implement e.g. core
+ * scheduling. This has no effect if the current task on the target CPU
+ * is not on SCX.
*/
SCX_KICK_WAIT = 1LLU << 2,
};
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo
@ 2025-10-22 7:43 ` Andrea Righi
2025-10-22 18:37 ` Tejun Heo
2025-10-22 8:03 ` Peter Zijlstra
1 sibling, 1 reply; 8+ messages in thread
From: Andrea Righi @ 2025-10-22 7:43 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, Peter Zijlstra, linux-kernel,
sched-ext, bpf, Wen-Fang Liu
Hi Tejun,
On Tue, Oct 21, 2025 at 11:03:54AM -1000, Tejun Heo wrote:
> SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
> a reschedule and can be used to implement operations like core scheduling.
>
> This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
> which was always called when a CPU picks the next task to run, allowing
> SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
> pick the next task.
>
> However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
> with switch_class()") replaced scx_next_task_picked() with the
> switch_class() callback, which is only called when switching between sched
> classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
> reliably incremented unless the previous task was SCX and the next task was
> not.
>
> This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
> pick_task_scx()") which refactored the pick path making put_prev_task_scx()
> the natural place to track task switches for SCX_KICK_WAIT. The fix moves
> pnt_seq increment to put_prev_task_scx() and refines the semantics: If the
> current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task
> switches out. This provides sufficient guarantee for use cases like core
> scheduling while keeping the operation self-contained within SCX.
>
> Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> kernel/sched/ext.c | 31 ++++++++++++++++++-------------
> kernel/sched/ext_internal.h | 6 ++++--
> 2 files changed, 22 insertions(+), 15 deletions(-)
>
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq,
> struct scx_sched *sch = scx_root;
> const struct sched_class *next_class = next->sched_class;
>
> - /*
> - * Pairs with the smp_load_acquire() issued by a CPU in
> - * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> - * resched.
> - */
> - smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
> return;
>
> @@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq
> struct task_struct *next)
> {
> struct scx_sched *sch = scx_root;
> +
> + /*
> + * Pairs with the smp_load_acquire() issued by a CPU in
> + * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
> + * resched.
> + */
> + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
> +
> update_curr_scx(rq);
>
> /* see dequeue_task_scx() on why we skip when !QUEUED */
> @@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct
> }
>
> if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
> - pseqs[cpu] = rq->scx.pnt_seq;
> - should_wait = true;
> + if (cur_class == &ext_sched_class) {
> + pseqs[cpu] = rq->scx.pnt_seq;
> + should_wait = true;
> + } else {
> + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
> + }
> }
>
> resched_curr(rq);
> @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
>
> if (cpu != cpu_of(this_rq)) {
It's probably fine anyway, but should we check for cpu_online(cpu) here?
> /*
> - * Pairs with smp_store_release() issued by this CPU in
> - * switch_class() on the resched path.
> + * Pairs with store_release in put_prev_task_scx().
> *
> - * We busy-wait here to guarantee that no other task can
> - * be scheduled on our core before the target CPU has
> - * entered the resched path.
> + * We busy-wait here to guarantee that the task running
> + * at the time of kicking is no longer running. This can
> + * be used to implement e.g. core scheduling.
> */
> while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> cpu_relax();
I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice
is refilled concurrently between kick_one_cpu() and this busy wait. In this
case we return, because wait_pnt_seq is incremented, but we keep running
the same task.
Should we introduce a flag (or something similar) to force the re-enqueue
of the prev task in this case?
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -997,8 +997,10 @@ enum scx_kick_flags {
> SCX_KICK_PREEMPT = 1LLU << 1,
>
> /*
> - * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
> - * return after the target CPU finishes picking the next task.
> + * The scx_bpf_kick_cpu() call will return after the current SCX task of
> + * the target CPU switches out. This can be used to implement e.g. core
> + * scheduling. This has no effect if the current task on the target CPU
> + * is not on SCX.
> */
> SCX_KICK_WAIT = 1LLU << 2,
> };
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo
2025-10-22 7:43 ` Andrea Righi
@ 2025-10-22 8:03 ` Peter Zijlstra
2025-10-22 18:38 ` Tejun Heo
1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2025-10-22 8:03 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
bpf, Wen-Fang Liu
On Tue, Oct 21, 2025 at 11:03:54AM -1000, Tejun Heo wrote:
> @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
>
> if (cpu != cpu_of(this_rq)) {
> /*
> - * Pairs with smp_store_release() issued by this CPU in
> - * switch_class() on the resched path.
> + * Pairs with store_release in put_prev_task_scx().
> *
> - * We busy-wait here to guarantee that no other task can
> - * be scheduled on our core before the target CPU has
> - * entered the resched path.
> + * We busy-wait here to guarantee that the task running
> + * at the time of kicking is no longer running. This can
> + * be used to implement e.g. core scheduling.
> */
> while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> cpu_relax();
You could consider using:
smp_cond_load_acquire(wait_pnt_seq, VAL !+ pseqs[cpu]);
that's the fancy way of doing a spin wait and allows architectures to
optimize (mostly arm64 at this point).
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
2025-10-22 7:43 ` Andrea Righi
@ 2025-10-22 18:37 ` Tejun Heo
2025-10-22 19:25 ` Andrea Righi
0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2025-10-22 18:37 UTC (permalink / raw)
To: Andrea Righi
Cc: David Vernet, Changwoo Min, Peter Zijlstra, linux-kernel,
sched-ext, bpf, Wen-Fang Liu
Hello, Andrea.
On Wed, Oct 22, 2025 at 09:43:25AM +0200, Andrea Righi wrote:
> > @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
> >
> > if (cpu != cpu_of(this_rq)) {
>
> It's probably fine anyway, but should we check for cpu_online(cpu) here?
This block gets activated iff kick_one_cpu() returns true and that is gated
by the CPU being online && the current task being on SCX. For the CPU to go
offline, that task has to go off CPU and thus increment the sequence
counter.
> > while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> > cpu_relax();
>
> I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice
> is refilled concurrently between kick_one_cpu() and this busy wait. In this
> case we return, because wait_pnt_seq is incremented, but we keep running
> the same task.
>
> Should we introduce a flag (or something similar) to force the re-enqueue
> of the prev task in this case?
Ah, right, that's a hole. There's another hole. The BPF scheduler can choose
to run the same task and put_prev_task_scx() won't be called. I think we
need to bump the seq count on entry to pick_task_scx() too. That should
solve both problems. All that we're guaranteeing is that we wait until the
task enters scheduling path. If a higher class task gets picked,
put_prev_task_scx() will be called. Otherwise, we break the wait when
pick_task_scx() is entered.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
2025-10-22 8:03 ` Peter Zijlstra
@ 2025-10-22 18:38 ` Tejun Heo
0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2025-10-22 18:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext,
bpf, Wen-Fang Liu
On Wed, Oct 22, 2025 at 10:03:46AM +0200, Peter Zijlstra wrote:
> > while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> > cpu_relax();
>
> You could consider using:
>
> smp_cond_load_acquire(wait_pnt_seq, VAL !+ pseqs[cpu]);
>
> that's the fancy way of doing a spin wait and allows architectures to
> optimize (mostly arm64 at this point).
Will do. Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably
2025-10-22 18:37 ` Tejun Heo
@ 2025-10-22 19:25 ` Andrea Righi
0 siblings, 0 replies; 8+ messages in thread
From: Andrea Righi @ 2025-10-22 19:25 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, Peter Zijlstra, linux-kernel,
sched-ext, bpf, Wen-Fang Liu
On Wed, Oct 22, 2025 at 08:37:50AM -1000, Tejun Heo wrote:
> Hello, Andrea.
>
> On Wed, Oct 22, 2025 at 09:43:25AM +0200, Andrea Righi wrote:
> > > @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
> > >
> > > if (cpu != cpu_of(this_rq)) {
> >
> > It's probably fine anyway, but should we check for cpu_online(cpu) here?
>
> This block gets activated iff kick_one_cpu() returns true and that is gated
> by the CPU being online && the current task being on SCX. For the CPU to go
> offline, that task has to go off CPU and thus increment the sequence
> counter.
I was thinking if the CPU goes offline after kick_one_cpu() returns and
before reaching this loop, but even in this case we're not accessing
anything unsafe, so we should be fine.
>
> > > while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
> > > cpu_relax();
> >
> > I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice
> > is refilled concurrently between kick_one_cpu() and this busy wait. In this
> > case we return, because wait_pnt_seq is incremented, but we keep running
> > the same task.
> >
> > Should we introduce a flag (or something similar) to force the re-enqueue
> > of the prev task in this case?
>
> Ah, right, that's a hole. There's another hole. The BPF scheduler can choose
> to run the same task and put_prev_task_scx() won't be called. I think we
> need to bump the seq count on entry to pick_task_scx() too. That should
> solve both problems. All that we're guaranteeing is that we wait until the
> task enters scheduling path. If a higher class task gets picked,
> put_prev_task_scx() will be called. Otherwise, we break the wait when
> pick_task_scx() is entered.
Yeah, that sounds reasonable to me.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-10-22 19:26 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo
2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo
2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo
2025-10-22 7:43 ` Andrea Righi
2025-10-22 18:37 ` Tejun Heo
2025-10-22 19:25 ` Andrea Righi
2025-10-22 8:03 ` Peter Zijlstra
2025-10-22 18:38 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox