* [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability @ 2025-10-21 21:03 Tejun Heo 2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo 2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo 0 siblings, 2 replies; 8+ messages in thread From: Tejun Heo @ 2025-10-21 21:03 UTC (permalink / raw) To: David Vernet, Andrea Righi, Changwoo Min Cc: Peter Zijlstra, linux-kernel, sched-ext, bpf SCX_KICK_WAIT is used to synchronously wait for a target CPU to complete a reschedule, which is needed for implementing operations like core scheduling. This broke when scx_next_task_picked() was replaced with switch_class() in commit b999e365c298, because the sequence counter increment that SCX_KICK_WAIT depends on was no longer reliably called. This patchset fixes the regression by moving the sequence counter update to put_prev_task_scx() and refining the semantics to work correctly with the updated scheduler structure. The first patch adds a prerequisite check to skip kicking CPUs running higher sched classes, which SCX has no control over. Based on sched_ext/for-6.19 (2dbbdeda77a6). 0001 sched_ext: Don't kick CPUs running higher classes 0002 sched_ext: Fix SCX_KICK_WAIT to work reliably Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-fix-kick_wait kernel/sched/ext.c | 42 ++++++++++++++++++++++++++---------------- kernel/sched/ext_internal.h | 6 ++++-- 2 files changed, 30 insertions(+), 18 deletions(-) -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
* sched_ext: Don't kick CPUs running higher classes 2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo @ 2025-10-21 21:03 ` Tejun Heo 2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo 1 sibling, 0 replies; 8+ messages in thread From: Tejun Heo @ 2025-10-21 21:03 UTC (permalink / raw) To: David Vernet, Andrea Righi, Changwoo Min Cc: Peter Zijlstra, linux-kernel, sched-ext, bpf, Tejun Heo When a sched_ext scheduler tries to kick a CPU, the CPU may be running a higher class task. sched_ext has no control over such CPUs. A sched_ext scheduler couldn't have expected to get access to the CPU after kicking it anyway. Skip kicking when the target CPU is running a higher class. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -5122,18 +5122,23 @@ static bool kick_one_cpu(s32 cpu, struct { struct rq *rq = cpu_rq(cpu); struct scx_rq *this_scx = &this_rq->scx; + const struct sched_class *cur_class; bool should_wait = false; unsigned long flags; raw_spin_rq_lock_irqsave(rq, flags); + cur_class = rq->curr->sched_class; /* * During CPU hotplug, a CPU may depend on kicking itself to make - * forward progress. Allow kicking self regardless of online state. + * forward progress. Allow kicking self regardless of online state. If + * @cpu is running a higher class task, we have no control over @cpu. + * Skip kicking. */ - if (cpu_online(cpu) || cpu == cpu_of(this_rq)) { + if ((cpu_online(cpu) || cpu == cpu_of(this_rq)) && + !sched_class_above(cur_class, &ext_sched_class)) { if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) { - if (rq->curr->sched_class == &ext_sched_class) + if (cur_class == &ext_sched_class) rq->curr->scx.slice = 0; cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); } ^ permalink raw reply [flat|nested] 8+ messages in thread
* sched_ext: Fix SCX_KICK_WAIT to work reliably 2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo 2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo @ 2025-10-21 21:03 ` Tejun Heo 2025-10-22 7:43 ` Andrea Righi 2025-10-22 8:03 ` Peter Zijlstra 1 sibling, 2 replies; 8+ messages in thread From: Tejun Heo @ 2025-10-21 21:03 UTC (permalink / raw) To: David Vernet, Andrea Righi, Changwoo Min Cc: Peter Zijlstra, linux-kernel, sched-ext, bpf, Tejun Heo, Wen-Fang Liu SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete a reschedule and can be used to implement operations like core scheduling. This used to be implemented by scx_next_task_picked() incrementing pnt_seq, which was always called when a CPU picks the next task to run, allowing SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and pick the next task. However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked() with switch_class()") replaced scx_next_task_picked() with the switch_class() callback, which is only called when switching between sched classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be reliably incremented unless the previous task was SCX and the next task was not. This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into pick_task_scx()") which refactored the pick path making put_prev_task_scx() the natural place to track task switches for SCX_KICK_WAIT. The fix moves pnt_seq increment to put_prev_task_scx() and refines the semantics: If the current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task switches out. This provides sufficient guarantee for use cases like core scheduling while keeping the operation self-contained within SCX. Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/sched/ext.c | 31 ++++++++++++++++++------------- kernel/sched/ext_internal.h | 6 ++++-- 2 files changed, 22 insertions(+), 15 deletions(-) --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq, struct scx_sched *sch = scx_root; const struct sched_class *next_class = next->sched_class; - /* - * Pairs with the smp_load_acquire() issued by a CPU in - * kick_cpus_irq_workfn() who is waiting for this CPU to perform a - * resched. - */ - smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT)) return; @@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq struct task_struct *next) { struct scx_sched *sch = scx_root; + + /* + * Pairs with the smp_load_acquire() issued by a CPU in + * kick_cpus_irq_workfn() who is waiting for this CPU to perform a + * resched. + */ + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); + update_curr_scx(rq); /* see dequeue_task_scx() on why we skip when !QUEUED */ @@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct } if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) { - pseqs[cpu] = rq->scx.pnt_seq; - should_wait = true; + if (cur_class == &ext_sched_class) { + pseqs[cpu] = rq->scx.pnt_seq; + should_wait = true; + } else { + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); + } } resched_curr(rq); @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct if (cpu != cpu_of(this_rq)) { /* - * Pairs with smp_store_release() issued by this CPU in - * switch_class() on the resched path. + * Pairs with store_release in put_prev_task_scx(). * - * We busy-wait here to guarantee that no other task can - * be scheduled on our core before the target CPU has - * entered the resched path. + * We busy-wait here to guarantee that the task running + * at the time of kicking is no longer running. This can + * be used to implement e.g. core scheduling. */ while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) cpu_relax(); --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -997,8 +997,10 @@ enum scx_kick_flags { SCX_KICK_PREEMPT = 1LLU << 1, /* - * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will - * return after the target CPU finishes picking the next task. + * The scx_bpf_kick_cpu() call will return after the current SCX task of + * the target CPU switches out. This can be used to implement e.g. core + * scheduling. This has no effect if the current task on the target CPU + * is not on SCX. */ SCX_KICK_WAIT = 1LLU << 2, }; ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably 2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo @ 2025-10-22 7:43 ` Andrea Righi 2025-10-22 18:37 ` Tejun Heo 2025-10-22 8:03 ` Peter Zijlstra 1 sibling, 1 reply; 8+ messages in thread From: Andrea Righi @ 2025-10-22 7:43 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Peter Zijlstra, linux-kernel, sched-ext, bpf, Wen-Fang Liu Hi Tejun, On Tue, Oct 21, 2025 at 11:03:54AM -1000, Tejun Heo wrote: > SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete > a reschedule and can be used to implement operations like core scheduling. > > This used to be implemented by scx_next_task_picked() incrementing pnt_seq, > which was always called when a CPU picks the next task to run, allowing > SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and > pick the next task. > > However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked() > with switch_class()") replaced scx_next_task_picked() with the > switch_class() callback, which is only called when switching between sched > classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be > reliably incremented unless the previous task was SCX and the next task was > not. > > This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into > pick_task_scx()") which refactored the pick path making put_prev_task_scx() > the natural place to track task switches for SCX_KICK_WAIT. The fix moves > pnt_seq increment to put_prev_task_scx() and refines the semantics: If the > current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task > switches out. This provides sufficient guarantee for use cases like core > scheduling while keeping the operation self-contained within SCX. > > Reported-by: Wen-Fang Liu <liuwenfang@honor.com> > Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com > Cc: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > kernel/sched/ext.c | 31 ++++++++++++++++++------------- > kernel/sched/ext_internal.h | 6 ++++-- > 2 files changed, 22 insertions(+), 15 deletions(-) > > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq, > struct scx_sched *sch = scx_root; > const struct sched_class *next_class = next->sched_class; > > - /* > - * Pairs with the smp_load_acquire() issued by a CPU in > - * kick_cpus_irq_workfn() who is waiting for this CPU to perform a > - * resched. > - */ > - smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); > if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT)) > return; > > @@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq > struct task_struct *next) > { > struct scx_sched *sch = scx_root; > + > + /* > + * Pairs with the smp_load_acquire() issued by a CPU in > + * kick_cpus_irq_workfn() who is waiting for this CPU to perform a > + * resched. > + */ > + smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1); > + > update_curr_scx(rq); > > /* see dequeue_task_scx() on why we skip when !QUEUED */ > @@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct > } > > if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) { > - pseqs[cpu] = rq->scx.pnt_seq; > - should_wait = true; > + if (cur_class == &ext_sched_class) { > + pseqs[cpu] = rq->scx.pnt_seq; > + should_wait = true; > + } else { > + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); > + } > } > > resched_curr(rq); > @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct > > if (cpu != cpu_of(this_rq)) { It's probably fine anyway, but should we check for cpu_online(cpu) here? > /* > - * Pairs with smp_store_release() issued by this CPU in > - * switch_class() on the resched path. > + * Pairs with store_release in put_prev_task_scx(). > * > - * We busy-wait here to guarantee that no other task can > - * be scheduled on our core before the target CPU has > - * entered the resched path. > + * We busy-wait here to guarantee that the task running > + * at the time of kicking is no longer running. This can > + * be used to implement e.g. core scheduling. > */ > while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) > cpu_relax(); I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice is refilled concurrently between kick_one_cpu() and this busy wait. In this case we return, because wait_pnt_seq is incremented, but we keep running the same task. Should we introduce a flag (or something similar) to force the re-enqueue of the prev task in this case? > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -997,8 +997,10 @@ enum scx_kick_flags { > SCX_KICK_PREEMPT = 1LLU << 1, > > /* > - * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will > - * return after the target CPU finishes picking the next task. > + * The scx_bpf_kick_cpu() call will return after the current SCX task of > + * the target CPU switches out. This can be used to implement e.g. core > + * scheduling. This has no effect if the current task on the target CPU > + * is not on SCX. > */ > SCX_KICK_WAIT = 1LLU << 2, > }; Thanks, -Andrea ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably 2025-10-22 7:43 ` Andrea Righi @ 2025-10-22 18:37 ` Tejun Heo 2025-10-22 19:25 ` Andrea Righi 0 siblings, 1 reply; 8+ messages in thread From: Tejun Heo @ 2025-10-22 18:37 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Peter Zijlstra, linux-kernel, sched-ext, bpf, Wen-Fang Liu Hello, Andrea. On Wed, Oct 22, 2025 at 09:43:25AM +0200, Andrea Righi wrote: > > @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct > > > > if (cpu != cpu_of(this_rq)) { > > It's probably fine anyway, but should we check for cpu_online(cpu) here? This block gets activated iff kick_one_cpu() returns true and that is gated by the CPU being online && the current task being on SCX. For the CPU to go offline, that task has to go off CPU and thus increment the sequence counter. > > while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) > > cpu_relax(); > > I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice > is refilled concurrently between kick_one_cpu() and this busy wait. In this > case we return, because wait_pnt_seq is incremented, but we keep running > the same task. > > Should we introduce a flag (or something similar) to force the re-enqueue > of the prev task in this case? Ah, right, that's a hole. There's another hole. The BPF scheduler can choose to run the same task and put_prev_task_scx() won't be called. I think we need to bump the seq count on entry to pick_task_scx() too. That should solve both problems. All that we're guaranteeing is that we wait until the task enters scheduling path. If a higher class task gets picked, put_prev_task_scx() will be called. Otherwise, we break the wait when pick_task_scx() is entered. Thanks. -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably 2025-10-22 18:37 ` Tejun Heo @ 2025-10-22 19:25 ` Andrea Righi 0 siblings, 0 replies; 8+ messages in thread From: Andrea Righi @ 2025-10-22 19:25 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Peter Zijlstra, linux-kernel, sched-ext, bpf, Wen-Fang Liu On Wed, Oct 22, 2025 at 08:37:50AM -1000, Tejun Heo wrote: > Hello, Andrea. > > On Wed, Oct 22, 2025 at 09:43:25AM +0200, Andrea Righi wrote: > > > @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct > > > > > > if (cpu != cpu_of(this_rq)) { > > > > It's probably fine anyway, but should we check for cpu_online(cpu) here? > > This block gets activated iff kick_one_cpu() returns true and that is gated > by the CPU being online && the current task being on SCX. For the CPU to go > offline, that task has to go off CPU and thus increment the sequence > counter. I was thinking if the CPU goes offline after kick_one_cpu() returns and before reaching this loop, but even in this case we're not accessing anything unsafe, so we should be fine. > > > > while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) > > > cpu_relax(); > > > > I'm wondering if we can break the semantic if cpu_rq(cpu)->curr->scx.slice > > is refilled concurrently between kick_one_cpu() and this busy wait. In this > > case we return, because wait_pnt_seq is incremented, but we keep running > > the same task. > > > > Should we introduce a flag (or something similar) to force the re-enqueue > > of the prev task in this case? > > Ah, right, that's a hole. There's another hole. The BPF scheduler can choose > to run the same task and put_prev_task_scx() won't be called. I think we > need to bump the seq count on entry to pick_task_scx() too. That should > solve both problems. All that we're guaranteeing is that we wait until the > task enters scheduling path. If a higher class task gets picked, > put_prev_task_scx() will be called. Otherwise, we break the wait when > pick_task_scx() is entered. Yeah, that sounds reasonable to me. Thanks, -Andrea ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably 2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo 2025-10-22 7:43 ` Andrea Righi @ 2025-10-22 8:03 ` Peter Zijlstra 2025-10-22 18:38 ` Tejun Heo 1 sibling, 1 reply; 8+ messages in thread From: Peter Zijlstra @ 2025-10-22 8:03 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext, bpf, Wen-Fang Liu On Tue, Oct 21, 2025 at 11:03:54AM -1000, Tejun Heo wrote: > @@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct > > if (cpu != cpu_of(this_rq)) { > /* > - * Pairs with smp_store_release() issued by this CPU in > - * switch_class() on the resched path. > + * Pairs with store_release in put_prev_task_scx(). > * > - * We busy-wait here to guarantee that no other task can > - * be scheduled on our core before the target CPU has > - * entered the resched path. > + * We busy-wait here to guarantee that the task running > + * at the time of kicking is no longer running. This can > + * be used to implement e.g. core scheduling. > */ > while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) > cpu_relax(); You could consider using: smp_cond_load_acquire(wait_pnt_seq, VAL !+ pseqs[cpu]); that's the fancy way of doing a spin wait and allows architectures to optimize (mostly arm64 at this point). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sched_ext: Fix SCX_KICK_WAIT to work reliably 2025-10-22 8:03 ` Peter Zijlstra @ 2025-10-22 18:38 ` Tejun Heo 0 siblings, 0 replies; 8+ messages in thread From: Tejun Heo @ 2025-10-22 18:38 UTC (permalink / raw) To: Peter Zijlstra Cc: David Vernet, Andrea Righi, Changwoo Min, linux-kernel, sched-ext, bpf, Wen-Fang Liu On Wed, Oct 22, 2025 at 10:03:46AM +0200, Peter Zijlstra wrote: > > while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu]) > > cpu_relax(); > > You could consider using: > > smp_cond_load_acquire(wait_pnt_seq, VAL !+ pseqs[cpu]); > > that's the fancy way of doing a spin wait and allows architectures to > optimize (mostly arm64 at this point). Will do. Thanks. -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-10-22 19:26 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo 2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo 2025-10-21 21:03 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Tejun Heo 2025-10-22 7:43 ` Andrea Righi 2025-10-22 18:37 ` Tejun Heo 2025-10-22 19:25 ` Andrea Righi 2025-10-22 8:03 ` Peter Zijlstra 2025-10-22 18:38 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox