From: Tejun Heo <tj@kernel.org>
To: David Vernet <void@manifault.com>,
Andrea Righi <arighi@nvidia.com>,
Changwoo Min <changwoo@igalia.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev,
bpf@vger.kernel.org, Tejun Heo <tj@kernel.org>,
Wen-Fang Liu <liuwenfang@honor.com>
Subject: sched_ext: Fix SCX_KICK_WAIT to work reliably
Date: Tue, 21 Oct 2025 11:03:54 -1000 [thread overview]
Message-ID: <20251021210354.89570-3-tj@kernel.org> (raw)
In-Reply-To: <20251021210354.89570-1-tj@kernel.org>
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.
This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.
However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.
This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and refines the semantics: If the
current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task
switches out. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.
Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 31 ++++++++++++++++++-------------
kernel/sched/ext_internal.h | 6 ++++--
2 files changed, 22 insertions(+), 15 deletions(-)
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq,
struct scx_sched *sch = scx_root;
const struct sched_class *next_class = next->sched_class;
- /*
- * Pairs with the smp_load_acquire() issued by a CPU in
- * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
- * resched.
- */
- smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
return;
@@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq
struct task_struct *next)
{
struct scx_sched *sch = scx_root;
+
+ /*
+ * Pairs with the smp_load_acquire() issued by a CPU in
+ * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+ * resched.
+ */
+ smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+
update_curr_scx(rq);
/* see dequeue_task_scx() on why we skip when !QUEUED */
@@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct
}
if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
- pseqs[cpu] = rq->scx.pnt_seq;
- should_wait = true;
+ if (cur_class == &ext_sched_class) {
+ pseqs[cpu] = rq->scx.pnt_seq;
+ should_wait = true;
+ } else {
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
+ }
}
resched_curr(rq);
@@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
if (cpu != cpu_of(this_rq)) {
/*
- * Pairs with smp_store_release() issued by this CPU in
- * switch_class() on the resched path.
+ * Pairs with store_release in put_prev_task_scx().
*
- * We busy-wait here to guarantee that no other task can
- * be scheduled on our core before the target CPU has
- * entered the resched path.
+ * We busy-wait here to guarantee that the task running
+ * at the time of kicking is no longer running. This can
+ * be used to implement e.g. core scheduling.
*/
while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
cpu_relax();
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -997,8 +997,10 @@ enum scx_kick_flags {
SCX_KICK_PREEMPT = 1LLU << 1,
/*
- * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
- * return after the target CPU finishes picking the next task.
+ * The scx_bpf_kick_cpu() call will return after the current SCX task of
+ * the target CPU switches out. This can be used to implement e.g. core
+ * scheduling. This has no effect if the current task on the target CPU
+ * is not on SCX.
*/
SCX_KICK_WAIT = 1LLU << 2,
};
next prev parent reply other threads:[~2025-10-21 21:03 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo
2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo
2025-10-21 21:03 ` Tejun Heo [this message]
2025-10-22 7:43 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Andrea Righi
2025-10-22 18:37 ` Tejun Heo
2025-10-22 19:25 ` Andrea Righi
2025-10-22 8:03 ` Peter Zijlstra
2025-10-22 18:38 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251021210354.89570-3-tj@kernel.org \
--to=tj@kernel.org \
--cc=arighi@nvidia.com \
--cc=bpf@vger.kernel.org \
--cc=changwoo@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=liuwenfang@honor.com \
--cc=peterz@infradead.org \
--cc=sched-ext@lists.linux.dev \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox