From: Tejun Heo <tj@kernel.org>
To: David Vernet <void@manifault.com>,
Andrea Righi <arighi@nvidia.com>,
Changwoo Min <changwoo@igalia.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev,
bpf@vger.kernel.org, Tejun Heo <tj@kernel.org>,
Wen-Fang Liu <liuwenfang@honor.com>
Subject: sched_ext: Fix SCX_KICK_WAIT to work reliably
Date: Tue, 21 Oct 2025 11:03:54 -1000 [thread overview]
Message-ID: <20251021210354.89570-3-tj@kernel.org> (raw)
In-Reply-To: <20251021210354.89570-1-tj@kernel.org>
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.
This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.
However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.
This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and refines the semantics: If the
current task on the target CPU is SCX, SCX_KICK_WAIT waits until that task
switches out. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.
Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 31 ++++++++++++++++++-------------
kernel/sched/ext_internal.h | 6 ++++--
2 files changed, 22 insertions(+), 15 deletions(-)
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2260,12 +2260,6 @@ static void switch_class(struct rq *rq,
struct scx_sched *sch = scx_root;
const struct sched_class *next_class = next->sched_class;
- /*
- * Pairs with the smp_load_acquire() issued by a CPU in
- * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
- * resched.
- */
- smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
if (!(sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT))
return;
@@ -2305,6 +2299,14 @@ static void put_prev_task_scx(struct rq
struct task_struct *next)
{
struct scx_sched *sch = scx_root;
+
+ /*
+ * Pairs with the smp_load_acquire() issued by a CPU in
+ * kick_cpus_irq_workfn() who is waiting for this CPU to perform a
+ * resched.
+ */
+ smp_store_release(&rq->scx.pnt_seq, rq->scx.pnt_seq + 1);
+
update_curr_scx(rq);
/* see dequeue_task_scx() on why we skip when !QUEUED */
@@ -5144,8 +5146,12 @@ static bool kick_one_cpu(s32 cpu, struct
}
if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
- pseqs[cpu] = rq->scx.pnt_seq;
- should_wait = true;
+ if (cur_class == &ext_sched_class) {
+ pseqs[cpu] = rq->scx.pnt_seq;
+ should_wait = true;
+ } else {
+ cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
+ }
}
resched_curr(rq);
@@ -5208,12 +5214,11 @@ static void kick_cpus_irq_workfn(struct
if (cpu != cpu_of(this_rq)) {
/*
- * Pairs with smp_store_release() issued by this CPU in
- * switch_class() on the resched path.
+ * Pairs with store_release in put_prev_task_scx().
*
- * We busy-wait here to guarantee that no other task can
- * be scheduled on our core before the target CPU has
- * entered the resched path.
+ * We busy-wait here to guarantee that the task running
+ * at the time of kicking is no longer running. This can
+ * be used to implement e.g. core scheduling.
*/
while (smp_load_acquire(wait_pnt_seq) == pseqs[cpu])
cpu_relax();
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -997,8 +997,10 @@ enum scx_kick_flags {
SCX_KICK_PREEMPT = 1LLU << 1,
/*
- * Wait for the CPU to be rescheduled. The scx_bpf_kick_cpu() call will
- * return after the target CPU finishes picking the next task.
+ * The scx_bpf_kick_cpu() call will return after the current SCX task of
+ * the target CPU switches out. This can be used to implement e.g. core
+ * scheduling. This has no effect if the current task on the target CPU
+ * is not on SCX.
*/
SCX_KICK_WAIT = 1LLU << 2,
};
next prev parent reply other threads:[~2025-10-21 21:03 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-21 21:03 [PATCHSET sched_ext/for-6.19] sched_ext: Fix SCX_KICK_WAIT reliability Tejun Heo
2025-10-21 21:03 ` sched_ext: Don't kick CPUs running higher classes Tejun Heo
2025-10-21 21:03 ` Tejun Heo [this message]
2025-10-22 7:43 ` sched_ext: Fix SCX_KICK_WAIT to work reliably Andrea Righi
2025-10-22 18:37 ` Tejun Heo
2025-10-22 19:25 ` Andrea Righi
2025-10-22 8:03 ` Peter Zijlstra
2025-10-22 18:38 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251021210354.89570-3-tj@kernel.org \
--to=tj@kernel.org \
--cc=arighi@nvidia.com \
--cc=bpf@vger.kernel.org \
--cc=changwoo@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=liuwenfang@honor.com \
--cc=peterz@infradead.org \
--cc=sched-ext@lists.linux.dev \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.