From: Kuba Piecuch <jpiecuch@google.com>
To: Tejun Heo <tj@kernel.org>, Andrea Righi <arighi@nvidia.com>,
Changwoo Min <changwoo@igalia.com>,
David Vernet <void@manifault.com>
Cc: <linux-kernel@vger.kernel.org>, <sched-ext@lists.linux.dev>
Subject: SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs
Date: Wed, 22 Apr 2026 13:21:27 +0000 [thread overview]
Message-ID: <DHZPHUFXB4N3.2RY28MUEWBNYK@google.com> (raw)
Hi folks,
I recently saw that scx_qmap got rid of the sched_switch tracepoint hook,
claiming that SCX_OPS_ALWAYS_ENQ_IMMED is sufficient to keep tasks from
lingering on local DSQs.
This prompted me to think about some possible edge cases, and I think we
can end up with lingering tasks on the local DSQ in the following scenario:
Initial conditions: rq->curr == rq->idle &&
rq->next_class == &idle_sched_class
1. We enter schedule() for whatever reason, e.g. BPF scheduler kick from
another CPU.
2. In __pick_next_task(), all sched classes above SCX fail to pick a task.
We still have rq->next_class == &idle_sched_class.
3. We enter do_pick_task_scx(). rq_modified_begin() does nothing because
sched_class_above(rq->next_class, &ext_sched_class) is false.
4. ops.dispatch() dispatches two tasks. The first one goes to the local DSQ,
and the second one goes to a remote CPU's local DSQ. The first task is
dispatched without interference.
5. During dispatch of the second task, while the local CPU's rq lock is dropped
during insertion into the remote CPU's local DSQ, an RT task wakes up on the
local CPU. Since rq->next_class is still idle, wakeup_preempt() calls
wakeup_preempt_idle() which calls resched_curr(rq). This effectively does
nothing since need_resched is cleared in __schedule() after pick.
rq->next_class is set to &rt_sched_class.
6. At the end of balance_one(), we don't trigger a reenqueue because the local
DSQ has only one task.
7. do_pick_task_scx() notices rq_modified_above(rq, &ext_sched_class) and
returns RETRY_TASK.
8. The RT task ends up being picked and runs. SCX is not notified of the switch
because we're switching from the idle task to an RT task.
If my understanding is correct and I didn't miss anything important, then
at no point does SCX reenqueue the first task, even though it should.
This particular scenario may not apply to scx_qmap, but I think it proves that
it's possible to have dispatched tasks lingering on the local DSQ even with
SCX_OPS_ALWAYS_ENQ_IMMED.
I was thinking we could fix this by adding a nr_immed check right before
returning RETRY_TASK:
diff --git i/kernel/sched/ext.c w/kernel/sched/ext.c
index d66fea57ee69..480627fdc203 100644
--- i/kernel/sched/ext.c
+++ w/kernel/sched/ext.c
@@ -3079,8 +3079,11 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
* If @force_scx is true, always try to pick a SCHED_EXT task,
* regardless of any higher-priority sched classes activity.
*/
- if (!force_scx && rq_modified_above(rq, &ext_sched_class))
+ if (!force_scx && rq_modified_above(rq, &ext_sched_class)) {
+ if (rq->scx.nr_immed)
+ schedule_reenq_local(rq, 0);
return RETRY_TASK;
+ }
keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
if (unlikely(keep_prev &&
...but I think this only fixes the case where the RT task wakes up on the CPU
that is doing the dispatch. The other case is one where the RT task wakes up
on the remote CPU (the one the second task was dispatched to) after insertion
of the second task, assuming the remote CPU is initially idle.
To fix both cases, one potential solution that comes to mind is bumping
rq->next_class to &ext_sched_class when inserting a task into rq->scx.local_dsq.
Perhaps we should call wakeup_preempt() in dispatch_to_local_dsq()?
Let me know what you think!
Thanks,
Kuba
next reply other threads:[~2026-04-22 13:21 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 13:21 Kuba Piecuch [this message]
2026-04-22 16:50 ` SCX_ENQ_IMMED potentially leaving dispatched tasks lingering on local DSQs Tejun Heo
2026-04-23 9:48 ` Kuba Piecuch
2026-04-23 16:53 ` Tejun Heo
2026-04-23 19:12 ` Kuba Piecuch
2026-04-23 19:29 ` Tejun Heo
2026-04-23 20:03 ` Kuba Piecuch
2026-04-23 21:57 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DHZPHUFXB4N3.2RY28MUEWBNYK@google.com \
--to=jpiecuch@google.com \
--cc=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox