From: Andrea Righi <arighi@nvidia.com>
To: Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Christian Loehle <christian.loehle@arm.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH sched_ext/for-7.1] sched_ext: Documentation: Add missing calls to quiescent(), runnable()
Date: Wed, 8 Apr 2026 16:54:29 +0200 [thread overview]
Message-ID: <adZsJZDFNgtod-Bd@gpd4> (raw)
In-Reply-To: <DHNU5YSWABWC.2B8Q38OKJGVDR@google.com>
Hi Kuba,
On Wed, Apr 08, 2026 at 02:17:03PM +0000, Kuba Piecuch wrote:
> On Wed Apr 8, 2026 at 1:49 PM UTC, Andrea Righi wrote:
> > On Wed, Apr 08, 2026 at 12:40:09PM +0000, Kuba Piecuch wrote:
> >> Hi Andrea,
> >>
> >> On Wed Apr 8, 2026 at 11:28 AM UTC, Andrea Righi wrote:
> >> ...
> >> >
> >> > Looks good, but I noticed another issue, should we also change the condition up
> >> > above as following?
> >> >
> >> > Documentation/scheduler/sched-ext.rst | 2 +-
> >> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >> >
> >> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> >> > index 29d36e248f58b..99df4cc982375 100644
> >> > --- a/Documentation/scheduler/sched-ext.rst
> >> > +++ b/Documentation/scheduler/sched-ext.rst
> >> > @@ -423,7 +423,7 @@ by a sched_ext scheduler:
> >> > ops.runnable(); /* Task becomes ready to run */
> >> >
> >> > while (task_is_runnable(task)) {
> >> > - if (task is not in a DSQ && task->scx.slice == 0) {
> >> > + if (task is not in a DSQ || task->scx.slice == 0) {
> >> > ops.enqueue(); /* Task can be added to a DSQ */
> >> >
> >> > /* Task property change (i.e., affinity, nice, etc.)? */
> >> >
> >> > Because we trigger ops.enqueue() when the task expired its time slice or it
> >> > becomes runnable and has not been added to a DSQ.
> >> >
> >> > This also represents correctly the sched_change() scenario: a task being
> >> > re-enqueued after sched_change() still has its time slice > 0, but we need to
> >> > call ops.enqueue() for it.
> >>
> >> I agree that the condition should be changed, but I'm not sure that this is
> >> what it should look like.
> >>
> >> Is the "task is not in a DSQ" part of the condition there to handle direct
> >> dispatch? Apart from direct dispatch from ops.select_cpu(), I wasn't able to
> >> come up with a situation where we would reach this condition with the task
> >> present on some DSQ.
> >
> > The intent is to represent the direct dispatch from ops.select_cpu(), since in
> > that case ops.enqueue() is skipped.
> >
> > Honestly I think if we change the && to || in that condition, everything should
> > be pretty accurate.
>
> In the case of direct dispatch from ops.select_cpu() we don't invoke
> ops.dispatch() and ops.dequeue() before ops.running(), right? The current
> pseudocode calls them unconditionally.
We can move ops.dispatch() and ops.dequeue() inside the
if (task is not in a DSQ || task->scx.slice == 0) block.
>
> Another inaccuracy not related to direct dispatch: property changes can occur
> while a task is running, while the psedocode only allows for property changes
> while a task is queued.
Sure... but again, modelling all the possible scenarios would make the
pseudocode completely unreadable.
IMHO it'd be better to give an overview of the most common use cases here and
clarify in the description that the diagram doesn't cover all the possible
scenarios. This one is a special use case that, personally, I wouldn't cover in
the pseudocode.
>
> There's also preemption by a higher sched class, which is not covered in the
> loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
> task_is_runnable() to return false if there's a higher-priority sched class
> with runnable tasks on the CPU, though that would be in conflict with the
> actual implementation of task_is_runnable() in include/linux/sched.h.
Ditto.
>
> >
> >>
> >> A more general comment about the pseudocode: I think it can be useful to
> >> introduce someone new to the general flow of the callbacks in sched_ext,
> >> but the documentation should be clear that this is a simplified view that
> >> makes assumptions about the behavior of the BPF scheduler itself (flags like
> >> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
> >> the overall system (Can sched_ext be preempted by a higher-priority sched
> >> class? Can scheduling properties of a task be changed while it's running?)
> >> Without stating these assumptions clearly, we risk leaving the reader falsely
> >> believing they have a complete understanding.
> >
> > Of course this schema is not a complete representation of the entire sched_ext
> > state machine, if we put everything it'd become too big and complex. I think we
> > should just cover the most common use cases here. Maybe we can clarify this in
> > the description before this diagram.
>
> Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
> and attach an appropriate disclaimer to the pseudocode.
If we move ops.dispatch() + ops.dequeue() inside the ops.enqueue() block I think
the pseudocode becomes "fairly" accurate. At least more accurate than what we
have right now. It won't be perfect, but it can help newer sched_ext devs having
an overview the task lifecycle without going too much into implementation
details.
So, to recap, what do you think about this?
ops.init_task(); /* A new task is created */
ops.enable(); /* Enable BPF scheduling for the task */
while (task in SCHED_EXT) {
if (task can migrate)
ops.select_cpu(); /* Called on wakeup (optimization) */
ops.runnable(); /* Task becomes ready to run */
while (task_is_runnable(task)) {
if (task is not in a DSQ || task->scx.slice == 0) {
ops.enqueue(); /* Task can be added to a DSQ */
/* Task property change (i.e., affinity, nice, etc.)? */
if (sched_change(task)) {
ops.dequeue(); /* Exiting BPF scheduler custody */
ops.quiescent();
/* Property change callback, e.g. ops.set_weight() */
ops.runnable();
continue;
}
/* Any usable CPU becomes available */
ops.dispatch(); /* Task is moved to a local DSQ */
ops.dequeue(); /* Exiting BPF scheduler custody */
}
ops.running(); /* Task starts running on its assigned CPU */
while (task_is_runnable(task) && task->scx.slice > 0) {
ops.tick(); /* Called every 1/HZ seconds */
if (task->scx.slice == 0)
ops.dispatch(); /* task->scx.slice can be refilled */
}
ops.stopping(); /* Task stops running (time slice expires or wait) */
}
ops.quiescent(); /* Task releases its assigned CPU (wait) */
}
ops.disable(); /* Disable BPF scheduling for the task */
ops.exit_task(); /* Task is destroyed */
Thanks,
-Andrea
prev parent reply other threads:[~2026-04-08 14:54 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-06 11:47 [PATCH sched_ext/for-7.1] sched_ext: Documentation: Add ops.dequeue() to task lifecycle Andrea Righi
2026-04-06 14:49 ` Emil Tsalapatis
2026-04-06 19:08 ` Andrea Righi
2026-04-06 18:09 ` Tejun Heo
2026-04-07 9:54 ` Kuba Piecuch
2026-04-07 16:31 ` Andrea Righi
2026-04-08 9:18 ` [PATCH sched_ext/for-7.1] sched_ext: Documentation: Add missing calls to quiescent(), runnable() Kuba Piecuch
2026-04-08 11:28 ` Andrea Righi
2026-04-08 12:40 ` Kuba Piecuch
2026-04-08 13:49 ` Andrea Righi
2026-04-08 14:17 ` Kuba Piecuch
2026-04-08 14:54 ` Andrea Righi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adZsJZDFNgtod-Bd@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=emil@etsalapatis.com \
--cc=jpiecuch@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox