From: Andrea Righi <arighi@nvidia.com>
To: Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Christian Loehle <christian.loehle@arm.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH sched_ext/for-7.1] sched_ext: Documentation: Add missing calls to quiescent(), runnable()
Date: Thu, 9 Apr 2026 11:38:01 +0200 [thread overview]
Message-ID: <addzeX8Le_533Lgo@gpd4> (raw)
In-Reply-To: <DHOHR34ZLGL6.2UIL2GRFLR53X@google.com>
On Thu, Apr 09, 2026 at 08:46:03AM +0000, Kuba Piecuch wrote:
> On Wed Apr 8, 2026 at 2:54 PM UTC, Andrea Righi wrote:
> ...
> >>
> >> Another inaccuracy not related to direct dispatch: property changes can occur
> >> while a task is running, while the psedocode only allows for property changes
> >> while a task is queued.
> >
> > Sure... but again, modelling all the possible scenarios would make the
> > pseudocode completely unreadable.
>
> I'm not arguing we should cover all scenarios.
>
> I'm ok with omitting scenarios whose existence depends on a configuration flag
> or presence/absence of a callback, because:
>
> a) Using the right configuration, one can actually write a scheduler where the
> pseudocode is an accurate representation of the task lifecycle;
>
> b) The assumptions about the configuration can be clearly stated next to the
> pseudocode.
>
> I'm less ok with omitting specific scenarios that can't be simply "turned off"
> because they are triggered by the scheduled tasks themselves. A task's property
> being changed while it's running is one example of such a scenario -- one can't
> just prevent it from happening by setting a configuration flag, and sched_ext
> schedulers implementing dequeue/quiescent/runnable/enqueue should be aware of
> it.
>
> What I especially don't like is giving the reader a partial picture that looks
> like a complete one, as is the case with property changes here. We're letting
> the reader know that it can happen, but the pseudocode makes it look like it
> can only happen while a task is queued and not while it's running, giving the
> reader a false impression that they can assume property changes apply only to
> queued tasks.
I agree on that, but I think the goal of this pseudocode is to find a reasonable
compromise between readability and accuracy. If such comprosmise doesn't exist
or if we're concerned that it'd introduce more confusion than benefits for the
users, then we can also consider removing it.
>
> >
> > IMHO it'd be better to give an overview of the most common use cases here and
> > clarify in the description that the diagram doesn't cover all the possible
> > scenarios. This one is a special use case that, personally, I wouldn't cover in
> > the pseudocode.
> >
> >>
> >> There's also preemption by a higher sched class, which is not covered in the
> >> loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
> >> task_is_runnable() to return false if there's a higher-priority sched class
> >> with runnable tasks on the CPU, though that would be in conflict with the
> >> actual implementation of task_is_runnable() in include/linux/sched.h.
> >
> > Ditto.
> >
> >>
> >> >
> >> >>
> >> >> A more general comment about the pseudocode: I think it can be useful to
> >> >> introduce someone new to the general flow of the callbacks in sched_ext,
> >> >> but the documentation should be clear that this is a simplified view that
> >> >> makes assumptions about the behavior of the BPF scheduler itself (flags like
> >> >> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
> >> >> the overall system (Can sched_ext be preempted by a higher-priority sched
> >> >> class? Can scheduling properties of a task be changed while it's running?)
> >> >> Without stating these assumptions clearly, we risk leaving the reader falsely
> >> >> believing they have a complete understanding.
> >> >
> >> > Of course this schema is not a complete representation of the entire sched_ext
> >> > state machine, if we put everything it'd become too big and complex. I think we
> >> > should just cover the most common use cases here. Maybe we can clarify this in
> >> > the description before this diagram.
> >>
> >> Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
> >> and attach an appropriate disclaimer to the pseudocode.
> >
> > If we move ops.dispatch() + ops.dequeue() inside the ops.enqueue() block I think
> > the pseudocode becomes "fairly" accurate. At least more accurate than what we
> > have right now. It won't be perfect, but it can help newer sched_ext devs having
> > an overview the task lifecycle without going too much into implementation
> > details.
> >
> > So, to recap, what do you think about this?
> >
> > ops.init_task(); /* A new task is created */
> > ops.enable(); /* Enable BPF scheduling for the task */
> >
> > while (task in SCHED_EXT) {
> > if (task can migrate)
> > ops.select_cpu(); /* Called on wakeup (optimization) */
> >
> > ops.runnable(); /* Task becomes ready to run */
> >
> > while (task_is_runnable(task)) {
> > if (task is not in a DSQ || task->scx.slice == 0) {
> > ops.enqueue(); /* Task can be added to a DSQ */
> >
> > /* Task property change (i.e., affinity, nice, etc.)? */
> > if (sched_change(task)) {
> > ops.dequeue(); /* Exiting BPF scheduler custody */
> > ops.quiescent();
> >
> > /* Property change callback, e.g. ops.set_weight() */
> >
> > ops.runnable();
> > continue;
> > }
> >
> > /* Any usable CPU becomes available */
> >
> > ops.dispatch(); /* Task is moved to a local DSQ */
> > ops.dequeue(); /* Exiting BPF scheduler custody */
> > }
> >
> > ops.running(); /* Task starts running on its assigned CPU */
> >
> > while (task_is_runnable(task) && task->scx.slice > 0) {
> > ops.tick(); /* Called every 1/HZ seconds */
> >
> > if (task->scx.slice == 0)
> > ops.dispatch(); /* task->scx.slice can be refilled */
> > }
> >
> > ops.stopping(); /* Task stops running (time slice expires or wait) */
> > }
> >
> > ops.quiescent(); /* Task releases its assigned CPU (wait) */
> > }
> >
> > ops.disable(); /* Disable BPF scheduling for the task */
> > ops.exit_task(); /* Task is destroyed */
>
> I don't love it (and I probably never will), but I agree it's the best so far.
> I'll send a v2 with the updated pseudocode and I'll put a bit of a disclaimer
> before it.
I also don't love it, but with these changes it'd better (or rather a bit more
accurate) than what we have right now...
Maybe we can add a "Special cases" section below the task lifecycle to better
explain all the exceptions and non-covered scenarios? Some of them are covered
in the "Scheduling Cycle" section, so we could also point to them.
Thanks,
-Andrea
next prev parent reply other threads:[~2026-04-09 9:38 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-06 11:47 [PATCH sched_ext/for-7.1] sched_ext: Documentation: Add ops.dequeue() to task lifecycle Andrea Righi
2026-04-06 14:49 ` Emil Tsalapatis
2026-04-06 19:08 ` Andrea Righi
2026-04-06 18:09 ` Tejun Heo
2026-04-07 9:54 ` Kuba Piecuch
2026-04-07 16:31 ` Andrea Righi
2026-04-08 9:18 ` [PATCH sched_ext/for-7.1] sched_ext: Documentation: Add missing calls to quiescent(), runnable() Kuba Piecuch
2026-04-08 11:28 ` Andrea Righi
2026-04-08 12:40 ` Kuba Piecuch
2026-04-08 13:49 ` Andrea Righi
2026-04-08 14:17 ` Kuba Piecuch
2026-04-08 14:54 ` Andrea Righi
2026-04-09 8:46 ` Kuba Piecuch
2026-04-09 9:38 ` Andrea Righi [this message]
2026-04-09 9:46 ` Christian Loehle
2026-04-09 13:30 ` Kuba Piecuch
2026-04-09 14:12 ` Andrea Righi
2026-04-09 13:51 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=addzeX8Le_533Lgo@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=emil@etsalapatis.com \
--cc=jpiecuch@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox