From: Andrea Righi <arighi@nvidia.com>
To: Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Christian Loehle <christian.loehle@arm.com>,
Daniel Hodges <hodgesd@meta.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org,
Emil Tsalapatis <emil@etsalapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Sat, 31 Jan 2026 18:24:06 +0100 [thread overview]
Message-ID: <aX46tvnTjLZy0pCW@gpd4> (raw)
In-Reply-To: <DG2XDI382JPD.6GDH6BO96EXY@google.com>
Hi Kuba,
On Sat, Jan 31, 2026 at 04:45:59PM +0000, Kuba Piecuch wrote:
...
> >> The BPF scheduler is naturally going to have some internal per-task state.
> >> That state may be expensive to compute from scratch, so we don't want to
> >> completely discard it when the BPF scheduler loses ownership of the task.
> >>
> >> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
> >> "Hey, some scheduling properties of the task are about to change, so you
> >> probably should invalidate whatever state you have for that task which depends
> >> on these properties."
> >
> > Correct. And it's also a way to notify that the task has left the BPF
> > scheduler, so if the task is stored in any internal queue it can/should be
> > removed.
>
> Right, unless the task has already been dispatched, in which case it's just
> an invalidation notification.
Right, but if the task has already been dispatched I don't think we should
trigger ops.dequeue(SCHED_CHANGE), because it's not anymore under the BPF
scheduler's custody (not the way it's implemented right now, I'm just
trying to define the proper semantics based on the latest disussions).
> >> That way, the BPF scheduler will know to recompute the invalidated state on
> >> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
> >> BPF scheduler knows that none of the task's fundamental scheduling properties
> >> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
> >> the state. Of course, the potential for savings depends on the particular
> >> scheduler's policy.
> >>
> >> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
> >> a task is running: for subsequent calls, the BPF scheduler had already been
> >> notified to invalidate its state, so there's no use in notifying it again.
> >
> > Actually I think the proper behavior would be to trigger
> > ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
> > scheduler. While running, tasks are outside the BPF scheduler ownership, so
> > ops.dequeue() shouldn't be triggered at all.
> >
>
> I don't think this is what the current implementation does, right?
Right, sorry, I wasn't clear. I'm just trying to define the behavior that
makes more sense (see below).
> >> However, I feel like there's a hidden assumption here that the BPF scheduler
> >> doesn't recompute its state for the task before the next ops.enqueue().
> >
> > And that should be the proper behavior. BPF scheduler should recompute a
> > task state only when the task is re-enqueued after a property change.
> >
>
> That would make sense if ops.enqueue() was called immediately after a property
> change when a task is running, but I believe that's currently not the case,
> see my attempt at tracing the enqueue-dequeue cycle on property change in my
> first reply.
Yeah, that's right.
I have a new patch set where I've implemented the following semantics (that
should match also Tejun's requirements).
With the new semantics:
- for running tasks: property changes do NOT trigger ops.dequeue(SCHED_CHANGE)
- once a task leaves BPF custody (dispatched to local DSQ), the BPF
scheduler no longer manages it
- property changes on running tasks don't affect the BPF scheduler
Key principle: ops.dequeue() is only called when a task leaves BPF
scheduler's custody. A running task has already left BPF custody, so
property changes don't trigger ops.dequeue().
Therefore, `ops.dequeue(SCHED_CHANGE)` gets called only when:
- task is in BPF data structures (QUEUED state), or
- task is on a non-local DSQ (still in BPF custody)
In this case (BPF scheduler custody), if a property change happens,
ops.dequeue(SCHED_CHANGE) is called to notify the BPF scheduler.
Then if you want to react immediately on priority changes for running tasks
we have:
- ops.set_cpumask(): CPU affinity changes
- ops.set_weight(): priority/nice changes
- ops.cgroup_*(): cgroup changes
In conclusion, we don't need ops.dequeue(SCHED_CHANGE) for running tasks,
the dedicated callbacks (ops.set_cpumask(), ops.set_weight(), ...) already
provide comprehensive coverage for property changes on all tasks,
regardless of whether they're running or in BPF custody. And the new
ops.dequeue(SCHED_CHANGE) semantics only notifies for property changes when
tasks are actively managed by the BPF scheduler (in QUEUED state or on
non-local DSQs).
Do you think it's reasonable enough / do you see any flaws?
Thanks,
-Andrea
next prev parent reply other threads:[~2026-01-31 17:24 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-26 8:41 [PATCHSET v3 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38 ` Emil Tsalapatis
2026-01-27 16:41 ` Kuba Piecuch
2026-01-30 7:34 ` Andrea Righi
2026-01-30 13:14 ` Kuba Piecuch
2026-01-31 6:54 ` Andrea Righi
2026-01-31 16:45 ` Kuba Piecuch
2026-01-31 17:24 ` Andrea Righi [this message]
2026-01-28 21:21 ` Tejun Heo
2026-01-30 11:54 ` Kuba Piecuch
2026-01-31 9:02 ` Andrea Righi
2026-01-31 17:53 ` Kuba Piecuch
2026-01-31 20:26 ` Andrea Righi
2026-02-02 15:19 ` Tejun Heo
2026-02-02 15:30 ` Andrea Righi
2026-02-01 17:43 ` Tejun Heo
2026-02-02 15:52 ` Andrea Righi
2026-02-02 16:23 ` Kuba Piecuch
2026-01-26 8:41 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-01-27 16:53 ` Emil Tsalapatis
-- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20 ` Tejun Heo
2026-02-11 16:06 ` Andrea Righi
2026-02-11 19:47 ` Tejun Heo
2026-02-11 22:34 ` Andrea Righi
2026-02-11 22:37 ` Tejun Heo
2026-02-11 22:48 ` Andrea Righi
2026-02-12 10:16 ` Andrea Righi
2026-02-12 14:32 ` Christian Loehle
2026-02-12 15:45 ` Andrea Righi
2026-02-12 17:07 ` Tejun Heo
2026-02-12 18:14 ` Andrea Righi
2026-02-12 18:35 ` Tejun Heo
2026-02-12 22:30 ` Andrea Righi
2026-02-14 10:16 ` Andrea Righi
2026-02-14 17:56 ` Tejun Heo
2026-02-14 19:32 ` Andrea Righi
2026-02-10 23:54 ` Tejun Heo
2026-02-11 16:07 ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35 ` Emil Tsalapatis
2026-02-07 9:26 ` Andrea Righi
2026-02-09 17:28 ` Tejun Heo
2026-02-09 19:06 ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29 ` Kuba Piecuch
2026-02-05 21:32 ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14 ` Tejun Heo
2026-02-05 9:26 ` Andrea Righi
2026-02-01 9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47 ` Christian Loehle
2026-02-02 7:45 ` Andrea Righi
2026-02-02 9:26 ` Andrea Righi
2026-02-02 10:02 ` Christian Loehle
2026-02-02 15:32 ` Andrea Righi
2026-02-02 10:09 ` Christian Loehle
2026-02-02 13:59 ` Kuba Piecuch
2026-02-04 9:36 ` Andrea Righi
2026-02-04 9:51 ` Kuba Piecuch
2026-02-02 11:56 ` Kuba Piecuch
2026-02-04 10:11 ` Andrea Righi
2026-02-04 10:33 ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54 ` Christian Loehle
2026-01-21 12:57 ` Andrea Righi
2026-01-22 9:28 ` Kuba Piecuch
2026-01-23 13:32 ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28 3:20 ` Emil Tsalapatis
2025-12-29 16:36 ` Andrea Righi
2025-12-29 18:35 ` Emil Tsalapatis
2025-12-28 17:19 ` Tejun Heo
2025-12-28 23:28 ` Tejun Heo
2025-12-28 23:38 ` Tejun Heo
2025-12-29 17:07 ` Andrea Righi
2025-12-29 18:55 ` Emil Tsalapatis
2025-12-28 23:42 ` Tejun Heo
2025-12-29 17:17 ` Andrea Righi
2025-12-29 0:06 ` Tejun Heo
2025-12-29 18:56 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aX46tvnTjLZy0pCW@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=emil@etsalapatis.com \
--cc=hodgesd@meta.com \
--cc=jpiecuch@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox