From: Kuba Piecuch <jpiecuch@google.com>
To: Andrea Righi <arighi@nvidia.com>, Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Christian Loehle <christian.loehle@arm.com>,
Daniel Hodges <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
<linux-kernel@vger.kernel.org>,
Emil Tsalapatis <emil@etsalapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Sat, 31 Jan 2026 16:45:59 +0000 [thread overview]
Message-ID: <DG2XDI382JPD.6GDH6BO96EXY@google.com> (raw)
In-Reply-To: <aX2nFSL5QilKsGsm@gpd4>
Hi Andrea,
On Sat Jan 31, 2026 at 6:54 AM UTC, Andrea Righi wrote:
>> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
>> >> for a task at most once between it being dispatched and taken off the CPU,
>> >> even if its properties are changed multiple times while it's on CPU.
>> >> Is that intentional? I don't see it documented.
>> >>
>> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and
>> >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
>> >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
>> >>
>> >> When a property of p is changed while it runs on the CPU,
>> >> the sequence of calls is:
>> >> dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
>> >> (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
>> >> set_next_task_scx(p).
>> >>
>> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
>> >> ops.dequeue(p, ... | SCHED_CHANGE) and clears
>> >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
>> >>
>> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
>> >> dequeue_task_scx().
>> >>
>> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
>> >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
>> >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
>> >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
>> >>
>> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
>> >> this is not a core-sched pick, but it won't do much because the ops_state is
>> >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
>> >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
>> >> inserted into.
>> >>
>> >>
>> >> So, we end up in a state where any subsequent property change while the task is
>> >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
>> >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
>> >> unset in p->scx.flags.
>> >>
>> >> I really hope I didn't mess anything up when tracing the code, but of course
>> >> I'm happy to be corrected.
>> >
>> > Correct. And the enqueue/dequeue balancing is preserved here. In the
>> > scenario you describe, subsequent property changes while the task remains
>> > running go through ENQUEUE_RESTORE, which intentionally skips
>> > ops.enqueue(). Since no new enqueue cycle is started, there is no
>> > corresponding ops.dequeue() to deliver either.
>> >
>> > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
>> > scheduler state established by the last ops.enqueue(), not with every
>> > individual property change. Multiple property changes while the task stays
>> > on CPU are coalesced and the enqueue/dequeue pairing remains balanced.
>>
>> Ok, I think I understand the logic behind this, here's how I understand it:
>>
>> The BPF scheduler is naturally going to have some internal per-task state.
>> That state may be expensive to compute from scratch, so we don't want to
>> completely discard it when the BPF scheduler loses ownership of the task.
>>
>> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
>> "Hey, some scheduling properties of the task are about to change, so you
>> probably should invalidate whatever state you have for that task which depends
>> on these properties."
>
> Correct. And it's also a way to notify that the task has left the BPF
> scheduler, so if the task is stored in any internal queue it can/should be
> removed.
Right, unless the task has already been dispatched, in which case it's just
an invalidation notification.
>
>>
>> That way, the BPF scheduler will know to recompute the invalidated state on
>> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
>> BPF scheduler knows that none of the task's fundamental scheduling properties
>> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
>> the state. Of course, the potential for savings depends on the particular
>> scheduler's policy.
>>
>> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
>> a task is running: for subsequent calls, the BPF scheduler had already been
>> notified to invalidate its state, so there's no use in notifying it again.
>
> Actually I think the proper behavior would be to trigger
> ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
> scheduler. While running, tasks are outside the BPF scheduler ownership, so
> ops.dequeue() shouldn't be triggered at all.
>
I don't think this is what the current implementation does, right?
>>
>> However, I feel like there's a hidden assumption here that the BPF scheduler
>> doesn't recompute its state for the task before the next ops.enqueue().
>
> And that should be the proper behavior. BPF scheduler should recompute a
> task state only when the task is re-enqueued after a property change.
>
That would make sense if ops.enqueue() was called immediately after a property
change when a task is running, but I believe that's currently not the case,
see my attempt at tracing the enqueue-dequeue cycle on property change in my
first reply.
>> What if the scheduler wanted to immediately react to the priority of a task
>> being decreased by preempting it? You might say "hook into
>> ops.set_weight()", but then doesn't that obviate the need for
>> ops.dequeue(SCHED_CHANGE)?
>
> If a scheduler wants to implement preemption on property change, it can do
> so in ops.enqueue(): after a property change, the task is re-enqueued,
> triggering ops.enqueue(), at which point the BPF scheduler can decide
> whether and how to preempt currently running tasks.
>
> If a property change does not result in an ops.enqueue() call, it means the
> task is not runnable yet (or does not intend to run), so attempting to
> trigger a preemption at that point would be pointless.
>
IIUC a dequeue-enqueue cycle on a running task during property change doesn't
result in a call to ops.enqueue(), so if the BPF scheduler recomputed its state
only in ops.enqueue(), then it wouldn't be able to react immediately.
>>
>> I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property
>> changes that happen under ``scoped_guard (sched_change, ...)`` which don't have
>> a dedicated ops callback, but I wasn't able to find any such properties which
>> would be relevant to SCX.
>>
>> Another thought on the design: currently, the exact meaning of
>> ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF
>> scheduler:
>>
>> * When it's owned, it combines two notifications: BPF scheduler losing
>> ownership AND that it should invalidate task state.
>> * When it's not owned, it only serves as an "invalidate" notification,
>> the ownership status doesn't change.
>
> When it's not owned I think ops.dequeue() shouldn't be triggered at all.
>
>>
>> Wouldn't it be more elegant to have another callback, say
>> ops.property_change(), which would only serve as the "invalidate" notification,
>> and leave ops.dequeue() only for tracking ownership?
>> That would mean calling ops.dequeue() followed by ops.property_change() when
>> changing properties of a task owned by the BPF scheduler, as opposed to a
>> single call to ops.dequeue(SCHED_CHANGE).
>
> We could provide an ops.property_change(), but honestly I don't see any
> practical usage of this callback.
>
Neither do I, I just made it up for the sake of argument :-)
Thanks,
Kuba
next prev parent reply other threads:[~2026-01-31 16:46 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-26 8:41 [PATCHSET v3 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38 ` Emil Tsalapatis
2026-01-27 16:41 ` Kuba Piecuch
2026-01-30 7:34 ` Andrea Righi
2026-01-30 13:14 ` Kuba Piecuch
2026-01-31 6:54 ` Andrea Righi
2026-01-31 16:45 ` Kuba Piecuch [this message]
2026-01-31 17:24 ` Andrea Righi
2026-01-28 21:21 ` Tejun Heo
2026-01-30 11:54 ` Kuba Piecuch
2026-01-31 9:02 ` Andrea Righi
2026-01-31 17:53 ` Kuba Piecuch
2026-01-31 20:26 ` Andrea Righi
2026-02-02 15:19 ` Tejun Heo
2026-02-02 15:30 ` Andrea Righi
2026-02-01 17:43 ` Tejun Heo
2026-02-02 15:52 ` Andrea Righi
2026-02-02 16:23 ` Kuba Piecuch
2026-01-26 8:41 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-01-27 16:53 ` Emil Tsalapatis
-- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20 ` Tejun Heo
2026-02-11 16:06 ` Andrea Righi
2026-02-11 19:47 ` Tejun Heo
2026-02-11 22:34 ` Andrea Righi
2026-02-11 22:37 ` Tejun Heo
2026-02-11 22:48 ` Andrea Righi
2026-02-12 10:16 ` Andrea Righi
2026-02-12 14:32 ` Christian Loehle
2026-02-12 15:45 ` Andrea Righi
2026-02-12 17:07 ` Tejun Heo
2026-02-12 18:14 ` Andrea Righi
2026-02-12 18:35 ` Tejun Heo
2026-02-12 22:30 ` Andrea Righi
2026-02-14 10:16 ` Andrea Righi
2026-02-14 17:56 ` Tejun Heo
2026-02-14 19:32 ` Andrea Righi
2026-02-10 23:54 ` Tejun Heo
2026-02-11 16:07 ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35 ` Emil Tsalapatis
2026-02-07 9:26 ` Andrea Righi
2026-02-09 17:28 ` Tejun Heo
2026-02-09 19:06 ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29 ` Kuba Piecuch
2026-02-05 21:32 ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14 ` Tejun Heo
2026-02-05 9:26 ` Andrea Righi
2026-02-01 9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47 ` Christian Loehle
2026-02-02 7:45 ` Andrea Righi
2026-02-02 9:26 ` Andrea Righi
2026-02-02 10:02 ` Christian Loehle
2026-02-02 15:32 ` Andrea Righi
2026-02-02 10:09 ` Christian Loehle
2026-02-02 13:59 ` Kuba Piecuch
2026-02-04 9:36 ` Andrea Righi
2026-02-04 9:51 ` Kuba Piecuch
2026-02-02 11:56 ` Kuba Piecuch
2026-02-04 10:11 ` Andrea Righi
2026-02-04 10:33 ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54 ` Christian Loehle
2026-01-21 12:57 ` Andrea Righi
2026-01-22 9:28 ` Kuba Piecuch
2026-01-23 13:32 ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28 3:20 ` Emil Tsalapatis
2025-12-29 16:36 ` Andrea Righi
2025-12-29 18:35 ` Emil Tsalapatis
2025-12-28 17:19 ` Tejun Heo
2025-12-28 23:28 ` Tejun Heo
2025-12-28 23:38 ` Tejun Heo
2025-12-29 17:07 ` Andrea Righi
2025-12-29 18:55 ` Emil Tsalapatis
2025-12-28 23:42 ` Tejun Heo
2025-12-29 17:17 ` Andrea Righi
2025-12-29 0:06 ` Tejun Heo
2025-12-29 18:56 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DG2XDI382JPD.6GDH6BO96EXY@google.com \
--to=jpiecuch@google.com \
--cc=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=emil@etsalapatis.com \
--cc=hodgesd@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox