Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kuba Piecuch <jpiecuch@google.com>
To: Andrea Righi <arighi@nvidia.com>, Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	 Changwoo Min <changwoo@igalia.com>,
	Christian Loehle <christian.loehle@arm.com>,
	 Daniel Hodges <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
	 <linux-kernel@vger.kernel.org>,
	Emil Tsalapatis <emil@etsalapatis.com>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Sat, 31 Jan 2026 16:45:59 +0000	[thread overview]
Message-ID: <DG2XDI382JPD.6GDH6BO96EXY@google.com> (raw)
In-Reply-To: <aX2nFSL5QilKsGsm@gpd4>

Hi Andrea,

On Sat Jan 31, 2026 at 6:54 AM UTC, Andrea Righi wrote:
>> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
>> >> for a task at most once between it being dispatched and taken off the CPU,
>> >> even if its properties are changed multiple times while it's on CPU.
>> >> Is that intentional? I don't see it documented.
>> >> 
>> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and
>> >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
>> >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
>> >> 
>> >> When a property of p is changed while it runs on the CPU,
>> >> the sequence of calls is:
>> >>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
>> >>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
>> >>   set_next_task_scx(p).
>> >> 
>> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
>> >> ops.dequeue(p, ... | SCHED_CHANGE) and clears
>> >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
>> >> 
>> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
>> >> dequeue_task_scx().
>> >> 
>> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
>> >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
>> >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
>> >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
>> >> 
>> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
>> >> this is not a core-sched pick, but it won't do much because the ops_state is
>> >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
>> >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
>> >> inserted into.
>> >> 
>> >> 
>> >> So, we end up in a state where any subsequent property change while the task is
>> >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
>> >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
>> >> unset in p->scx.flags.
>> >> 
>> >> I really hope I didn't mess anything up when tracing the code, but of course
>> >> I'm happy to be corrected.
>> >
>> > Correct. And the enqueue/dequeue balancing is preserved here. In the
>> > scenario you describe, subsequent property changes while the task remains
>> > running go through ENQUEUE_RESTORE, which intentionally skips
>> > ops.enqueue(). Since no new enqueue cycle is started, there is no
>> > corresponding ops.dequeue() to deliver either.
>> >
>> > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
>> > scheduler state established by the last ops.enqueue(), not with every
>> > individual property change. Multiple property changes while the task stays
>> > on CPU are coalesced and the enqueue/dequeue pairing remains balanced.
>> 
>> Ok, I think I understand the logic behind this, here's how I understand it:
>> 
>> The BPF scheduler is naturally going to have some internal per-task state.
>> That state may be expensive to compute from scratch, so we don't want to
>> completely discard it when the BPF scheduler loses ownership of the task.
>> 
>> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
>> "Hey, some scheduling properties of the task are about to change, so you
>> probably should invalidate whatever state you have for that task which depends
>> on these properties."
>
> Correct. And it's also a way to notify that the task has left the BPF
> scheduler, so if the task is stored in any internal queue it can/should be
> removed.

Right, unless the task has already been dispatched, in which case it's just
an invalidation notification.

>
>> 
>> That way, the BPF scheduler will know to recompute the invalidated state on
>> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
>> BPF scheduler knows that none of the task's fundamental scheduling properties
>> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
>> the state. Of course, the potential for savings depends on the particular
>> scheduler's policy.
>> 
>> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
>> a task is running: for subsequent calls, the BPF scheduler had already been
>> notified to invalidate its state, so there's no use in notifying it again.
>
> Actually I think the proper behavior would be to trigger
> ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
> scheduler. While running, tasks are outside the BPF scheduler ownership, so
> ops.dequeue() shouldn't be triggered at all.
>

I don't think this is what the current implementation does, right?

>> 
>> However, I feel like there's a hidden assumption here that the BPF scheduler
>> doesn't recompute its state for the task before the next ops.enqueue().
>
> And that should be the proper behavior. BPF scheduler should recompute a
> task state only when the task is re-enqueued after a property change.
>

That would make sense if ops.enqueue() was called immediately after a property
change when a task is running, but I believe that's currently not the case,
see my attempt at tracing the enqueue-dequeue cycle on property change in my
first reply.

>> What if the scheduler wanted to immediately react to the priority of a task
>> being decreased by preempting it? You might say "hook into
>> ops.set_weight()", but then doesn't that obviate the need for
>> ops.dequeue(SCHED_CHANGE)?
>
> If a scheduler wants to implement preemption on property change, it can do
> so in ops.enqueue(): after a property change, the task is re-enqueued,
> triggering ops.enqueue(), at which point the BPF scheduler can decide
> whether and how to preempt currently running tasks.
>
> If a property change does not result in an ops.enqueue() call, it means the
> task is not runnable yet (or does not intend to run), so attempting to
> trigger a preemption at that point would be pointless.
>

IIUC a dequeue-enqueue cycle on a running task during property change doesn't
result in a call to ops.enqueue(), so if the BPF scheduler recomputed its state
only in ops.enqueue(), then it wouldn't be able to react immediately.

>> 
>> I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property
>> changes that happen under ``scoped_guard (sched_change, ...)`` which don't have
>> a dedicated ops callback, but I wasn't able to find any such properties which
>> would be relevant to SCX.
>> 
>> Another thought on the design: currently, the exact meaning of
>> ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF
>> scheduler:
>> 
>> * When it's owned, it combines two notifications: BPF scheduler losing
>>   ownership AND that it should invalidate task state.
>> * When it's not owned, it only serves as an "invalidate" notification,
>>   the ownership status doesn't change.
>
> When it's not owned I think ops.dequeue() shouldn't be triggered at all.
>
>> 
>> Wouldn't it be more elegant to have another callback, say
>> ops.property_change(), which would only serve as the "invalidate" notification,
>> and leave ops.dequeue() only for tracking ownership?
>> That would mean calling ops.dequeue() followed by ops.property_change() when
>> changing properties of a task owned by the BPF scheduler, as opposed to a
>> single call to ops.dequeue(SCHED_CHANGE).
>
> We could provide an ops.property_change(), but honestly I don't see any
> practical usage of this callback.
>

Neither do I, I just made it up for the sake of argument :-)

Thanks,
Kuba

next prev parent reply	other threads:[~2026-01-31 16:46 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38   ` Emil Tsalapatis
2026-01-27 16:41   ` Kuba Piecuch
2026-01-30  7:34     ` Andrea Righi
2026-01-30 13:14       ` Kuba Piecuch
2026-01-31  6:54         ` Andrea Righi
2026-01-31 16:45           ` Kuba Piecuch [this message]
2026-01-31 17:24             ` Andrea Righi
2026-01-28 21:21   ` Tejun Heo
2026-01-30 11:54     ` Kuba Piecuch
2026-01-31  9:02       ` Andrea Righi
2026-01-31 17:53         ` Kuba Piecuch
2026-01-31 20:26           ` Andrea Righi
2026-02-02 15:19             ` Tejun Heo
2026-02-02 15:30               ` Andrea Righi
2026-02-01 17:43       ` Tejun Heo
2026-02-02 15:52         ` Andrea Righi
2026-02-02 16:23           ` Kuba Piecuch
2026-01-26  8:41 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-01-27 16:53   ` Emil Tsalapatis
  -- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20   ` Tejun Heo
2026-02-11 16:06     ` Andrea Righi
2026-02-11 19:47       ` Tejun Heo
2026-02-11 22:34         ` Andrea Righi
2026-02-11 22:37           ` Tejun Heo
2026-02-11 22:48             ` Andrea Righi
2026-02-12 10:16             ` Andrea Righi
2026-02-12 14:32               ` Christian Loehle
2026-02-12 15:45                 ` Andrea Righi
2026-02-12 17:07                   ` Tejun Heo
2026-02-12 18:14                     ` Andrea Righi
2026-02-12 18:35                       ` Tejun Heo
2026-02-12 22:30                         ` Andrea Righi
2026-02-14 10:16                           ` Andrea Righi
2026-02-14 17:56                             ` Tejun Heo
2026-02-14 19:32                               ` Andrea Righi
2026-02-10 23:54   ` Tejun Heo
2026-02-11 16:07     ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29   ` Kuba Piecuch
2026-02-05 21:32     ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14   ` Tejun Heo
2026-02-05  9:26     ` Andrea Righi
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47   ` Christian Loehle
2026-02-02  7:45     ` Andrea Righi
2026-02-02  9:26       ` Andrea Righi
2026-02-02 10:02         ` Christian Loehle
2026-02-02 15:32           ` Andrea Righi
2026-02-02 10:09       ` Christian Loehle
2026-02-02 13:59       ` Kuba Piecuch
2026-02-04  9:36         ` Andrea Righi
2026-02-04  9:51           ` Kuba Piecuch
2026-02-02 11:56   ` Kuba Piecuch
2026-02-04 10:11     ` Andrea Righi
2026-02-04 10:33       ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54   ` Christian Loehle
2026-01-21 12:57     ` Andrea Righi
2026-01-22  9:28   ` Kuba Piecuch
2026-01-23 13:32     ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28  3:20   ` Emil Tsalapatis
2025-12-29 16:36     ` Andrea Righi
2025-12-29 18:35       ` Emil Tsalapatis
2025-12-28 17:19   ` Tejun Heo
2025-12-28 23:28     ` Tejun Heo
2025-12-28 23:38       ` Tejun Heo
2025-12-29 17:07         ` Andrea Righi
2025-12-29 18:55           ` Emil Tsalapatis
2025-12-28 23:42   ` Tejun Heo
2025-12-29 17:17     ` Andrea Righi
2025-12-29  0:06   ` Tejun Heo
2025-12-29 18:56     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DG2XDI382JPD.6GDH6BO96EXY@google.com \
    --to=jpiecuch@google.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.