From: "Emil Tsalapatis" <emil@etsalapatis.com>
To: "Andrea Righi" <arighi@nvidia.com>, "Tejun Heo" <tj@kernel.org>,
"David Vernet" <void@manifault.com>,
"Changwoo Min" <changwoo@igalia.com>
Cc: "Daniel Hodges" <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Sat, 27 Dec 2025 22:20:06 -0500 [thread overview]
Message-ID: <DF9IXXXOIXP7.LILBQBLDYVM7@etsalapatis.com> (raw)
In-Reply-To: <20251219224450.2537941-2-arighi@nvidia.com>
On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote:
> Properly implement ops.dequeue() to ensure every ops.enqueue() is
> balanced by a corresponding ops.dequeue() call, regardless of whether
> the task is on a BPF data structure or already dispatched to a DSQ.
>
> A task is considered enqueued when it is owned by the BPF scheduler.
> This ownership persists until the task is either dispatched (moved to a
> local DSQ for execution) or removed from the BPF scheduler, such as when
> it blocks waiting for an event or when its properties (for example, CPU
> affinity or priority) are updated.
>
> When the task enters the BPF scheduler ops.enqueue() is invoked, when it
> leaves BPF scheduler ownership, ops.dequeue() is invoked.
>
> This allows BPF schedulers to reliably track task ownership and maintain
> accurate accounting.
>
> Cc: Emil Tsalapatis <emil@etsalapatis.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
Hi Andrea,
This change looks reasonable to me. Some comments inline:
> Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
> include/linux/sched/ext.h | 1 +
> kernel/sched/ext.c | 27 ++++++++++++++++++++++++++-
> 3 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..3ed4be53f97da 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
>
> * Queue the task on the BPF side.
>
> + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> + is owned by the BPF scheduler. Ownership is retained until the task is
> + either dispatched (moved to a local DSQ for execution) or dequeued
> + (removed from the scheduler due to a blocking event, or to modify a
> + property, like CPU affinity, priority, etc.). When the task leaves the
> + BPF scheduler ``ops.dequeue()`` is invoked.
> +
Can we say "leaves the scx class" instead? On direct dispatch we
technically never insert the task into the BPF scheduler.
> + **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> + regardless of whether the task is still on a BPF data structure, or it
> + is already dispatched to a DSQ (global, local, or user DSQ)
> +
> + This guarantees that every ``ops.enqueue()`` will eventually be followed
> + by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
> + track task ownership and maintain accurate accounting, such as per-DSQ
> + queued runtime sums.
> +
> + BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> + don't need to track these transitions. The sched_ext core will safely
> + handle all dequeue operations regardless.
> +
> 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> empty, it then looks at the global DSQ. If there still isn't a task to
> run, ``ops.dispatch()`` is invoked which can use the following two
> @@ -319,6 +339,8 @@ by a sched_ext scheduler:
> /* Any usable CPU becomes available */
>
> ops.dispatch(); /* Task is moved to a local DSQ */
> +
> + ops.dequeue(); /* Exiting BPF scheduler */
> }
> ops.running(); /* Task starts running on its assigned CPU */
> while (task->scx.slice > 0 && task is runnable)
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..334c3692a9c62 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> /* scx_entity.flags */
> enum scx_ent_flags {
> SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
> + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* ops.enqueue() was called */
Can we rename this flag? For direct dispatch we never got enqueued.
Something like "DEQ_ON_DISPATCH" would show the purpose of the
flag more clearly.
> SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 94164f2dec6dc..985d75d374385 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
>
> + /* Mark that ops.enqueue() is being called for this task */
> + p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +
Can we avoid setting this flag when we have no .dequeue() method?
Otherwise it stays set forever AFAICT, even after the task has been
sent to the runqueues.
> ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
> WARN_ON_ONCE(*ddsp_taskp);
> *ddsp_taskp = p;
> @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>
> switch (opss & SCX_OPSS_STATE_MASK) {
> case SCX_OPSS_NONE:
> + /*
> + * Task is not currently being enqueued or queued on the BPF
> + * scheduler. Check if ops.enqueue() was called for this task.
> + */
> + if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
> + SCX_HAS_OP(sch, dequeue)) {
> + /*
> + * ops.enqueue() was called and the task was dispatched.
> + * Call ops.dequeue() to notify the BPF scheduler that
> + * the task is leaving.
> + */
> + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> + p, deq_flags);
> + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> + }
> break;
> case SCX_OPSS_QUEUEING:
> /*
> @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> */
> BUG();
> case SCX_OPSS_QUEUED:
> - if (SCX_HAS_OP(sch, dequeue))
> + /*
> + * Task is owned by the BPF scheduler. Call ops.dequeue()
> + * to notify the BPF scheduler that the task is being
> + * removed.
> + */
> + if (SCX_HAS_OP(sch, dequeue)) {
Edge case, but if we have a .dequeue() method but not an .enqueue() we
still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an
extra condition to be consistent with the SCX_OPSS_NONE case above?
> SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> p, deq_flags);
> + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> + }
>
> if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
> SCX_OPSS_NONE))
next prev parent reply other threads:[~2025-12-28 3:20 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper ops.dequeue() semantics Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28 3:20 ` Emil Tsalapatis [this message]
2025-12-29 16:36 ` Andrea Righi
2025-12-29 18:35 ` Emil Tsalapatis
2025-12-28 17:19 ` Tejun Heo
2025-12-28 23:28 ` Tejun Heo
2025-12-28 23:38 ` Tejun Heo
2025-12-29 17:07 ` Andrea Righi
2025-12-29 18:55 ` Emil Tsalapatis
2025-12-28 23:42 ` Tejun Heo
2025-12-29 17:17 ` Andrea Righi
2025-12-29 0:06 ` Tejun Heo
2025-12-29 18:56 ` Andrea Righi
2025-12-19 22:43 ` [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() Andrea Righi
2025-12-28 3:28 ` Emil Tsalapatis
2025-12-29 16:11 ` Andrea Righi
-- strict thread matches above, loose matches on Subject: below --
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54 ` Christian Loehle
2026-01-21 12:57 ` Andrea Righi
2026-01-22 9:28 ` Kuba Piecuch
2026-01-23 13:32 ` Andrea Righi
2026-01-26 8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38 ` Emil Tsalapatis
2026-01-27 16:41 ` Kuba Piecuch
2026-01-30 7:34 ` Andrea Righi
2026-01-30 13:14 ` Kuba Piecuch
2026-01-31 6:54 ` Andrea Righi
2026-01-31 16:45 ` Kuba Piecuch
2026-01-31 17:24 ` Andrea Righi
2026-01-28 21:21 ` Tejun Heo
2026-01-30 11:54 ` Kuba Piecuch
2026-01-31 9:02 ` Andrea Righi
2026-01-31 17:53 ` Kuba Piecuch
2026-01-31 20:26 ` Andrea Righi
2026-02-02 15:19 ` Tejun Heo
2026-02-02 15:30 ` Andrea Righi
2026-02-01 17:43 ` Tejun Heo
2026-02-02 15:52 ` Andrea Righi
2026-02-02 16:23 ` Kuba Piecuch
2026-02-01 9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47 ` Christian Loehle
2026-02-02 7:45 ` Andrea Righi
2026-02-02 9:26 ` Andrea Righi
2026-02-02 10:02 ` Christian Loehle
2026-02-02 15:32 ` Andrea Righi
2026-02-02 10:09 ` Christian Loehle
2026-02-02 13:59 ` Kuba Piecuch
2026-02-04 9:36 ` Andrea Righi
2026-02-04 9:51 ` Kuba Piecuch
2026-02-02 11:56 ` Kuba Piecuch
2026-02-04 10:11 ` Andrea Righi
2026-02-04 10:33 ` Kuba Piecuch
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14 ` Tejun Heo
2026-02-05 9:26 ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29 ` Kuba Piecuch
2026-02-05 21:32 ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35 ` Emil Tsalapatis
2026-02-07 9:26 ` Andrea Righi
2026-02-09 17:28 ` Tejun Heo
2026-02-09 19:06 ` Andrea Righi
2026-02-10 21:26 [PATCHSET v8] " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20 ` Tejun Heo
2026-02-11 16:06 ` Andrea Righi
2026-02-11 19:47 ` Tejun Heo
2026-02-11 22:34 ` Andrea Righi
2026-02-11 22:37 ` Tejun Heo
2026-02-11 22:48 ` Andrea Righi
2026-02-12 10:16 ` Andrea Righi
2026-02-12 14:32 ` Christian Loehle
2026-02-12 15:45 ` Andrea Righi
2026-02-12 17:07 ` Tejun Heo
2026-02-12 18:14 ` Andrea Righi
2026-02-12 18:35 ` Tejun Heo
2026-02-12 22:30 ` Andrea Righi
2026-02-14 10:16 ` Andrea Righi
2026-02-14 17:56 ` Tejun Heo
2026-02-14 19:32 ` Andrea Righi
2026-02-10 23:54 ` Tejun Heo
2026-02-11 16:07 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DF9IXXXOIXP7.LILBQBLDYVM7@etsalapatis.com \
--to=emil@etsalapatis.com \
--cc=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=hodgesd@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox