From: Kuba Piecuch <jpiecuch@google.com>
To: Andrea Righi <arighi@nvidia.com>, Tejun Heo <tj@kernel.org>,
David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>
Cc: Kuba Piecuch <jpiecuch@google.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
Christian Loehle <christian.loehle@arm.com>,
Daniel Hodges <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Thu, 05 Feb 2026 19:29:42 +0000 [thread overview]
Message-ID: <DG79ZKS630EW.IUD2PSCREXDF@google.com> (raw)
In-Reply-To: <20260205153304.1996142-2-arighi@nvidia.com>
Hi Andrea,
On Thu Feb 5, 2026 at 3:32 PM UTC, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change events. In addition, ops.dequeue()
> callbacks are completely skipped when tasks are dispatched to non-local
> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> track task state.
>
> Fix this by guaranteeing that each task entering the BPF scheduler's
> custody triggers exactly one ops.dequeue() call when it leaves that
> custody, whether the exit is due to a dispatch (regular or via a core
> scheduling pick) or to a scheduling property change (e.g.
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, etc.).
>
> BPF scheduler custody concept: a task is considered to be in "BPF
> scheduler's custody" when it has been queued in user-created DSQs and
> the BPF scheduler is responsible for its lifecycle. Custody ends when
> the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> selected by core scheduling, or removed due to a property change.
Strictly speaking, a task in BPF scheduler custody doesn't have to be queued
in a user-created DSQ. It could just reside on some custom data structure.
>
> Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> entirely and are not in its custody. Terminal DSQs include:
> - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> where tasks go directly to execution.
> - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
> BPF scheduler is considered "done" with the task.
>
> As a result, ops.dequeue() is not invoked for tasks dispatched to
> terminal DSQs, as the BPF scheduler no longer retains custody of them.
Shouldn't it be "directly dispatched to terminal DSQs"?
>
> To identify dequeues triggered by scheduling property changes, introduce
> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> the dequeue was caused by a scheduling property change.
>
> New ops.dequeue() semantics:
> - ops.dequeue() is invoked exactly once when the task leaves the BPF
> scheduler's custody, in one of the following cases:
> a) regular dispatch: a task dispatched to a user DSQ is moved to a
> terminal DSQ (ops.dequeue() called without any special flags set),
I don't think the task has to be on a user DSQ. How about just "a task in BPF
scheduler's custody is dispatched to a terminal DSQ from ops.dispatch()"?
> b) core scheduling dispatch: core-sched picks task before dispatch,
> ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set,
> c) property change: task properties modified before dispatch,
> ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set.
>
> This allows BPF schedulers to:
> - reliably track task ownership and lifecycle,
> - maintain accurate accounting of managed tasks,
> - update internal state when tasks change properties.
>
...
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..ccd1fad3b3b92 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed.
>
> * Queue the task on the BPF side.
>
> + **Task State Tracking and ops.dequeue() Semantics**
> +
> + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> + enter the "BPF scheduler's custody" depending on where it's dispatched:
> +
> + * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``,
> + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
> + is done with the task - it either goes straight to a CPU's local run
> + queue or to the global DSQ as a fallback. The task never enters (or
> + exits) BPF custody, and ``ops.dequeue()`` will not be called.
> +
> + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
> + BPF scheduler's custody. When the task later leaves BPF custody
> + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
> + sleep/property changes), ``ops.dequeue()`` will be called exactly once.
> +
> + * **Queued on BPF side**: The task is in BPF data structures and in BPF
> + custody, ``ops.dequeue()`` will be called when it leaves.
> +
> + The key principle: **ops.dequeue() is called when a task leaves the BPF
> + scheduler's custody**.
> +
> + This works also with the ``ops.select_cpu()`` direct dispatch
> + optimization: even though it skips ``ops.enqueue()`` invocation, if the
> + task is dispatched to a user-created DSQ, it enters BPF custody and will
> + get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ,
> + the BPF scheduler is done with it immediately. This provides the
> + performance benefit of avoiding the ``ops.enqueue()`` roundtrip while
> + maintaining correct state tracking.
> +
> + The dequeue can happen for different reasons, distinguished by flags:
> +
> + 1. **Regular dispatch workflow**: when the task is dispatched from a
> + user-created DSQ to a terminal DSQ (leaving BPF custody for execution),
> + ``ops.dequeue()`` is triggered without any special flags.
There's no requirement for the task do be on a user-created DSQ.
> +
> + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> + core scheduling picks a task for execution while it's still in BPF
> + custody, ``ops.dequeue()`` is called with the
> + ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> +
> + 3. **Scheduling property change**: when a task property changes (via
> + operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> + priority changes, CPU migrations, etc.) while the task is still in
> + BPF custody, ``ops.dequeue()`` is called with the
> + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> +
> + **Important**: Once a task has left BPF custody (dispatched to a
> + terminal DSQ), property changes will not trigger ``ops.dequeue()``,
> + since the task is no longer being managed by the BPF scheduler.
> +
> 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> empty, it then looks at the global DSQ. If there still isn't a task to
> run, ``ops.dispatch()`` is invoked which can use the following two
...
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..35a88942810b4 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> /* scx_entity.flags */
> enum scx_ent_flags {
> SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
> + SCX_TASK_NEED_DEQ = 1 << 1, /* task needs ops.dequeue() */
I think this could use a comment that connects this flag to the concept of
BPF custody, so how about something like "task is in BPF custody, needs
ops.dequeue() when leaving it"?
> SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0bb8fa927e9e9..9ebca357196b4 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
...
> @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> dsq_mod_nr(dsq, 1);
> p->scx.dsq = dsq;
>
> + /*
> + * Handle ops.dequeue() and custody tracking.
> + *
> + * Builtin DSQs (local, global, bypass) are terminal: the BPF
> + * scheduler is done with the task. If it was in BPF custody, call
> + * ops.dequeue() and clear the flag.
> + *
> + * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> + * ops.dequeue() will be called when it leaves.
> + */
> + if (SCX_HAS_OP(sch, dequeue)) {
> + if (is_terminal_dsq(dsq->id)) {
> + if (p->scx.flags & SCX_TASK_NEED_DEQ)
> + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> + rq, p, 0);
> + p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> + } else {
> + p->scx.flags |= SCX_TASK_NEED_DEQ;
> + }
> + }
> +
This is the only place where I see SCX_TASK_NEED_DEQ being set, which means
it won't be set if the enqueued task is queued on the BPF scheduler's internal
data structures rather than dispatched to a user-created DSQ. I don't think
that's the behavior we're aiming for.
> @@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>
> switch (opss & SCX_OPSS_STATE_MASK) {
> case SCX_OPSS_NONE:
> + /*
> + * Task is not in BPF data structures (either dispatched to
> + * a DSQ or running). Only call ops.dequeue() if the task
> + * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
> + * is set).
> + *
> + * If the task has already been dispatched to a terminal
> + * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
> + * scheduler's custody and the flag will be clear, so we
> + * skip ops.dequeue().
> + *
> + * If this is a property change (not sleep/core-sched) and
> + * the task is still in BPF custody, set the
> + * %SCX_DEQ_SCHED_CHANGE flag.
> + */
> + if (SCX_HAS_OP(sch, dequeue) &&
> + (p->scx.flags & SCX_TASK_NEED_DEQ))
> + call_task_dequeue(sch, rq, p, deq_flags);
> break;
> case SCX_OPSS_QUEUEING:
> /*
> @@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> */
> BUG();
> case SCX_OPSS_QUEUED:
> + /*
> + * Task is still on the BPF scheduler (not dispatched yet).
> + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> + * only for property changes, not for core-sched picks or
> + * sleep.
> + */
The part of the comment about SCX_DEQ_SCHED_CHANGE looks like it belongs in
call_task_dequeue(), not here.
> if (SCX_HAS_OP(sch, dequeue))
> - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> - p, deq_flags);
> + call_task_dequeue(sch, rq, p, deq_flags);
How about adding WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)) here or in
call_task_dequeue()?
Thanks,
Kuba
next prev parent reply other threads:[~2026-02-05 19:29 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-05 15:32 [PATCHSET v6] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29 ` Kuba Piecuch [this message]
2026-02-05 21:32 ` Andrea Righi
2026-02-05 15:32 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
-- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20 ` Tejun Heo
2026-02-11 16:06 ` Andrea Righi
2026-02-11 19:47 ` Tejun Heo
2026-02-11 22:34 ` Andrea Righi
2026-02-11 22:37 ` Tejun Heo
2026-02-11 22:48 ` Andrea Righi
2026-02-12 10:16 ` Andrea Righi
2026-02-12 14:32 ` Christian Loehle
2026-02-12 15:45 ` Andrea Righi
2026-02-12 17:07 ` Tejun Heo
2026-02-12 18:14 ` Andrea Righi
2026-02-12 18:35 ` Tejun Heo
2026-02-12 22:30 ` Andrea Righi
2026-02-14 10:16 ` Andrea Righi
2026-02-14 17:56 ` Tejun Heo
2026-02-14 19:32 ` Andrea Righi
2026-02-10 23:54 ` Tejun Heo
2026-02-11 16:07 ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35 ` Emil Tsalapatis
2026-02-07 9:26 ` Andrea Righi
2026-02-09 17:28 ` Tejun Heo
2026-02-09 19:06 ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14 ` Tejun Heo
2026-02-05 9:26 ` Andrea Righi
2026-02-01 9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47 ` Christian Loehle
2026-02-02 7:45 ` Andrea Righi
2026-02-02 9:26 ` Andrea Righi
2026-02-02 10:02 ` Christian Loehle
2026-02-02 15:32 ` Andrea Righi
2026-02-02 10:09 ` Christian Loehle
2026-02-02 13:59 ` Kuba Piecuch
2026-02-04 9:36 ` Andrea Righi
2026-02-04 9:51 ` Kuba Piecuch
2026-02-02 11:56 ` Kuba Piecuch
2026-02-04 10:11 ` Andrea Righi
2026-02-04 10:33 ` Kuba Piecuch
2026-01-26 8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38 ` Emil Tsalapatis
2026-01-27 16:41 ` Kuba Piecuch
2026-01-30 7:34 ` Andrea Righi
2026-01-30 13:14 ` Kuba Piecuch
2026-01-31 6:54 ` Andrea Righi
2026-01-31 16:45 ` Kuba Piecuch
2026-01-31 17:24 ` Andrea Righi
2026-01-28 21:21 ` Tejun Heo
2026-01-30 11:54 ` Kuba Piecuch
2026-01-31 9:02 ` Andrea Righi
2026-01-31 17:53 ` Kuba Piecuch
2026-01-31 20:26 ` Andrea Righi
2026-02-02 15:19 ` Tejun Heo
2026-02-02 15:30 ` Andrea Righi
2026-02-01 17:43 ` Tejun Heo
2026-02-02 15:52 ` Andrea Righi
2026-02-02 16:23 ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54 ` Christian Loehle
2026-01-21 12:57 ` Andrea Righi
2026-01-22 9:28 ` Kuba Piecuch
2026-01-23 13:32 ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28 3:20 ` Emil Tsalapatis
2025-12-29 16:36 ` Andrea Righi
2025-12-29 18:35 ` Emil Tsalapatis
2025-12-28 17:19 ` Tejun Heo
2025-12-28 23:28 ` Tejun Heo
2025-12-28 23:38 ` Tejun Heo
2025-12-29 17:07 ` Andrea Righi
2025-12-29 18:55 ` Emil Tsalapatis
2025-12-28 23:42 ` Tejun Heo
2025-12-29 17:17 ` Andrea Righi
2025-12-29 0:06 ` Tejun Heo
2025-12-29 18:56 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DG79ZKS630EW.IUD2PSCREXDF@google.com \
--to=jpiecuch@google.com \
--cc=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=emil@etsalapatis.com \
--cc=hodgesd@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox