From: Kuba Piecuch <jpiecuch@google.com>
To: Andrea Righi <arighi@nvidia.com>, Tejun Heo <tj@kernel.org>,
David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>
Cc: Kuba Piecuch <jpiecuch@google.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
Christian Loehle <christian.loehle@arm.com>,
Daniel Hodges <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Thu, 05 Feb 2026 19:29:42 +0000 [thread overview]
Message-ID: <DG79ZKS630EW.IUD2PSCREXDF@google.com> (raw)
In-Reply-To: <20260205153304.1996142-2-arighi@nvidia.com>
Hi Andrea,
On Thu Feb 5, 2026 at 3:32 PM UTC, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change events. In addition, ops.dequeue()
> callbacks are completely skipped when tasks are dispatched to non-local
> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> track task state.
>
> Fix this by guaranteeing that each task entering the BPF scheduler's
> custody triggers exactly one ops.dequeue() call when it leaves that
> custody, whether the exit is due to a dispatch (regular or via a core
> scheduling pick) or to a scheduling property change (e.g.
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, etc.).
>
> BPF scheduler custody concept: a task is considered to be in "BPF
> scheduler's custody" when it has been queued in user-created DSQs and
> the BPF scheduler is responsible for its lifecycle. Custody ends when
> the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> selected by core scheduling, or removed due to a property change.
Strictly speaking, a task in BPF scheduler custody doesn't have to be queued
in a user-created DSQ. It could just reside on some custom data structure.
>
> Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> entirely and are not in its custody. Terminal DSQs include:
> - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> where tasks go directly to execution.
> - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
> BPF scheduler is considered "done" with the task.
>
> As a result, ops.dequeue() is not invoked for tasks dispatched to
> terminal DSQs, as the BPF scheduler no longer retains custody of them.
Shouldn't it be "directly dispatched to terminal DSQs"?
>
> To identify dequeues triggered by scheduling property changes, introduce
> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> the dequeue was caused by a scheduling property change.
>
> New ops.dequeue() semantics:
> - ops.dequeue() is invoked exactly once when the task leaves the BPF
> scheduler's custody, in one of the following cases:
> a) regular dispatch: a task dispatched to a user DSQ is moved to a
> terminal DSQ (ops.dequeue() called without any special flags set),
I don't think the task has to be on a user DSQ. How about just "a task in BPF
scheduler's custody is dispatched to a terminal DSQ from ops.dispatch()"?
> b) core scheduling dispatch: core-sched picks task before dispatch,
> ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set,
> c) property change: task properties modified before dispatch,
> ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set.
>
> This allows BPF schedulers to:
> - reliably track task ownership and lifecycle,
> - maintain accurate accounting of managed tasks,
> - update internal state when tasks change properties.
>
...
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..ccd1fad3b3b92 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed.
>
> * Queue the task on the BPF side.
>
> + **Task State Tracking and ops.dequeue() Semantics**
> +
> + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> + enter the "BPF scheduler's custody" depending on where it's dispatched:
> +
> + * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``,
> + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
> + is done with the task - it either goes straight to a CPU's local run
> + queue or to the global DSQ as a fallback. The task never enters (or
> + exits) BPF custody, and ``ops.dequeue()`` will not be called.
> +
> + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
> + BPF scheduler's custody. When the task later leaves BPF custody
> + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
> + sleep/property changes), ``ops.dequeue()`` will be called exactly once.
> +
> + * **Queued on BPF side**: The task is in BPF data structures and in BPF
> + custody, ``ops.dequeue()`` will be called when it leaves.
> +
> + The key principle: **ops.dequeue() is called when a task leaves the BPF
> + scheduler's custody**.
> +
> + This works also with the ``ops.select_cpu()`` direct dispatch
> + optimization: even though it skips ``ops.enqueue()`` invocation, if the
> + task is dispatched to a user-created DSQ, it enters BPF custody and will
> + get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ,
> + the BPF scheduler is done with it immediately. This provides the
> + performance benefit of avoiding the ``ops.enqueue()`` roundtrip while
> + maintaining correct state tracking.
> +
> + The dequeue can happen for different reasons, distinguished by flags:
> +
> + 1. **Regular dispatch workflow**: when the task is dispatched from a
> + user-created DSQ to a terminal DSQ (leaving BPF custody for execution),
> + ``ops.dequeue()`` is triggered without any special flags.
There's no requirement for the task do be on a user-created DSQ.
> +
> + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> + core scheduling picks a task for execution while it's still in BPF
> + custody, ``ops.dequeue()`` is called with the
> + ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> +
> + 3. **Scheduling property change**: when a task property changes (via
> + operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> + priority changes, CPU migrations, etc.) while the task is still in
> + BPF custody, ``ops.dequeue()`` is called with the
> + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> +
> + **Important**: Once a task has left BPF custody (dispatched to a
> + terminal DSQ), property changes will not trigger ``ops.dequeue()``,
> + since the task is no longer being managed by the BPF scheduler.
> +
> 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> empty, it then looks at the global DSQ. If there still isn't a task to
> run, ``ops.dispatch()`` is invoked which can use the following two
...
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..35a88942810b4 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> /* scx_entity.flags */
> enum scx_ent_flags {
> SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
> + SCX_TASK_NEED_DEQ = 1 << 1, /* task needs ops.dequeue() */
I think this could use a comment that connects this flag to the concept of
BPF custody, so how about something like "task is in BPF custody, needs
ops.dequeue() when leaving it"?
> SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0bb8fa927e9e9..9ebca357196b4 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
...
> @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> dsq_mod_nr(dsq, 1);
> p->scx.dsq = dsq;
>
> + /*
> + * Handle ops.dequeue() and custody tracking.
> + *
> + * Builtin DSQs (local, global, bypass) are terminal: the BPF
> + * scheduler is done with the task. If it was in BPF custody, call
> + * ops.dequeue() and clear the flag.
> + *
> + * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> + * ops.dequeue() will be called when it leaves.
> + */
> + if (SCX_HAS_OP(sch, dequeue)) {
> + if (is_terminal_dsq(dsq->id)) {
> + if (p->scx.flags & SCX_TASK_NEED_DEQ)
> + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> + rq, p, 0);
> + p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> + } else {
> + p->scx.flags |= SCX_TASK_NEED_DEQ;
> + }
> + }
> +
This is the only place where I see SCX_TASK_NEED_DEQ being set, which means
it won't be set if the enqueued task is queued on the BPF scheduler's internal
data structures rather than dispatched to a user-created DSQ. I don't think
that's the behavior we're aiming for.
> @@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>
> switch (opss & SCX_OPSS_STATE_MASK) {
> case SCX_OPSS_NONE:
> + /*
> + * Task is not in BPF data structures (either dispatched to
> + * a DSQ or running). Only call ops.dequeue() if the task
> + * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
> + * is set).
> + *
> + * If the task has already been dispatched to a terminal
> + * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
> + * scheduler's custody and the flag will be clear, so we
> + * skip ops.dequeue().
> + *
> + * If this is a property change (not sleep/core-sched) and
> + * the task is still in BPF custody, set the
> + * %SCX_DEQ_SCHED_CHANGE flag.
> + */
> + if (SCX_HAS_OP(sch, dequeue) &&
> + (p->scx.flags & SCX_TASK_NEED_DEQ))
> + call_task_dequeue(sch, rq, p, deq_flags);
> break;
> case SCX_OPSS_QUEUEING:
> /*
> @@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> */
> BUG();
> case SCX_OPSS_QUEUED:
> + /*
> + * Task is still on the BPF scheduler (not dispatched yet).
> + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> + * only for property changes, not for core-sched picks or
> + * sleep.
> + */
The part of the comment about SCX_DEQ_SCHED_CHANGE looks like it belongs in
call_task_dequeue(), not here.
> if (SCX_HAS_OP(sch, dequeue))
> - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> - p, deq_flags);
> + call_task_dequeue(sch, rq, p, deq_flags);
How about adding WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)) here or in
call_task_dequeue()?
Thanks,
Kuba
next prev parent reply other threads:[~2026-02-05 19:29 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-05 15:32 [PATCHSET v6] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29 ` Kuba Piecuch [this message]
2026-02-05 21:32 ` Andrea Righi
2026-02-05 15:32 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
-- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20 ` Tejun Heo
2026-02-11 16:06 ` Andrea Righi
2026-02-11 19:47 ` Tejun Heo
2026-02-11 22:34 ` Andrea Righi
2026-02-11 22:37 ` Tejun Heo
2026-02-11 22:48 ` Andrea Righi
2026-02-12 10:16 ` Andrea Righi
2026-02-12 14:32 ` Christian Loehle
2026-02-12 15:45 ` Andrea Righi
2026-02-12 17:07 ` Tejun Heo
2026-02-12 18:14 ` Andrea Righi
2026-02-12 18:35 ` Tejun Heo
2026-02-12 22:30 ` Andrea Righi
2026-02-14 10:16 ` Andrea Righi
2026-02-14 17:56 ` Tejun Heo
2026-02-14 19:32 ` Andrea Righi
2026-02-10 23:54 ` Tejun Heo
2026-02-11 16:07 ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35 ` Emil Tsalapatis
2026-02-07 9:26 ` Andrea Righi
2026-02-09 17:28 ` Tejun Heo
2026-02-09 19:06 ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14 ` Tejun Heo
2026-02-05 9:26 ` Andrea Righi
2026-02-01 9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47 ` Christian Loehle
2026-02-02 7:45 ` Andrea Righi
2026-02-02 9:26 ` Andrea Righi
2026-02-02 10:02 ` Christian Loehle
2026-02-02 15:32 ` Andrea Righi
2026-02-02 10:09 ` Christian Loehle
2026-02-02 13:59 ` Kuba Piecuch
2026-02-04 9:36 ` Andrea Righi
2026-02-04 9:51 ` Kuba Piecuch
2026-02-02 11:56 ` Kuba Piecuch
2026-02-04 10:11 ` Andrea Righi
2026-02-04 10:33 ` Kuba Piecuch
2026-01-26 8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38 ` Emil Tsalapatis
2026-01-27 16:41 ` Kuba Piecuch
2026-01-30 7:34 ` Andrea Righi
2026-01-30 13:14 ` Kuba Piecuch
2026-01-31 6:54 ` Andrea Righi
2026-01-31 16:45 ` Kuba Piecuch
2026-01-31 17:24 ` Andrea Righi
2026-01-28 21:21 ` Tejun Heo
2026-01-30 11:54 ` Kuba Piecuch
2026-01-31 9:02 ` Andrea Righi
2026-01-31 17:53 ` Kuba Piecuch
2026-01-31 20:26 ` Andrea Righi
2026-02-02 15:19 ` Tejun Heo
2026-02-02 15:30 ` Andrea Righi
2026-02-01 17:43 ` Tejun Heo
2026-02-02 15:52 ` Andrea Righi
2026-02-02 16:23 ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54 ` Christian Loehle
2026-01-21 12:57 ` Andrea Righi
2026-01-22 9:28 ` Kuba Piecuch
2026-01-23 13:32 ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28 3:20 ` Emil Tsalapatis
2025-12-29 16:36 ` Andrea Righi
2025-12-29 18:35 ` Emil Tsalapatis
2025-12-28 17:19 ` Tejun Heo
2025-12-28 23:28 ` Tejun Heo
2025-12-28 23:38 ` Tejun Heo
2025-12-29 17:07 ` Andrea Righi
2025-12-29 18:55 ` Emil Tsalapatis
2025-12-28 23:42 ` Tejun Heo
2025-12-29 17:17 ` Andrea Righi
2025-12-29 0:06 ` Tejun Heo
2025-12-29 18:56 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DG79ZKS630EW.IUD2PSCREXDF@google.com \
--to=jpiecuch@google.com \
--cc=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=emil@etsalapatis.com \
--cc=hodgesd@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.