* [PATCHSET v6] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-05 15:32 Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 15:32 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
0 siblings, 2 replies; 81+ messages in thread
From: Andrea Righi @ 2026-02-05 15:32 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
sched-ext, linux-kernel
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g. sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).
To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
Changes in v6:
- Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
- Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
DSQs (local, global, bypass)
- centralize ops.dequeue() logic in dispatch_enqueue()
- Remove "Property Change Notifications for Running Tasks" section from
the documentation
- The kselftest now validates the right behavior both from ops.enqueue()
and ops.select_cpu()
- Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@nvidia.com
Changes in v5:
- Introduce the concept of "terminal DSQ" (when a task is dispatched to a
terminal DSQ, the task leaves the BPF scheduler's custody)
- Consider SCX_DSQ_GLOBAL as a terminal DSQ
- Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com
Changes in v4:
- Introduce the concept of "BPF scheduler custody"
- Do not trigger ops.dequeue() for direct dispatches to local DSQs
- Trigger ops.dequeue() only once; after the task leaves BPF scheduler
custody, further dequeue events are not reported.
- Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 53 ++++
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 130 ++++++++-
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 1 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 334 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 222 ++++++++++++++++
10 files changed, 739 insertions(+), 13 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
^ permalink raw reply [flat|nested] 81+ messages in thread* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-05 15:32 [PATCHSET v6] sched_ext: Fix ops.dequeue() semantics Andrea Righi @ 2026-02-05 15:32 ` Andrea Righi 2026-02-05 19:29 ` Kuba Piecuch 2026-02-05 15:32 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi 1 sibling, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-05 15:32 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in "BPF scheduler's custody" when it has been queued in user-created DSQs and the BPF scheduler is responsible for its lifecycle. Custody ends when the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a property change. Tasks directly dispatched to terminal DSQs bypass the BPF scheduler entirely and are not in its custody. Terminal DSQs include: - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues where tasks go directly to execution. - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the BPF scheduler is considered "done" with the task. As a result, ops.dequeue() is not invoked for tasks dispatched to terminal DSQs, as the BPF scheduler no longer retains custody of them. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: a task dispatched to a user DSQ is moved to a terminal DSQ (ops.dequeue() called without any special flags set), b) core scheduling dispatch: core-sched picks task before dispatch, ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set, c) property change: task properties modified before dispatch, ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set. This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- Documentation/scheduler/sched-ext.rst | 53 +++++++ include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 130 ++++++++++++++++-- kernel/sched/ext_internal.h | 7 + .../sched_ext/include/scx/enum_defs.autogen.h | 1 + .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + tools/sched_ext/include/scx/enums.autogen.h | 1 + 7 files changed, 182 insertions(+), 13 deletions(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 404fe6126a769..ccd1fad3b3b92 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + **Task State Tracking and ops.dequeue() Semantics** + + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may + enter the "BPF scheduler's custody" depending on where it's dispatched: + + * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``, + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler + is done with the task - it either goes straight to a CPU's local run + queue or to the global DSQ as a fallback. The task never enters (or + exits) BPF custody, and ``ops.dequeue()`` will not be called. + + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the + BPF scheduler's custody. When the task later leaves BPF custody + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for + sleep/property changes), ``ops.dequeue()`` will be called exactly once. + + * **Queued on BPF side**: The task is in BPF data structures and in BPF + custody, ``ops.dequeue()`` will be called when it leaves. + + The key principle: **ops.dequeue() is called when a task leaves the BPF + scheduler's custody**. + + This works also with the ``ops.select_cpu()`` direct dispatch + optimization: even though it skips ``ops.enqueue()`` invocation, if the + task is dispatched to a user-created DSQ, it enters BPF custody and will + get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ, + the BPF scheduler is done with it immediately. This provides the + performance benefit of avoiding the ``ops.enqueue()`` roundtrip while + maintaining correct state tracking. + + The dequeue can happen for different reasons, distinguished by flags: + + 1. **Regular dispatch workflow**: when the task is dispatched from a + user-created DSQ to a terminal DSQ (leaving BPF custody for execution), + ``ops.dequeue()`` is triggered without any special flags. + + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and + core scheduling picks a task for execution while it's still in BPF + custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. + + 3. **Scheduling property change**: when a task property changes (via + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, + priority changes, CPU migrations, etc.) while the task is still in + BPF custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. + + **Important**: Once a task has left BPF custody (dispatched to a + terminal DSQ), property changes will not trigger ``ops.dequeue()``, + since the task is no longer being managed by the BPF scheduler. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -319,6 +370,8 @@ by a sched_ext scheduler: /* Any usable CPU becomes available */ ops.dispatch(); /* Task is moved to a local DSQ */ + + ops.dequeue(); /* Exiting BPF scheduler */ } ops.running(); /* Task starts running on its assigned CPU */ while (task->scx.slice > 0 && task is runnable) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d8..35a88942810b4 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -84,6 +84,7 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + SCX_TASK_NEED_DEQ = 1 << 1, /* task needs ops.dequeue() */ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 0bb8fa927e9e9..9ebca357196b4 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p) #endif } +/** + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes + * @dsq_id: DSQ ID to check + * + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF + * scheduler is considered "done" with the task. + * + * Builtin DSQs include: + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues + * where tasks go directly to execution, + * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue, + * - Bypass DSQ: used during bypass mode. + * + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not + * trigger ops.dequeue() when they are later consumed. + */ +static inline bool is_terminal_dsq(u64 dsq_id) +{ + return dsq_id & SCX_DSQ_FLAG_BUILTIN; +} + /** * touch_core_sched_dispatch - Update core-sched timestamp on dispatch * @rq: rq to read clock from, must be locked @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p resched_curr(rq); } -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, + struct scx_dispatch_q *dsq, struct task_struct *p, u64 enq_flags) { bool is_local = dsq->id == SCX_DSQ_LOCAL; @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, dsq_mod_nr(dsq, 1); p->scx.dsq = dsq; + /* + * Handle ops.dequeue() and custody tracking. + * + * Builtin DSQs (local, global, bypass) are terminal: the BPF + * scheduler is done with the task. If it was in BPF custody, call + * ops.dequeue() and clear the flag. + * + * User DSQs: Task is in BPF scheduler's custody. Set the flag so + * ops.dequeue() will be called when it leaves. + */ + if (SCX_HAS_OP(sch, dequeue)) { + if (is_terminal_dsq(dsq->id)) { + if (p->scx.flags & SCX_TASK_NEED_DEQ) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, + rq, p, 0); + p->scx.flags &= ~SCX_TASK_NEED_DEQ; + } else { + p->scx.flags |= SCX_TASK_NEED_DEQ; + } + } + /* * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the * direct dispatch path, but we clear them here because the direct @@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, return; } - dispatch_enqueue(sch, dsq, p, + dispatch_enqueue(sch, rq, dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); } @@ -1413,7 +1456,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, direct_dispatch(sch, p, enq_flags); return; local_norefill: - dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags); + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags); return; local: dsq = &rq->scx.local_dsq; @@ -1433,7 +1476,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, */ touch_core_sched(rq, p); refill_task_slice_dfl(sch, p); - dispatch_enqueue(sch, dsq, p, enq_flags); + dispatch_enqueue(sch, rq, dsq, p, enq_flags); } static bool task_runnable(const struct task_struct *p) @@ -1511,6 +1554,18 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags __scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1); } +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, + struct task_struct *p, u64 deq_flags) +{ + u64 flags = deq_flags; + + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + flags |= SCX_DEQ_SCHED_CHANGE; + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); + p->scx.flags &= ~SCX_TASK_NEED_DEQ; +} + static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) { struct scx_sched *sch = scx_root; @@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) switch (opss & SCX_OPSS_STATE_MASK) { case SCX_OPSS_NONE: + /* + * Task is not in BPF data structures (either dispatched to + * a DSQ or running). Only call ops.dequeue() if the task + * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ + * is set). + * + * If the task has already been dispatched to a terminal + * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF + * scheduler's custody and the flag will be clear, so we + * skip ops.dequeue(). + * + * If this is a property change (not sleep/core-sched) and + * the task is still in BPF custody, set the + * %SCX_DEQ_SCHED_CHANGE flag. + */ + if (SCX_HAS_OP(sch, dequeue) && + (p->scx.flags & SCX_TASK_NEED_DEQ)) + call_task_dequeue(sch, rq, p, deq_flags); break; case SCX_OPSS_QUEUEING: /* @@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: + /* + * Task is still on the BPF scheduler (not dispatched yet). + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE + * only for property changes, not for core-sched picks or + * sleep. + */ if (SCX_HAS_OP(sch, dequeue)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, - p, deq_flags); + call_task_dequeue(sch, rq, p, deq_flags); if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -1631,6 +1709,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, struct scx_dispatch_q *src_dsq, struct rq *dst_rq) { + struct scx_sched *sch = scx_root; struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; /* @dsq is locked and @p is on @dst_rq */ @@ -1639,6 +1718,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(p->scx.holding_cpu >= 0); + /* + * Task is moving from a non-local DSQ to a local (terminal) DSQ. + * Call ops.dequeue() if the task was in BPF custody. + */ + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_NEED_DEQ)) { + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); + p->scx.flags &= ~SCX_TASK_NEED_DEQ; + } + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) list_add(&p->scx.dsq_list.node, &dst_dsq->list); else @@ -1879,7 +1967,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, dispatch_dequeue_locked(p, src_dsq); raw_spin_unlock(&src_dsq->lock); - dispatch_enqueue(sch, dst_dsq, p, enq_flags); + dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags); } return dst_rq; @@ -1969,14 +2057,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, * If dispatching to @rq that @p is already on, no lock dancing needed. */ if (rq == src_rq && rq == dst_rq) { - dispatch_enqueue(sch, dst_dsq, p, + dispatch_enqueue(sch, rq, dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); return; } if (src_rq != dst_rq && unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { - dispatch_enqueue(sch, find_global_dsq(sch, p), p, + dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p, enq_flags | SCX_ENQ_CLEAR_OPSS); return; } @@ -2014,7 +2102,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, */ if (src_rq == dst_rq) { p->scx.holding_cpu = -1; - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, enq_flags); } else { move_remote_task_to_local_dsq(p, enq_flags, @@ -2113,7 +2201,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, if (dsq->id == SCX_DSQ_LOCAL) dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); else - dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); + dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); } static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq) @@ -2414,7 +2502,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, * DSQ. */ if (p->scx.slice && !scx_rq_bypassing(rq)) { - dispatch_enqueue(sch, &rq->scx.local_dsq, p, + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD); goto switch_class; } @@ -2898,6 +2986,14 @@ static void scx_enable_task(struct task_struct *p) lockdep_assert_rq_held(rq); + /* + * Verify the task is not in BPF scheduler's custody. If flag + * transitions are consistent, the flag should always be clear + * here. + */ + if (SCX_HAS_OP(sch, dequeue)) + WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ); + /* * Set the weight before calling ops.enable() so that the scheduler * doesn't see a stale value if they inspect the task struct. @@ -2929,6 +3025,14 @@ static void scx_disable_task(struct task_struct *p) if (SCX_HAS_OP(sch, disable)) SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); scx_set_task_state(p, SCX_TASK_READY); + + /* + * Verify the task is not in BPF scheduler's custody. If flag + * transitions are consistent, the flag should always be clear + * here. + */ + if (SCX_HAS_OP(sch, dequeue)) + WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ); } static void scx_exit_task(struct task_struct *p) @@ -3919,7 +4023,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq, * between bypass DSQs. */ dispatch_dequeue_locked(p, donor_dsq); - dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED); + dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED); /* * $donee might have been idle and need to be woken up. No need diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 386c677e4c9a0..befa9a5d6e53f 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -982,6 +982,13 @@ enum scx_deq_flags { * it hasn't been dispatched yet. Dequeue from the BPF side. */ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, + + /* + * The task is being dequeued due to a property change (e.g., + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), + * etc.). + */ + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, }; enum scx_pick_idle_cpu_flags { diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c2..dcc945304760f 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -21,6 +21,7 @@ #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_SCHED_CHANGE #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19ad..5da50f9376844 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584be..fc9a7a4d9dea5 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -46,4 +46,5 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ } while (0) -- 2.53.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi @ 2026-02-05 19:29 ` Kuba Piecuch 2026-02-05 21:32 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Kuba Piecuch @ 2026-02-05 19:29 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hi Andrea, On Thu Feb 5, 2026 at 3:32 PM UTC, Andrea Righi wrote: > Currently, ops.dequeue() is only invoked when the sched_ext core knows > that a task resides in BPF-managed data structures, which causes it to > miss scheduling property change events. In addition, ops.dequeue() > callbacks are completely skipped when tasks are dispatched to non-local > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably > track task state. > > Fix this by guaranteeing that each task entering the BPF scheduler's > custody triggers exactly one ops.dequeue() call when it leaves that > custody, whether the exit is due to a dispatch (regular or via a core > scheduling pick) or to a scheduling property change (e.g. > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA > balancing, etc.). > > BPF scheduler custody concept: a task is considered to be in "BPF > scheduler's custody" when it has been queued in user-created DSQs and > the BPF scheduler is responsible for its lifecycle. Custody ends when > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL), > selected by core scheduling, or removed due to a property change. Strictly speaking, a task in BPF scheduler custody doesn't have to be queued in a user-created DSQ. It could just reside on some custom data structure. > > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler > entirely and are not in its custody. Terminal DSQs include: > - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > where tasks go directly to execution. > - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the > BPF scheduler is considered "done" with the task. > > As a result, ops.dequeue() is not invoked for tasks dispatched to > terminal DSQs, as the BPF scheduler no longer retains custody of them. Shouldn't it be "directly dispatched to terminal DSQs"? > > To identify dequeues triggered by scheduling property changes, introduce > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, > the dequeue was caused by a scheduling property change. > > New ops.dequeue() semantics: > - ops.dequeue() is invoked exactly once when the task leaves the BPF > scheduler's custody, in one of the following cases: > a) regular dispatch: a task dispatched to a user DSQ is moved to a > terminal DSQ (ops.dequeue() called without any special flags set), I don't think the task has to be on a user DSQ. How about just "a task in BPF scheduler's custody is dispatched to a terminal DSQ from ops.dispatch()"? > b) core scheduling dispatch: core-sched picks task before dispatch, > ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set, > c) property change: task properties modified before dispatch, > ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set. > > This allows BPF schedulers to: > - reliably track task ownership and lifecycle, > - maintain accurate accounting of managed tasks, > - update internal state when tasks change properties. > ... > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > index 404fe6126a769..ccd1fad3b3b92 100644 > --- a/Documentation/scheduler/sched-ext.rst > +++ b/Documentation/scheduler/sched-ext.rst > @@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed. > > * Queue the task on the BPF side. > > + **Task State Tracking and ops.dequeue() Semantics** > + > + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may > + enter the "BPF scheduler's custody" depending on where it's dispatched: > + > + * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``, > + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler > + is done with the task - it either goes straight to a CPU's local run > + queue or to the global DSQ as a fallback. The task never enters (or > + exits) BPF custody, and ``ops.dequeue()`` will not be called. > + > + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the > + BPF scheduler's custody. When the task later leaves BPF custody > + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for > + sleep/property changes), ``ops.dequeue()`` will be called exactly once. > + > + * **Queued on BPF side**: The task is in BPF data structures and in BPF > + custody, ``ops.dequeue()`` will be called when it leaves. > + > + The key principle: **ops.dequeue() is called when a task leaves the BPF > + scheduler's custody**. > + > + This works also with the ``ops.select_cpu()`` direct dispatch > + optimization: even though it skips ``ops.enqueue()`` invocation, if the > + task is dispatched to a user-created DSQ, it enters BPF custody and will > + get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ, > + the BPF scheduler is done with it immediately. This provides the > + performance benefit of avoiding the ``ops.enqueue()`` roundtrip while > + maintaining correct state tracking. > + > + The dequeue can happen for different reasons, distinguished by flags: > + > + 1. **Regular dispatch workflow**: when the task is dispatched from a > + user-created DSQ to a terminal DSQ (leaving BPF custody for execution), > + ``ops.dequeue()`` is triggered without any special flags. There's no requirement for the task do be on a user-created DSQ. > + > + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and > + core scheduling picks a task for execution while it's still in BPF > + custody, ``ops.dequeue()`` is called with the > + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. > + > + 3. **Scheduling property change**: when a task property changes (via > + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, > + priority changes, CPU migrations, etc.) while the task is still in > + BPF custody, ``ops.dequeue()`` is called with the > + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. > + > + **Important**: Once a task has left BPF custody (dispatched to a > + terminal DSQ), property changes will not trigger ``ops.dequeue()``, > + since the task is no longer being managed by the BPF scheduler. > + > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If > empty, it then looks at the global DSQ. If there still isn't a task to > run, ``ops.dispatch()`` is invoked which can use the following two ... > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index bcb962d5ee7d8..35a88942810b4 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > /* scx_entity.flags */ > enum scx_ent_flags { > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > + SCX_TASK_NEED_DEQ = 1 << 1, /* task needs ops.dequeue() */ I think this could use a comment that connects this flag to the concept of BPF custody, so how about something like "task is in BPF custody, needs ops.dequeue() when leaving it"? > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 0bb8fa927e9e9..9ebca357196b4 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c ... > @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > dsq_mod_nr(dsq, 1); > p->scx.dsq = dsq; > > + /* > + * Handle ops.dequeue() and custody tracking. > + * > + * Builtin DSQs (local, global, bypass) are terminal: the BPF > + * scheduler is done with the task. If it was in BPF custody, call > + * ops.dequeue() and clear the flag. > + * > + * User DSQs: Task is in BPF scheduler's custody. Set the flag so > + * ops.dequeue() will be called when it leaves. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + if (is_terminal_dsq(dsq->id)) { > + if (p->scx.flags & SCX_TASK_NEED_DEQ) > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, > + rq, p, 0); > + p->scx.flags &= ~SCX_TASK_NEED_DEQ; > + } else { > + p->scx.flags |= SCX_TASK_NEED_DEQ; > + } > + } > + This is the only place where I see SCX_TASK_NEED_DEQ being set, which means it won't be set if the enqueued task is queued on the BPF scheduler's internal data structures rather than dispatched to a user-created DSQ. I don't think that's the behavior we're aiming for. > @@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > switch (opss & SCX_OPSS_STATE_MASK) { > case SCX_OPSS_NONE: > + /* > + * Task is not in BPF data structures (either dispatched to > + * a DSQ or running). Only call ops.dequeue() if the task > + * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ > + * is set). > + * > + * If the task has already been dispatched to a terminal > + * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF > + * scheduler's custody and the flag will be clear, so we > + * skip ops.dequeue(). > + * > + * If this is a property change (not sleep/core-sched) and > + * the task is still in BPF custody, set the > + * %SCX_DEQ_SCHED_CHANGE flag. > + */ > + if (SCX_HAS_OP(sch, dequeue) && > + (p->scx.flags & SCX_TASK_NEED_DEQ)) > + call_task_dequeue(sch, rq, p, deq_flags); > break; > case SCX_OPSS_QUEUEING: > /* > @@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > */ > BUG(); > case SCX_OPSS_QUEUED: > + /* > + * Task is still on the BPF scheduler (not dispatched yet). > + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE > + * only for property changes, not for core-sched picks or > + * sleep. > + */ The part of the comment about SCX_DEQ_SCHED_CHANGE looks like it belongs in call_task_dequeue(), not here. > if (SCX_HAS_OP(sch, dequeue)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > - p, deq_flags); > + call_task_dequeue(sch, rq, p, deq_flags); How about adding WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)) here or in call_task_dequeue()? Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-05 19:29 ` Kuba Piecuch @ 2026-02-05 21:32 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-05 21:32 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hi Kuba, On Thu, Feb 05, 2026 at 07:29:42PM +0000, Kuba Piecuch wrote: > Hi Andrea, > > On Thu Feb 5, 2026 at 3:32 PM UTC, Andrea Righi wrote: > > Currently, ops.dequeue() is only invoked when the sched_ext core knows > > that a task resides in BPF-managed data structures, which causes it to > > miss scheduling property change events. In addition, ops.dequeue() > > callbacks are completely skipped when tasks are dispatched to non-local > > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably > > track task state. > > > > Fix this by guaranteeing that each task entering the BPF scheduler's > > custody triggers exactly one ops.dequeue() call when it leaves that > > custody, whether the exit is due to a dispatch (regular or via a core > > scheduling pick) or to a scheduling property change (e.g. > > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA > > balancing, etc.). > > > > BPF scheduler custody concept: a task is considered to be in "BPF > > scheduler's custody" when it has been queued in user-created DSQs and > > the BPF scheduler is responsible for its lifecycle. Custody ends when > > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL), > > selected by core scheduling, or removed due to a property change. > > Strictly speaking, a task in BPF scheduler custody doesn't have to be queued > in a user-created DSQ. It could just reside on some custom data structure. Yeah... we definitely need to consider internal BPF queues. > > > > > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler > > entirely and are not in its custody. Terminal DSQs include: > > - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > > where tasks go directly to execution. > > - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the > > BPF scheduler is considered "done" with the task. > > > > As a result, ops.dequeue() is not invoked for tasks dispatched to > > terminal DSQs, as the BPF scheduler no longer retains custody of them. > > Shouldn't it be "directly dispatched to terminal DSQs"? Ack. > > > > > To identify dequeues triggered by scheduling property changes, introduce > > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, > > the dequeue was caused by a scheduling property change. > > > > New ops.dequeue() semantics: > > - ops.dequeue() is invoked exactly once when the task leaves the BPF > > scheduler's custody, in one of the following cases: > > a) regular dispatch: a task dispatched to a user DSQ is moved to a > > terminal DSQ (ops.dequeue() called without any special flags set), > > I don't think the task has to be on a user DSQ. How about just "a task in BPF > scheduler's custody is dispatched to a terminal DSQ from ops.dispatch()"? Right. > > > b) core scheduling dispatch: core-sched picks task before dispatch, > > ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set, > > c) property change: task properties modified before dispatch, > > ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set. > > > > This allows BPF schedulers to: > > - reliably track task ownership and lifecycle, > > - maintain accurate accounting of managed tasks, > > - update internal state when tasks change properties. > > > ... > > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > > index 404fe6126a769..ccd1fad3b3b92 100644 > > --- a/Documentation/scheduler/sched-ext.rst > > +++ b/Documentation/scheduler/sched-ext.rst > > @@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed. > > > > * Queue the task on the BPF side. > > > > + **Task State Tracking and ops.dequeue() Semantics** > > + > > + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may > > + enter the "BPF scheduler's custody" depending on where it's dispatched: > > + > > + * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``, > > + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler > > + is done with the task - it either goes straight to a CPU's local run > > + queue or to the global DSQ as a fallback. The task never enters (or > > + exits) BPF custody, and ``ops.dequeue()`` will not be called. > > + > > + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the > > + BPF scheduler's custody. When the task later leaves BPF custody > > + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for > > + sleep/property changes), ``ops.dequeue()`` will be called exactly once. > > + > > + * **Queued on BPF side**: The task is in BPF data structures and in BPF > > + custody, ``ops.dequeue()`` will be called when it leaves. > > + > > + The key principle: **ops.dequeue() is called when a task leaves the BPF > > + scheduler's custody**. > > + > > + This works also with the ``ops.select_cpu()`` direct dispatch > > + optimization: even though it skips ``ops.enqueue()`` invocation, if the > > + task is dispatched to a user-created DSQ, it enters BPF custody and will > > + get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ, > > + the BPF scheduler is done with it immediately. This provides the > > + performance benefit of avoiding the ``ops.enqueue()`` roundtrip while > > + maintaining correct state tracking. > > + > > + The dequeue can happen for different reasons, distinguished by flags: > > + > > + 1. **Regular dispatch workflow**: when the task is dispatched from a > > + user-created DSQ to a terminal DSQ (leaving BPF custody for execution), > > + ``ops.dequeue()`` is triggered without any special flags. > > There's no requirement for the task do be on a user-created DSQ. Ditto. > > > + > > + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and > > + core scheduling picks a task for execution while it's still in BPF > > + custody, ``ops.dequeue()`` is called with the > > + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. > > + > > + 3. **Scheduling property change**: when a task property changes (via > > + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, > > + priority changes, CPU migrations, etc.) while the task is still in > > + BPF custody, ``ops.dequeue()`` is called with the > > + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. > > + > > + **Important**: Once a task has left BPF custody (dispatched to a > > + terminal DSQ), property changes will not trigger ``ops.dequeue()``, > > + since the task is no longer being managed by the BPF scheduler. > > + > > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If > > empty, it then looks at the global DSQ. If there still isn't a task to > > run, ``ops.dispatch()`` is invoked which can use the following two > ... > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > index bcb962d5ee7d8..35a88942810b4 100644 > > --- a/include/linux/sched/ext.h > > +++ b/include/linux/sched/ext.h > > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > > /* scx_entity.flags */ > > enum scx_ent_flags { > > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > > + SCX_TASK_NEED_DEQ = 1 << 1, /* task needs ops.dequeue() */ > > I think this could use a comment that connects this flag to the concept of > BPF custody, so how about something like "task is in BPF custody, needs > ops.dequeue() when leaving it"? Ack. > > > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index 0bb8fa927e9e9..9ebca357196b4 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > ... > > @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > > dsq_mod_nr(dsq, 1); > > p->scx.dsq = dsq; > > > > + /* > > + * Handle ops.dequeue() and custody tracking. > > + * > > + * Builtin DSQs (local, global, bypass) are terminal: the BPF > > + * scheduler is done with the task. If it was in BPF custody, call > > + * ops.dequeue() and clear the flag. > > + * > > + * User DSQs: Task is in BPF scheduler's custody. Set the flag so > > + * ops.dequeue() will be called when it leaves. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) { > > + if (is_terminal_dsq(dsq->id)) { > > + if (p->scx.flags & SCX_TASK_NEED_DEQ) > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, > > + rq, p, 0); > > + p->scx.flags &= ~SCX_TASK_NEED_DEQ; > > + } else { > > + p->scx.flags |= SCX_TASK_NEED_DEQ; > > + } > > + } > > + > > This is the only place where I see SCX_TASK_NEED_DEQ being set, which means > it won't be set if the enqueued task is queued on the BPF scheduler's internal > data structures rather than dispatched to a user-created DSQ. I don't think > that's the behavior we're aiming for. Right, I'll implement the right behavior (calling ops.dequeue()) for tasks stored in internal BPF queues. > > > @@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > > > switch (opss & SCX_OPSS_STATE_MASK) { > > case SCX_OPSS_NONE: > > + /* > > + * Task is not in BPF data structures (either dispatched to > > + * a DSQ or running). Only call ops.dequeue() if the task > > + * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ > > + * is set). > > + * > > + * If the task has already been dispatched to a terminal > > + * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF > > + * scheduler's custody and the flag will be clear, so we > > + * skip ops.dequeue(). > > + * > > + * If this is a property change (not sleep/core-sched) and > > + * the task is still in BPF custody, set the > > + * %SCX_DEQ_SCHED_CHANGE flag. > > + */ > > + if (SCX_HAS_OP(sch, dequeue) && > > + (p->scx.flags & SCX_TASK_NEED_DEQ)) > > + call_task_dequeue(sch, rq, p, deq_flags); > > break; > > case SCX_OPSS_QUEUEING: > > /* > > @@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > */ > > BUG(); > > case SCX_OPSS_QUEUED: > > + /* > > + * Task is still on the BPF scheduler (not dispatched yet). > > + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE > > + * only for property changes, not for core-sched picks or > > + * sleep. > > + */ > > The part of the comment about SCX_DEQ_SCHED_CHANGE looks like it belongs in > call_task_dequeue(), not here. Ack. > > > if (SCX_HAS_OP(sch, dequeue)) > > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > > - p, deq_flags); > > + call_task_dequeue(sch, rq, p, deq_flags); > > How about adding WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)) here or in > call_task_dequeue()? Ack. Thanks for the review! -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics 2026-02-05 15:32 [PATCHSET v6] sched_ext: Fix ops.dequeue() semantics Andrea Righi 2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi @ 2026-02-05 15:32 ` Andrea Righi 1 sibling, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-05 15:32 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Add a new kselftest to validate that the new ops.dequeue() semantics work correctly for all task lifecycle scenarios, including the distinction between terminal DSQs (where BPF scheduler is done with the task) and user DSQs (where BPF scheduler manages the task lifecycle), regardless of which callback performs the dispatch. The test validates 6 scenarios: - from ops.enqueue(): - scenario 0 (local DSQ): tasks dispatched to local DSQs bypass BPF scheduler entirely, they never enter BPF custody, so no ops.dequeue() should be called, - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also bypass BPF scheduler, like local DSQs, no ops.dequeue() should be called, - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full enqueue/dequeue lifecycle tracking and state machine validation (expects 1:1 enqueue/dequeue pairing). - from ops.select_cpu(): - scenario 3 (local DSQ): identical behavior to scenario 0, - scenario 4 (global DSQ): identical behavior to scenario 1, - scenario 5 (user DSQ): identical behavior to scenario 2. This verifies that: - terminal DSQ dispatch (local, global) don't trigger ops.dequeue(), - user DSQ dispatch has exact 1:1 ops.enqueue()/dequeue() pairing, - dispatch dequeues have no flags (normal workflow), - property change dequeues have the %SCX_DEQ_SCHED_CHANGE flag set, - no duplicate enqueues or invalid state transitions are happening, - ops.enqueue() and ops.select_cpu() dispatch paths behave identically. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- tools/testing/selftests/sched_ext/Makefile | 1 + .../testing/selftests/sched_ext/dequeue.bpf.c | 334 ++++++++++++++++++ tools/testing/selftests/sched_ext/dequeue.c | 222 ++++++++++++ 3 files changed, 557 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c create mode 100644 tools/testing/selftests/sched_ext/dequeue.c diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 5fe45f9c5f8fd..764e91edabf93 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -161,6 +161,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs auto-test-targets := \ create_dsq \ + dequeue \ enq_last_no_enq_fails \ ddsp_bogus_dsq_fail \ ddsp_vtimelocal_fail \ diff --git a/tools/testing/selftests/sched_ext/dequeue.bpf.c b/tools/testing/selftests/sched_ext/dequeue.bpf.c new file mode 100644 index 0000000000000..9b1950737a014 --- /dev/null +++ b/tools/testing/selftests/sched_ext/dequeue.bpf.c @@ -0,0 +1,334 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * A scheduler that validates ops.dequeue() is called correctly: + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF + * scheduler entirely: no ops.dequeue() should be called + * - Tasks dispatched to user DSQs enter BPF custody: ops.dequeue() must be + * called when they leave custody + * - Every ops.enqueue() for non-terminal DSQs is followed by exactly one + * ops.dequeue() (validate 1:1 pairing and state machine) + * + * Copyright (c) 2026 NVIDIA Corporation. + */ + +#include <scx/common.bpf.h> + +#define SHARED_DSQ 0 + +char _license[] SEC("license") = "GPL"; + +UEI_DEFINE(uei); + +/* + * Counters to track the lifecycle of tasks: + * - enqueue_cnt: Number of times ops.enqueue() was called + * - dequeue_cnt: Number of times ops.dequeue() was called (any type) + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag) + * - change_dequeue_cnt: Number of property change dequeues + */ +u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt; + +/* + * Test scenarios: + * 0) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF + * scheduler, no dequeue callbacks) + * 1) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF + * scheduler, no dequeue callbacks) + * 2) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler, + * dequeue callbacks expected) + * 3) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF + * scheduler, no dequeue callbacks) + * 4) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses + * BPF scheduler, no dequeue callbacks) + * 5) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler, + * dequeue callbacks expected) + */ +u32 test_scenario; + +/* + * Per-task state to track lifecycle and validate workflow semantics. + * State transitions: + * NONE -> ENQUEUED (on enqueue) + * ENQUEUED -> DISPATCHED (on dispatch dequeue) + * DISPATCHED -> NONE (on property change dequeue or re-enqueue) + * ENQUEUED -> NONE (on property change dequeue before dispatch) + */ +enum task_state { + TASK_NONE = 0, + TASK_ENQUEUED, + TASK_DISPATCHED, +}; + +struct task_ctx { + enum task_state state; /* Current state in the workflow */ + u64 enqueue_seq; /* Sequence number for debugging */ +}; + +struct { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct task_ctx); +} task_ctx_stor SEC(".maps"); + +static struct task_ctx *try_lookup_task_ctx(struct task_struct *p) +{ + return bpf_task_storage_get(&task_ctx_stor, p, 0, 0); +} + +s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p, + s32 prev_cpu, u64 wake_flags) +{ + struct task_ctx *tctx; + + tctx = try_lookup_task_ctx(p); + if (!tctx) + return prev_cpu; + + switch (test_scenario) { + case 3: + /* + * Scenario 3: Direct dispatch to local DSQ from select_cpu. + * + * Task bypasses BPF scheduler entirely: no enqueue + * tracking, no dequeue callbacks. Behavior should be + * identical to scenario 0. + */ + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); + return prev_cpu; + + case 4: + /* + * Scenario 4: Direct dispatch to global DSQ from select_cpu. + * + * Like scenario 3, task bypasses BPF scheduler entirely. + * Behavior should be identical to scenario 1. + */ + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0); + return prev_cpu; + + case 5: + /* + * Scenario 5: Dispatch to shared user DSQ from select_cpu. + * + * Task enters BPF scheduler management: track + * enqueue/dequeue lifecycle and validate state transitions. + * Behavior should be identical to scenario 2. + */ + __sync_fetch_and_add(&enqueue_cnt, 1); + + /* + * Validate state transition: enqueue is only valid from + * NONE or DISPATCHED states. Getting enqueue while in + * ENQUEUED state indicates a missing dequeue. + */ + if (tctx->state == TASK_ENQUEUED) + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", + p->pid, p->comm, tctx->enqueue_seq); + + /* Transition to ENQUEUED state */ + tctx->state = TASK_ENQUEUED; + tctx->enqueue_seq++; + + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0); + return prev_cpu; + + default: + /* For scenarios 0-2, bounce to ops.enqueue() */ + return prev_cpu; + } +} + +void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags) +{ + struct task_ctx *tctx; + + tctx = try_lookup_task_ctx(p); + if (!tctx) + return; + + switch (test_scenario) { + case 0: + /* + * Scenario 0: Direct dispatch to the local DSQ. + * + * Task bypasses BPF scheduler entirely: no enqueue + * tracking, no dequeue callbacks. Don't increment counters + * or validate state since the task never enters BPF + * scheduler management. + */ + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags); + break; + + case 1: + /* + * Scenario 1: Direct dispatch to the global DSQ. + * + * Like scenario 0, task bypasses BPF scheduler entirely. + * SCX_DSQ_GLOBAL is a terminal DSQ, tasks dispatched to it + * leave BPF custody immediately, so no dequeue callbacks + * should be triggered. + */ + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + break; + + case 2: + /* + * Scenario 2: Dispatch to shared user DSQ. + * + * Task enters BPF scheduler management: track + * enqueue/dequeue lifecycle and validate state + * transitions. + */ + __sync_fetch_and_add(&enqueue_cnt, 1); + + /* + * Validate state transition: enqueue is only valid from + * NONE or DISPATCHED states. Getting enqueue while in + * ENQUEUED state indicates a missing dequeue. + */ + if (tctx->state == TASK_ENQUEUED) + scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu", + p->pid, p->comm, tctx->enqueue_seq); + + /* Transition to ENQUEUED state */ + tctx->state = TASK_ENQUEUED; + tctx->enqueue_seq++; + + scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags); + break; + default: + /* For scenarios 3-5 dispatch to the global DSQ */ + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + } +} + +void BPF_STRUCT_OPS(dequeue_dequeue, struct task_struct *p, u64 deq_flags) +{ + struct task_ctx *tctx; + + __sync_fetch_and_add(&dequeue_cnt, 1); + + tctx = try_lookup_task_ctx(p); + if (!tctx) + return; + + /* + * For scenarios 0, 1, 3, and 4 (terminal DSQs: local and global), + * ops.dequeue() should never be called because tasks bypass the + * BPF scheduler entirely. If we get here, it's a kernel bug. We + * don't track enqueues for these scenarios, so tctx->enqueue_seq + * will be 0. + */ + if (test_scenario == 0 || test_scenario == 3) { + scx_bpf_error("%d (%s): dequeue called for local DSQ scenario - kernel bug!", + p->pid, p->comm); + return; + } + if (test_scenario == 1 || test_scenario == 4) { + scx_bpf_error("%d (%s): dequeue called for global DSQ scenario - kernel bug!", + p->pid, p->comm); + return; + } + + /* + * Validate state: dequeue should only happen from ENQUEUED or + * DISPATCHED states. Getting dequeue from NONE indicates a bug. + */ + if (tctx->state == TASK_NONE) { + scx_bpf_error("%d (%s): dequeue from NONE state seq=%llu", + p->pid, p->comm, tctx->enqueue_seq); + return; + } + + if (deq_flags & SCX_DEQ_SCHED_CHANGE) { + /* + * Property change interrupting the workflow. Valid from + * both ENQUEUED and DISPATCHED states. Transitions task + * back to NONE state. + */ + __sync_fetch_and_add(&change_dequeue_cnt, 1); + + /* Validate state transition */ + if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_DISPATCHED) + scx_bpf_error("%d (%s): invalid property change dequeue state=%d seq=%llu", + p->pid, p->comm, tctx->state, tctx->enqueue_seq); + + /* Transition back to NONE: task outside scheduler control */ + tctx->state = TASK_NONE; + } else { + /* + * Regular dispatch dequeue: normal workflow step. Valid + * only from ENQUEUED state (after enqueue, before dispatch + * dequeue). Transitions to DISPATCHED state. + */ + __sync_fetch_and_add(&dispatch_dequeue_cnt, 1); + + /* + * Dispatch dequeue should not have %SCX_DEQ_SCHED_CHANGE + * flag. + */ + if (deq_flags & SCX_DEQ_SCHED_CHANGE) + scx_bpf_error("%d (%s): SCX_DEQ_SCHED_CHANGE in dispatch dequeue seq=%llu", + p->pid, p->comm, tctx->enqueue_seq); + + /* + * Must be in ENQUEUED state. + */ + if (tctx->state != TASK_ENQUEUED) + scx_bpf_error("%d (%s): dispatch dequeue from state %d seq=%llu", + p->pid, p->comm, tctx->state, tctx->enqueue_seq); + + /* + * Transition to DISPATCHED: normal cycle completed + * dispatch. + */ + tctx->state = TASK_DISPATCHED; + } +} + +void BPF_STRUCT_OPS(dequeue_dispatch, s32 cpu, struct task_struct *prev) +{ + scx_bpf_dsq_move_to_local(SHARED_DSQ); +} + +s32 BPF_STRUCT_OPS(dequeue_init_task, struct task_struct *p, + struct scx_init_task_args *args) +{ + struct task_ctx *tctx; + + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!tctx) + return -ENOMEM; + + return 0; +} + +s32 BPF_STRUCT_OPS_SLEEPABLE(dequeue_init) +{ + s32 ret; + + ret = scx_bpf_create_dsq(SHARED_DSQ, -1); + if (ret) + return ret; + + return 0; +} + +void BPF_STRUCT_OPS(dequeue_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops dequeue_ops = { + .select_cpu = (void *)dequeue_select_cpu, + .enqueue = (void *)dequeue_enqueue, + .dequeue = (void *)dequeue_dequeue, + .dispatch = (void *)dequeue_dispatch, + .init_task = (void *)dequeue_init_task, + .init = (void *)dequeue_init, + .exit = (void *)dequeue_exit, + .name = "dequeue_test", +}; diff --git a/tools/testing/selftests/sched_ext/dequeue.c b/tools/testing/selftests/sched_ext/dequeue.c new file mode 100644 index 0000000000000..02655b9525ffe --- /dev/null +++ b/tools/testing/selftests/sched_ext/dequeue.c @@ -0,0 +1,222 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2025 NVIDIA Corporation. + */ +#define _GNU_SOURCE +#include <stdio.h> +#include <unistd.h> +#include <signal.h> +#include <bpf/bpf.h> +#include <scx/common.h> +#include <sys/wait.h> +#include <sched.h> +#include <pthread.h> +#include "scx_test.h" +#include "dequeue.bpf.skel.h" + +#define NUM_WORKERS 8 + +/* + * Worker function that creates enqueue/dequeue events. It alternates + * between CPU work, sleeping, and affinity changes to trigger dequeues. + */ +static void worker_fn(int id) +{ + cpu_set_t cpuset; + int i; + volatile int sum = 0; + + for (i = 0; i < 1000; i++) { + int j; + + /* Do some work to trigger scheduling events */ + for (j = 0; j < 10000; j++) + sum += j; + + /* Change affinity to trigger dequeue */ + if (i % 10 == 0) { + CPU_ZERO(&cpuset); + /* Rotate through the first 4 CPUs */ + CPU_SET(i % 4, &cpuset); + sched_setaffinity(0, sizeof(cpuset), &cpuset); + } + + /* Do additional work */ + for (j = 0; j < 10000; j++) + sum += j; + + /* Sleep to trigger dequeue */ + usleep(1000 + (id * 100)); + } + + exit(0); +} + +static enum scx_test_status run_scenario(struct dequeue *skel, u32 scenario, + const char *scenario_name) +{ + struct bpf_link *link; + pid_t pids[NUM_WORKERS]; + int i, status; + u64 enq_start, deq_start, dispatch_deq_start, change_deq_start; + u64 enq_delta, deq_delta, dispatch_deq_delta, change_deq_delta; + + /* Set the test scenario */ + skel->bss->test_scenario = scenario; + + /* Record starting counts */ + enq_start = skel->bss->enqueue_cnt; + deq_start = skel->bss->dequeue_cnt; + dispatch_deq_start = skel->bss->dispatch_dequeue_cnt; + change_deq_start = skel->bss->change_dequeue_cnt; + + link = bpf_map__attach_struct_ops(skel->maps.dequeue_ops); + SCX_FAIL_IF(!link, "Failed to attach struct_ops for scenario %s", scenario_name); + + /* Fork worker processes to generate enqueue/dequeue events */ + for (i = 0; i < NUM_WORKERS; i++) { + pids[i] = fork(); + SCX_FAIL_IF(pids[i] < 0, "Failed to fork worker %d", i); + + if (pids[i] == 0) { + worker_fn(i); + /* Should not reach here */ + exit(1); + } + } + + /* Wait for all workers to complete */ + for (i = 0; i < NUM_WORKERS; i++) { + SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i], + "Failed to wait for worker %d", i); + SCX_FAIL_IF(status != 0, "Worker %d exited with status %d", i, status); + } + + bpf_link__destroy(link); + + SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG)); + + /* Calculate deltas */ + enq_delta = skel->bss->enqueue_cnt - enq_start; + deq_delta = skel->bss->dequeue_cnt - deq_start; + dispatch_deq_delta = skel->bss->dispatch_dequeue_cnt - dispatch_deq_start; + change_deq_delta = skel->bss->change_dequeue_cnt - change_deq_start; + + printf("%s:\n", scenario_name); + printf(" enqueues: %lu\n", (unsigned long)enq_delta); + printf(" dequeues: %lu (dispatch: %lu, property_change: %lu)\n", + (unsigned long)deq_delta, + (unsigned long)dispatch_deq_delta, + (unsigned long)change_deq_delta); + + /* + * Validate enqueue/dequeue lifecycle tracking. + * + * For scenarios 0, 1, 3, 4 (local and global DSQs from both + * ops.enqueue() and ops.select_cpu()), both enqueues and dequeues + * should be 0 because tasks bypass the BPF scheduler entirely: tasks + * never enter BPF scheduler's custody. + * + * For scenarios 2 and 5 (user DSQ from both ops.enqueue() and + * ops.select_cpu()), we expect both enqueues and dequeues. + * + * The BPF code does strict state machine validation with + * scx_bpf_error() to ensure the workflow semantics are correct. If + * we reach this point without errors, the semantics are validated + * correctly. + */ + if (scenario == 0 || scenario == 1 || scenario == 3 || scenario == 4) { + /* Terminal DSQs: tasks bypass BPF scheduler completely */ + SCX_EQ(enq_delta, 0); + SCX_EQ(deq_delta, 0); + SCX_EQ(dispatch_deq_delta, 0); + SCX_EQ(change_deq_delta, 0); + } else { + /* User DSQ: tasks enter BPF scheduler's custody */ + SCX_GT(enq_delta, 0); + SCX_GT(deq_delta, 0); + /* Validate 1:1 enqueue/dequeue pairing */ + SCX_EQ(enq_delta, deq_delta); + } + + return SCX_TEST_PASS; +} + +static enum scx_test_status setup(void **ctx) +{ + struct dequeue *skel; + + skel = dequeue__open(); + SCX_FAIL_IF(!skel, "Failed to open skel"); + SCX_ENUM_INIT(skel); + SCX_FAIL_IF(dequeue__load(skel), "Failed to load skel"); + + *ctx = skel; + + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct dequeue *skel = ctx; + enum scx_test_status status; + + status = run_scenario(skel, 0, "Scenario 0: Local DSQ from ops.enqueue()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 1, "Scenario 1: Global DSQ from ops.enqueue()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 2, "Scenario 2: User DSQ from ops.enqueue()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 3, "Scenario 3: Local DSQ from ops.select_cpu()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 4, "Scenario 4: Global DSQ from ops.select_cpu()"); + if (status != SCX_TEST_PASS) + return status; + + status = run_scenario(skel, 5, "Scenario 5: User DSQ from ops.select_cpu()"); + if (status != SCX_TEST_PASS) + return status; + + printf("\n=== Summary ===\n"); + printf("Total enqueues: %lu\n", (unsigned long)skel->bss->enqueue_cnt); + printf("Total dequeues: %lu\n", (unsigned long)skel->bss->dequeue_cnt); + printf(" Dispatch dequeues: %lu (no flag, normal workflow)\n", + (unsigned long)skel->bss->dispatch_dequeue_cnt); + printf(" Property change dequeues: %lu (SCX_DEQ_SCHED_CHANGE flag)\n", + (unsigned long)skel->bss->change_dequeue_cnt); + printf("\nAll scenarios passed - no state machine violations detected\n"); + printf("-> Validated: Local DSQ dispatch bypasses BPF scheduler (both paths)\n"); + printf("-> Validated: Global DSQ dispatch bypasses BPF scheduler (both paths)\n"); + printf("-> Validated: User DSQ dispatch triggers dequeue callbacks (both paths)\n"); + printf("-> Validated: ops.enqueue() and ops.select_cpu() behave identically\n"); + printf("-> Validated: Dispatch dequeues have no flags (normal workflow)\n"); + printf("-> Validated: Property change dequeues have SCX_DEQ_SCHED_CHANGE flag\n"); + printf("-> Validated: No duplicate enqueues or invalid state transitions\n"); + + return SCX_TEST_PASS; +} + +static void cleanup(void *ctx) +{ + struct dequeue *skel = ctx; + + dequeue__destroy(skel); +} + +struct scx_test dequeue_test = { + .name = "dequeue", + .description = "Verify ops.dequeue() semantics", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; + +REGISTER_SCX_TEST(&dequeue_test) -- 2.53.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* [PATCHSET v8] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-10 21:26 Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
0 siblings, 1 reply; 81+ messages in thread
From: Andrea Righi @ 2026-02-10 21:26 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
sched-ext, linux-kernel
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g., sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).
To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
Changes in v8:
- Rename SCX_TASK_NEED_DEQ -> SCX_TASK_IN_CUSTODY and set/clear this flag
also when ops.dequeue() is not implemented (can be used for other
purposes in the future)
- Clarify ops.select_cpu() behavior: dispatch to terminal DSQs doesn't
trigger ops.dequeue(), dispatch to user DSQs triggers ops.dequeue(),
store to BPF-internal data structure is discouraged
- Link to v7:
https://lore.kernel.org/all/20260206135742.2339918-1-arighi@nvidia.com
Changes in v7:
- Handle tasks stored to BPF internal data structures (trigger
ops.dequeue())
- Add a kselftest scenario with a BPF queue to verify ops.dequeue()
behavior with tasks stored in internal BPF data structures
- Link to v6:
https://lore.kernel.org/all/20260205153304.1996142-1-arighi@nvidia.com
Changes in v6:
- Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
- Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
DSQs (local, global, bypass)
- centralize ops.dequeue() logic in dispatch_enqueue()
- Remove "Property Change Notifications for Running Tasks" section from
the documentation
- The kselftest now validates the right behavior both from ops.enqueue()
and ops.select_cpu()
- Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@nvidia.com
Changes in v5:
- Introduce the concept of "terminal DSQ" (when a task is dispatched to a
terminal DSQ, the task leaves the BPF scheduler's custody)
- Consider SCX_DSQ_GLOBAL as a terminal DSQ
- Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com
Changes in v4:
- Introduce the concept of "BPF scheduler custody"
- Do not trigger ops.dequeue() for direct dispatches to local DSQs
- Trigger ops.dequeue() only once; after the task leaves BPF scheduler
custody, further dequeue events are not reported.
- Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 78 ++++-
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 155 ++++++++--
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 1 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 368 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 265 +++++++++++++++++
10 files changed, 855 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
^ permalink raw reply [flat|nested] 81+ messages in thread* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi @ 2026-02-10 21:26 ` Andrea Righi 2026-02-10 23:20 ` Tejun Heo 2026-02-10 23:54 ` Tejun Heo 0 siblings, 2 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-10 21:26 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in the BPF scheduler's custody when the scheduler is responsible for managing its lifecycle. This includes tasks dispatched to user-created DSQs or stored in the BPF scheduler's internal data structures from ops.enqueue(). Custody ends when the task is dispatched to a terminal DSQ (such as the local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a property change. Tasks directly dispatched to terminal DSQs bypass the BPF scheduler entirely and are never in its custody. Terminal DSQs include: - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues where tasks go directly to execution. - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the BPF scheduler is considered "done" with the task. As a result, ops.dequeue() is not invoked for tasks directly dispatched to terminal DSQs. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: a task dispatched to a user DSQ or stored in internal BPF data structures is moved to a terminal DSQ (ops.dequeue() called without any special flags set), b) core scheduling dispatch: core-sched picks task before dispatch (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set), c) property change: task properties modified before dispatch, (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set). This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- Documentation/scheduler/sched-ext.rst | 78 ++++++++- include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 155 ++++++++++++++++-- kernel/sched/ext_internal.h | 7 + .../sched_ext/include/scx/enum_defs.autogen.h | 1 + .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + tools/sched_ext/include/scx/enums.autogen.h | 1 + 7 files changed, 221 insertions(+), 24 deletions(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 404fe6126a769..21c65e504da7c 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -229,16 +229,23 @@ The following briefly shows how a waking task is scheduled and executed. scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper, using ``ops.select_cpu()`` judiciously can be simpler and more efficient. - A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` - by calling ``scx_bpf_dsq_insert()``. If the task is inserted into - ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the - local DSQ of whichever CPU is returned from ``ops.select_cpu()``. - Additionally, inserting directly from ``ops.select_cpu()`` will cause the - ``ops.enqueue()`` callback to be skipped. - Note that the scheduler core will ignore an invalid CPU selection, for example, if it's outside the allowed cpumask of the task. + A task can be immediately inserted into a DSQ from ``ops.select_cpu()`` + by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``. + + If the task is inserted into ``SCX_DSQ_LOCAL`` from + ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU + is returned from ``ops.select_cpu()``. Additionally, inserting directly + from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to + be skipped. + + Any other attempt to store a task in BPF-internal data structures from + ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being + invoked. This is discouraged, as it can introduce racy or inconsistent + state. + 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()`` can make one of the following decisions: @@ -252,6 +259,61 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + **Task State Tracking and ops.dequeue() Semantics** + + A task is in the "BPF scheduler's custody" when the BPF scheduler is + responsible for managing its lifecycle. A task enters custody when it is + dispatched to a user DSQ or stored in the BPF scheduler's internal data + structures. Custody is entered only from ``ops.enqueue()`` for those + operations. The only exception is dispatching to a user DSQ from + ``ops.select_cpu()``: although the task is not yet technically in BPF + scheduler custody at that point, the dispatch has the same semantic + effect as dispatching from ``ops.enqueue()`` for custody-related + semantics. + + Once ``ops.enqueue()`` is called, the task may or may not enter custody + depending on what the scheduler does: + + * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``, + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler + is done with the task - it either goes straight to a CPU's local run + queue or to the global DSQ as a fallback. The task never enters (or + exits) BPF custody, and ``ops.dequeue()`` will not be called. + + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the + BPF scheduler's custody. When the task later leaves BPF custody + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for + sleep/property changes), ``ops.dequeue()`` will be called exactly + once. + + * **Stored in BPF data structures** (e.g., internal BPF queues): the + task is in BPF custody. ``ops.dequeue()`` will be called when it + leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or + on property change / sleep). + + When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked. + The dequeue can happen for different reasons, distinguished by flags: + + 1. **Regular dispatch**: when a task in BPF custody is dispatched to a + terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for + execution), ``ops.dequeue()`` is triggered without any special flags. + + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and + core scheduling picks a task for execution while it's still in BPF + custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. + + 3. **Scheduling property change**: when a task property changes (via + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, + priority changes, CPU migrations, etc.) while the task is still in + BPF custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. + + **Important**: Once a task has left BPF custody (e.g. after being + dispatched to a terminal DSQ), property changes will not trigger + ``ops.dequeue()``, since the task is no longer managed by the BPF + scheduler. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -319,6 +381,8 @@ by a sched_ext scheduler: /* Any usable CPU becomes available */ ops.dispatch(); /* Task is moved to a local DSQ */ + + ops.dequeue(); /* Exiting BPF scheduler */ } ops.running(); /* Task starts running on its assigned CPU */ while (task->scx.slice > 0 && task is runnable) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d8..4601e5ecb43c0 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -84,6 +84,7 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + SCX_TASK_IN_CUSTODY = 1 << 1, /* in custody, needs ops.dequeue() when leaving */ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 0bb8fa927e9e9..5f7c9088f90a9 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p) #endif } +/** + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes + * @dsq_id: DSQ ID to check + * + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF + * scheduler is considered "done" with the task. + * + * Builtin DSQs include: + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues + * where tasks go directly to execution, + * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue, + * - Bypass DSQ: used during bypass mode. + * + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not + * trigger ops.dequeue() when they are later consumed. + */ +static inline bool is_terminal_dsq(u64 dsq_id) +{ + return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID; +} + /** * touch_core_sched_dispatch - Update core-sched timestamp on dispatch * @rq: rq to read clock from, must be locked @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p resched_curr(rq); } -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, + struct scx_dispatch_q *dsq, struct task_struct *p, u64 enq_flags) { bool is_local = dsq->id == SCX_DSQ_LOCAL; @@ -1103,6 +1125,23 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, dsq_mod_nr(dsq, 1); p->scx.dsq = dsq; + /* + * Handle ops.dequeue() and custody tracking. + * + * Terminal DSQs: the BPF scheduler is done with the task. If it + * was in BPF custody, call ops.dequeue() and clear the flag. + * + * Non-terminal DSQs: task is in BPF scheduler's custody. + */ + if (is_terminal_dsq(dsq->id)) { + if (SCX_HAS_OP(sch, dequeue) && + (p->scx.flags & SCX_TASK_IN_CUSTODY)) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; + } else { + p->scx.flags |= SCX_TASK_IN_CUSTODY; + } + /* * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the * direct dispatch path, but we clear them here because the direct @@ -1323,7 +1362,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, return; } - dispatch_enqueue(sch, dsq, p, + dispatch_enqueue(sch, rq, dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); } @@ -1407,13 +1446,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, * dequeue may be waiting. The store_release matches their load_acquire. */ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq); + + /* + * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY + * so ops.dequeue() is called when it leaves custody. + */ + p->scx.flags |= SCX_TASK_IN_CUSTODY; return; direct: direct_dispatch(sch, p, enq_flags); return; local_norefill: - dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags); + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags); return; local: dsq = &rq->scx.local_dsq; @@ -1433,7 +1478,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, */ touch_core_sched(rq, p); refill_task_slice_dfl(sch, p); - dispatch_enqueue(sch, dsq, p, enq_flags); + dispatch_enqueue(sch, rq, dsq, p, enq_flags); } static bool task_runnable(const struct task_struct *p) @@ -1511,6 +1556,27 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags __scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1); } +/* + * Call ops.dequeue() for a task leaving BPF custody. + */ +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, + struct task_struct *p, u64 deq_flags, + bool is_sched_change) +{ + if (SCX_HAS_OP(sch, dequeue)) { + /* + * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a + * property change (not sleep or core-sched pick). + */ + if (is_sched_change && + !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + deq_flags |= SCX_DEQ_SCHED_CHANGE; + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags); + } + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; +} + static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) { struct scx_sched *sch = scx_root; @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) switch (opss & SCX_OPSS_STATE_MASK) { case SCX_OPSS_NONE: + /* + * If the task is still in BPF scheduler's custody + * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue(). + */ + if (p->scx.flags & SCX_TASK_IN_CUSTODY) + call_task_dequeue(sch, rq, p, deq_flags, true); break; case SCX_OPSS_QUEUEING: /* @@ -1532,9 +1604,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: - if (SCX_HAS_OP(sch, dequeue)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, - p, deq_flags); + /* + * Task is BPF scheduler's custody (not dispatched yet). + * Call ops.dequeue() to notify that it's leaving custody. + */ + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY)); + call_task_dequeue(sch, rq, p, deq_flags, true); if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, struct scx_dispatch_q *src_dsq, struct rq *dst_rq) { + struct scx_sched *sch = scx_root; struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; /* @dsq is locked and @p is on @dst_rq */ @@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(p->scx.holding_cpu >= 0); + /* + * Task is moving from a non-local DSQ to a local (terminal) DSQ. + * Call ops.dequeue() if the task was in BPF custody. + */ + if (p->scx.flags & SCX_TASK_IN_CUSTODY) { + if (SCX_HAS_OP(sch, dequeue)) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; + } + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) list_add(&p->scx.dsq_list.node, &dst_dsq->list); else @@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p, !WARN_ON_ONCE(src_rq != task_rq(p)); } -static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, - struct scx_dispatch_q *dsq, struct rq *src_rq) +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq, + struct task_struct *p, + struct scx_dispatch_q *dsq, struct rq *src_rq) { raw_spin_rq_unlock(this_rq); if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { + /* + * Task is moving from a non-local DSQ to a local (terminal) DSQ. + * Call ops.dequeue() if the task was in BPF custody. + */ + if (p->scx.flags & SCX_TASK_IN_CUSTODY) + call_task_dequeue(sch, src_rq, p, 0, false); move_remote_task_to_local_dsq(p, 0, src_rq, this_rq); return true; } else { @@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, src_dsq, dst_rq); raw_spin_unlock(&src_dsq->lock); } else { + /* + * Moving to a local DSQ, dispatch_enqueue() is not + * used, so call ops.dequeue() here if the task was + * in BPF scheduler's custody. + */ + if (p->scx.flags & SCX_TASK_IN_CUSTODY) + call_task_dequeue(sch, src_rq, p, 0, false); raw_spin_unlock(&src_dsq->lock); move_remote_task_to_local_dsq(p, enq_flags, src_rq, dst_rq); @@ -1879,7 +1979,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, dispatch_dequeue_locked(p, src_dsq); raw_spin_unlock(&src_dsq->lock); - dispatch_enqueue(sch, dst_dsq, p, enq_flags); + dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags); } return dst_rq; @@ -1922,7 +2022,7 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq, } if (task_can_run_on_remote_rq(sch, p, rq, false)) { - if (likely(consume_remote_task(rq, p, dsq, task_rq))) + if (likely(consume_remote_task(sch, rq, p, dsq, task_rq))) return true; goto retry; } @@ -1969,14 +2069,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, * If dispatching to @rq that @p is already on, no lock dancing needed. */ if (rq == src_rq && rq == dst_rq) { - dispatch_enqueue(sch, dst_dsq, p, + dispatch_enqueue(sch, rq, dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); return; } if (src_rq != dst_rq && unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { - dispatch_enqueue(sch, find_global_dsq(sch, p), p, + dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p, enq_flags | SCX_ENQ_CLEAR_OPSS); return; } @@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, */ if (src_rq == dst_rq) { p->scx.holding_cpu = -1; - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, enq_flags); } else { + /* + * Moving to a local DSQ, dispatch_enqueue() is not + * used, so call ops.dequeue() here if the task was + * in BPF scheduler's custody. + */ + if (p->scx.flags & SCX_TASK_IN_CUSTODY) + call_task_dequeue(sch, src_rq, p, 0, false); move_remote_task_to_local_dsq(p, enq_flags, src_rq, dst_rq); /* task has been moved to dst_rq, which is now locked */ @@ -2113,7 +2220,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, if (dsq->id == SCX_DSQ_LOCAL) dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); else - dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); + dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); } static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq) @@ -2414,7 +2521,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, * DSQ. */ if (p->scx.slice && !scx_rq_bypassing(rq)) { - dispatch_enqueue(sch, &rq->scx.local_dsq, p, + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD); goto switch_class; } @@ -2898,6 +3005,13 @@ static void scx_enable_task(struct task_struct *p) lockdep_assert_rq_held(rq); + /* + * Verify the task is not in BPF scheduler's custody. If flag + * transitions are consistent, the flag should always be clear + * here. + */ + WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY); + /* * Set the weight before calling ops.enable() so that the scheduler * doesn't see a stale value if they inspect the task struct. @@ -2929,6 +3043,13 @@ static void scx_disable_task(struct task_struct *p) if (SCX_HAS_OP(sch, disable)) SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); scx_set_task_state(p, SCX_TASK_READY); + + /* + * Verify the task is not in BPF scheduler's custody. If flag + * transitions are consistent, the flag should always be clear + * here. + */ + WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY); } static void scx_exit_task(struct task_struct *p) @@ -3919,7 +4040,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq, * between bypass DSQs. */ dispatch_dequeue_locked(p, donor_dsq); - dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED); + dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED); /* * $donee might have been idle and need to be woken up. No need diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 386c677e4c9a0..befa9a5d6e53f 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -982,6 +982,13 @@ enum scx_deq_flags { * it hasn't been dispatched yet. Dequeue from the BPF side. */ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, + + /* + * The task is being dequeued due to a property change (e.g., + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), + * etc.). + */ + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, }; enum scx_pick_idle_cpu_flags { diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c2..dcc945304760f 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -21,6 +21,7 @@ #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_SCHED_CHANGE #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19ad..5da50f9376844 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584be..fc9a7a4d9dea5 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -46,4 +46,5 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ } while (0) -- 2.53.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi @ 2026-02-10 23:20 ` Tejun Heo 2026-02-11 16:06 ` Andrea Righi 2026-02-10 23:54 ` Tejun Heo 1 sibling, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-10 23:20 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote: > +/** > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes > + * @dsq_id: DSQ ID to check > + * > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF > + * scheduler is considered "done" with the task. > + * > + * Builtin DSQs include: > + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > + * where tasks go directly to execution, > + * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue, > + * - Bypass DSQ: used during bypass mode. > + * > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not > + * trigger ops.dequeue() when they are later consumed. > + */ > +static inline bool is_terminal_dsq(u64 dsq_id) > +{ > + return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID; > +} Please use () do clarify ordering between & and &&. It's just visually confusing. I wonder whether it'd be cleaner to make it take @dsq instead of @dsq_id and then it can just do: return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL; because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ id, and the above code positively identifies what's terminal. > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, > + struct scx_dispatch_q *dsq, > struct task_struct *p, u64 enq_flags) While minor, this patch would be easier to read if the @rq addition were done in a separate patch. > +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, > + struct task_struct *p, u64 deq_flags, > + bool is_sched_change) Isn't @is_sched_change a bit of misnomer given that it needs to exclude SCX_DEQ_CORE_SCHED_EXEC. I wonder whether it'd be easier if @deq_flags handling is separated out. This part is ops_dequeue() specific, right? Everyone else statically knows what DEQ flags to use. That might make ops_dequeue() calculate flags unnecessarily but ops_dequeue() is not particularly hot, so I don't think that'd matter. > +{ > + if (SCX_HAS_OP(sch, dequeue)) { > + /* > + * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a > + * property change (not sleep or core-sched pick). > + */ > + if (is_sched_change && > + !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > + deq_flags |= SCX_DEQ_SCHED_CHANGE; > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags); > + } > + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; Let's move flag clearing to the call sites. It's a bit confusing w/ the function name. > static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > { > struct scx_sched *sch = scx_root; > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > switch (opss & SCX_OPSS_STATE_MASK) { > case SCX_OPSS_NONE: > + /* > + * If the task is still in BPF scheduler's custody > + * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue(). > + */ > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > + call_task_dequeue(sch, rq, p, deq_flags, true); Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be responsible for clearing IN_CUSTODY too? > @@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > struct scx_dispatch_q *src_dsq, > struct rq *dst_rq) > { > + struct scx_sched *sch = scx_root; > struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; > > /* @dsq is locked and @p is on @dst_rq */ > @@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > > WARN_ON_ONCE(p->scx.holding_cpu >= 0); > > + /* > + * Task is moving from a non-local DSQ to a local (terminal) DSQ. > + * Call ops.dequeue() if the task was in BPF custody. > + */ > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) { > + if (SCX_HAS_OP(sch, dequeue)) > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); > + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; > + } I think a better place to put this would be inside local_dsq_post_enq() so that dispatch_enqueue() and move_local_task_to_local_dsq() can share the path. This would mean breaking out local and global cases in dispatch_enqueue(). ie. at the end of dispatch_enqueue(): if (is_local) { local_dsq_post_enq(...); } else { if (dsq->id == SCX_DSQ_GLOBAL) global_dsq_post_enq(...); /* or open code with comment */ raw_spin_unlock(&dsq->lock); } > @@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p, > !WARN_ON_ONCE(src_rq != task_rq(p)); > } > > -static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, > - struct scx_dispatch_q *dsq, struct rq *src_rq) > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq, > + struct task_struct *p, > + struct scx_dispatch_q *dsq, struct rq *src_rq) > { > raw_spin_rq_unlock(this_rq); > > if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { > + /* > + * Task is moving from a non-local DSQ to a local (terminal) DSQ. > + * Call ops.dequeue() if the task was in BPF custody. > + */ > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > + call_task_dequeue(sch, src_rq, p, 0, false); and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates and reactivates the task. The deactivation invokes ops_dequeue() but that should suppress dequeue invocation as that's internal transfer (this is discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should trigger dequeue invocation, right? > @@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, > src_dsq, dst_rq); > raw_spin_unlock(&src_dsq->lock); > } else { > + /* > + * Moving to a local DSQ, dispatch_enqueue() is not > + * used, so call ops.dequeue() here if the task was > + * in BPF scheduler's custody. > + */ > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > + call_task_dequeue(sch, src_rq, p, 0, false); and then this becomes unnecessary too. > @@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > */ > if (src_rq == dst_rq) { > p->scx.holding_cpu = -1; > - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, > + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, > enq_flags); > } else { > + /* > + * Moving to a local DSQ, dispatch_enqueue() is not > + * used, so call ops.dequeue() here if the task was > + * in BPF scheduler's custody. > + */ > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > + call_task_dequeue(sch, src_rq, p, 0, false); ditto. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-10 23:20 ` Tejun Heo @ 2026-02-11 16:06 ` Andrea Righi 2026-02-11 19:47 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-11 16:06 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hi Tejun, On Tue, Feb 10, 2026 at 01:20:11PM -1000, Tejun Heo wrote: > On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote: > > +/** > > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes > > + * @dsq_id: DSQ ID to check > > + * > > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF > > + * scheduler is considered "done" with the task. > > + * > > + * Builtin DSQs include: > > + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > > + * where tasks go directly to execution, > > + * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue, > > + * - Bypass DSQ: used during bypass mode. > > + * > > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not > > + * trigger ops.dequeue() when they are later consumed. > > + */ > > +static inline bool is_terminal_dsq(u64 dsq_id) > > +{ > > + return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID; > > +} > > Please use () do clarify ordering between & and &&. It's just visually > confusing. I wonder whether it'd be cleaner to make it take @dsq instead of > @dsq_id and then it can just do: > > return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL; > > because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ > id, and the above code positively identifies what's terminal. Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking SCX_DSQ_FLAG_BUILTIN is more generic? > > > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, > > + struct scx_dispatch_q *dsq, > > struct task_struct *p, u64 enq_flags) > > While minor, this patch would be easier to read if the @rq addition were > done in a separate patch. Ack. I'll split that out. > > > +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, > > + struct task_struct *p, u64 deq_flags, > > + bool is_sched_change) > > Isn't @is_sched_change a bit of misnomer given that it needs to exclude > SCX_DEQ_CORE_SCHED_EXEC. I wonder whether it'd be easier if @deq_flags > handling is separated out. This part is ops_dequeue() specific, right? > Everyone else statically knows what DEQ flags to use. That might make > ops_dequeue() calculate flags unnecessarily but ops_dequeue() is not > particularly hot, so I don't think that'd matter. Ack, I'll handle deq_flags in ops_dequeue() and simplify call_task_dequeue() accordingly. > > > +{ > > + if (SCX_HAS_OP(sch, dequeue)) { > > + /* > > + * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a > > + * property change (not sleep or core-sched pick). > > + */ > > + if (is_sched_change && > > + !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > > + deq_flags |= SCX_DEQ_SCHED_CHANGE; > > + > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags); > > + } > > + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; > > Let's move flag clearing to the call sites. It's a bit confusing w/ the > function name. Ack. > > > static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > { > > struct scx_sched *sch = scx_root; > > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > > > switch (opss & SCX_OPSS_STATE_MASK) { > > case SCX_OPSS_NONE: > > + /* > > + * If the task is still in BPF scheduler's custody > > + * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue(). > > + */ > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > > + call_task_dequeue(sch, rq, p, deq_flags, true); > > Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be > responsible for clearing IN_CUSTODY too? The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or move the task. So we can see NONE + IN_CUSTODY here and need to handle it. And we can't clear IN_CUSTODY at the same time we set NONE there, because we don't hold the task's rq lock yet and we can't trigger ops.dequeue(). > > > @@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > > struct scx_dispatch_q *src_dsq, > > struct rq *dst_rq) > > { > > + struct scx_sched *sch = scx_root; > > struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; > > > > /* @dsq is locked and @p is on @dst_rq */ > > @@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > > > > WARN_ON_ONCE(p->scx.holding_cpu >= 0); > > > > + /* > > + * Task is moving from a non-local DSQ to a local (terminal) DSQ. > > + * Call ops.dequeue() if the task was in BPF custody. > > + */ > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) { > > + if (SCX_HAS_OP(sch, dequeue)) > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); > > + p->scx.flags &= ~SCX_TASK_IN_CUSTODY; > > + } > > I think a better place to put this would be inside local_dsq_post_enq() so > that dispatch_enqueue() and move_local_task_to_local_dsq() can share the > path. This would mean breaking out local and global cases in > dispatch_enqueue(). ie. at the end of dispatch_enqueue(): > > if (is_local) { > local_dsq_post_enq(...); > } else { > if (dsq->id == SCX_DSQ_GLOBAL) > global_dsq_post_enq(...); /* or open code with comment */ > raw_spin_unlock(&dsq->lock); > } Agreed, I'll move this into local_dsq_post_enq() and introduce a global_dsq_post_enq(). > > > @@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p, > > !WARN_ON_ONCE(src_rq != task_rq(p)); > > } > > > > -static bool consume_remote_task(struct rq *this_rq, struct task_struct *p, > > - struct scx_dispatch_q *dsq, struct rq *src_rq) > > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq, > > + struct task_struct *p, > > + struct scx_dispatch_q *dsq, struct rq *src_rq) > > { > > raw_spin_rq_unlock(this_rq); > > > > if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { > > + /* > > + * Task is moving from a non-local DSQ to a local (terminal) DSQ. > > + * Call ops.dequeue() if the task was in BPF custody. > > + */ > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > > + call_task_dequeue(sch, src_rq, p, 0, false); > > and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates > and reactivates the task. The deactivation invokes ops_dequeue() but that > should suppress dequeue invocation as that's internal transfer (this is > discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it > gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should > trigger dequeue invocation, right? Should we trigger ops.dequeue() when the task is dequeued inside move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by deactivate_task() there) instead of suppressing it and invoking on the target in local_dsq_post_enq()? That way the BPF sees dequeue on the source and then enqueue on the target, we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the "when to call dequeue" logic stays consistent in ops_dequeue and the terminal local/global post_enq paths. Does it make sense or would you rather suppress it and only invoke on the target when the task lands on the local DSQ?? > > > @@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, > > src_dsq, dst_rq); > > raw_spin_unlock(&src_dsq->lock); > > } else { > > + /* > > + * Moving to a local DSQ, dispatch_enqueue() is not > > + * used, so call ops.dequeue() here if the task was > > + * in BPF scheduler's custody. > > + */ > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > > + call_task_dequeue(sch, src_rq, p, 0, false); > > and then this becomes unnecessary too. Ack + same comment about consume_remote_task(). > > > @@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > > */ > > if (src_rq == dst_rq) { > > p->scx.holding_cpu = -1; > > - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, > > + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, > > enq_flags); > > } else { > > + /* > > + * Moving to a local DSQ, dispatch_enqueue() is not > > + * used, so call ops.dequeue() here if the task was > > + * in BPF scheduler's custody. > > + */ > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > > + call_task_dequeue(sch, src_rq, p, 0, false); > > ditto. Ack + same as above. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-11 16:06 ` Andrea Righi @ 2026-02-11 19:47 ` Tejun Heo 2026-02-11 22:34 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-11 19:47 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hello, On Wed, Feb 11, 2026 at 05:06:20PM +0100, Andrea Righi wrote: ... > > Please use () do clarify ordering between & and &&. It's just visually > > confusing. I wonder whether it'd be cleaner to make it take @dsq instead of > > @dsq_id and then it can just do: > > > > return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL; > > > > because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ > > id, and the above code positively identifies what's terminal. > > Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking > SCX_DSQ_FLAG_BUILTIN is more generic? Ah, forgot about that. Hmm... we can do: switch (dsq->id) { case SCX_DSQ_LOCAL: case SCX_DSQ_GLOBAL: case SCX_DSQ_BYPASS: return true; default: return false; } I just feel iffy about not being specific. Easier to make mistakes in the future and more difficult to notice after doing so, but I think this point is kinda moot. If we break up LOCAL and GLOBAL/BYPASS handling into separate paths in dispatch_enqueue(), we won't need this function anyway. > > > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > > > > > switch (opss & SCX_OPSS_STATE_MASK) { > > > case SCX_OPSS_NONE: > > > + /* > > > + * If the task is still in BPF scheduler's custody > > > + * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue(). > > > + */ > > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > > > + call_task_dequeue(sch, rq, p, deq_flags, true); > > > > Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be > > responsible for clearing IN_CUSTODY too? > > The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in > dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING > to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent > dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or > move the task. So we can see NONE + IN_CUSTODY here and need to handle it. > And we can't clear IN_CUSTODY at the same time we set NONE there, because > we don't hold the task's rq lock yet and we can't trigger ops.dequeue(). I see. Can you please add a comment with the above? ... > > I think a better place to put this would be inside local_dsq_post_enq() so > > that dispatch_enqueue() and move_local_task_to_local_dsq() can share the > > path. This would mean breaking out local and global cases in > > dispatch_enqueue(). ie. at the end of dispatch_enqueue(): > > > > if (is_local) { > > local_dsq_post_enq(...); > > } else { > > if (dsq->id == SCX_DSQ_GLOBAL) > > global_dsq_post_enq(...); /* or open code with comment */ > > raw_spin_unlock(&dsq->lock); > > } > > Agreed, I'll move this into local_dsq_post_enq() and introduce > a global_dsq_post_enq(). Yeah, and as you pointed out, BYPASS. > > > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq, > > > + struct task_struct *p, > > > + struct scx_dispatch_q *dsq, struct rq *src_rq) > > > { > > > raw_spin_rq_unlock(this_rq); > > > > > > if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { > > > + /* > > > + * Task is moving from a non-local DSQ to a local (terminal) DSQ. > > > + * Call ops.dequeue() if the task was in BPF custody. > > > + */ > > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > > > + call_task_dequeue(sch, src_rq, p, 0, false); > > > > and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates > > and reactivates the task. The deactivation invokes ops_dequeue() but that > > should suppress dequeue invocation as that's internal transfer (this is > > discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it > > gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should > > trigger dequeue invocation, right? > > Should we trigger ops.dequeue() when the task is dequeued inside > move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by > deactivate_task() there) instead of suppressing it and invoking on the > target in local_dsq_post_enq()? > > That way the BPF sees dequeue on the source and then enqueue on the target, > we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the > "when to call dequeue" logic stays consistent in ops_dequeue and the > terminal local/global post_enq paths. > > Does it make sense or would you rather suppress it and only invoke on the > target when the task lands on the local DSQ?? The end result is about the same because whenever we migrate we're sending it to the local DSQ of the destination CPU, so whether we generate the event on deactivation of the source CPU or activation on the destination doesn't make *whole* lot of difference. However, conceptually, migrations are internal events. There isn't anything actionable for the BPF scheduler. The reason why ops.dequeue() should be emitted is not because the task is changing CPUs (which caused the deactivation) but the fact that it ends up in a local DSQ afterwards. I think it'll be cleaner both conceptually and code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue paths. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-11 19:47 ` Tejun Heo @ 2026-02-11 22:34 ` Andrea Righi 2026-02-11 22:37 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-11 22:34 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Wed, Feb 11, 2026 at 09:47:57AM -1000, Tejun Heo wrote: > Hello, > > On Wed, Feb 11, 2026 at 05:06:20PM +0100, Andrea Righi wrote: > ... > > > Please use () do clarify ordering between & and &&. It's just visually > > > confusing. I wonder whether it'd be cleaner to make it take @dsq instead of > > > @dsq_id and then it can just do: > > > > > > return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL; > > > > > > because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ > > > id, and the above code positively identifies what's terminal. > > > > Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking > > SCX_DSQ_FLAG_BUILTIN is more generic? > > Ah, forgot about that. Hmm... we can do: > > switch (dsq->id) { > case SCX_DSQ_LOCAL: > case SCX_DSQ_GLOBAL: > case SCX_DSQ_BYPASS: > return true; > default: > return false; > } > > I just feel iffy about not being specific. Easier to make mistakes in the > future and more difficult to notice after doing so, but I think this point > is kinda moot. If we break up LOCAL and GLOBAL/BYPASS handling into separate > paths in dispatch_enqueue(), we won't need this function anyway. Ack, makes sense. > > > > > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > > > > > > > switch (opss & SCX_OPSS_STATE_MASK) { > > > > case SCX_OPSS_NONE: > > > > + /* > > > > + * If the task is still in BPF scheduler's custody > > > > + * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue(). > > > > + */ > > > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > > > > + call_task_dequeue(sch, rq, p, deq_flags, true); > > > > > > Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be > > > responsible for clearing IN_CUSTODY too? > > > > The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in > > dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING > > to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent > > dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or > > move the task. So we can see NONE + IN_CUSTODY here and need to handle it. > > And we can't clear IN_CUSTODY at the same time we set NONE there, because > > we don't hold the task's rq lock yet and we can't trigger ops.dequeue(). > > I see. Can you please add a comment with the above? Ok. > > ... > > > I think a better place to put this would be inside local_dsq_post_enq() so > > > that dispatch_enqueue() and move_local_task_to_local_dsq() can share the > > > path. This would mean breaking out local and global cases in > > > dispatch_enqueue(). ie. at the end of dispatch_enqueue(): > > > > > > if (is_local) { > > > local_dsq_post_enq(...); > > > } else { > > > if (dsq->id == SCX_DSQ_GLOBAL) > > > global_dsq_post_enq(...); /* or open code with comment */ > > > raw_spin_unlock(&dsq->lock); > > > } > > > > Agreed, I'll move this into local_dsq_post_enq() and introduce > > a global_dsq_post_enq(). > > Yeah, and as you pointed out, BYPASS. Ok. > > > > > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq, > > > > + struct task_struct *p, > > > > + struct scx_dispatch_q *dsq, struct rq *src_rq) > > > > { > > > > raw_spin_rq_unlock(this_rq); > > > > > > > > if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) { > > > > + /* > > > > + * Task is moving from a non-local DSQ to a local (terminal) DSQ. > > > > + * Call ops.dequeue() if the task was in BPF custody. > > > > + */ > > > > + if (p->scx.flags & SCX_TASK_IN_CUSTODY) > > > > + call_task_dequeue(sch, src_rq, p, 0, false); > > > > > > and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates > > > and reactivates the task. The deactivation invokes ops_dequeue() but that > > > should suppress dequeue invocation as that's internal transfer (this is > > > discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it > > > gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should > > > trigger dequeue invocation, right? > > > > Should we trigger ops.dequeue() when the task is dequeued inside > > move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by > > deactivate_task() there) instead of suppressing it and invoking on the > > target in local_dsq_post_enq()? > > > > That way the BPF sees dequeue on the source and then enqueue on the target, > > we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the > > "when to call dequeue" logic stays consistent in ops_dequeue and the > > terminal local/global post_enq paths. > > > > Does it make sense or would you rather suppress it and only invoke on the > > target when the task lands on the local DSQ?? > > The end result is about the same because whenever we migrate we're sending > it to the local DSQ of the destination CPU, so whether we generate the event > on deactivation of the source CPU or activation on the destination doesn't > make *whole* lot of difference. However, conceptually, migrations are > internal events. There isn't anything actionable for the BPF scheduler. The > reason why ops.dequeue() should be emitted is not because the task is > changing CPUs (which caused the deactivation) but the fact that it ends up > in a local DSQ afterwards. I think it'll be cleaner both conceptually and > code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue > paths. Does this include core scheduler migrations or just SCX-initiated migrations (move_remote_task_to_local_dsq())? Because with core scheduler migrations we trigger ops.enqueue(), so we should also trigger ops.dequeue(). Or we need to send the task straight to local to prevent calling ops.enqueue(). Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-11 22:34 ` Andrea Righi @ 2026-02-11 22:37 ` Tejun Heo 2026-02-11 22:48 ` Andrea Righi 2026-02-12 10:16 ` Andrea Righi 0 siblings, 2 replies; 81+ messages in thread From: Tejun Heo @ 2026-02-11 22:37 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hello, On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote: > > The end result is about the same because whenever we migrate we're sending > > it to the local DSQ of the destination CPU, so whether we generate the event > > on deactivation of the source CPU or activation on the destination doesn't > > make *whole* lot of difference. However, conceptually, migrations are > > internal events. There isn't anything actionable for the BPF scheduler. The > > reason why ops.dequeue() should be emitted is not because the task is > > changing CPUs (which caused the deactivation) but the fact that it ends up > > in a local DSQ afterwards. I think it'll be cleaner both conceptually and > > code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue > > paths. > > Does this include core scheduler migrations or just SCX-initiated > migrations (move_remote_task_to_local_dsq())? > > Because with core scheduler migrations we trigger ops.enqueue(), so we > should also trigger ops.dequeue(). Or we need to send the task straight to > local to prevent calling ops.enqueue(). I'm a bit lost. Can you elaborate on core scheduler migrations triggering ops.enqueue()? Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-11 22:37 ` Tejun Heo @ 2026-02-11 22:48 ` Andrea Righi 2026-02-12 10:16 ` Andrea Righi 1 sibling, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-11 22:48 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote: > Hello, > > On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote: > > > The end result is about the same because whenever we migrate we're sending > > > it to the local DSQ of the destination CPU, so whether we generate the event > > > on deactivation of the source CPU or activation on the destination doesn't > > > make *whole* lot of difference. However, conceptually, migrations are > > > internal events. There isn't anything actionable for the BPF scheduler. The > > > reason why ops.dequeue() should be emitted is not because the task is > > > changing CPUs (which caused the deactivation) but the fact that it ends up > > > in a local DSQ afterwards. I think it'll be cleaner both conceptually and > > > code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue > > > paths. > > > > Does this include core scheduler migrations or just SCX-initiated > > migrations (move_remote_task_to_local_dsq())? > > > > Because with core scheduler migrations we trigger ops.enqueue(), so we > > should also trigger ops.dequeue(). Or we need to send the task straight to > > local to prevent calling ops.enqueue(). > > I'm a bit lost. Can you elaborate on core scheduler migrations triggering > ops.enqueue()? Nevermind, just ignore that comment, we clearly want to trigger ops.dequeue/enqueue() in that case, it's the whole point of SCX_DEQ_SCHED_CHANGE. I should probably go to bed and get some sleep. :) -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-11 22:37 ` Tejun Heo 2026-02-11 22:48 ` Andrea Righi @ 2026-02-12 10:16 ` Andrea Righi 2026-02-12 14:32 ` Christian Loehle 1 sibling, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-12 10:16 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote: > Hello, > > On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote: > > > The end result is about the same because whenever we migrate we're sending > > > it to the local DSQ of the destination CPU, so whether we generate the event > > > on deactivation of the source CPU or activation on the destination doesn't > > > make *whole* lot of difference. However, conceptually, migrations are > > > internal events. There isn't anything actionable for the BPF scheduler. The > > > reason why ops.dequeue() should be emitted is not because the task is > > > changing CPUs (which caused the deactivation) but the fact that it ends up > > > in a local DSQ afterwards. I think it'll be cleaner both conceptually and > > > code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue > > > paths. > > > > Does this include core scheduler migrations or just SCX-initiated > > migrations (move_remote_task_to_local_dsq())? > > > > Because with core scheduler migrations we trigger ops.enqueue(), so we > > should also trigger ops.dequeue(). Or we need to send the task straight to > > local to prevent calling ops.enqueue(). > > I'm a bit lost. Can you elaborate on core scheduler migrations triggering > ops.enqueue()? Alright, let me re-elaborate more on this with a (slightly) fresher brain. We have two main classes of migrations: 1) Internal SCX-initiated migrations: e.g., dispatch_to_local_dsq() -> move_remote_task_to_local_dsq(), or consume_remote_task() -> move_remote_task_to_local_dsq(), these are completely internal to SCX and shouldn't trigger ops.dequeue/enqueue() 2) Core scheduler migrations - CPU affinity: sched_setaffinity, cpuset/cgroup mask change, etc. affine_move_task -> move_queued_task migrates it -> we trigger ops.dequeue(SCX_DEQ_SCHED_CHANGE) on the source and ops.enqueue() on the target. - Core scheduling (CONFIG_SCHED_CORE): two different cases: - Migration (task moved between runqueues via move_queued_task_locked() to satisfy core cookie) - NUMA balancing: migrate_task_to() can move an SCX task to another CPU - CPU hotplug: on CPU down, runnable tasks are pushed off via __balance_push_cpu_stop() -> __migrate_task() If we want to skip ops.dequeue() only for internal SCX migrations (and maybe also for NUMA and hotplug?), then only checking task_on_rq_migrating(p) is not enough, because that's true for every migration listed above and we'd skip all of them. So, we need a way to mark "this migration is internal to SCX", like a new SCX_TASK_MIGRATING_INTERNAL flag? The alternative is to always trigger ops.dequeue/enqueue() on every migration (no flag): even for internal SCX migrations the BPF scheduler could use it to track task movements, though there's nothing it can do. That way we don't need the additional flag. Does one of these directions fit better with what you have in mind? Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-12 10:16 ` Andrea Righi @ 2026-02-12 14:32 ` Christian Loehle 2026-02-12 15:45 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Christian Loehle @ 2026-02-12 14:32 UTC (permalink / raw) To: Andrea Righi, Tejun Heo Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On 2/12/26 10:16, Andrea Righi wrote: > On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote: >> Hello, >> >> On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote: >>>> The end result is about the same because whenever we migrate we're sending >>>> it to the local DSQ of the destination CPU, so whether we generate the event >>>> on deactivation of the source CPU or activation on the destination doesn't >>>> make *whole* lot of difference. However, conceptually, migrations are >>>> internal events. There isn't anything actionable for the BPF scheduler. The >>>> reason why ops.dequeue() should be emitted is not because the task is >>>> changing CPUs (which caused the deactivation) but the fact that it ends up >>>> in a local DSQ afterwards. I think it'll be cleaner both conceptually and >>>> code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue >>>> paths. >>> >>> Does this include core scheduler migrations or just SCX-initiated >>> migrations (move_remote_task_to_local_dsq())? >>> >>> Because with core scheduler migrations we trigger ops.enqueue(), so we >>> should also trigger ops.dequeue(). Or we need to send the task straight to >>> local to prevent calling ops.enqueue(). >> >> I'm a bit lost. Can you elaborate on core scheduler migrations triggering >> ops.enqueue()? > > Alright, let me re-elaborate more on this with a (slightly) fresher brain. > > We have two main classes of migrations: > > 1) Internal SCX-initiated migrations: e.g., > dispatch_to_local_dsq() -> move_remote_task_to_local_dsq(), or > consume_remote_task() -> move_remote_task_to_local_dsq(), these > are completely internal to SCX and shouldn't trigger > ops.dequeue/enqueue() > > 2) Core scheduler migrations > - CPU affinity: sched_setaffinity, cpuset/cgroup mask change, etc. > affine_move_task -> move_queued_task migrates it -> we trigger > ops.dequeue(SCX_DEQ_SCHED_CHANGE) on the source and ops.enqueue() on > the target. > > - Core scheduling (CONFIG_SCHED_CORE): two different cases: > - Migration (task moved between runqueues via move_queued_task_locked() > to satisfy core cookie) > > - NUMA balancing: migrate_task_to() can move an SCX task to another CPU > > - CPU hotplug: on CPU down, runnable tasks are pushed off via > __balance_push_cpu_stop() -> __migrate_task() > > If we want to skip ops.dequeue() only for internal SCX migrations (and > maybe also for NUMA and hotplug?), then only checking > task_on_rq_migrating(p) is not enough, because that's true for every > migration listed above and we'd skip all of them. > > So, we need a way to mark "this migration is internal to SCX", like a new > SCX_TASK_MIGRATING_INTERNAL flag? > > The alternative is to always trigger ops.dequeue/enqueue() on every > migration (no flag): even for internal SCX migrations the BPF scheduler > could use it to track task movements, though there's nothing it can do. > That way we don't need the additional flag. > > Does one of these directions fit better with what you have in mind? IIUC one example might sway your opinion (or not): Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ (and maybe being enqueued at another) prevents e.g. accurate PELT load tracking on the BPF side. Regular utilization tracking works through ops.running() and ops.stopping() but load I don't think load can be implemented accurately. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-12 14:32 ` Christian Loehle @ 2026-02-12 15:45 ` Andrea Righi 2026-02-12 17:07 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-12 15:45 UTC (permalink / raw) To: Christian Loehle Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Thu, Feb 12, 2026 at 02:32:02PM +0000, Christian Loehle wrote: > On 2/12/26 10:16, Andrea Righi wrote: > > On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote: > >> Hello, > >> > >> On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote: > >>>> The end result is about the same because whenever we migrate we're sending > >>>> it to the local DSQ of the destination CPU, so whether we generate the event > >>>> on deactivation of the source CPU or activation on the destination doesn't > >>>> make *whole* lot of difference. However, conceptually, migrations are > >>>> internal events. There isn't anything actionable for the BPF scheduler. The > >>>> reason why ops.dequeue() should be emitted is not because the task is > >>>> changing CPUs (which caused the deactivation) but the fact that it ends up > >>>> in a local DSQ afterwards. I think it'll be cleaner both conceptually and > >>>> code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue > >>>> paths. > >>> > >>> Does this include core scheduler migrations or just SCX-initiated > >>> migrations (move_remote_task_to_local_dsq())? > >>> > >>> Because with core scheduler migrations we trigger ops.enqueue(), so we > >>> should also trigger ops.dequeue(). Or we need to send the task straight to > >>> local to prevent calling ops.enqueue(). > >> > >> I'm a bit lost. Can you elaborate on core scheduler migrations triggering > >> ops.enqueue()? > > > > Alright, let me re-elaborate more on this with a (slightly) fresher brain. > > > > We have two main classes of migrations: > > > > 1) Internal SCX-initiated migrations: e.g., > > dispatch_to_local_dsq() -> move_remote_task_to_local_dsq(), or > > consume_remote_task() -> move_remote_task_to_local_dsq(), these > > are completely internal to SCX and shouldn't trigger > > ops.dequeue/enqueue() > > > > 2) Core scheduler migrations > > - CPU affinity: sched_setaffinity, cpuset/cgroup mask change, etc. > > affine_move_task -> move_queued_task migrates it -> we trigger > > ops.dequeue(SCX_DEQ_SCHED_CHANGE) on the source and ops.enqueue() on > > the target. > > > > - Core scheduling (CONFIG_SCHED_CORE): two different cases: > > - Migration (task moved between runqueues via move_queued_task_locked() > > to satisfy core cookie) > > > > - NUMA balancing: migrate_task_to() can move an SCX task to another CPU > > > > - CPU hotplug: on CPU down, runnable tasks are pushed off via > > __balance_push_cpu_stop() -> __migrate_task() > > > > If we want to skip ops.dequeue() only for internal SCX migrations (and > > maybe also for NUMA and hotplug?), then only checking > > task_on_rq_migrating(p) is not enough, because that's true for every > > migration listed above and we'd skip all of them. > > > > So, we need a way to mark "this migration is internal to SCX", like a new > > SCX_TASK_MIGRATING_INTERNAL flag? > > > > The alternative is to always trigger ops.dequeue/enqueue() on every > > migration (no flag): even for internal SCX migrations the BPF scheduler > > could use it to track task movements, though there's nothing it can do. > > That way we don't need the additional flag. > > > > Does one of these directions fit better with what you have in mind? > IIUC one example might sway your opinion (or not): > Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ > (and maybe being enqueued at another) prevents e.g. accurate PELT load > tracking on the BPF side. > Regular utilization tracking works through ops.running() and > ops.stopping() but load I don't think load can be implemented accurately. It makes sense to me and I think it's actually valid reason to prefer the "always trigger" way. We have DSQs and potentially BPF can have its own queues, but to implement accurate PELT (runnable contribution to a runqueue, possibly with decay), we'd also need to know exactly when a task leaves one runqueue and joins another. Essentially we could get the full task lifecyle in BPF: - runnable lifecycle: - ops.dequeue(): task leaves runqueue, source CPU = scx_bpf_task_cpu(p), - ops.enqueue(): task wants to run, curr CPU = scx_bpf_task_cpu(p), - running lifecycle: - ops.running(p): task starts running on scx_bpf_task_cpu(p), - ops.stopping(p): task stops running on scx_bpf_task_cpu(p). A potential concern could be about introducing more overhead, but I don't think it matters much, especially since schedulers that don't implement ops.dequeue() effectively pay no cost for these events. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-12 15:45 ` Andrea Righi @ 2026-02-12 17:07 ` Tejun Heo 2026-02-12 18:14 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-12 17:07 UTC (permalink / raw) To: Andrea Righi Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, On Thu, Feb 12, 2026 at 04:45:43PM +0100, Andrea Righi wrote: > > > So, we need a way to mark "this migration is internal to SCX", like a new > > > SCX_TASK_MIGRATING_INTERNAL flag? Yeah, I think this is what we should do. That's the only ops.dequeue() without matching ops.enqueue(), right? ... > > IIUC one example might sway your opinion (or not): > > Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ > > (and maybe being enqueued at another) prevents e.g. accurate PELT load > > tracking on the BPF side. > > Regular utilization tracking works through ops.running() and > > ops.stopping() but load I don't think load can be implemented accurately. > > It makes sense to me and I think it's actually valid reason to prefer the > "always trigger" way. I don't think this is a valid argument. PELT is done that way because the association of the task and the CPU is meaningful for in-kernel schedulers. The queues are actually per-CPU. For SCX scheds, the relationship is not known to the kernel. Only the BPF scheduler itself knows, if it wants to attribute per-task load to a specific CPU, which CPU it should be attributed to. What's the point of following in-kernel association for PELT if the task was going to be hot migrated to another CPU on execution? Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-12 17:07 ` Tejun Heo @ 2026-02-12 18:14 ` Andrea Righi 2026-02-12 18:35 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-12 18:14 UTC (permalink / raw) To: Tejun Heo Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Thu, Feb 12, 2026 at 07:07:05AM -1000, Tejun Heo wrote: > Hello, > > On Thu, Feb 12, 2026 at 04:45:43PM +0100, Andrea Righi wrote: > > > > So, we need a way to mark "this migration is internal to SCX", like a new > > > > SCX_TASK_MIGRATING_INTERNAL flag? > > Yeah, I think this is what we should do. That's the only ops.dequeue() > without matching ops.enqueue(), right? Correct. > > ... > > > IIUC one example might sway your opinion (or not): > > > Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ > > > (and maybe being enqueued at another) prevents e.g. accurate PELT load > > > tracking on the BPF side. > > > Regular utilization tracking works through ops.running() and > > > ops.stopping() but load I don't think load can be implemented accurately. > > > > It makes sense to me and I think it's actually valid reason to prefer the > > "always trigger" way. > > I don't think this is a valid argument. PELT is done that way because the > association of the task and the CPU is meaningful for in-kernel schedulers. > The queues are actually per-CPU. For SCX scheds, the relationship is not > known to the kernel. Only the BPF scheduler itself knows, if it wants to > attribute per-task load to a specific CPU, which CPU it should be attributed > to. What's the point of following in-kernel association for PELT if the task > was going to be hot migrated to another CPU on execution? I see, let me elaborate more on this to make sure we're on the same page. In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU: it can put the task on an arbitrary DSQ or even in some internal BPF data structures. The task is still associated with a runqueue, but only to satisfy a kernel requirement, for sched_ext that association isn't meaningful, because the task isn't really "on" that CPU (in fact in ops.dispatch() can do the "last minute" migration). Therefore, keeping accurate per-CPU information from the kernel's perspective doesn't buy us much, given that the BPF scheduler can keep tasks in its own queues or structures. Accurate PELT is still doable: the BPF scheduler can track where it puts each task in its own state, updates runnable load when it places the task in a DSQ / data structure and when the task leaves (dequeue). And it can use ops.running() / ops.stopping() for utilization. And with a proper ops.dequeue() semantics, PELT can be driven by the BPF scheduler's own placement and the scx callbacks, not by the specific rq a task is on. If all of the above makes sense for everyone, I agree that we don't need to notify all the internal migrations. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-12 18:14 ` Andrea Righi @ 2026-02-12 18:35 ` Tejun Heo 2026-02-12 22:30 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-12 18:35 UTC (permalink / raw) To: Andrea Righi Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, Andrea. On Thu, Feb 12, 2026 at 07:14:13PM +0100, Andrea Righi wrote: ... > In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU: > it can put the task on an arbitrary DSQ or even in some internal BPF data > structures. The task is still associated with a runqueue, but only to > satisfy a kernel requirement, for sched_ext that association isn't > meaningful, because the task isn't really "on" that CPU (in fact in > ops.dispatch() can do the "last minute" migration). Yes. > Therefore, keeping accurate per-CPU information from the kernel's > perspective doesn't buy us much, given that the BPF scheduler can keep > tasks in its own queues or structures. > > Accurate PELT is still doable: the BPF scheduler can track where it puts > each task in its own state, updates runnable load when it places the task > in a DSQ / data structure and when the task leaves (dequeue). And it can > use ops.running() / ops.stopping() for utilization. And the BPF sched might choose to do load aggregation at a differnt level too - e.g. maybe per-CPU load metric doesn't make sense given the machine and scheduler and only per-LLC level aggregation would be meaningful, which would be true for multiple of the current SCX schedulers given the per-LLC DSQ usage. > And with a proper ops.dequeue() semantics, PELT can be driven by the BPF > scheduler's own placement and the scx callbacks, not by the specific rq a > task is on. > > If all of the above makes sense for everyone, I agree that we don't need to > notify all the internal migrations. Yeah, I think we're on the same page. BTW, I wonder whether we could use p->scx.sticky_cpu to detect internal migrations. It's only used for internal migrations, so maybe it can be used for detection. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-12 18:35 ` Tejun Heo @ 2026-02-12 22:30 ` Andrea Righi 2026-02-14 10:16 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-12 22:30 UTC (permalink / raw) To: Tejun Heo Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Thu, Feb 12, 2026 at 08:35:55AM -1000, Tejun Heo wrote: > Hello, Andrea. > > On Thu, Feb 12, 2026 at 07:14:13PM +0100, Andrea Righi wrote: > ... > > In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU: > > it can put the task on an arbitrary DSQ or even in some internal BPF data > > structures. The task is still associated with a runqueue, but only to > > satisfy a kernel requirement, for sched_ext that association isn't > > meaningful, because the task isn't really "on" that CPU (in fact in > > ops.dispatch() can do the "last minute" migration). > > Yes. > > > Therefore, keeping accurate per-CPU information from the kernel's > > perspective doesn't buy us much, given that the BPF scheduler can keep > > tasks in its own queues or structures. > > > > Accurate PELT is still doable: the BPF scheduler can track where it puts > > each task in its own state, updates runnable load when it places the task > > in a DSQ / data structure and when the task leaves (dequeue). And it can > > use ops.running() / ops.stopping() for utilization. > > And the BPF sched might choose to do load aggregation at a differnt level > too - e.g. maybe per-CPU load metric doesn't make sense given the machine > and scheduler and only per-LLC level aggregation would be meaningful, which > would be true for multiple of the current SCX schedulers given the per-LLC > DSQ usage. > > > And with a proper ops.dequeue() semantics, PELT can be driven by the BPF > > scheduler's own placement and the scx callbacks, not by the specific rq a > > task is on. > > > > If all of the above makes sense for everyone, I agree that we don't need to > > notify all the internal migrations. > > Yeah, I think we're on the same page. BTW, I wonder whether we could use > p->scx.sticky_cpu to detect internal migrations. It's only used for internal > migrations, so maybe it can be used for detection. Perfect. And yes, I think if we set p->scx.sticky_cpu before deactivate_task() in move_remote_task_to_local_dsq(), then in ops_dequeue() we should be able to catch the internal migrations checking task_on_rq_migrating(p) && p->scx.sticky_cpu >= 0. I'll run some tests with that. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-12 22:30 ` Andrea Righi @ 2026-02-14 10:16 ` Andrea Righi 2026-02-14 17:56 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-14 10:16 UTC (permalink / raw) To: Tejun Heo Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Thu, Feb 12, 2026 at 11:30:14PM +0100, Andrea Righi wrote: > On Thu, Feb 12, 2026 at 08:35:55AM -1000, Tejun Heo wrote: > > Hello, Andrea. > > > > On Thu, Feb 12, 2026 at 07:14:13PM +0100, Andrea Righi wrote: > > ... > > > In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU: > > > it can put the task on an arbitrary DSQ or even in some internal BPF data > > > structures. The task is still associated with a runqueue, but only to > > > satisfy a kernel requirement, for sched_ext that association isn't > > > meaningful, because the task isn't really "on" that CPU (in fact in > > > ops.dispatch() can do the "last minute" migration). > > > > Yes. > > > > > Therefore, keeping accurate per-CPU information from the kernel's > > > perspective doesn't buy us much, given that the BPF scheduler can keep > > > tasks in its own queues or structures. > > > > > > Accurate PELT is still doable: the BPF scheduler can track where it puts > > > each task in its own state, updates runnable load when it places the task > > > in a DSQ / data structure and when the task leaves (dequeue). And it can > > > use ops.running() / ops.stopping() for utilization. > > > > And the BPF sched might choose to do load aggregation at a differnt level > > too - e.g. maybe per-CPU load metric doesn't make sense given the machine > > and scheduler and only per-LLC level aggregation would be meaningful, which > > would be true for multiple of the current SCX schedulers given the per-LLC > > DSQ usage. > > > > > And with a proper ops.dequeue() semantics, PELT can be driven by the BPF > > > scheduler's own placement and the scx callbacks, not by the specific rq a > > > task is on. > > > > > > If all of the above makes sense for everyone, I agree that we don't need to > > > notify all the internal migrations. > > > > Yeah, I think we're on the same page. BTW, I wonder whether we could use > > p->scx.sticky_cpu to detect internal migrations. It's only used for internal > > migrations, so maybe it can be used for detection. > > Perfect. And yes, I think if we set p->scx.sticky_cpu before > deactivate_task() in move_remote_task_to_local_dsq(), then in ops_dequeue() > we should be able to catch the internal migrations checking > task_on_rq_migrating(p) && p->scx.sticky_cpu >= 0. > > I'll run some tests with that. I ran more tests and I don't think we can simply rely on p->scx.sticky_cpu. In particular, I don't see how to handle this scenario using only p->scx.sticky_cpu: a task starts an internal migration, a sched_change occurs, and ops.dequeue() gets skipped because p->scx.sticky_cpu >= 0. So I'm back to the idea of introducing an SCX_TASK_MIGRATING_INTERNAL flag... -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-14 10:16 ` Andrea Righi @ 2026-02-14 17:56 ` Tejun Heo 2026-02-14 19:32 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-14 17:56 UTC (permalink / raw) To: Andrea Righi Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, Andrea. On Sat, Feb 14, 2026 at 11:16:34AM +0100, Andrea Righi wrote: > I ran more tests and I don't think we can simply rely on p->scx.sticky_cpu. > > In particular, I don't see how to handle this scenario using only > p->scx.sticky_cpu: a task starts an internal migration, a sched_change > occurs, and ops.dequeue() gets skipped because p->scx.sticky_cpu >= 0. Oh, that shouldn't happen, so move_remote_task_to_local_dsq() does the following: deactivate_task(src_rq, p, 0); set_task_cpu(p, cpu_of(dst_rq)); p->scx.sticky_cpu = cpu_of(dst_rq); raw_spin_rq_unlock(src_rq); raw_spin_rq_lock(dst_rq); ... activate_task(dst_rq, p, 0); It *looks* like something get can get while the locks are switched; however, the above deactivate_task() does WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING) and task_rq_lock() does the following: for (;;) { raw_spin_lock_irqsave(&p->pi_lock, rf->flags); rq = task_rq(p); raw_spin_rq_lock(rq); /* * move_queued_task() task_rq_lock() * * ACQUIRE (rq->lock) * [S] ->on_rq = MIGRATING [L] rq = task_rq() * WMB (__set_task_cpu()) ACQUIRE (rq->lock); * [S] ->cpu = new_cpu [L] task_rq() * [L] ->on_rq * RELEASE (rq->lock) * * If we observe the old CPU in task_rq_lock(), the acquire of * the old rq->lock will fully serialize against the stores. * * If we observe the new CPU in task_rq_lock(), the address * dependency headed by '[L] rq = task_rq()' and the acquire * will pair with the WMB to ensure we then also see migrating. */ if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { rq_pin_lock(rq, rf); return rq; } raw_spin_rq_unlock(rq); raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); while (unlikely(task_on_rq_migrating(p))) cpu_relax(); } ie. TASK_ON_RQ_MIGRATING works like a separate lock that protects the task while it's switching the RQs, so any operations that use task_rq_lock() which includes any property changes can't get inbetween. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-14 17:56 ` Tejun Heo @ 2026-02-14 19:32 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-14 19:32 UTC (permalink / raw) To: Tejun Heo Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Sat, Feb 14, 2026 at 07:56:12AM -1000, Tejun Heo wrote: > Hello, Andrea. > > On Sat, Feb 14, 2026 at 11:16:34AM +0100, Andrea Righi wrote: > > I ran more tests and I don't think we can simply rely on p->scx.sticky_cpu. > > > > In particular, I don't see how to handle this scenario using only > > p->scx.sticky_cpu: a task starts an internal migration, a sched_change > > occurs, and ops.dequeue() gets skipped because p->scx.sticky_cpu >= 0. > > Oh, that shouldn't happen, so move_remote_task_to_local_dsq() does the > following: > > deactivate_task(src_rq, p, 0); > set_task_cpu(p, cpu_of(dst_rq)); > p->scx.sticky_cpu = cpu_of(dst_rq); > > raw_spin_rq_unlock(src_rq); > raw_spin_rq_lock(dst_rq); > ... > activate_task(dst_rq, p, 0); > > It *looks* like something get can get while the locks are switched; however, > the above deactivate_task() does WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING) > and task_rq_lock() does the following: > > for (;;) { > raw_spin_lock_irqsave(&p->pi_lock, rf->flags); > rq = task_rq(p); > raw_spin_rq_lock(rq); > /* > * move_queued_task() task_rq_lock() > * > * ACQUIRE (rq->lock) > * [S] ->on_rq = MIGRATING [L] rq = task_rq() > * WMB (__set_task_cpu()) ACQUIRE (rq->lock); > * [S] ->cpu = new_cpu [L] task_rq() > * [L] ->on_rq > * RELEASE (rq->lock) > * > * If we observe the old CPU in task_rq_lock(), the acquire of > * the old rq->lock will fully serialize against the stores. > * > * If we observe the new CPU in task_rq_lock(), the address > * dependency headed by '[L] rq = task_rq()' and the acquire > * will pair with the WMB to ensure we then also see migrating. > */ > if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) { > rq_pin_lock(rq, rf); > return rq; > } > raw_spin_rq_unlock(rq); > raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags); > > while (unlikely(task_on_rq_migrating(p))) > cpu_relax(); > } > > ie. TASK_ON_RQ_MIGRATING works like a separate lock that protects the task > while it's switching the RQs, so any operations that use task_rq_lock() > which includes any property changes can't get inbetween. Yeah, that makes sense, so the scenario I was thinking it was happening can't happen. I guess I'm missing some ops.dequeue() events then or there's a race somewhere, because I can see tasks being enqueued without a corresponding ops.dequeue(). I'll add some debugging and keep investigating. Thanks! -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi 2026-02-10 23:20 ` Tejun Heo @ 2026-02-10 23:54 ` Tejun Heo 2026-02-11 16:07 ` Andrea Righi 1 sibling, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-10 23:54 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel One more comment. On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote: > @@ -1407,13 +1446,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > * dequeue may be waiting. The store_release matches their load_acquire. > */ > atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq); > + > + /* > + * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY > + * so ops.dequeue() is called when it leaves custody. > + */ > + p->scx.flags |= SCX_TASK_IN_CUSTODY; As this is protected by task's rq lock, doing it here is okay but can you move this above atomic_long_set_release()? That's conceptually more straightforward as that set_release() is supposed to be the "I'm done with this task" point. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-10 23:54 ` Tejun Heo @ 2026-02-11 16:07 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-11 16:07 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Tue, Feb 10, 2026 at 01:54:39PM -1000, Tejun Heo wrote: > One more comment. > > On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote: > > @@ -1407,13 +1446,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > > * dequeue may be waiting. The store_release matches their load_acquire. > > */ > > atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq); > > + > > + /* > > + * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY > > + * so ops.dequeue() is called when it leaves custody. > > + */ > > + p->scx.flags |= SCX_TASK_IN_CUSTODY; > > As this is protected by task's rq lock, doing it here is okay but can you > move this above atomic_long_set_release()? That's conceptually more > straightforward as that set_release() is supposed to be the "I'm done with > this task" point. Agreed, it definitely looks more correct to move this before the atomic_long_set_release(). Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCHSET v7] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-06 13:54 Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
0 siblings, 1 reply; 81+ messages in thread
From: Andrea Righi @ 2026-02-06 13:54 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
sched-ext, linux-kernel
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g. sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).
To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
Changes in v7:
- Handle tasks stored to BPF internal data structures (trigger
ops.dequeue())
- Add a new kselftest scenario to verify ops.dequeue() behavior with tasks
stored to internal BPF data structures
- Link to v6:
https://lore.kernel.org/all/20260205153304.1996142-1-arighi@nvidia.com
Changes in v6:
- Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
- Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
DSQs (local, global, bypass)
- centralize ops.dequeue() logic in dispatch_enqueue()
- Remove "Property Change Notifications for Running Tasks" section from
the documentation
- The kselftest now validates the right behavior both from ops.enqueue()
and ops.select_cpu()
- Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@nvidia.com
Changes in v5:
- Introduce the concept of "terminal DSQ" (when a task is dispatched to a
terminal DSQ, the task leaves the BPF scheduler's custody)
- Consider SCX_DSQ_GLOBAL as a terminal DSQ
- Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com
Changes in v4:
- Introduce the concept of "BPF scheduler custody"
- Do not trigger ops.dequeue() for direct dispatches to local DSQs
- Trigger ops.dequeue() only once; after the task leaves BPF scheduler
custody, further dequeue events are not reported.
- Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 58 ++++
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 157 ++++++++-
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 1 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 403 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 258 +++++++++++++++
10 files changed, 875 insertions(+), 14 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
^ permalink raw reply [flat|nested] 81+ messages in thread* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-06 13:54 [PATCHSET v7] " Andrea Righi @ 2026-02-06 13:54 ` Andrea Righi 2026-02-06 20:35 ` Emil Tsalapatis 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-06 13:54 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in the BPF scheduler's custody when the scheduler is responsible for managing its lifecycle. This includes tasks dispatched to user-created DSQs or stored in the BPF scheduler's internal data structures. Custody ends when the task is dispatched to a terminal DSQ (such as the local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a property change. Tasks directly dispatched to terminal DSQs bypass the BPF scheduler entirely and are never in its custody. Terminal DSQs include: - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues where tasks go directly to execution. - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the BPF scheduler is considered "done" with the task. As a result, ops.dequeue() is not invoked for tasks directly dispatched to terminal DSQs. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: a task dispatched to a user DSQ or stored in internal BPF data structures is moved to a terminal DSQ (ops.dequeue() called without any special flags set), b) core scheduling dispatch: core-sched picks task before dispatch (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set), c) property change: task properties modified before dispatch, (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set). This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- Documentation/scheduler/sched-ext.rst | 58 +++++++ include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 157 ++++++++++++++++-- kernel/sched/ext_internal.h | 7 + .../sched_ext/include/scx/enum_defs.autogen.h | 1 + .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + tools/sched_ext/include/scx/enums.autogen.h | 1 + 7 files changed, 213 insertions(+), 14 deletions(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 404fe6126a769..fe8c59b0c1477 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -252,6 +252,62 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + **Task State Tracking and ops.dequeue() Semantics** + + A task is in the "BPF scheduler's custody" when the BPF scheduler is + responsible for managing its lifecycle. That includes tasks dispatched + to user-created DSQs or stored in the BPF scheduler's internal data + structures. Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, + the task may or may not enter custody depending on what the scheduler + does: + + * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``, + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler + is done with the task - it either goes straight to a CPU's local run + queue or to the global DSQ as a fallback. The task never enters (or + exits) BPF custody, and ``ops.dequeue()`` will not be called. + + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the + BPF scheduler's custody. When the task later leaves BPF custody + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for + sleep/property changes), ``ops.dequeue()`` will be called exactly once. + + * **Queued on BPF side** (e.g., internal queues, no DSQ): The task is in + BPF custody. ``ops.dequeue()`` will be called when it leaves (e.g. + when ``ops.dispatch()`` moves it to a terminal DSQ, or on property + change / sleep). + + **NOTE**: this concept is valid also with the ``ops.select_cpu()`` + direct dispatch optimization. Even though it skips ``ops.enqueue()`` + invocation, if the task is dispatched to a user-created DSQ or internal + BPF structure, it enters BPF custody and will get ``ops.dequeue()`` when + it leaves. If dispatched to a terminal DSQ, the BPF scheduler is done + with it immediately. This provides the performance benefit of avoiding + the ``ops.enqueue()`` roundtrip while maintaining correct state + tracking. + + The dequeue can happen for different reasons, distinguished by flags: + + 1. **Regular dispatch**: when a task in BPF custody is dispatched to a + terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for + execution), ``ops.dequeue()`` is triggered without any special flags. + + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and + core scheduling picks a task for execution while it's still in BPF + custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. + + 3. **Scheduling property change**: when a task property changes (via + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, + priority changes, CPU migrations, etc.) while the task is still in + BPF custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. + + **Important**: Once a task has left BPF custody (e.g. after being + dispatched to a terminal DSQ), property changes will not trigger + ``ops.dequeue()``, since the task is no longer being managed by the BPF + scheduler. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -319,6 +375,8 @@ by a sched_ext scheduler: /* Any usable CPU becomes available */ ops.dispatch(); /* Task is moved to a local DSQ */ + + ops.dequeue(); /* Exiting BPF scheduler */ } ops.running(); /* Task starts running on its assigned CPU */ while (task->scx.slice > 0 && task is runnable) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d8..c48f818eee9b8 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -84,6 +84,7 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + SCX_TASK_NEED_DEQ = 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 0bb8fa927e9e9..d17fd9141adf4 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p) #endif } +/** + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes + * @dsq_id: DSQ ID to check + * + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF + * scheduler is considered "done" with the task. + * + * Builtin DSQs include: + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues + * where tasks go directly to execution, + * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue, + * - Bypass DSQ: used during bypass mode. + * + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not + * trigger ops.dequeue() when they are later consumed. + */ +static inline bool is_terminal_dsq(u64 dsq_id) +{ + return dsq_id & SCX_DSQ_FLAG_BUILTIN; +} + /** * touch_core_sched_dispatch - Update core-sched timestamp on dispatch * @rq: rq to read clock from, must be locked @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p resched_curr(rq); } -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, + struct scx_dispatch_q *dsq, struct task_struct *p, u64 enq_flags) { bool is_local = dsq->id == SCX_DSQ_LOCAL; @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, dsq_mod_nr(dsq, 1); p->scx.dsq = dsq; + /* + * Handle ops.dequeue() and custody tracking. + * + * Builtin DSQs (local, global, bypass) are terminal: the BPF + * scheduler is done with the task. If it was in BPF custody, call + * ops.dequeue() and clear the flag. + * + * User DSQs: Task is in BPF scheduler's custody. Set the flag so + * ops.dequeue() will be called when it leaves. + */ + if (SCX_HAS_OP(sch, dequeue)) { + if (is_terminal_dsq(dsq->id)) { + if (p->scx.flags & SCX_TASK_NEED_DEQ) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, + rq, p, 0); + p->scx.flags &= ~SCX_TASK_NEED_DEQ; + } else { + p->scx.flags |= SCX_TASK_NEED_DEQ; + } + } + /* * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the * direct dispatch path, but we clear them here because the direct @@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, return; } - dispatch_enqueue(sch, dsq, p, + dispatch_enqueue(sch, rq, dsq, p, p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); } @@ -1407,13 +1450,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, * dequeue may be waiting. The store_release matches their load_acquire. */ atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq); + + /* + * Task is now in BPF scheduler's custody (queued on BPF internal + * structures). Set %SCX_TASK_NEED_DEQ so ops.dequeue() is called + * when it leaves custody (e.g. dispatched to a terminal DSQ or on + * property change). + */ + if (SCX_HAS_OP(sch, dequeue)) + p->scx.flags |= SCX_TASK_NEED_DEQ; return; direct: direct_dispatch(sch, p, enq_flags); return; local_norefill: - dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags); + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags); return; local: dsq = &rq->scx.local_dsq; @@ -1433,7 +1485,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, */ touch_core_sched(rq, p); refill_task_slice_dfl(sch, p); - dispatch_enqueue(sch, dsq, p, enq_flags); + dispatch_enqueue(sch, rq, dsq, p, enq_flags); } static bool task_runnable(const struct task_struct *p) @@ -1511,6 +1563,22 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags __scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1); } +/* + * Call ops.dequeue() for a task leaving BPF custody. Adds %SCX_DEQ_SCHED_CHANGE + * when the dequeue is due to a property change (not sleep or core-sched pick). + */ +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, + struct task_struct *p, u64 deq_flags) +{ + u64 flags = deq_flags; + + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + flags |= SCX_DEQ_SCHED_CHANGE; + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); + p->scx.flags &= ~SCX_TASK_NEED_DEQ; +} + static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) { struct scx_sched *sch = scx_root; @@ -1524,6 +1592,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) switch (opss & SCX_OPSS_STATE_MASK) { case SCX_OPSS_NONE: + /* + * Task is not in BPF data structures (either dispatched to + * a DSQ or running). Only call ops.dequeue() if the task + * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ + * is set). + * + * If the task has already been dispatched to a terminal + * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF + * scheduler's custody and the flag will be clear, so we + * skip ops.dequeue(). + * + * If this is a property change (not sleep/core-sched) and + * the task is still in BPF custody, set the + * %SCX_DEQ_SCHED_CHANGE flag. + */ + if (SCX_HAS_OP(sch, dequeue) && + (p->scx.flags & SCX_TASK_NEED_DEQ)) + call_task_dequeue(sch, rq, p, deq_flags); break; case SCX_OPSS_QUEUEING: /* @@ -1532,9 +1618,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: - if (SCX_HAS_OP(sch, dequeue)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, - p, deq_flags); + /* + * Task is still on the BPF scheduler (not dispatched yet). + * Call ops.dequeue() to notify it is leaving BPF custody. + */ + if (SCX_HAS_OP(sch, dequeue)) { + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)); + call_task_dequeue(sch, rq, p, deq_flags); + } if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -1631,6 +1722,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, struct scx_dispatch_q *src_dsq, struct rq *dst_rq) { + struct scx_sched *sch = scx_root; struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; /* @dsq is locked and @p is on @dst_rq */ @@ -1639,6 +1731,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(p->scx.holding_cpu >= 0); + /* + * Task is moving from a non-local DSQ to a local (terminal) DSQ. + * Call ops.dequeue() if the task was in BPF custody. + */ + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_NEED_DEQ)) { + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); + p->scx.flags &= ~SCX_TASK_NEED_DEQ; + } + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) list_add(&p->scx.dsq_list.node, &dst_dsq->list); else @@ -1879,7 +1980,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, dispatch_dequeue_locked(p, src_dsq); raw_spin_unlock(&src_dsq->lock); - dispatch_enqueue(sch, dst_dsq, p, enq_flags); + dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags); } return dst_rq; @@ -1969,14 +2070,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, * If dispatching to @rq that @p is already on, no lock dancing needed. */ if (rq == src_rq && rq == dst_rq) { - dispatch_enqueue(sch, dst_dsq, p, + dispatch_enqueue(sch, rq, dst_dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); return; } if (src_rq != dst_rq && unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { - dispatch_enqueue(sch, find_global_dsq(sch, p), p, + dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p, enq_flags | SCX_ENQ_CLEAR_OPSS); return; } @@ -2014,9 +2115,21 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, */ if (src_rq == dst_rq) { p->scx.holding_cpu = -1; - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, enq_flags); } else { + /* + * Moving to a remote local DSQ. dispatch_enqueue() is + * not used (we go through deactivate/activate), so + * call ops.dequeue() here if the task was in BPF + * custody. + */ + if (SCX_HAS_OP(sch, dequeue) && + (p->scx.flags & SCX_TASK_NEED_DEQ)) { + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, + src_rq, p, 0); + p->scx.flags &= ~SCX_TASK_NEED_DEQ; + } move_remote_task_to_local_dsq(p, enq_flags, src_rq, dst_rq); /* task has been moved to dst_rq, which is now locked */ @@ -2113,7 +2226,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, if (dsq->id == SCX_DSQ_LOCAL) dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); else - dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); + dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); } static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq) @@ -2414,7 +2527,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, * DSQ. */ if (p->scx.slice && !scx_rq_bypassing(rq)) { - dispatch_enqueue(sch, &rq->scx.local_dsq, p, + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD); goto switch_class; } @@ -2898,6 +3011,14 @@ static void scx_enable_task(struct task_struct *p) lockdep_assert_rq_held(rq); + /* + * Verify the task is not in BPF scheduler's custody. If flag + * transitions are consistent, the flag should always be clear + * here. + */ + if (SCX_HAS_OP(sch, dequeue)) + WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ); + /* * Set the weight before calling ops.enable() so that the scheduler * doesn't see a stale value if they inspect the task struct. @@ -2929,6 +3050,14 @@ static void scx_disable_task(struct task_struct *p) if (SCX_HAS_OP(sch, disable)) SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); scx_set_task_state(p, SCX_TASK_READY); + + /* + * Verify the task is not in BPF scheduler's custody. If flag + * transitions are consistent, the flag should always be clear + * here. + */ + if (SCX_HAS_OP(sch, dequeue)) + WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ); } static void scx_exit_task(struct task_struct *p) @@ -3919,7 +4048,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq, * between bypass DSQs. */ dispatch_dequeue_locked(p, donor_dsq); - dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED); + dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED); /* * $donee might have been idle and need to be woken up. No need diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 386c677e4c9a0..befa9a5d6e53f 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -982,6 +982,13 @@ enum scx_deq_flags { * it hasn't been dispatched yet. Dequeue from the BPF side. */ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, + + /* + * The task is being dequeued due to a property change (e.g., + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), + * etc.). + */ + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, }; enum scx_pick_idle_cpu_flags { diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c2..dcc945304760f 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -21,6 +21,7 @@ #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_SCHED_CHANGE #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19ad..5da50f9376844 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584be..fc9a7a4d9dea5 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -46,4 +46,5 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ } while (0) -- 2.53.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi @ 2026-02-06 20:35 ` Emil Tsalapatis 2026-02-07 9:26 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Emil Tsalapatis @ 2026-02-06 20:35 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote: > Currently, ops.dequeue() is only invoked when the sched_ext core knows > that a task resides in BPF-managed data structures, which causes it to > miss scheduling property change events. In addition, ops.dequeue() > callbacks are completely skipped when tasks are dispatched to non-local > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably > track task state. > > Fix this by guaranteeing that each task entering the BPF scheduler's > custody triggers exactly one ops.dequeue() call when it leaves that > custody, whether the exit is due to a dispatch (regular or via a core > scheduling pick) or to a scheduling property change (e.g. > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA > balancing, etc.). > > BPF scheduler custody concept: a task is considered to be in the BPF > scheduler's custody when the scheduler is responsible for managing its > lifecycle. This includes tasks dispatched to user-created DSQs or stored > in the BPF scheduler's internal data structures. Custody ends when the > task is dispatched to a terminal DSQ (such as the local DSQ or > %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a > property change. > > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler > entirely and are never in its custody. Terminal DSQs include: > - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > where tasks go directly to execution. > - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the > BPF scheduler is considered "done" with the task. > > As a result, ops.dequeue() is not invoked for tasks directly dispatched > to terminal DSQs. > > To identify dequeues triggered by scheduling property changes, introduce > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, > the dequeue was caused by a scheduling property change. > > New ops.dequeue() semantics: > - ops.dequeue() is invoked exactly once when the task leaves the BPF > scheduler's custody, in one of the following cases: > a) regular dispatch: a task dispatched to a user DSQ or stored in > internal BPF data structures is moved to a terminal DSQ > (ops.dequeue() called without any special flags set), > b) core scheduling dispatch: core-sched picks task before dispatch > (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set), > c) property change: task properties modified before dispatch, > (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set). > > This allows BPF schedulers to: > - reliably track task ownership and lifecycle, > - maintain accurate accounting of managed tasks, > - update internal state when tasks change properties. > > Cc: Tejun Heo <tj@kernel.org> > Cc: Emil Tsalapatis <emil@etsalapatis.com> > Cc: Kuba Piecuch <jpiecuch@google.com> > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- Hi Andrea, > Documentation/scheduler/sched-ext.rst | 58 +++++++ > include/linux/sched/ext.h | 1 + > kernel/sched/ext.c | 157 ++++++++++++++++-- > kernel/sched/ext_internal.h | 7 + > .../sched_ext/include/scx/enum_defs.autogen.h | 1 + > .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + > tools/sched_ext/include/scx/enums.autogen.h | 1 + > 7 files changed, 213 insertions(+), 14 deletions(-) > > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > index 404fe6126a769..fe8c59b0c1477 100644 > --- a/Documentation/scheduler/sched-ext.rst > +++ b/Documentation/scheduler/sched-ext.rst > @@ -252,6 +252,62 @@ The following briefly shows how a waking task is scheduled and executed. > > * Queue the task on the BPF side. > > + **Task State Tracking and ops.dequeue() Semantics** > + > + A task is in the "BPF scheduler's custody" when the BPF scheduler is > + responsible for managing its lifecycle. That includes tasks dispatched > + to user-created DSQs or stored in the BPF scheduler's internal data > + structures. Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, > + the task may or may not enter custody depending on what the scheduler > + does: > + > + * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``, > + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler > + is done with the task - it either goes straight to a CPU's local run > + queue or to the global DSQ as a fallback. The task never enters (or > + exits) BPF custody, and ``ops.dequeue()`` will not be called. > + > + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the > + BPF scheduler's custody. When the task later leaves BPF custody > + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for > + sleep/property changes), ``ops.dequeue()`` will be called exactly once. > + > + * **Queued on BPF side** (e.g., internal queues, no DSQ): The task is in > + BPF custody. ``ops.dequeue()`` will be called when it leaves (e.g. > + when ``ops.dispatch()`` moves it to a terminal DSQ, or on property > + change / sleep). > + > + **NOTE**: this concept is valid also with the ``ops.select_cpu()`` > + direct dispatch optimization. Even though it skips ``ops.enqueue()`` > + invocation, if the task is dispatched to a user-created DSQ or internal > + BPF structure, it enters BPF custody and will get ``ops.dequeue()`` when > + it leaves. If dispatched to a terminal DSQ, the BPF scheduler is done > + with it immediately. This provides the performance benefit of avoiding > + the ``ops.enqueue()`` roundtrip while maintaining correct state > + tracking. > + > + The dequeue can happen for different reasons, distinguished by flags: > + > + 1. **Regular dispatch**: when a task in BPF custody is dispatched to a > + terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for > + execution), ``ops.dequeue()`` is triggered without any special flags. > + > + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and > + core scheduling picks a task for execution while it's still in BPF > + custody, ``ops.dequeue()`` is called with the > + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. > + > + 3. **Scheduling property change**: when a task property changes (via > + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, > + priority changes, CPU migrations, etc.) while the task is still in > + BPF custody, ``ops.dequeue()`` is called with the > + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. > + > + **Important**: Once a task has left BPF custody (e.g. after being > + dispatched to a terminal DSQ), property changes will not trigger > + ``ops.dequeue()``, since the task is no longer being managed by the BPF > + scheduler. > + > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If > empty, it then looks at the global DSQ. If there still isn't a task to > run, ``ops.dispatch()`` is invoked which can use the following two > @@ -319,6 +375,8 @@ by a sched_ext scheduler: > /* Any usable CPU becomes available */ > > ops.dispatch(); /* Task is moved to a local DSQ */ > + > + ops.dequeue(); /* Exiting BPF scheduler */ > } > ops.running(); /* Task starts running on its assigned CPU */ > while (task->scx.slice > 0 && task is runnable) > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index bcb962d5ee7d8..c48f818eee9b8 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > /* scx_entity.flags */ > enum scx_ent_flags { > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > + SCX_TASK_NEED_DEQ = 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */ Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be in BPF custody vs the core scx scheduler (terminal DSQs) this is a more general property that can be useful to check in the future. An example: We can now assert that a task's BPF state is consistent with its actual kernel state when using BPF-based data structures to manage tasks. > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 0bb8fa927e9e9..d17fd9141adf4 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p) > #endif > } > > +/** > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes > + * @dsq_id: DSQ ID to check > + * > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF > + * scheduler is considered "done" with the task. > + * > + * Builtin DSQs include: > + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > + * where tasks go directly to execution, > + * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue, > + * - Bypass DSQ: used during bypass mode. > + * > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not > + * trigger ops.dequeue() when they are later consumed. > + */ > +static inline bool is_terminal_dsq(u64 dsq_id) > +{ > + return dsq_id & SCX_DSQ_FLAG_BUILTIN; > +} > + > /** > * touch_core_sched_dispatch - Update core-sched timestamp on dispatch > * @rq: rq to read clock from, must be locked > @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p > resched_curr(rq); > } > > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, > + struct scx_dispatch_q *dsq, > struct task_struct *p, u64 enq_flags) > { > bool is_local = dsq->id == SCX_DSQ_LOCAL; > @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > dsq_mod_nr(dsq, 1); > p->scx.dsq = dsq; > > + /* > + * Handle ops.dequeue() and custody tracking. > + * > + * Builtin DSQs (local, global, bypass) are terminal: the BPF > + * scheduler is done with the task. If it was in BPF custody, call > + * ops.dequeue() and clear the flag. > + * > + * User DSQs: Task is in BPF scheduler's custody. Set the flag so > + * ops.dequeue() will be called when it leaves. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + if (is_terminal_dsq(dsq->id)) { > + if (p->scx.flags & SCX_TASK_NEED_DEQ) > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, > + rq, p, 0); > + p->scx.flags &= ~SCX_TASK_NEED_DEQ; > + } else { > + p->scx.flags |= SCX_TASK_NEED_DEQ; > + } > + } > + > /* > * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the > * direct dispatch path, but we clear them here because the direct > @@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > return; > } > > - dispatch_enqueue(sch, dsq, p, > + dispatch_enqueue(sch, rq, dsq, p, > p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); > } > > @@ -1407,13 +1450,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > * dequeue may be waiting. The store_release matches their load_acquire. > */ > atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq); > + > + /* > + * Task is now in BPF scheduler's custody (queued on BPF internal > + * structures). Set %SCX_TASK_NEED_DEQ so ops.dequeue() is called > + * when it leaves custody (e.g. dispatched to a terminal DSQ or on > + * property change). > + */ > + if (SCX_HAS_OP(sch, dequeue)) Related to the rename: Can we remove the guards and track the flag regardless of whether ops.dequeue() is present? There is no reason not to track whether a task is in BPF or the core, and it is a property that's independent of whether we implement ops.dequeue(). This also simplifies the code since we now just guard the actual ops.dequeue() call. > + p->scx.flags |= SCX_TASK_NEED_DEQ; > return; > > direct: > direct_dispatch(sch, p, enq_flags); > return; > local_norefill: > - dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags); > + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags); > return; > local: > dsq = &rq->scx.local_dsq; > @@ -1433,7 +1485,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > */ > touch_core_sched(rq, p); > refill_task_slice_dfl(sch, p); > - dispatch_enqueue(sch, dsq, p, enq_flags); > + dispatch_enqueue(sch, rq, dsq, p, enq_flags); > } > > static bool task_runnable(const struct task_struct *p) > @@ -1511,6 +1563,22 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags > __scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1); > } > > +/* > + * Call ops.dequeue() for a task leaving BPF custody. Adds %SCX_DEQ_SCHED_CHANGE > + * when the dequeue is due to a property change (not sleep or core-sched pick). > + */ > +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq, > + struct task_struct *p, u64 deq_flags) > +{ > + u64 flags = deq_flags; > + > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > + flags |= SCX_DEQ_SCHED_CHANGE; > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > + p->scx.flags &= ~SCX_TASK_NEED_DEQ; > +} > + > static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > { > struct scx_sched *sch = scx_root; > @@ -1524,6 +1592,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > switch (opss & SCX_OPSS_STATE_MASK) { > case SCX_OPSS_NONE: > + /* > + * Task is not in BPF data structures (either dispatched to > + * a DSQ or running). Only call ops.dequeue() if the task > + * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ > + * is set). > + * > + * If the task has already been dispatched to a terminal > + * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF > + * scheduler's custody and the flag will be clear, so we > + * skip ops.dequeue(). > + * > + * If this is a property change (not sleep/core-sched) and > + * the task is still in BPF custody, set the > + * %SCX_DEQ_SCHED_CHANGE flag. > + */ > + if (SCX_HAS_OP(sch, dequeue) && > + (p->scx.flags & SCX_TASK_NEED_DEQ)) > + call_task_dequeue(sch, rq, p, deq_flags); > break; > case SCX_OPSS_QUEUEING: > /* > @@ -1532,9 +1618,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > */ > BUG(); > case SCX_OPSS_QUEUED: > - if (SCX_HAS_OP(sch, dequeue)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > - p, deq_flags); > + /* > + * Task is still on the BPF scheduler (not dispatched yet). > + * Call ops.dequeue() to notify it is leaving BPF custody. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)); > + call_task_dequeue(sch, rq, p, deq_flags); > + } > > if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, > SCX_OPSS_NONE)) > @@ -1631,6 +1722,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > struct scx_dispatch_q *src_dsq, > struct rq *dst_rq) > { > + struct scx_sched *sch = scx_root; > struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; > > /* @dsq is locked and @p is on @dst_rq */ > @@ -1639,6 +1731,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > > WARN_ON_ONCE(p->scx.holding_cpu >= 0); > > + /* > + * Task is moving from a non-local DSQ to a local (terminal) DSQ. > + * Call ops.dequeue() if the task was in BPF custody. > + */ > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_NEED_DEQ)) { > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); > + p->scx.flags &= ~SCX_TASK_NEED_DEQ; > + } > + > if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) > list_add(&p->scx.dsq_list.node, &dst_dsq->list); > else > @@ -1879,7 +1980,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch, > dispatch_dequeue_locked(p, src_dsq); > raw_spin_unlock(&src_dsq->lock); > > - dispatch_enqueue(sch, dst_dsq, p, enq_flags); > + dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags); > } > > return dst_rq; > @@ -1969,14 +2070,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > * If dispatching to @rq that @p is already on, no lock dancing needed. > */ > if (rq == src_rq && rq == dst_rq) { > - dispatch_enqueue(sch, dst_dsq, p, > + dispatch_enqueue(sch, rq, dst_dsq, p, > enq_flags | SCX_ENQ_CLEAR_OPSS); > return; > } > > if (src_rq != dst_rq && > unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) { > - dispatch_enqueue(sch, find_global_dsq(sch, p), p, > + dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p, > enq_flags | SCX_ENQ_CLEAR_OPSS); > return; > } > @@ -2014,9 +2115,21 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq, > */ > if (src_rq == dst_rq) { > p->scx.holding_cpu = -1; > - dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p, > + dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p, > enq_flags); > } else { > + /* > + * Moving to a remote local DSQ. dispatch_enqueue() is > + * not used (we go through deactivate/activate), so > + * call ops.dequeue() here if the task was in BPF > + * custody. > + */ > + if (SCX_HAS_OP(sch, dequeue) && > + (p->scx.flags & SCX_TASK_NEED_DEQ)) { > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, > + src_rq, p, 0); > + p->scx.flags &= ~SCX_TASK_NEED_DEQ; > + } > move_remote_task_to_local_dsq(p, enq_flags, > src_rq, dst_rq); > /* task has been moved to dst_rq, which is now locked */ > @@ -2113,7 +2226,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > if (dsq->id == SCX_DSQ_LOCAL) > dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags); > else > - dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); > + dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS); > } > > static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq) > @@ -2414,7 +2527,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, > * DSQ. > */ > if (p->scx.slice && !scx_rq_bypassing(rq)) { > - dispatch_enqueue(sch, &rq->scx.local_dsq, p, > + dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, > SCX_ENQ_HEAD); > goto switch_class; > } > @@ -2898,6 +3011,14 @@ static void scx_enable_task(struct task_struct *p) > > lockdep_assert_rq_held(rq); > > + /* > + * Verify the task is not in BPF scheduler's custody. If flag > + * transitions are consistent, the flag should always be clear > + * here. > + */ > + if (SCX_HAS_OP(sch, dequeue)) > + WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ); > + > /* > * Set the weight before calling ops.enable() so that the scheduler > * doesn't see a stale value if they inspect the task struct. > @@ -2929,6 +3050,14 @@ static void scx_disable_task(struct task_struct *p) > if (SCX_HAS_OP(sch, disable)) > SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); > scx_set_task_state(p, SCX_TASK_READY); > + > + /* > + * Verify the task is not in BPF scheduler's custody. If flag > + * transitions are consistent, the flag should always be clear > + * here. > + */ > + if (SCX_HAS_OP(sch, dequeue)) > + WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ); > } > > static void scx_exit_task(struct task_struct *p) > @@ -3919,7 +4048,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq, > * between bypass DSQs. > */ > dispatch_dequeue_locked(p, donor_dsq); > - dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED); > + dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED); > > /* > * $donee might have been idle and need to be woken up. No need > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index 386c677e4c9a0..befa9a5d6e53f 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -982,6 +982,13 @@ enum scx_deq_flags { > * it hasn't been dispatched yet. Dequeue from the BPF side. > */ > SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, > + > + /* > + * The task is being dequeued due to a property change (e.g., > + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), > + * etc.). > + */ > + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, > }; > > enum scx_pick_idle_cpu_flags { > diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h > index c2c33df9292c2..dcc945304760f 100644 > --- a/tools/sched_ext/include/scx/enum_defs.autogen.h > +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h > @@ -21,6 +21,7 @@ > #define HAVE_SCX_CPU_PREEMPT_UNKNOWN > #define HAVE_SCX_DEQ_SLEEP > #define HAVE_SCX_DEQ_CORE_SCHED_EXEC > +#define HAVE_SCX_DEQ_SCHED_CHANGE > #define HAVE_SCX_DSQ_FLAG_BUILTIN > #define HAVE_SCX_DSQ_FLAG_LOCAL_ON > #define HAVE_SCX_DSQ_INVALID > diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h > index 2f8002bcc19ad..5da50f9376844 100644 > --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h > +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h > @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; > const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; > #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ > > +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; > +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE > diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h > index fedec938584be..fc9a7a4d9dea5 100644 > --- a/tools/sched_ext/include/scx/enums.autogen.h > +++ b/tools/sched_ext/include/scx/enums.autogen.h > @@ -46,4 +46,5 @@ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ > + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ > } while (0) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-06 20:35 ` Emil Tsalapatis @ 2026-02-07 9:26 ` Andrea Righi 2026-02-09 17:28 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-07 9:26 UTC (permalink / raw) To: Emil Tsalapatis Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hi Emil, On Fri, Feb 06, 2026 at 03:35:34PM -0500, Emil Tsalapatis wrote: > On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote: ... > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > index bcb962d5ee7d8..c48f818eee9b8 100644 > > --- a/include/linux/sched/ext.h > > +++ b/include/linux/sched/ext.h > > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > > /* scx_entity.flags */ > > enum scx_ent_flags { > > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > > + SCX_TASK_NEED_DEQ = 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */ > > Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be > in BPF custody vs the core scx scheduler (terminal DSQs) this is a more > general property that can be useful to check in the future. An example: > We can now assert that a task's BPF state is consistent with its actual > kernel state when using BPF-based data structures to manage tasks. Ack. I like SCX_TASK_IN_BPF and I also like the idea of resuing the flag for other purposes. It can be helpful for debugging as well. > > > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index 0bb8fa927e9e9..d17fd9141adf4 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p) > > #endif > > } > > > > +/** > > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes > > + * @dsq_id: DSQ ID to check > > + * > > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF > > + * scheduler is considered "done" with the task. > > + * > > + * Builtin DSQs include: > > + * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > > + * where tasks go directly to execution, > > + * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue, > > + * - Bypass DSQ: used during bypass mode. > > + * > > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not > > + * trigger ops.dequeue() when they are later consumed. > > + */ > > +static inline bool is_terminal_dsq(u64 dsq_id) > > +{ > > + return dsq_id & SCX_DSQ_FLAG_BUILTIN; > > +} > > + > > /** > > * touch_core_sched_dispatch - Update core-sched timestamp on dispatch > > * @rq: rq to read clock from, must be locked > > @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p > > resched_curr(rq); > > } > > > > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq, > > + struct scx_dispatch_q *dsq, > > struct task_struct *p, u64 enq_flags) > > { > > bool is_local = dsq->id == SCX_DSQ_LOCAL; > > @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > > dsq_mod_nr(dsq, 1); > > p->scx.dsq = dsq; > > > > + /* > > + * Handle ops.dequeue() and custody tracking. > > + * > > + * Builtin DSQs (local, global, bypass) are terminal: the BPF > > + * scheduler is done with the task. If it was in BPF custody, call > > + * ops.dequeue() and clear the flag. > > + * > > + * User DSQs: Task is in BPF scheduler's custody. Set the flag so > > + * ops.dequeue() will be called when it leaves. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) { > > + if (is_terminal_dsq(dsq->id)) { > > + if (p->scx.flags & SCX_TASK_NEED_DEQ) > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, > > + rq, p, 0); > > + p->scx.flags &= ~SCX_TASK_NEED_DEQ; > > + } else { > > + p->scx.flags |= SCX_TASK_NEED_DEQ; > > + } > > + } > > + > > /* > > * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the > > * direct dispatch path, but we clear them here because the direct > > @@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > > return; > > } > > > > - dispatch_enqueue(sch, dsq, p, > > + dispatch_enqueue(sch, rq, dsq, p, > > p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS); > > } > > > > @@ -1407,13 +1450,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > > * dequeue may be waiting. The store_release matches their load_acquire. > > */ > > atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq); > > + > > + /* > > + * Task is now in BPF scheduler's custody (queued on BPF internal > > + * structures). Set %SCX_TASK_NEED_DEQ so ops.dequeue() is called > > + * when it leaves custody (e.g. dispatched to a terminal DSQ or on > > + * property change). > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) > > Related to the rename: Can we remove the guards and track the flag > regardless of whether ops.dequeue() is present? > > There is no reason not to track whether a task is in BPF or the core, > and it is a property that's independent of whether we implement ops.dequeue(). > This also simplifies the code since we now just guard the actual ops.dequeue() > call. I was concerned about introducing overhead, with the guard we can save a few memory writes to p->scx.flags. But I don't have numbers and probably the overhead is negligible. Also, if we have a working ops.dequeue(), I guess more schedulers will start implementing an ops.dequeue() callback, so the guard itself may actually become the extra overhead. So, I guess we can remove the guard and just set/clear the flag even without an ops.dequeue() callback... Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-07 9:26 ` Andrea Righi @ 2026-02-09 17:28 ` Tejun Heo 2026-02-09 19:06 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-09 17:28 UTC (permalink / raw) To: Andrea Righi Cc: Emil Tsalapatis, David Vernet, Changwoo Min, Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Sat, Feb 07, 2026 at 10:26:17AM +0100, Andrea Righi wrote: > Hi Emil, > > On Fri, Feb 06, 2026 at 03:35:34PM -0500, Emil Tsalapatis wrote: > > On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote: > ... > > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > > index bcb962d5ee7d8..c48f818eee9b8 100644 > > > --- a/include/linux/sched/ext.h > > > +++ b/include/linux/sched/ext.h > > > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > > > /* scx_entity.flags */ > > > enum scx_ent_flags { > > > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > > > + SCX_TASK_NEED_DEQ = 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */ > > > > Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be > > in BPF custody vs the core scx scheduler (terminal DSQs) this is a more > > general property that can be useful to check in the future. An example: > > We can now assert that a task's BPF state is consistent with its actual > > kernel state when using BPF-based data structures to manage tasks. > > Ack. I like SCX_TASK_IN_BPF and I also like the idea of resuing the flag > for other purposes. It can be helpful for debugging as well. One problem with the name is that when a task is in the BPF scheduler's custody, it can be still be on the kernel side in a DSQ or can be on the BPF side on a BPF data structure. This is currently distinguished by SCX_OPSS state (queued on the ops side or not). We do say things like "the task is in BPF" to note that the task is not on a DSQ but in BPF proper, so I think SCX_TASK_IN_BPF can become confusing. I don't know what the right name is. When we write it out, we say "in BPF sched's custody" where "BPF sched" means the whole SCX scheduler. Maybe just SCX_TASK_IN_CUSTODY? Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-09 17:28 ` Tejun Heo @ 2026-02-09 19:06 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-09 19:06 UTC (permalink / raw) To: Tejun Heo Cc: Emil Tsalapatis, David Vernet, Changwoo Min, Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Mon, Feb 09, 2026 at 07:28:50AM -1000, Tejun Heo wrote: > On Sat, Feb 07, 2026 at 10:26:17AM +0100, Andrea Righi wrote: > > Hi Emil, > > > > On Fri, Feb 06, 2026 at 03:35:34PM -0500, Emil Tsalapatis wrote: > > > On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote: > > ... > > > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > > > index bcb962d5ee7d8..c48f818eee9b8 100644 > > > > --- a/include/linux/sched/ext.h > > > > +++ b/include/linux/sched/ext.h > > > > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > > > > /* scx_entity.flags */ > > > > enum scx_ent_flags { > > > > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > > > > + SCX_TASK_NEED_DEQ = 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */ > > > > > > Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be > > > in BPF custody vs the core scx scheduler (terminal DSQs) this is a more > > > general property that can be useful to check in the future. An example: > > > We can now assert that a task's BPF state is consistent with its actual > > > kernel state when using BPF-based data structures to manage tasks. > > > > Ack. I like SCX_TASK_IN_BPF and I also like the idea of resuing the flag > > for other purposes. It can be helpful for debugging as well. > > One problem with the name is that when a task is in the BPF scheduler's > custody, it can be still be on the kernel side in a DSQ or can be on the BPF > side on a BPF data structure. This is currently distinguished by SCX_OPSS > state (queued on the ops side or not). We do say things like "the task is in > BPF" to note that the task is not on a DSQ but in BPF proper, so I think > SCX_TASK_IN_BPF can become confusing. > > I don't know what the right name is. When we write it out, we say "in BPF > sched's custody" where "BPF sched" means the whole SCX scheduler. Maybe just > SCX_TASK_IN_CUSTODY? Yeah, I agree that the "task in BPF" concept is a bit too overloaded. I think SCX_TASK_IN_CUSTODY is clear enough and it doesn't overlap with the "in BPF" concept. I'll rename the flag to SCX_TASK_IN_CUSTODY. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCHSET v5] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-04 16:05 Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
0 siblings, 1 reply; 81+ messages in thread
From: Andrea Righi @ 2026-02-04 16:05 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
sched-ext, linux-kernel
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g. sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).
To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
Changes in v5:
- Introduce the concept of "terminal DSQ" (when a task is dispatched to a
terminal DSQ, the task leaves the BPF scheduler's custody)
- Consider SCX_DSQ_GLOBAL as a terminal DSQ
- Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com
Changes in v4:
- Introduce the concept of "BPF scheduler custody"
- Do not trigger ops.dequeue() for direct dispatches to local DSQs
- Trigger ops.dequeue() only once; after the task leaves BPF scheduler
custody, further dequeue events are not reported.
- Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 74 +++++++
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 186 +++++++++++++++-
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 1 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 269 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 207 ++++++++++++++++++
10 files changed, 746 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
^ permalink raw reply [flat|nested] 81+ messages in thread* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-04 16:05 [PATCHSET v5] " Andrea Righi @ 2026-02-04 16:05 ` Andrea Righi 2026-02-04 22:14 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-04 16:05 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in "BPF scheduler's custody" when it has been queued in user-created DSQs and the BPF scheduler is responsible for its lifecycle. Custody ends when the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a property change. Tasks directly dispatched to terminal DSQs bypass the BPF scheduler entirely and are not in its custody. Terminal DSQs include: - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues where tasks go directly to execution. - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the BPF scheduler is considered "done" with the task. As a result, ops.dequeue() is not invoked for tasks dispatched to terminal DSQs, as the BPF scheduler no longer retains custody of them. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: a task dispatched to a user DSQ is moved to a terminal DSQ (ops.dequeue() called without any special flags set), b) core scheduling dispatch: core-sched picks task before dispatch, ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set, c) property change: task properties modified before dispatch, ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set. This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- Documentation/scheduler/sched-ext.rst | 74 +++++++ include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 186 +++++++++++++++++- kernel/sched/ext_internal.h | 7 + .../sched_ext/include/scx/enum_defs.autogen.h | 1 + .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + tools/sched_ext/include/scx/enums.autogen.h | 1 + 7 files changed, 269 insertions(+), 3 deletions(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 404fe6126a769..1457f2aefa93e 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -252,6 +252,78 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + **Task State Tracking and ops.dequeue() Semantics** + + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may + enter the "BPF scheduler's custody" depending on where it's dispatched: + + * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``, + ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler + is done with the task - it either goes straight to a CPU's local run + queue or to the global DSQ as a fallback. The task never enters (or + exits) BPF custody, and ``ops.dequeue()`` will not be called. + + * **Dispatch to user-created DSQs** (custom DSQs): the task enters the + BPF scheduler's custody. When the task later leaves BPF custody + (dispatched to a terminal DSQ, picked by core-sched, or dequeued for + sleep/property changes), ``ops.dequeue()`` will be called exactly once. + + * **Queued on BPF side**: The task is in BPF data structures and in BPF + custody, ``ops.dequeue()`` will be called when it leaves. + + The key principle: **ops.dequeue() is called when a task leaves the BPF + scheduler's custody**. + + This works also with the ``ops.select_cpu()`` direct dispatch + optimization: even though it skips ``ops.enqueue()`` invocation, if the + task is dispatched to a user-created DSQ, it enters BPF custody and will + get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ, + the BPF scheduler is done with it immediately. This provides the + performance benefit of avoiding the ``ops.enqueue()`` roundtrip while + maintaining correct state tracking. + + The dequeue can happen for different reasons, distinguished by flags: + + 1. **Regular dispatch workflow**: when the task is dispatched from a + user-created DSQ to a terminal DSQ (leaving BPF custody for execution), + ``ops.dequeue()`` is triggered without any special flags. + + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and + core scheduling picks a task for execution while it's still in BPF + custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. + + 3. **Scheduling property change**: when a task property changes (via + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, + priority changes, CPU migrations, etc.) while the task is still in + BPF custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. + + **Important**: Once a task has left BPF custody (dispatched to a + terminal DSQ), property changes will not trigger ``ops.dequeue()``, + since the task is no longer being managed by the BPF scheduler. + + **Property Change Notifications for Running Tasks**: + + For tasks that have left BPF custody (running or on terminal DSQs), + property changes can be intercepted through the dedicated callbacks: + + * ``ops.set_cpumask()``: Called when a task's CPU affinity changes + (e.g., via ``sched_setaffinity()``). This callback is invoked for + all tasks regardless of their state or BPF custody. + + * ``ops.set_weight()``: Called when a task's scheduling weight/priority + changes (e.g., via ``sched_setscheduler()`` or ``set_user_nice()``). + This callback is also invoked for all tasks. + + These callbacks provide complete coverage for property changes, + complementing ``ops.dequeue()`` which only applies to tasks in BPF + custody. + + BPF schedulers can choose not to implement ``ops.dequeue()`` if they + don't need to track these transitions. The sched_ext core will safely + handle all dequeue operations regardless. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -319,6 +391,8 @@ by a sched_ext scheduler: /* Any usable CPU becomes available */ ops.dispatch(); /* Task is moved to a local DSQ */ + + ops.dequeue(); /* Exiting BPF scheduler */ } ops.running(); /* Task starts running on its assigned CPU */ while (task->scx.slice > 0 && task is runnable) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d8..8d7c13e75efec 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -84,6 +84,7 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* in BPF scheduler's custody */ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index afe28c04d5aa7..34ba6870d2abf 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -924,6 +924,26 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p) #endif } +/** + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes + * @dsq_id: DSQ ID to check + * + * Returns true if @dsq_id is a terminal DSQ where the BPF scheduler is + * considered "done" with the task. Terminal DSQs include: + * - Local DSQs (SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON): per-CPU queues where + * tasks go directly to execution + * - Global DSQ (SCX_DSQ_GLOBAL): the built-in fallback queue + * + * Tasks dispatched to terminal DSQs exit BPF scheduler custody and do not + * trigger ops.dequeue() when they are later consumed. + */ +static inline bool is_terminal_dsq(u64 dsq_id) +{ + return dsq_id == SCX_DSQ_LOCAL || + (dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON || + dsq_id == SCX_DSQ_GLOBAL; +} + /** * touch_core_sched_dispatch - Update core-sched timestamp on dispatch * @rq: rq to read clock from, must be locked @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, dsq_mod_nr(dsq, 1); p->scx.dsq = dsq; + /* + * Mark task as in BPF scheduler's custody if being queued to a + * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are + * terminal: tasks on them have left BPF custody. + * + * Don't touch the flag if already set (e.g., by + * mark_direct_dispatch() or direct_dispatch()/finish_dispatch() + * for user DSQs). + */ + if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN)) + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + /* * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the * direct dispatch path, but we clear them here because the direct @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch, p->scx.ddsp_dsq_id = dsq_id; p->scx.ddsp_enq_flags = enq_flags; + + /* + * Mark the task as entering BPF scheduler's custody if it's being + * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This + * handles the case where ops.select_cpu() directly dispatches - even + * though ops.enqueue() won't be called, the task enters BPF custody + * if dispatched to a user DSQ and should get ops.dequeue() when it + * leaves. + * + * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag + * is clear since the BPF scheduler is done with the task. + */ + if (SCX_HAS_OP(sch, dequeue)) { + if (!is_terminal_dsq(dsq_id)) + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + else + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } } static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, p->scx.ddsp_enq_flags |= enq_flags; + /* + * The task is about to be dispatched, handle ops.dequeue() based + * on where the task is going. + * + * Key principle: ops.dequeue() is called when a task leaves the + * BPF scheduler's custody. A task is in BPF custody if it's on a + * user-created DSQ or in BPF data structures. Once dispatched to a + * terminal DSQ (local DSQ or SCX_DSQ_GLOBAL), the BPF scheduler is + * done with it. + * + * Direct dispatch to terminal DSQs: task never enters (or exits) + * BPF scheduler's custody. If it was in custody, call ops.dequeue() + * to notify the BPF scheduler. Clear the flag so future property + * changes also won't trigger ops.dequeue(). + * + * Direct dispatch to user DSQs: task enters BPF scheduler's custody. + * Mark the task as in BPF custody so that when it's later dispatched + * to a terminal DSQ or dequeued for property changes, ops.dequeue() + * will be called. + * + * This also handles the ops.select_cpu() direct dispatch: the + * shortcut skips ops.enqueue() but the task still enters BPF custody + * if dispatched to a user DSQ, and thus needs ops.dequeue() when it + * leaves. + */ + if (SCX_HAS_OP(sch, dequeue)) { + if (!is_terminal_dsq(dsq->id)) { + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + } else { + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } + } + /* * We are in the enqueue path with @rq locked and pinned, and thus can't * double lock a remote rq and enqueue to its local DSQ. For @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) switch (opss & SCX_OPSS_STATE_MASK) { case SCX_OPSS_NONE: + /* + * Task is not in BPF data structures (either dispatched to + * a DSQ or running). Only call ops.dequeue() if the task + * is still in BPF scheduler's custody + * (%SCX_TASK_OPS_ENQUEUED is set). + * + * If the task has already been dispatched to a terminal + * DSQ (local DSQ or SCX_DSQ_GLOBAL), it has left the BPF + * scheduler's custody and the flag will be clear, so we + * skip ops.dequeue(). + * + * If this is a property change (not sleep/core-sched) and + * the task is still in BPF custody, set the + * %SCX_DEQ_SCHED_CHANGE flag. + */ + if (SCX_HAS_OP(sch, dequeue) && + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { + u64 flags = deq_flags; + + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + flags |= SCX_DEQ_SCHED_CHANGE; + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } break; case SCX_OPSS_QUEUEING: /* @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: - if (SCX_HAS_OP(sch, dequeue)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, - p, deq_flags); + /* + * Task is still on the BPF scheduler (not dispatched yet). + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE + * only for property changes, not for core-sched picks or + * sleep. + * + * Clear the flag after calling ops.dequeue(): the task is + * leaving BPF scheduler's custody. + */ + if (SCX_HAS_OP(sch, dequeue)) { + u64 flags = deq_flags; + + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + flags |= SCX_DEQ_SCHED_CHANGE; + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, struct scx_dispatch_q *src_dsq, struct rq *dst_rq) { + struct scx_sched *sch = scx_root; struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; /* @dsq is locked and @p is on @dst_rq */ @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(p->scx.holding_cpu >= 0); + /* + * Task is moving from a non-local DSQ to a local DSQ. Call + * ops.dequeue() if the task was in BPF custody. + */ + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) list_add(&p->scx.dsq_list.node, &dst_dsq->list); else @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); + /* + * Handle ops.dequeue() based on destination DSQ. + * + * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF + * scheduler is done with the task. Call ops.dequeue() if it was in + * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag. + * + * Dispatch to user DSQs: task is in BPF scheduler's custody. + * Mark it so ops.dequeue() will be called when it leaves. + */ + if (SCX_HAS_OP(sch, dequeue)) { + if (!is_terminal_dsq(dsq_id)) { + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + } else { + /* + * Locking: we're holding the @rq lock (the + * dispatch CPU's rq), but not necessarily + * task_rq(p), since @p may be from a remote CPU. + * + * This is safe because SCX_OPSS_DISPATCHING state + * prevents racing dequeues, any concurrent + * ops_dequeue() will wait for this state to clear. + */ + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); + + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } + } + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); if (dsq->id == SCX_DSQ_LOCAL) @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p) lockdep_assert_rq_held(rq); + /* + * Clear enqueue/dequeue tracking flags when enabling the task. + * This ensures a clean state when the task enters SCX. Only needed + * if ops.dequeue() is implemented. + */ + if (SCX_HAS_OP(sch, dequeue)) + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + /* * Set the weight before calling ops.enable() so that the scheduler * doesn't see a stale value if they inspect the task struct. @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p) if (SCX_HAS_OP(sch, disable)) SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); scx_set_task_state(p, SCX_TASK_READY); + + /* + * Clear enqueue/dequeue tracking flags when disabling the task. + * Only needed if ops.dequeue() is implemented. + */ + if (SCX_HAS_OP(sch, dequeue)) + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; } static void scx_exit_task(struct task_struct *p) diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 386c677e4c9a0..befa9a5d6e53f 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -982,6 +982,13 @@ enum scx_deq_flags { * it hasn't been dispatched yet. Dequeue from the BPF side. */ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, + + /* + * The task is being dequeued due to a property change (e.g., + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), + * etc.). + */ + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, }; enum scx_pick_idle_cpu_flags { diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c2..dcc945304760f 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -21,6 +21,7 @@ #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_SCHED_CHANGE #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19ad..5da50f9376844 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584be..fc9a7a4d9dea5 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -46,4 +46,5 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ } while (0) -- 2.53.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi @ 2026-02-04 22:14 ` Tejun Heo 2026-02-05 9:26 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-04 22:14 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hello, On Wed, Feb 04, 2026 at 05:05:58PM +0100, Andrea Righi wrote: > Currently, ops.dequeue() is only invoked when the sched_ext core knows > that a task resides in BPF-managed data structures, which causes it to > miss scheduling property change events. In addition, ops.dequeue() > callbacks are completely skipped when tasks are dispatched to non-local > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably > track task state. > > Fix this by guaranteeing that each task entering the BPF scheduler's > custody triggers exactly one ops.dequeue() call when it leaves that > custody, whether the exit is due to a dispatch (regular or via a core > scheduling pick) or to a scheduling property change (e.g. > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA > balancing, etc.). > > BPF scheduler custody concept: a task is considered to be in "BPF > scheduler's custody" when it has been queued in user-created DSQs and > the BPF scheduler is responsible for its lifecycle. Custody ends when > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL), > selected by core scheduling, or removed due to a property change. > > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler > entirely and are not in its custody. Terminal DSQs include: > - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > where tasks go directly to execution. > - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the > BPF scheduler is considered "done" with the task. > > As a result, ops.dequeue() is not invoked for tasks dispatched to > terminal DSQs, as the BPF scheduler no longer retains custody of them. > > To identify dequeues triggered by scheduling property changes, introduce > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, > the dequeue was caused by a scheduling property change. > ... > + **Property Change Notifications for Running Tasks**: > + > + For tasks that have left BPF custody (running or on terminal DSQs), > + property changes can be intercepted through the dedicated callbacks: I'm not sure this section is necessary. The way it's phrased makes it sound like schedulers would use DEQ_SCHED_CHANGE to process property changes but that's not the case. Relevant property changes will be notified in whatever ways they're notified and a task being dequeued for SCHED_CHANGE doesn't necessarily mean there will be an associated property change event either. e.g. We don't do anything re. on sched_setnuma(). > @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > dsq_mod_nr(dsq, 1); > p->scx.dsq = dsq; > > + /* > + * Mark task as in BPF scheduler's custody if being queued to a > + * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are > + * terminal: tasks on them have left BPF custody. > + * > + * Don't touch the flag if already set (e.g., by > + * mark_direct_dispatch() or direct_dispatch()/finish_dispatch() > + * for user DSQs). > + */ > + if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN)) > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; given that this is tied to dequeue, maybe a more direct name would be less confusing? e.g. something like SCX_TASK_NEED_DEQ? > @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch, > > p->scx.ddsp_dsq_id = dsq_id; > p->scx.ddsp_enq_flags = enq_flags; > + > + /* > + * Mark the task as entering BPF scheduler's custody if it's being > + * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This > + * handles the case where ops.select_cpu() directly dispatches - even > + * though ops.enqueue() won't be called, the task enters BPF custody > + * if dispatched to a user DSQ and should get ops.dequeue() when it > + * leaves. > + * > + * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag > + * is clear since the BPF scheduler is done with the task. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + if (!is_terminal_dsq(dsq_id)) > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > + else > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + } Hmm... I'm a bit confused on why this needs to be in mark_direct_dispatch() AND dispatch_enqueue(). The flag should be clear when off SCX. The only places where it could be set is from the enqueue path - when a task is direct dispatched to a non-terminal DSQ or BPF. Both cases can be reliably captured in do_enqueue_task(), no? > static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, ... > + if (SCX_HAS_OP(sch, dequeue)) { > + if (!is_terminal_dsq(dsq->id)) { > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > + } else { > + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + } > + } And when would direct_dispatch() need to call ops.dequeue()? direct_dispatch() is only used from do_enqueue_task() and there can only be one direct dispatch attempt on any given enqueue event. A task being enqueued shouldn't have the OPS_ENQUEUED set and would get dispatched once to either a terminal or non-terminal DSQ. If terminal, there's nothing to do. If non-terminal, the flag would need to be set. Am I missing something? > @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) ... > + if (SCX_HAS_OP(sch, dequeue) && > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { nit: () around & expression. > + u64 flags = deq_flags; > + > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > + flags |= SCX_DEQ_SCHED_CHANGE; > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + } > break; > case SCX_OPSS_QUEUEING: > /* > @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > */ > BUG(); > case SCX_OPSS_QUEUED: > - if (SCX_HAS_OP(sch, dequeue)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > - p, deq_flags); > + /* > + * Task is still on the BPF scheduler (not dispatched yet). > + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE > + * only for property changes, not for core-sched picks or > + * sleep. > + * > + * Clear the flag after calling ops.dequeue(): the task is > + * leaving BPF scheduler's custody. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + u64 flags = deq_flags; > + > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > + flags |= SCX_DEQ_SCHED_CHANGE; > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; I wonder whether this and the above block can be factored somehow. > @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > struct scx_dispatch_q *src_dsq, > struct rq *dst_rq) > { > + struct scx_sched *sch = scx_root; > struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; > > /* @dsq is locked and @p is on @dst_rq */ > @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > > WARN_ON_ONCE(p->scx.holding_cpu >= 0); > > + /* > + * Task is moving from a non-local DSQ to a local DSQ. Call > + * ops.dequeue() if the task was in BPF custody. > + */ > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + } > + > if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) > list_add(&p->scx.dsq_list.node, &dst_dsq->list); > else > @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > > BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); > > + /* > + * Handle ops.dequeue() based on destination DSQ. > + * > + * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF > + * scheduler is done with the task. Call ops.dequeue() if it was in > + * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag. > + * > + * Dispatch to user DSQs: task is in BPF scheduler's custody. > + * Mark it so ops.dequeue() will be called when it leaves. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + if (!is_terminal_dsq(dsq_id)) { > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > + } else { Let's do "if (COND) { A } else { B }" instead of "if (!COND) { B } else { A }". Continuing from earlier, I don't understand why we'd need to set OPS_ENQUEUED here. Given that a transition to a terminal DSQ is terminal, I can't think of conditions where we'd need to set OPS_ENQUEUED from ops.dispatch(). > + /* > + * Locking: we're holding the @rq lock (the > + * dispatch CPU's rq), but not necessarily > + * task_rq(p), since @p may be from a remote CPU. > + * > + * This is safe because SCX_OPSS_DISPATCHING state > + * prevents racing dequeues, any concurrent > + * ops_dequeue() will wait for this state to clear. > + */ > + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > + > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + } > + } I'm not sure finish_dispatch() is the right place to do this. e.g. scx_bpf_dsq_move() can also move tasks from a user DSQ to a terminal DSQ and the above wouldn't cover it. Wouldn't it make more sense to do this in dispatch_enqueue()? > @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p) > > lockdep_assert_rq_held(rq); > > + /* > + * Clear enqueue/dequeue tracking flags when enabling the task. > + * This ensures a clean state when the task enters SCX. Only needed > + * if ops.dequeue() is implemented. > + */ > + if (SCX_HAS_OP(sch, dequeue)) > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + > /* > * Set the weight before calling ops.enable() so that the scheduler > * doesn't see a stale value if they inspect the task struct. > @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p) > if (SCX_HAS_OP(sch, disable)) > SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); > scx_set_task_state(p, SCX_TASK_READY); > + > + /* > + * Clear enqueue/dequeue tracking flags when disabling the task. > + * Only needed if ops.dequeue() is implemented. > + */ > + if (SCX_HAS_OP(sch, dequeue)) > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; If we make the flag transitions consistent, we shouldn't need these, right? We can add WARN_ON_ONCE() at the head of enqueue maybe. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-04 22:14 ` Tejun Heo @ 2026-02-05 9:26 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-05 9:26 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hi Tejun, On Wed, Feb 04, 2026 at 12:14:40PM -1000, Tejun Heo wrote: > Hello, > > On Wed, Feb 04, 2026 at 05:05:58PM +0100, Andrea Righi wrote: > > Currently, ops.dequeue() is only invoked when the sched_ext core knows > > that a task resides in BPF-managed data structures, which causes it to > > miss scheduling property change events. In addition, ops.dequeue() > > callbacks are completely skipped when tasks are dispatched to non-local > > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably > > track task state. > > > > Fix this by guaranteeing that each task entering the BPF scheduler's > > custody triggers exactly one ops.dequeue() call when it leaves that > > custody, whether the exit is due to a dispatch (regular or via a core > > scheduling pick) or to a scheduling property change (e.g. > > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA > > balancing, etc.). > > > > BPF scheduler custody concept: a task is considered to be in "BPF > > scheduler's custody" when it has been queued in user-created DSQs and > > the BPF scheduler is responsible for its lifecycle. Custody ends when > > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL), > > selected by core scheduling, or removed due to a property change. > > > > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler > > entirely and are not in its custody. Terminal DSQs include: > > - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues > > where tasks go directly to execution. > > - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the > > BPF scheduler is considered "done" with the task. > > > > As a result, ops.dequeue() is not invoked for tasks dispatched to > > terminal DSQs, as the BPF scheduler no longer retains custody of them. > > > > To identify dequeues triggered by scheduling property changes, introduce > > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, > > the dequeue was caused by a scheduling property change. > > > ... > > + **Property Change Notifications for Running Tasks**: > > + > > + For tasks that have left BPF custody (running or on terminal DSQs), > > + property changes can be intercepted through the dedicated callbacks: > > I'm not sure this section is necessary. The way it's phrased makes it sound > like schedulers would use DEQ_SCHED_CHANGE to process property changes but > that's not the case. Relevant property changes will be notified in whatever > ways they're notified and a task being dequeued for SCHED_CHANGE doesn't > necessarily mean there will be an associated property change event either. > e.g. We don't do anything re. on sched_setnuma(). Agreed, this section is a bit misleading, DEQ_SCHED_CHANGE is an informational flag indicating the ops.dequeue() wasn't due to dispatch, schedulers shouldn't use it to process property changes. I'll remove it. > > > @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq, > > dsq_mod_nr(dsq, 1); > > p->scx.dsq = dsq; > > > > + /* > > + * Mark task as in BPF scheduler's custody if being queued to a > > + * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are > > + * terminal: tasks on them have left BPF custody. > > + * > > + * Don't touch the flag if already set (e.g., by > > + * mark_direct_dispatch() or direct_dispatch()/finish_dispatch() > > + * for user DSQs). > > + */ > > + if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN)) > > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > > given that this is tied to dequeue, maybe a more direct name would be less > confusing? e.g. something like SCX_TASK_NEED_DEQ? Ack. > > > @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch, > > > > p->scx.ddsp_dsq_id = dsq_id; > > p->scx.ddsp_enq_flags = enq_flags; > > + > > + /* > > + * Mark the task as entering BPF scheduler's custody if it's being > > + * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This > > + * handles the case where ops.select_cpu() directly dispatches - even > > + * though ops.enqueue() won't be called, the task enters BPF custody > > + * if dispatched to a user DSQ and should get ops.dequeue() when it > > + * leaves. > > + * > > + * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag > > + * is clear since the BPF scheduler is done with the task. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) { > > + if (!is_terminal_dsq(dsq_id)) > > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > > + else > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + } > > Hmm... I'm a bit confused on why this needs to be in mark_direct_dispatch() > AND dispatch_enqueue(). The flag should be clear when off SCX. The only > places where it could be set is from the enqueue path - when a task is > direct dispatched to a non-terminal DSQ or BPF. Both cases can be reliably > captured in do_enqueue_task(), no? You're right. I was incorrectly assuming we needed this in mark_direct_dispatch() to catch direct dispatches to user DSQs from ops.select_cpu(), but that's not true. All paths go through do_enqueue_task() which funnels to dispatch_enqueue(), so we can handle it all in one place. > > > static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > > @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > ... > > + if (SCX_HAS_OP(sch, dequeue)) { > > + if (!is_terminal_dsq(dsq->id)) { > > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > > + } else { > > + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + } > > + } > > And when would direct_dispatch() need to call ops.dequeue()? > direct_dispatch() is only used from do_enqueue_task() and there can only be > one direct dispatch attempt on any given enqueue event. A task being > enqueued shouldn't have the OPS_ENQUEUED set and would get dispatched once > to either a terminal or non-terminal DSQ. If terminal, there's nothing to > do. If non-terminal, the flag would need to be set. Am I missing something? Nah, you're right, direct_dispatch() doesn't need to call ops.dequeue() or manage the flag. I'll remove all the flag management from direct_dispatch() and centralize it in dispatch_enqueue(). > > > @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > ... > > + if (SCX_HAS_OP(sch, dequeue) && > > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { > > nit: () around & expression. > > > + u64 flags = deq_flags; > > + > > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > > + flags |= SCX_DEQ_SCHED_CHANGE; > > + > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + } > > break; > > case SCX_OPSS_QUEUEING: > > /* > > @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > */ > > BUG(); > > case SCX_OPSS_QUEUED: > > - if (SCX_HAS_OP(sch, dequeue)) > > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > > - p, deq_flags); > > + /* > > + * Task is still on the BPF scheduler (not dispatched yet). > > + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE > > + * only for property changes, not for core-sched picks or > > + * sleep. > > + * > > + * Clear the flag after calling ops.dequeue(): the task is > > + * leaving BPF scheduler's custody. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) { > > + u64 flags = deq_flags; > > + > > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > > + flags |= SCX_DEQ_SCHED_CHANGE; > > + > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > I wonder whether this and the above block can be factored somehow. Ack, we can add a helper for this. > > > @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > > struct scx_dispatch_q *src_dsq, > > struct rq *dst_rq) > > { > > + struct scx_sched *sch = scx_root; > > struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; > > > > /* @dsq is locked and @p is on @dst_rq */ > > @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, > > > > WARN_ON_ONCE(p->scx.holding_cpu >= 0); > > > > + /* > > + * Task is moving from a non-local DSQ to a local DSQ. Call > > + * ops.dequeue() if the task was in BPF custody. > > + */ > > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + } > > + > > if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) > > list_add(&p->scx.dsq_list.node, &dst_dsq->list); > > else > > @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > > > > BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); > > > > + /* > > + * Handle ops.dequeue() based on destination DSQ. > > + * > > + * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF > > + * scheduler is done with the task. Call ops.dequeue() if it was in > > + * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag. > > + * > > + * Dispatch to user DSQs: task is in BPF scheduler's custody. > > + * Mark it so ops.dequeue() will be called when it leaves. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) { > > + if (!is_terminal_dsq(dsq_id)) { > > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > > + } else { > > Let's do "if (COND) { A } else { B }" instead of "if (!COND) { B } else { A > }". Continuing from earlier, I don't understand why we'd need to set > OPS_ENQUEUED here. Given that a transition to a terminal DSQ is terminal, I > can't think of conditions where we'd need to set OPS_ENQUEUED from > ops.dispatch(). Right, a task that reaches ops.dispatch() is already in QUEUED state, if it's in a user DSQ the flag is already set from when it was enqueued, so there's no need to set the flag in finish_dispatch(). > > > + /* > > + * Locking: we're holding the @rq lock (the > > + * dispatch CPU's rq), but not necessarily > > + * task_rq(p), since @p may be from a remote CPU. > > + * > > + * This is safe because SCX_OPSS_DISPATCHING state > > + * prevents racing dequeues, any concurrent > > + * ops_dequeue() will wait for this state to clear. > > + */ > > + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > > + > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + } > > + } > > I'm not sure finish_dispatch() is the right place to do this. e.g. > scx_bpf_dsq_move() can also move tasks from a user DSQ to a terminal DSQ and > the above wouldn't cover it. Wouldn't it make more sense to do this in > dispatch_enqueue()? Agreed. > > > @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p) > > > > lockdep_assert_rq_held(rq); > > > > + /* > > + * Clear enqueue/dequeue tracking flags when enabling the task. > > + * This ensures a clean state when the task enters SCX. Only needed > > + * if ops.dequeue() is implemented. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + > > /* > > * Set the weight before calling ops.enable() so that the scheduler > > * doesn't see a stale value if they inspect the task struct. > > @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p) > > if (SCX_HAS_OP(sch, disable)) > > SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); > > scx_set_task_state(p, SCX_TASK_READY); > > + > > + /* > > + * Clear enqueue/dequeue tracking flags when disabling the task. > > + * Only needed if ops.dequeue() is implemented. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > If we make the flag transitions consistent, we shouldn't need these, right? > We can add WARN_ON_ONCE() at the head of enqueue maybe. Correct. Thanks for the review! I'll post a new version. -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCHSET v4 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-01 9:08 Andrea Righi
2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi
0 siblings, 1 reply; 81+ messages in thread
From: Andrea Righi @ 2026-02-01 9:08 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
sched-ext, linux-kernel
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g. sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).
To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
= Open issues =
Even though a few refinements are still pending, I'm sending a new
patchset, so we can comment more effectively based on the latest
agreed-upon semantics.
Open issues that still need agreement:
- Should we trigger ops.dequeue() for tasks dispatched to SCX_DSQ_GLOBAL?
(do we treat SCX_DSQ_GLOBAL as a local DSQ or a user DSQ?). In the
current implementation, SCX_DSQ_GLOBAL is treated like a built-in user
DSQ -> ops.dequeue() invoked for tasks dispatched to SCX_DSQ_GLOBAL.
Changes in v4:
- Introduce the concept of "BPF scheduler custody"
- Do not trigger ops.dequeue() for tasks directly dispatched to local DSQs
- Trigger ops.dequeue() only once; after the task leaves BPF scheduler
custody, further dequeue events are not reported.
- Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 76 ++++++++
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 168 ++++++++++++++++-
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 1 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 234 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 198 ++++++++++++++++++++
10 files changed, 686 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
^ permalink raw reply [flat|nested] 81+ messages in thread* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-01 9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi @ 2026-02-01 9:08 ` Andrea Righi 2026-02-01 22:47 ` Christian Loehle 2026-02-02 11:56 ` Kuba Piecuch 0 siblings, 2 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-01 9:08 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in "BPF scheduler's custody" when it has been queued in BPF-managed data structures and the BPF scheduler is responsible for its lifecycle. Custody ends when the task is dispatched to a local DSQ, selected by core scheduling, or removed due to a property change. Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its custody. As a result, ops.dequeue() is not invoked for these tasks. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: task was dispatched to a non-local DSQ (global or user DSQ), ops.dequeue() called without any special flags set b) core scheduling dispatch: core-sched picks task before dispatch, dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set c) property change: task properties modified before dispatch, dequeue called with %SCX_DEQ_SCHED_CHANGE flag set This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- Documentation/scheduler/sched-ext.rst | 76 ++++++++ include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 168 +++++++++++++++++- kernel/sched/ext_internal.h | 7 + .../sched_ext/include/scx/enum_defs.autogen.h | 1 + .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + tools/sched_ext/include/scx/enums.autogen.h | 1 + 7 files changed, 253 insertions(+), 3 deletions(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 404fe6126a769..6d9e82e6ca9d4 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + **Task State Tracking and ops.dequeue() Semantics** + + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may + enter the "BPF scheduler's custody" depending on where it's dispatched: + + * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or + ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler + entirely and goes straight to the CPU's local run queue. The task + never enters BPF custody, and ``ops.dequeue()`` will not be called. + + * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs): + the task enters the BPF scheduler's custody. When the task later + leaves BPF custody (dispatched to a local DSQ, picked by core-sched, + or dequeued for sleep/property changes), ``ops.dequeue()`` will be + called exactly once. + + * **Queued on BPF side**: The task is in BPF data structures and in BPF + custody, ``ops.dequeue()`` will be called when it leaves. + + The key principle: **ops.dequeue() is called when a task leaves the BPF + scheduler's custody**. A task is in BPF custody if it's on a non-local + DSQ or in BPF data structures. Once dispatched to a local DSQ or after + ops.dequeue() is called, the task is out of BPF custody and the BPF + scheduler no longer needs to track it. + + This works correctly with the ``ops.select_cpu()`` direct dispatch + optimization: even though it skips ``ops.enqueue()`` invocation, if the + task is dispatched to a non-local DSQ, it enters BPF custody and will + get ``ops.dequeue()`` when it leaves. This provides the performance + benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining + correct state tracking. + + The dequeue can happen for different reasons, distinguished by flags: + + 1. **Regular dispatch workflow**: when the task is dispatched from a + non-local DSQ to a local DSQ (leaving BPF custody for execution), + ``ops.dequeue()`` is triggered without any special flags. + + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and + core scheduling picks a task for execution while it's still in BPF + custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. + + 3. **Scheduling property change**: when a task property changes (via + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, + priority changes, CPU migrations, etc.) while the task is still in + BPF custody, ``ops.dequeue()`` is called with the + ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. + + **Important**: Once a task has left BPF custody (dispatched to local + DSQ), property changes will not trigger ``ops.dequeue()``, since the + task is no longer being managed by the BPF scheduler. + + **Property Change Notifications for Running Tasks**: + + For tasks that have left BPF custody (running or on local DSQs), + property changes can be intercepted through the dedicated callbacks: + + * ``ops.set_cpumask()``: Called when a task's CPU affinity changes + (e.g., via ``sched_setaffinity()``). This callback is invoked for + all tasks regardless of their state or BPF custody. + + * ``ops.set_weight()``: Called when a task's scheduling weight/priority + changes (e.g., via ``sched_setscheduler()`` or ``set_user_nice()``). + This callback is also invoked for all tasks. + + These callbacks provide complete coverage for property changes, + complementing ``ops.dequeue()`` which only applies to tasks in BPF + custody. + + BPF schedulers can choose not to implement ``ops.dequeue()`` if they + don't need to track these transitions. The sched_ext core will safely + handle all dequeue operations regardless. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -319,6 +393,8 @@ by a sched_ext scheduler: /* Any usable CPU becomes available */ ops.dispatch(); /* Task is moved to a local DSQ */ + + ops.dequeue(); /* Exiting BPF scheduler */ } ops.running(); /* Task starts running on its assigned CPU */ while (task->scx.slice > 0 && task is runnable) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d8..0d003d2845393 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -84,6 +84,7 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* under ext scheduler's custody */ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index afe28c04d5aa7..6d6f1253039d8 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -924,6 +924,19 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p) #endif } +/** + * is_local_dsq - Check if a DSQ ID represents a local DSQ + * @dsq_id: DSQ ID to check + * + * Returns true if @dsq_id is a local DSQ, false otherwise. Local DSQs are + * per-CPU queues where tasks go directly to execution. + */ +static inline bool is_local_dsq(u64 dsq_id) +{ + return dsq_id == SCX_DSQ_LOCAL || + (dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON; +} + /** * touch_core_sched_dispatch - Update core-sched timestamp on dispatch * @rq: rq to read clock from, must be locked @@ -1274,6 +1287,24 @@ static void mark_direct_dispatch(struct scx_sched *sch, p->scx.ddsp_dsq_id = dsq_id; p->scx.ddsp_enq_flags = enq_flags; + + /* + * Mark the task as entering BPF scheduler's custody if it's being + * dispatched to a non-local DSQ. This handles the case where + * ops.select_cpu() directly dispatches to a non-local DSQ - even + * though ops.enqueue() won't be called, the task enters BPF + * custody and should get ops.dequeue() when it leaves. + * + * For local DSQs, clear the flag, since the task bypasses the BPF + * scheduler entirely. This also clears any flag that was set by + * do_enqueue_task() before we knew the dispatch destination. + */ + if (SCX_HAS_OP(sch, dequeue)) { + if (!is_local_dsq(dsq_id)) + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + else + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } } static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, @@ -1287,6 +1318,40 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, p->scx.ddsp_enq_flags |= enq_flags; + /* + * The task is about to be dispatched, handle ops.dequeue() based + * on where the task is going. + * + * Key principle: ops.dequeue() is called when a task leaves the + * BPF scheduler's custody. A task is in BPF custody if it's on a + * non-local DSQ or in BPF data structures. Once dispatched to a + * local DSQ, it's out of BPF custody. + * + * Direct dispatch to local DSQs: task never enters BPF scheduler's + * custody, it goes straight to the CPU. Don't call ops.dequeue() + * and clear the flag so future property changes also won't trigger + * it. + * + * Direct dispatch to non-local DSQs: task enters BPF scheduler's + * custody. Mark the task as in BPF custody so that when it's later + * dispatched to a local DSQ or dequeued for property changes, + * ops.dequeue() will be called. + * + * This also handles the ops.select_cpu() direct dispatch to + * non-local DSQs: the shortcut skips ops.enqueue() invocation but + * the task still enters BPF custody if dispatched to a non-local + * DSQ, and thus needs ops.dequeue() when it leaves. + */ + if (SCX_HAS_OP(sch, dequeue)) { + if (!is_local_dsq(dsq->id)) { + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + } else { + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } + } + /* * We are in the enqueue path with @rq locked and pinned, and thus can't * double lock a remote rq and enqueue to its local DSQ. For @@ -1391,6 +1456,21 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); + /* + * Mark that ops.enqueue() is being called for this task. This + * indicates the task is entering the BPF scheduler's data + * structures (QUEUED state). + * + * However, if the task was already marked as in BPF custody by + * mark_direct_dispatch() (ops.select_cpu() direct dispatch to + * non-local DSQ), don't clear that - keep the flag set so + * ops.dequeue() will be called when appropriate. + * + * Only track this flag if ops.dequeue() is implemented. + */ + if (SCX_HAS_OP(sch, dequeue)) + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) switch (opss & SCX_OPSS_STATE_MASK) { case SCX_OPSS_NONE: + /* + * Task is not in BPF data structures (either dispatched to + * a DSQ or running). Only call ops.dequeue() if the task + * is still in BPF scheduler's custody + * (%SCX_TASK_OPS_ENQUEUED is set). + * + * If the task has already been dispatched to a local DSQ + * (left BPF custody), the flag will be clear and we skip + * ops.dequeue() + * + * If this is a property change (not sleep/core-sched) and + * the task is still in BPF custody, set the + * %SCX_DEQ_SCHED_CHANGE flag. + */ + if (SCX_HAS_OP(sch, dequeue) && + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { + u64 flags = deq_flags; + + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + flags |= SCX_DEQ_SCHED_CHANGE; + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } break; case SCX_OPSS_QUEUEING: /* @@ -1531,9 +1635,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: - if (SCX_HAS_OP(sch, dequeue)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, - p, deq_flags); + /* + * Task is still on the BPF scheduler (not dispatched yet). + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE + * only for property changes, not for core-sched picks or + * sleep. + * + * Clear the flag after calling ops.dequeue(): the task is + * leaving BPF scheduler's custody. + */ + if (SCX_HAS_OP(sch, dequeue)) { + u64 flags = deq_flags; + + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + flags |= SCX_DEQ_SCHED_CHANGE; + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -1630,6 +1749,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, struct scx_dispatch_q *src_dsq, struct rq *dst_rq) { + struct scx_sched *sch = scx_root; struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq; /* @dsq is locked and @p is on @dst_rq */ @@ -1638,6 +1758,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(p->scx.holding_cpu >= 0); + /* + * Task is moving from a non-local DSQ to a local DSQ. Call + * ops.dequeue() if the task was in BPF custody. + */ + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } + if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT)) list_add(&p->scx.dsq_list.node, &dst_dsq->list); else @@ -2107,6 +2236,24 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); + /* + * Direct dispatch to local DSQs: call ops.dequeue() if task was in + * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag. + * + * Dispatch to non-local DSQs: task is in BPF scheduler's custody. + * Mark it so ops.dequeue() will be called when it leaves. + */ + if (SCX_HAS_OP(sch, dequeue)) { + if (!is_local_dsq(dsq_id)) { + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + } else { + if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0); + + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } + } + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); if (dsq->id == SCX_DSQ_LOCAL) @@ -2894,6 +3041,14 @@ static void scx_enable_task(struct task_struct *p) lockdep_assert_rq_held(rq); + /* + * Clear enqueue/dequeue tracking flags when enabling the task. + * This ensures a clean state when the task enters SCX. Only needed + * if ops.dequeue() is implemented. + */ + if (SCX_HAS_OP(sch, dequeue)) + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + /* * Set the weight before calling ops.enable() so that the scheduler * doesn't see a stale value if they inspect the task struct. @@ -2925,6 +3080,13 @@ static void scx_disable_task(struct task_struct *p) if (SCX_HAS_OP(sch, disable)) SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p); scx_set_task_state(p, SCX_TASK_READY); + + /* + * Clear enqueue/dequeue tracking flags when disabling the task. + * Only needed if ops.dequeue() is implemented. + */ + if (SCX_HAS_OP(sch, dequeue)) + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; } static void scx_exit_task(struct task_struct *p) diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 386c677e4c9a0..befa9a5d6e53f 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -982,6 +982,13 @@ enum scx_deq_flags { * it hasn't been dispatched yet. Dequeue from the BPF side. */ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, + + /* + * The task is being dequeued due to a property change (e.g., + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), + * etc.). + */ + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, }; enum scx_pick_idle_cpu_flags { diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c2..dcc945304760f 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -21,6 +21,7 @@ #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_SCHED_CHANGE #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19ad..5da50f9376844 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584be..fc9a7a4d9dea5 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -46,4 +46,5 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ } while (0) -- 2.52.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi @ 2026-02-01 22:47 ` Christian Loehle 2026-02-02 7:45 ` Andrea Righi 2026-02-02 11:56 ` Kuba Piecuch 1 sibling, 1 reply; 81+ messages in thread From: Christian Loehle @ 2026-02-01 22:47 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On 2/1/26 09:08, Andrea Righi wrote: > Currently, ops.dequeue() is only invoked when the sched_ext core knows > that a task resides in BPF-managed data structures, which causes it to > miss scheduling property change events. In addition, ops.dequeue() > callbacks are completely skipped when tasks are dispatched to non-local > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably > track task state. > > Fix this by guaranteeing that each task entering the BPF scheduler's > custody triggers exactly one ops.dequeue() call when it leaves that > custody, whether the exit is due to a dispatch (regular or via a core > scheduling pick) or to a scheduling property change (e.g. > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA > balancing, etc.). > > BPF scheduler custody concept: a task is considered to be in "BPF > scheduler's custody" when it has been queued in BPF-managed data > structures and the BPF scheduler is responsible for its lifecycle. > Custody ends when the task is dispatched to a local DSQ, selected by > core scheduling, or removed due to a property change. > > Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or > %SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its > custody. As a result, ops.dequeue() is not invoked for these tasks. > > To identify dequeues triggered by scheduling property changes, introduce > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, > the dequeue was caused by a scheduling property change. > > New ops.dequeue() semantics: > - ops.dequeue() is invoked exactly once when the task leaves the BPF > scheduler's custody, in one of the following cases: > a) regular dispatch: task was dispatched to a non-local DSQ (global > or user DSQ), ops.dequeue() called without any special flags set > b) core scheduling dispatch: core-sched picks task before dispatch, > dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set > c) property change: task properties modified before dispatch, > dequeue called with %SCX_DEQ_SCHED_CHANGE flag set > > This allows BPF schedulers to: > - reliably track task ownership and lifecycle, > - maintain accurate accounting of managed tasks, > - update internal state when tasks change properties. > So I have finally gotten around updating scx_storm to the new semantics, see: https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics I don't think the new ops.dequeue() are enough to make inserts to local-on from anywhere safe, because it's still racing with dequeue from another CPU? Furthermore I can reproduce the following with this patch applied quite easily with something like hackbench -l 1000 & timeout 10 ./build/scheds/c/scx_storm [ 44.356878] sched_ext: BPF scheduler "simple" enabled [ 59.315370] sched_ext: BPF scheduler "simple" disabled (unregistered from user space) [ 85.366747] sched_ext: BPF scheduler "storm" enabled [ 85.371324] ------------[ cut here ]------------ [ 85.373370] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#5: gmain/1111 [ 85.373392] Modules linked in: qrtr [ 85.380088] ------------[ cut here ]------------ [ 85.380719] ------------[ cut here ]------------ [ 85.380722] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#10: kworker/u48:1/82 [ 85.380728] Modules linked in: qrtr 8021q garp mrp stp llc binfmt_misc sm3_ce r8169 cdns3_pci_wrap nf_tables nfnetlink fuse dm_mod ipv6 [ 85.380745] CPU: 10 UID: 0 PID: 82 Comm: kworker/u48:1 Tainted: G S 6.19.0-rc7-cix-build+ #256 PREEMPT [ 85.380749] Tainted: [S]=CPU_OUT_OF_SPEC [ 85.380750] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.1.0-1 2025-12-25T02:55:53+00:00 [ 85.380754] Workqueue: 0x0 (events_unbound) [ 85.380760] Sched_ext: storm (enabled+all), task: runnable_at=+0ms [ 85.380762] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 85.380764] pc : update_locked_rq+0x64/0x6c [ 85.380767] lr : update_locked_rq+0x60/0x6c [ 85.380769] sp : ffff8000803a3bd0 [ 85.380770] x29: ffff8000803a3bd0 x28: fffffdffbf622dc0 x27: ffff0000911e5040 [ 85.380773] x26: 0000000000000000 x25: ffffd204426cad80 x24: ffffd20442ba5bb8 [ 85.380776] x23: c00000000000000a x22: 0000000000000000 x21: ffffd20442ba4830 [ 85.380778] x20: ffff00009af0b000 x19: ffff0001fef2ed80 x18: 0000000000000000 [ 85.380781] x17: 0000000000000000 x16: 0000000000000000 x15: 0000aaaadd996940 [ 85.380783] x14: 0000000000000000 x13: 00000000000a0000 x12: 0000000000000000 [ 85.380786] x11: 0000000000000040 x10: ffffd204402e7ca0 x9 : ffffd2044324b000 [ 85.380788] x8 : ffff0000810e0000 x7 : 0000d00202cc2dc0 x6 : 0000000000000050 [ 85.380790] x5 : ffffd204426b5648 x4 : fffffdffbf622dc0 x3 : ffff0000810e0000 [ 85.380793] x2 : 0000000000000002 x1 : ffff2dfdbc960000 x0 : 0000000000000000 [ 85.380795] Call trace: [ 85.380796] update_locked_rq+0x64/0x6c (P) [ 85.380799] flush_dispatch_buf+0x2a8/0x2dc [ 85.380801] pick_task_scx+0x2b0/0x6d4 [ 85.380804] __schedule+0x62c/0x1060 [ 85.380811] schedule+0x48/0x15c [ 85.380813] worker_thread+0xdc/0x358 [ 85.380824] kthread+0x134/0x1fc [ 85.380831] ret_from_fork+0x10/0x20 [ 85.380839] irq event stamp: 34386 [ 85.380840] hardirqs last enabled at (34385): [<ffffd20441511408>] _raw_spin_unlock_irq+0x30/0x6c [ 85.380850] hardirqs last disabled at (34386): [<ffffd20441507100>] __schedule+0x510/0x1060 [ 85.380852] softirqs last enabled at (34014): [<ffffd204400c7280>] handle_softirqs+0x514/0x52c [ 85.380865] softirqs last disabled at (34007): [<ffffd204400105c4>] __do_softirq+0x14/0x20 [ 85.380867] ---[ end trace 0000000000000000 ]--- [ 85.380969] ------------[ cut here ]------------ [ 85.380970] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#10: kworker/u48:1/82 [ 85.380974] Modules linked in: qrtr 8021q garp mrp stp llc binfmt_misc sm3_ce r8169 cdns3_pci_wrap nf_tables nfnetlink fuse dm_mod ipv6 [ 85.380984] CPU: 10 UID: 0 PID: 82 Comm: kworker/u48:1 Tainted: G S W 6.19.0-rc7-cix-build+ #256 PREEMPT [ 85.380987] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN [ 85.380988] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.1.0-1 2025-12-25T02:55:53+00:00 [ 85.380990] Workqueue: 0x0 (events_unbound) [ 85.380993] Sched_ext: storm (enabled+all), task: runnable_at=+0ms [ 85.380994] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 85.380996] pc : update_locked_rq+0x64/0x6c [ 85.380997] lr : update_locked_rq+0x60/0x6c [ 85.380999] sp : ffff8000803a3bd0 [ 85.381000] x29: ffff8000803a3bd0 x28: fffffdffbf622dc0 x27: ffff00009151b580 [ 85.381002] x26: 0000000000000000 x25: ffffd204426cad80 x24: ffffd20442ba5bb8 [ 85.381005] x23: c00000000000000a x22: 0000000000000000 x21: ffffd20442ba4830 [ 85.381007] x20: ffff00009af0b000 x19: ffff0001fef52d80 x18: 0000000000000000 [ 85.381009] x17: 0000000000000000 x16: 0000000000000000 x15: 0000aaaae6917960 [ 85.381012] x14: 0000000000000000 x13: 00000000000a0000 x12: 0000000000000000 [ 85.381014] x11: 0000000000000040 x10: ffffd204402e7ca0 x9 : ffffd2044324b000 [ 85.381016] x8 : ffff0000810e0000 x7 : 0000d00202cc2dc0 x6 : 0000000000000050 [ 85.381019] x5 : ffffd204426b5648 x4 : fffffdffbf622dc0 x3 : ffff0000810e0000 [ 85.381021] x2 : 0000000000000002 x1 : ffff2dfdbc960000 x0 : 0000000000000000 [ 85.381023] Call trace: [ 85.381024] update_locked_rq+0x64/0x6c (P) [ 85.381026] flush_dispatch_buf+0x2a8/0x2dc [ 85.381028] pick_task_scx+0x2b0/0x6d4 [ 85.381030] __schedule+0x62c/0x1060 [ 85.381032] schedule+0x48/0x15c [ 85.381034] worker_thread+0xdc/0x358 [ 85.381036] kthread+0x134/0x1fc [ 85.381039] ret_from_fork+0x10/0x20 [ 85.381041] irq event stamp: 34394 [ 85.381042] hardirqs last enabled at (34393): [<ffffd20441511408>] _raw_spin_unlock_irq+0x30/0x6c [ 85.381044] hardirqs last disabled at (34394): [<ffffd20441507100>] __schedule+0x510/0x1060 [ 85.381046] softirqs last enabled at (34014): [<ffffd204400c7280>] handle_softirqs+0x514/0x52c [ 85.381049] softirqs last disabled at (34007): [<ffffd204400105c4>] __do_softirq+0x14/0x20 [ 85.381050] ---[ end trace 0000000000000000 ]--- [ 85.381199] ------------[ cut here ]------------ [ 85.381201] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#10: kworker/u48:1/82 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-01 22:47 ` Christian Loehle @ 2026-02-02 7:45 ` Andrea Righi 2026-02-02 9:26 ` Andrea Righi ` (2 more replies) 0 siblings, 3 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-02 7:45 UTC (permalink / raw) To: Christian Loehle Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hi Christian, On Sun, Feb 01, 2026 at 10:47:22PM +0000, Christian Loehle wrote: > On 2/1/26 09:08, Andrea Righi wrote: > > Currently, ops.dequeue() is only invoked when the sched_ext core knows > > that a task resides in BPF-managed data structures, which causes it to > > miss scheduling property change events. In addition, ops.dequeue() > > callbacks are completely skipped when tasks are dispatched to non-local > > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably > > track task state. > > > > Fix this by guaranteeing that each task entering the BPF scheduler's > > custody triggers exactly one ops.dequeue() call when it leaves that > > custody, whether the exit is due to a dispatch (regular or via a core > > scheduling pick) or to a scheduling property change (e.g. > > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA > > balancing, etc.). > > > > BPF scheduler custody concept: a task is considered to be in "BPF > > scheduler's custody" when it has been queued in BPF-managed data > > structures and the BPF scheduler is responsible for its lifecycle. > > Custody ends when the task is dispatched to a local DSQ, selected by > > core scheduling, or removed due to a property change. > > > > Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or > > %SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its > > custody. As a result, ops.dequeue() is not invoked for these tasks. > > > > To identify dequeues triggered by scheduling property changes, introduce > > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, > > the dequeue was caused by a scheduling property change. > > > > New ops.dequeue() semantics: > > - ops.dequeue() is invoked exactly once when the task leaves the BPF > > scheduler's custody, in one of the following cases: > > a) regular dispatch: task was dispatched to a non-local DSQ (global > > or user DSQ), ops.dequeue() called without any special flags set > > b) core scheduling dispatch: core-sched picks task before dispatch, > > dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set > > c) property change: task properties modified before dispatch, > > dequeue called with %SCX_DEQ_SCHED_CHANGE flag set > > > > This allows BPF schedulers to: > > - reliably track task ownership and lifecycle, > > - maintain accurate accounting of managed tasks, > > - update internal state when tasks change properties. > > > > So I have finally gotten around updating scx_storm to the new semantics, > see: > https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics > > I don't think the new ops.dequeue() are enough to make inserts to local-on > from anywhere safe, because it's still racing with dequeue from another CPU? Yeah, with this patch set BPF schedulers get proper ops.dequeue() callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from ops.dispatch(). When task properties change between scx_bpf_dsq_insert() and the actual dispatch, task_can_run_on_remote_rq() can still trigger a fatal scx_error(). The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the property change, so it can't prevent already-queued dispatches from failing. The race window is between ops.dispatch() returning and dispatch_to_local_dsq() executing. We can address this in a separate patch set. One thing at a time. :) > > Furthermore I can reproduce the following with this patch applied quite easily > with something like > > hackbench -l 1000 & timeout 10 ./build/scheds/c/scx_storm > > [ 44.356878] sched_ext: BPF scheduler "simple" enabled > [ 59.315370] sched_ext: BPF scheduler "simple" disabled (unregistered from user space) > [ 85.366747] sched_ext: BPF scheduler "storm" enabled > [ 85.371324] ------------[ cut here ]------------ > [ 85.373370] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#5: gmain/1111 Ah yes! I think I see it, can you try this on top? Thanks, -Andrea kernel/sched/ext.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 6d6f1253039d8..d8fed4a49195d 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, p->scx.flags |= SCX_TASK_OPS_ENQUEUED; } else { if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0); + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; } ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 7:45 ` Andrea Righi @ 2026-02-02 9:26 ` Andrea Righi 2026-02-02 10:02 ` Christian Loehle 2026-02-02 10:09 ` Christian Loehle 2026-02-02 13:59 ` Kuba Piecuch 2 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-02 9:26 UTC (permalink / raw) To: Christian Loehle Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Mon, Feb 02, 2026 at 08:45:18AM +0100, Andrea Righi wrote: ... > > So I have finally gotten around updating scx_storm to the new semantics, > > see: > > https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics > > > > I don't think the new ops.dequeue() are enough to make inserts to local-on > > from anywhere safe, because it's still racing with dequeue from another CPU? > > Yeah, with this patch set BPF schedulers get proper ops.dequeue() > callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from > ops.dispatch(). > > When task properties change between scx_bpf_dsq_insert() and the actual > dispatch, task_can_run_on_remote_rq() can still trigger a fatal > scx_error(). > > The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the > property change, so it can't prevent already-queued dispatches from > failing. The race window is between ops.dispatch() returning and > dispatch_to_local_dsq() executing. > > We can address this in a separate patch set. One thing at a time. :) Thinking more on this, the problem is that we're passing enforce=true to task_can_run_on_remote_rq(), triggering a critical failure - scx_error(). There's a logic in task_can_run_on_remote_rq() to fallback to the global DSQ, that doesn't happen if we pass enforce=true, due to scx_error(). However, instead of the global DSQ fallback, I was wondering if it'd be better to simply re-enqueue the task - setting SCX_ENQ_REENQ - if the target local DSQ isn't valid anymore when the dispatch is finalized. In this way using SCX_DSQ_LOCAL_ON | cpu from ops.dispatch() would simply trigger a re-enqueue when "cpu" isn't valid anymore (due to concurrent affinity / migration disabled changes) and the BPF scheduler can handle that in another ops.enqueue(). What do you think? Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 9:26 ` Andrea Righi @ 2026-02-02 10:02 ` Christian Loehle 2026-02-02 15:32 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Christian Loehle @ 2026-02-02 10:02 UTC (permalink / raw) To: Andrea Righi Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On 2/2/26 09:26, Andrea Righi wrote: > On Mon, Feb 02, 2026 at 08:45:18AM +0100, Andrea Righi wrote: > ... >>> So I have finally gotten around updating scx_storm to the new semantics, >>> see: >>> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics >>> >>> I don't think the new ops.dequeue() are enough to make inserts to local-on >>> from anywhere safe, because it's still racing with dequeue from another CPU? >> >> Yeah, with this patch set BPF schedulers get proper ops.dequeue() >> callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from >> ops.dispatch(). >> >> When task properties change between scx_bpf_dsq_insert() and the actual >> dispatch, task_can_run_on_remote_rq() can still trigger a fatal >> scx_error(). >> >> The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the >> property change, so it can't prevent already-queued dispatches from >> failing. The race window is between ops.dispatch() returning and >> dispatch_to_local_dsq() executing. >> >> We can address this in a separate patch set. One thing at a time. :) > > Thinking more on this, the problem is that we're passing enforce=true to > task_can_run_on_remote_rq(), triggering a critical failure - scx_error(). > There's a logic in task_can_run_on_remote_rq() to fallback to the global > DSQ, that doesn't happen if we pass enforce=true, due to scx_error(). > > However, instead of the global DSQ fallback, I was wondering if it'd be > better to simply re-enqueue the task - setting SCX_ENQ_REENQ - if the > target local DSQ isn't valid anymore when the dispatch is finalized. > > In this way using SCX_DSQ_LOCAL_ON | cpu from ops.dispatch() would simply > trigger a re-enqueue when "cpu" isn't valid anymore (due to concurrent > affinity / migration disabled changes) and the BPF scheduler can handle > that in another ops.enqueue(). > > What do you think? I think that's a lot more versatile for the BPF scheduler than using the global DSQ as fallback in that case, so yeah I'm all for it! ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 10:02 ` Christian Loehle @ 2026-02-02 15:32 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-02 15:32 UTC (permalink / raw) To: Christian Loehle Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Mon, Feb 02, 2026 at 10:02:30AM +0000, Christian Loehle wrote: > On 2/2/26 09:26, Andrea Righi wrote: > > On Mon, Feb 02, 2026 at 08:45:18AM +0100, Andrea Righi wrote: > > ... > >>> So I have finally gotten around updating scx_storm to the new semantics, > >>> see: > >>> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics > >>> > >>> I don't think the new ops.dequeue() are enough to make inserts to local-on > >>> from anywhere safe, because it's still racing with dequeue from another CPU? > >> > >> Yeah, with this patch set BPF schedulers get proper ops.dequeue() > >> callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from > >> ops.dispatch(). > >> > >> When task properties change between scx_bpf_dsq_insert() and the actual > >> dispatch, task_can_run_on_remote_rq() can still trigger a fatal > >> scx_error(). > >> > >> The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the > >> property change, so it can't prevent already-queued dispatches from > >> failing. The race window is between ops.dispatch() returning and > >> dispatch_to_local_dsq() executing. > >> > >> We can address this in a separate patch set. One thing at a time. :) > > > > Thinking more on this, the problem is that we're passing enforce=true to > > task_can_run_on_remote_rq(), triggering a critical failure - scx_error(). > > There's a logic in task_can_run_on_remote_rq() to fallback to the global > > DSQ, that doesn't happen if we pass enforce=true, due to scx_error(). > > > > However, instead of the global DSQ fallback, I was wondering if it'd be > > better to simply re-enqueue the task - setting SCX_ENQ_REENQ - if the > > target local DSQ isn't valid anymore when the dispatch is finalized. > > > > In this way using SCX_DSQ_LOCAL_ON | cpu from ops.dispatch() would simply > > trigger a re-enqueue when "cpu" isn't valid anymore (due to concurrent > > affinity / migration disabled changes) and the BPF scheduler can handle > > that in another ops.enqueue(). > > > > What do you think? > > I think that's a lot more versatile for the BPF scheduler than using the > global DSQ as fallback in that case, so yeah I'm all for it! > Ack, I already have a working patch do to this, I'll post it as a separate patch set. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 7:45 ` Andrea Righi 2026-02-02 9:26 ` Andrea Righi @ 2026-02-02 10:09 ` Christian Loehle 2026-02-02 13:59 ` Kuba Piecuch 2 siblings, 0 replies; 81+ messages in thread From: Christian Loehle @ 2026-02-02 10:09 UTC (permalink / raw) To: Andrea Righi Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On 2/2/26 07:45, Andrea Righi wrote: > Hi Christian, > > On Sun, Feb 01, 2026 at 10:47:22PM +0000, Christian Loehle wrote: >> On 2/1/26 09:08, Andrea Righi wrote: >>> Currently, ops.dequeue() is only invoked when the sched_ext core knows >>> that a task resides in BPF-managed data structures, which causes it to >>> miss scheduling property change events. In addition, ops.dequeue() >>> callbacks are completely skipped when tasks are dispatched to non-local >>> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably >>> track task state. >>> >>> Fix this by guaranteeing that each task entering the BPF scheduler's >>> custody triggers exactly one ops.dequeue() call when it leaves that >>> custody, whether the exit is due to a dispatch (regular or via a core >>> scheduling pick) or to a scheduling property change (e.g. >>> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA >>> balancing, etc.). >>> >>> BPF scheduler custody concept: a task is considered to be in "BPF >>> scheduler's custody" when it has been queued in BPF-managed data >>> structures and the BPF scheduler is responsible for its lifecycle. >>> Custody ends when the task is dispatched to a local DSQ, selected by >>> core scheduling, or removed due to a property change. >>> >>> Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or >>> %SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its >>> custody. As a result, ops.dequeue() is not invoked for these tasks. >>> >>> To identify dequeues triggered by scheduling property changes, introduce >>> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, >>> the dequeue was caused by a scheduling property change. >>> >>> New ops.dequeue() semantics: >>> - ops.dequeue() is invoked exactly once when the task leaves the BPF >>> scheduler's custody, in one of the following cases: >>> a) regular dispatch: task was dispatched to a non-local DSQ (global >>> or user DSQ), ops.dequeue() called without any special flags set >>> b) core scheduling dispatch: core-sched picks task before dispatch, >>> dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set >>> c) property change: task properties modified before dispatch, >>> dequeue called with %SCX_DEQ_SCHED_CHANGE flag set >>> >>> This allows BPF schedulers to: >>> - reliably track task ownership and lifecycle, >>> - maintain accurate accounting of managed tasks, >>> - update internal state when tasks change properties. >>> >> >> So I have finally gotten around updating scx_storm to the new semantics, >> see: >> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics >> >> I don't think the new ops.dequeue() are enough to make inserts to local-on >> from anywhere safe, because it's still racing with dequeue from another CPU? > > Yeah, with this patch set BPF schedulers get proper ops.dequeue() > callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from > ops.dispatch(). > > When task properties change between scx_bpf_dsq_insert() and the actual > dispatch, task_can_run_on_remote_rq() can still trigger a fatal > scx_error(). > > The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the > property change, so it can't prevent already-queued dispatches from > failing. The race window is between ops.dispatch() returning and > dispatch_to_local_dsq() executing. > > We can address this in a separate patch set. One thing at a time. :) > >> >> Furthermore I can reproduce the following with this patch applied quite easily >> with something like >> >> hackbench -l 1000 & timeout 10 ./build/scheds/c/scx_storm >> >> [ 44.356878] sched_ext: BPF scheduler "simple" enabled >> [ 59.315370] sched_ext: BPF scheduler "simple" disabled (unregistered from user space) >> [ 85.366747] sched_ext: BPF scheduler "storm" enabled >> [ 85.371324] ------------[ cut here ]------------ >> [ 85.373370] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#5: gmain/1111 > > Ah yes! I think I see it, can you try this on top? > > Thanks, > -Andrea > > kernel/sched/ext.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 6d6f1253039d8..d8fed4a49195d 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > } else { > if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0); > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > > p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > } Yup, that fixes it! ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 7:45 ` Andrea Righi 2026-02-02 9:26 ` Andrea Righi 2026-02-02 10:09 ` Christian Loehle @ 2026-02-02 13:59 ` Kuba Piecuch 2026-02-04 9:36 ` Andrea Righi 2 siblings, 1 reply; 81+ messages in thread From: Kuba Piecuch @ 2026-02-02 13:59 UTC (permalink / raw) To: Andrea Righi, Christian Loehle Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hi Andrea, On Mon Feb 2, 2026 at 7:45 AM UTC, Andrea Righi wrote: > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 6d6f1253039d8..d8fed4a49195d 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > } else { > if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0); > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > > p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > } This looks risky from a locking perspective. Are we relying on SCX_OPSS_DISPATCHING to protect against racing dequeues? If so, it might be worth calling out in a comment. Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 13:59 ` Kuba Piecuch @ 2026-02-04 9:36 ` Andrea Righi 2026-02-04 9:51 ` Kuba Piecuch 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-04 9:36 UTC (permalink / raw) To: Kuba Piecuch Cc: Christian Loehle, Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hi Kuba, sorry for the late response. On Mon, Feb 02, 2026 at 01:59:24PM +0000, Kuba Piecuch wrote: > Hi Andrea, > > On Mon Feb 2, 2026 at 7:45 AM UTC, Andrea Righi wrote: > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index 6d6f1253039d8..d8fed4a49195d 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > > p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > > } else { > > if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) > > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0); > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > > > > p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > } > > This looks risky from a locking perspective. Are we relying on > SCX_OPSS_DISPATCHING to protect against racing dequeues? If so, it might > be worth calling out in a comment. You're right, we're relying on SCX_OPSS_DISPATCHING to protect against racing dequeues and this definitely deserves a comment. How about something like the following? Thanks, -Andrea --- diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 292adf10fee1b..b189339e74101 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2260,6 +2260,15 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, if (!is_terminal_dsq(dsq_id)) { p->scx.flags |= SCX_TASK_OPS_ENQUEUED; } else { + /* + * Locking: we're holding the @rq lock (the + * dispatch CPU's rq), but not necessarily + * task_rq(p), since @p may be from a remote CPU. + * + * This is safe because SCX_OPSS_DISPATCHING state + * prevents racing dequeues, any concurrent + * ops_dequeue() will wait for this state to clear. + */ if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-04 9:36 ` Andrea Righi @ 2026-02-04 9:51 ` Kuba Piecuch 0 siblings, 0 replies; 81+ messages in thread From: Kuba Piecuch @ 2026-02-04 9:51 UTC (permalink / raw) To: Andrea Righi, Kuba Piecuch Cc: Christian Loehle, Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Wed Feb 4, 2026 at 9:36 AM UTC, Andrea Righi wrote: > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 292adf10fee1b..b189339e74101 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -2260,6 +2260,15 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > if (!is_terminal_dsq(dsq_id)) { > p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > } else { > + /* > + * Locking: we're holding the @rq lock (the > + * dispatch CPU's rq), but not necessarily > + * task_rq(p), since @p may be from a remote CPU. > + * > + * This is safe because SCX_OPSS_DISPATCHING state > + * prevents racing dequeues, any concurrent > + * ops_dequeue() will wait for this state to clear. > + */ > if (p->scx.flags & SCX_TASK_OPS_ENQUEUED) > SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); Looks good, thanks :) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi 2026-02-01 22:47 ` Christian Loehle @ 2026-02-02 11:56 ` Kuba Piecuch 2026-02-04 10:11 ` Andrea Righi 1 sibling, 1 reply; 81+ messages in thread From: Kuba Piecuch @ 2026-02-02 11:56 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel Hi Andrea, Looks good overall, but we need to settle on the global DSQ semantics, plus some edge cases that need clearing up. On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote: > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > index 404fe6126a769..6d9e82e6ca9d4 100644 > --- a/Documentation/scheduler/sched-ext.rst > +++ b/Documentation/scheduler/sched-ext.rst > @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed. > > * Queue the task on the BPF side. > > + **Task State Tracking and ops.dequeue() Semantics** > + > + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may > + enter the "BPF scheduler's custody" depending on where it's dispatched: > + > + * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or > + ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler > + entirely and goes straight to the CPU's local run queue. The task > + never enters BPF custody, and ``ops.dequeue()`` will not be called. > + > + * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs): > + the task enters the BPF scheduler's custody. When the task later > + leaves BPF custody (dispatched to a local DSQ, picked by core-sched, > + or dequeued for sleep/property changes), ``ops.dequeue()`` will be > + called exactly once. > + > + * **Queued on BPF side**: The task is in BPF data structures and in BPF > + custody, ``ops.dequeue()`` will be called when it leaves. > + > + The key principle: **ops.dequeue() is called when a task leaves the BPF > + scheduler's custody**. A task is in BPF custody if it's on a non-local > + DSQ or in BPF data structures. Once dispatched to a local DSQ or after > + ops.dequeue() is called, the task is out of BPF custody and the BPF > + scheduler no longer needs to track it. > + > + This works correctly with the ``ops.select_cpu()`` direct dispatch > + optimization: even though it skips ``ops.enqueue()`` invocation, if the > + task is dispatched to a non-local DSQ, it enters BPF custody and will > + get ``ops.dequeue()`` when it leaves. This provides the performance > + benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining > + correct state tracking. > + > + The dequeue can happen for different reasons, distinguished by flags: > + > + 1. **Regular dispatch workflow**: when the task is dispatched from a > + non-local DSQ to a local DSQ (leaving BPF custody for execution), > + ``ops.dequeue()`` is triggered without any special flags. Maybe add a note that this can happen asynchronously, without the BPF scheduler explicitly dispatching the task to a local DSQ, when the task is on a global DSQ? Or maybe make that case into a separate dequeue reason with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ? > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index bcb962d5ee7d8..0d003d2845393 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > /* scx_entity.flags */ > enum scx_ent_flags { > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* under ext scheduler's custody */ Nit: I think "in BPF scheduler's custody" would be a bit clearer, as "ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT as a whole. > @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > switch (opss & SCX_OPSS_STATE_MASK) { > case SCX_OPSS_NONE: > + /* > + * Task is not in BPF data structures (either dispatched to > + * a DSQ or running). Only call ops.dequeue() if the task > + * is still in BPF scheduler's custody > + * (%SCX_TASK_OPS_ENQUEUED is set). > + * > + * If the task has already been dispatched to a local DSQ > + * (left BPF custody), the flag will be clear and we skip > + * ops.dequeue() > + * > + * If this is a property change (not sleep/core-sched) and > + * the task is still in BPF custody, set the > + * %SCX_DEQ_SCHED_CHANGE flag. > + */ > + if (SCX_HAS_OP(sch, dequeue) && > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { > + u64 flags = deq_flags; > + > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > + flags |= SCX_DEQ_SCHED_CHANGE; I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for tasks being picked from a global DSQ being migrated from a remote rq to the local rq, which, while technically correct since the task is migrating rqs, may be confusing, since it fits two cases in the documentation: * Since the task is leaving BPF custody for execution, ops.dequeue() should be called without any special flags. * Since the task is being migrated between rqs, ops.dequeue() should be called with SCX_DEQ_SCHED_CHANGE. > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + } > break; > case SCX_OPSS_QUEUEING: > /* Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 11:56 ` Kuba Piecuch @ 2026-02-04 10:11 ` Andrea Righi 2026-02-04 10:33 ` Kuba Piecuch 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-04 10:11 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Mon, Feb 02, 2026 at 11:56:43AM +0000, Kuba Piecuch wrote: > Hi Andrea, > > Looks good overall, but we need to settle on the global DSQ semantics, plus > some edge cases that need clearing up. On this one I think we settled on the assumption that SCX_DSQ_GLOBAL can be considered a "terminal DSQ", so we won't trigger ops.dequeue(). > > On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote: > > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > > index 404fe6126a769..6d9e82e6ca9d4 100644 > > --- a/Documentation/scheduler/sched-ext.rst > > +++ b/Documentation/scheduler/sched-ext.rst > > @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed. > > > > * Queue the task on the BPF side. > > > > + **Task State Tracking and ops.dequeue() Semantics** > > + > > + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may > > + enter the "BPF scheduler's custody" depending on where it's dispatched: > > + > > + * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or > > + ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler > > + entirely and goes straight to the CPU's local run queue. The task > > + never enters BPF custody, and ``ops.dequeue()`` will not be called. > > + > > + * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs): > > + the task enters the BPF scheduler's custody. When the task later > > + leaves BPF custody (dispatched to a local DSQ, picked by core-sched, > > + or dequeued for sleep/property changes), ``ops.dequeue()`` will be > > + called exactly once. > > + > > + * **Queued on BPF side**: The task is in BPF data structures and in BPF > > + custody, ``ops.dequeue()`` will be called when it leaves. > > + > > + The key principle: **ops.dequeue() is called when a task leaves the BPF > > + scheduler's custody**. A task is in BPF custody if it's on a non-local > > + DSQ or in BPF data structures. Once dispatched to a local DSQ or after > > + ops.dequeue() is called, the task is out of BPF custody and the BPF > > + scheduler no longer needs to track it. > > + > > + This works correctly with the ``ops.select_cpu()`` direct dispatch > > + optimization: even though it skips ``ops.enqueue()`` invocation, if the > > + task is dispatched to a non-local DSQ, it enters BPF custody and will > > + get ``ops.dequeue()`` when it leaves. This provides the performance > > + benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining > > + correct state tracking. > > + > > + The dequeue can happen for different reasons, distinguished by flags: > > + > > + 1. **Regular dispatch workflow**: when the task is dispatched from a > > + non-local DSQ to a local DSQ (leaving BPF custody for execution), > > + ``ops.dequeue()`` is triggered without any special flags. > > Maybe add a note that this can happen asynchronously, without the BPF > scheduler explicitly dispatching the task to a local DSQ, when the task > is on a global DSQ? Or maybe make that case into a separate dequeue reason > with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ? And I guess we don't need this if we consider SCX_DSQ_GLOBAL as a terminal DSQ, because we won't trigger ops.dequeue(). > > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > index bcb962d5ee7d8..0d003d2845393 100644 > > --- a/include/linux/sched/ext.h > > +++ b/include/linux/sched/ext.h > > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > > /* scx_entity.flags */ > > enum scx_ent_flags { > > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > > + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* under ext scheduler's custody */ > > Nit: I think "in BPF scheduler's custody" would be a bit clearer, as > "ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT > as a whole. Ack. Will change that. > > > @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > > > switch (opss & SCX_OPSS_STATE_MASK) { > > case SCX_OPSS_NONE: > > + /* > > + * Task is not in BPF data structures (either dispatched to > > + * a DSQ or running). Only call ops.dequeue() if the task > > + * is still in BPF scheduler's custody > > + * (%SCX_TASK_OPS_ENQUEUED is set). > > + * > > + * If the task has already been dispatched to a local DSQ > > + * (left BPF custody), the flag will be clear and we skip > > + * ops.dequeue() > > + * > > + * If this is a property change (not sleep/core-sched) and > > + * the task is still in BPF custody, set the > > + * %SCX_DEQ_SCHED_CHANGE flag. > > + */ > > + if (SCX_HAS_OP(sch, dequeue) && > > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { > > + u64 flags = deq_flags; > > + > > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > > + flags |= SCX_DEQ_SCHED_CHANGE; > > I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for > tasks being picked from a global DSQ being migrated from a remote rq to the > local rq, which, while technically correct since the task is migrating rqs, > may be confusing, since it fits two cases in the documentation: > > * Since the task is leaving BPF custody for execution, ops.dequeue() should be > called without any special flags. > * Since the task is being migrated between rqs, ops.dequeue() should be called > with SCX_DEQ_SCHED_CHANGE. This also should be fixed with the new logic, because a task disptched to a global DSQ is considered outside of the BPF scheduler's custody, so ops.dequeue() is not invoked at all. I'll post a new patch set later today, so we can better discuss if all these assumptions have been addressed properly. :) > > > + > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + } > > break; > > case SCX_OPSS_QUEUEING: > > /* > > Thanks, > Kuba Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-04 10:11 ` Andrea Righi @ 2026-02-04 10:33 ` Kuba Piecuch 0 siblings, 0 replies; 81+ messages in thread From: Kuba Piecuch @ 2026-02-04 10:33 UTC (permalink / raw) To: Andrea Righi, Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Wed Feb 4, 2026 at 10:11 AM UTC, Andrea Righi wrote: > On Mon, Feb 02, 2026 at 11:56:43AM +0000, Kuba Piecuch wrote: >> Hi Andrea, >> >> Looks good overall, but we need to settle on the global DSQ semantics, plus >> some edge cases that need clearing up. > > On this one I think we settled on the assumption that SCX_DSQ_GLOBAL can be > considered a "terminal DSQ", so we won't trigger ops.dequeue(). Correct, I made this comment before we settled it. Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCHSET v3 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics
@ 2026-01-26 8:41 Andrea Righi
2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi
0 siblings, 1 reply; 81+ messages in thread
From: Andrea Righi @ 2026-01-26 8:41 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext,
linux-kernel
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), ensuring that every
ops.enqueue() is balanced by a corresponding ops.dequeue() invocation. In
addition, ops.dequeue() is now properly invoked when tasks are removed from
the sched_ext class, such as on task property changes.
To distinguish between a "regular" dequeue and a property change dequeue a
new dequeue flag is introduced: %SCX_DEQ_SCHED_CHANGE. BPF schedulers can
use this flag to distinguish between regular dispatch dequeues
(%SCX_DEQ_SCHED_CHANGE unset) and property change dequeues
(%SCX_DEQ_SCHED_CHANGE set).
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com/
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 33 ++++
include/linux/sched/ext.h | 11 ++
kernel/sched/ext.c | 89 +++++++++-
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 2 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 209 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 182 +++++++++++++++++++++
10 files changed, 534 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
^ permalink raw reply [flat|nested] 81+ messages in thread* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-26 8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi @ 2026-01-26 8:41 ` Andrea Righi 2026-01-27 16:38 ` Emil Tsalapatis ` (2 more replies) 0 siblings, 3 replies; 81+ messages in thread From: Andrea Righi @ 2026-01-26 8:41 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change scenarios. As a result, BPF schedulers cannot reliably track task state. In addition, some ops.dequeue() callbacks can be skipped (e.g., during direct dispatch), so ops.enqueue() calls are not always paired with a corresponding ops.dequeue(), potentially breaking accounting logic. Fix this by guaranteeing that every ops.enqueue() is matched with a corresponding ops.dequeue(), and introduce the %SCX_DEQ_SCHED_CHANGE flag to distinguish dequeues triggered by scheduling property changes from those occurring in the normal dispatch/execution workflow. New semantics: 1. ops.enqueue() is called when a task enters the BPF scheduler 2. ops.dequeue() is called when the task leaves the BPF scheduler in the following cases: a) regular dispatch workflow: task dispatched to a DSQ, b) core scheduling pick: core-sched picks task before dispatch, c) property change: task properties modified. A new %SCX_DEQ_SCHED_CHANGE flag is also introduced, allowing BPF schedulers to distinguish between: - normal dispatch/execution workflow (dispatch, core-sched pick), - property changes that require state updates (e.g., sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, CPU migrations, etc.). With this, BPF schedulers can: - reliably track task ownership and lifecycle, - maintain accurate accounting of enqueue/dequeue pairs, - distinguish between execution events and property changes, - update internal state appropriately for each dequeue type. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- Documentation/scheduler/sched-ext.rst | 33 +++++++ include/linux/sched/ext.h | 11 +++ kernel/sched/ext.c | 89 ++++++++++++++++++- kernel/sched/ext_internal.h | 7 ++ .../sched_ext/include/scx/enum_defs.autogen.h | 2 + .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + tools/sched_ext/include/scx/enums.autogen.h | 1 + 7 files changed, 142 insertions(+), 3 deletions(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 404fe6126a769..ed6bf7d9e6e8c 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + Once ``ops.enqueue()`` is called, the task enters the "enqueued state". + The task remains in this state until ``ops.dequeue()`` is called, which + happens in the following cases: + + 1. **Regular dispatch workflow**: when the task is successfully + dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()`` + is triggered immediately to notify the BPF scheduler. + + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and + core scheduling picks a task for execution before it has been + dispatched, ``ops.dequeue()`` is called with the + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. + + 3. **Scheduling property change**: when a task property changes (via + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, + priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called + with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. + + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, + regardless of whether the task is still on a BPF data structure, or it + has already been dispatched to a DSQ. This guarantees that every + ``ops.enqueue()`` will eventually be followed by a corresponding + ``ops.dequeue()``. + + This makes it reliable for BPF schedulers to track the enqueued state + and maintain accurate accounting. + + BPF schedulers can choose not to implement ``ops.dequeue()`` if they + don't need to track these transitions. The sched_ext core will safely + handle all dequeue operations regardless. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -319,6 +350,8 @@ by a sched_ext scheduler: /* Any usable CPU becomes available */ ops.dispatch(); /* Task is moved to a local DSQ */ + + ops.dequeue(); /* Exiting BPF scheduler */ } ops.running(); /* Task starts running on its assigned CPU */ while (task->scx.slice > 0 && task is runnable) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d8..59446cd0373fa 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -84,8 +84,19 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + /* + * Set when ops.enqueue() is called; used to determine if ops.dequeue() + * should be invoked when transitioning out of SCX_OPSS_NONE state. + */ + SCX_TASK_OPS_ENQUEUED = 1 << 1, SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ + /* + * Set when ops.dequeue() is called after successful dispatch; used to + * distinguish dispatch dequeues from property change dequeues and + * prevent duplicate dequeue calls. + */ + SCX_TASK_DISPATCH_DEQUEUED = 1 << 4, SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ SCX_TASK_STATE_BITS = 2, diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index afe28c04d5aa7..18bca2b83f5c5 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, p->scx.ddsp_enq_flags |= enq_flags; + /* + * The task is about to be dispatched. If ops.enqueue() was called, + * notify the BPF scheduler by calling ops.dequeue(). + * + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property + * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE. + * Mark that the dispatch dequeue has been called to distinguish + * from property change dequeues. + */ + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; + } + /* * We are in the enqueue path with @rq locked and pinned, and thus can't * double lock a remote rq and enqueue to its local DSQ. For @@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); + /* + * Mark that ops.enqueue() is being called for this task. + * Clear the dispatch dequeue flag for the new enqueue cycle. + * Only track these flags if ops.dequeue() is implemented. + */ + if (SCX_HAS_OP(sch, dequeue)) { + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED; + } + ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; @@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) switch (opss & SCX_OPSS_STATE_MASK) { case SCX_OPSS_NONE: + if (SCX_HAS_OP(sch, dequeue) && + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { + /* + * Task was already dispatched. Only call ops.dequeue() + * if it hasn't been called yet (check DISPATCH_DEQUEUED). + * This can happen when: + * 1. Core-sched picks a task that was dispatched + * 2. Property changes occur after dispatch + */ + if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) { + /* + * ops.dequeue() wasn't called during dispatch. + * This shouldn't normally happen, but call it now. + */ + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, + p, deq_flags); + } else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) { + /* + * This is a property change after + * dispatch. Call ops.dequeue() again with + * %SCX_DEQ_SCHED_CHANGE. + */ + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, + p, deq_flags | SCX_DEQ_SCHED_CHANGE); + } + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | + SCX_TASK_DISPATCH_DEQUEUED); + } break; case SCX_OPSS_QUEUEING: /* @@ -1531,9 +1583,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: - if (SCX_HAS_OP(sch, dequeue)) - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, - p, deq_flags); + /* + * Task is still on the BPF scheduler (not dispatched yet). + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE + * only for property changes, not for core-sched picks. + */ + if (SCX_HAS_OP(sch, dequeue)) { + u64 flags = deq_flags; + /* + * Add %SCX_DEQ_SCHED_CHANGE for property changes, + * but not for core-sched picks or sleep. + */ + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) + flags |= SCX_DEQ_SCHED_CHANGE; + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | + SCX_TASK_DISPATCH_DEQUEUED); + } if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -2107,6 +2174,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); + /* + * The task is about to be dispatched. If ops.enqueue() was called, + * notify the BPF scheduler by calling ops.dequeue(). + * + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property + * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE. + * Mark that the dispatch dequeue has been called to distinguish + * from property change dequeues. + */ + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { + struct rq *task_rq = task_rq(p); + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0); + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; + } + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); if (dsq->id == SCX_DSQ_LOCAL) diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 386c677e4c9a0..befa9a5d6e53f 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -982,6 +982,13 @@ enum scx_deq_flags { * it hasn't been dispatched yet. Dequeue from the BPF side. */ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, + + /* + * The task is being dequeued due to a property change (e.g., + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), + * etc.). + */ + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, }; enum scx_pick_idle_cpu_flags { diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c2..8284f717ff05e 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -21,6 +21,7 @@ #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_SCHED_CHANGE #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID @@ -48,6 +49,7 @@ #define HAVE_SCX_TASK_QUEUED #define HAVE_SCX_TASK_RESET_RUNNABLE_AT #define HAVE_SCX_TASK_DEQD_FOR_SLEEP +#define HAVE_SCX_TASK_DISPATCH_DEQUEUED #define HAVE_SCX_TASK_STATE_SHIFT #define HAVE_SCX_TASK_STATE_BITS #define HAVE_SCX_TASK_STATE_MASK diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19ad..5da50f9376844 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584be..fc9a7a4d9dea5 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -46,4 +46,5 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ } while (0) -- 2.52.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi @ 2026-01-27 16:38 ` Emil Tsalapatis 2026-01-27 16:41 ` Kuba Piecuch 2026-01-28 21:21 ` Tejun Heo 2 siblings, 0 replies; 81+ messages in thread From: Emil Tsalapatis @ 2026-01-27 16:38 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel On Mon Jan 26, 2026 at 3:41 AM EST, Andrea Righi wrote: > Currently, ops.dequeue() is only invoked when the sched_ext core knows > that a task resides in BPF-managed data structures, which causes it to > miss scheduling property change scenarios. As a result, BPF schedulers > cannot reliably track task state. > > In addition, some ops.dequeue() callbacks can be skipped (e.g., during > direct dispatch), so ops.enqueue() calls are not always paired with a > corresponding ops.dequeue(), potentially breaking accounting logic. > > Fix this by guaranteeing that every ops.enqueue() is matched with a > corresponding ops.dequeue(), and introduce the %SCX_DEQ_SCHED_CHANGE > flag to distinguish dequeues triggered by scheduling property changes > from those occurring in the normal dispatch/execution workflow. > > New semantics: > 1. ops.enqueue() is called when a task enters the BPF scheduler > 2. ops.dequeue() is called when the task leaves the BPF scheduler in > the following cases: > a) regular dispatch workflow: task dispatched to a DSQ, > b) core scheduling pick: core-sched picks task before dispatch, > c) property change: task properties modified. > > A new %SCX_DEQ_SCHED_CHANGE flag is also introduced, allowing BPF > schedulers to distinguish between: > - normal dispatch/execution workflow (dispatch, core-sched pick), > - property changes that require state updates (e.g., > sched_setaffinity(), sched_setscheduler(), set_user_nice(), > NUMA balancing, CPU migrations, etc.). > > With this, BPF schedulers can: > - reliably track task ownership and lifecycle, > - maintain accurate accounting of enqueue/dequeue pairs, > - distinguish between execution events and property changes, > - update internal state appropriately for each dequeue type. > > Cc: Tejun Heo <tj@kernel.org> > Cc: Emil Tsalapatis <emil@etsalapatis.com> > Cc: Kuba Piecuch <jpiecuch@google.com> > Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Looks great overall. Following up on our off-list chat about whether SCX_TASK_DISPATCH_DEQUEUED is necessary: We need it for the new DEQ_STATE change flag so no need to consider removing it imo. > --- > Documentation/scheduler/sched-ext.rst | 33 +++++++ > include/linux/sched/ext.h | 11 +++ > kernel/sched/ext.c | 89 ++++++++++++++++++- > kernel/sched/ext_internal.h | 7 ++ > .../sched_ext/include/scx/enum_defs.autogen.h | 2 + > .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + > tools/sched_ext/include/scx/enums.autogen.h | 1 + > 7 files changed, 142 insertions(+), 3 deletions(-) > > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > index 404fe6126a769..ed6bf7d9e6e8c 100644 > --- a/Documentation/scheduler/sched-ext.rst > +++ b/Documentation/scheduler/sched-ext.rst > @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed. > > * Queue the task on the BPF side. > > + Once ``ops.enqueue()`` is called, the task enters the "enqueued state". > + The task remains in this state until ``ops.dequeue()`` is called, which > + happens in the following cases: > + > + 1. **Regular dispatch workflow**: when the task is successfully > + dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()`` > + is triggered immediately to notify the BPF scheduler. > + > + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and > + core scheduling picks a task for execution before it has been > + dispatched, ``ops.dequeue()`` is called with the > + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. > + > + 3. **Scheduling property change**: when a task property changes (via > + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, > + priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called > + with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. > + > + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, > + regardless of whether the task is still on a BPF data structure, or it > + has already been dispatched to a DSQ. This guarantees that every > + ``ops.enqueue()`` will eventually be followed by a corresponding > + ``ops.dequeue()``. > + > + This makes it reliable for BPF schedulers to track the enqueued state > + and maintain accurate accounting. > + > + BPF schedulers can choose not to implement ``ops.dequeue()`` if they > + don't need to track these transitions. The sched_ext core will safely > + handle all dequeue operations regardless. > + > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If > empty, it then looks at the global DSQ. If there still isn't a task to > run, ``ops.dispatch()`` is invoked which can use the following two > @@ -319,6 +350,8 @@ by a sched_ext scheduler: > /* Any usable CPU becomes available */ > > ops.dispatch(); /* Task is moved to a local DSQ */ > + > + ops.dequeue(); /* Exiting BPF scheduler */ > } > ops.running(); /* Task starts running on its assigned CPU */ > while (task->scx.slice > 0 && task is runnable) > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index bcb962d5ee7d8..59446cd0373fa 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -84,8 +84,19 @@ struct scx_dispatch_q { > /* scx_entity.flags */ > enum scx_ent_flags { > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > + /* > + * Set when ops.enqueue() is called; used to determine if ops.dequeue() > + * should be invoked when transitioning out of SCX_OPSS_NONE state. > + */ > + SCX_TASK_OPS_ENQUEUED = 1 << 1, > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > + /* > + * Set when ops.dequeue() is called after successful dispatch; used to > + * distinguish dispatch dequeues from property change dequeues and > + * prevent duplicate dequeue calls. > + */ > + SCX_TASK_DISPATCH_DEQUEUED = 1 << 4, > > SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ > SCX_TASK_STATE_BITS = 2, > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index afe28c04d5aa7..18bca2b83f5c5 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > > p->scx.ddsp_enq_flags |= enq_flags; > > + /* > + * The task is about to be dispatched. If ops.enqueue() was called, > + * notify the BPF scheduler by calling ops.dequeue(). > + * > + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property > + * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE. > + * Mark that the dispatch dequeue has been called to distinguish > + * from property change dequeues. > + */ > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; > + } > + > /* > * We are in the enqueue path with @rq locked and pinned, and thus can't > * double lock a remote rq and enqueue to its local DSQ. For > @@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); > > + /* > + * Mark that ops.enqueue() is being called for this task. > + * Clear the dispatch dequeue flag for the new enqueue cycle. > + * Only track these flags if ops.dequeue() is implemented. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > + p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED; > + } > + > ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); > WARN_ON_ONCE(*ddsp_taskp); > *ddsp_taskp = p; > @@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > switch (opss & SCX_OPSS_STATE_MASK) { > case SCX_OPSS_NONE: > + if (SCX_HAS_OP(sch, dequeue) && > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { > + /* > + * Task was already dispatched. Only call ops.dequeue() > + * if it hasn't been called yet (check DISPATCH_DEQUEUED). > + * This can happen when: > + * 1. Core-sched picks a task that was dispatched > + * 2. Property changes occur after dispatch > + */ > + if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) { > + /* > + * ops.dequeue() wasn't called during dispatch. > + * This shouldn't normally happen, but call it now. > + */ > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > + p, deq_flags); > + } else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) { > + /* > + * This is a property change after > + * dispatch. Call ops.dequeue() again with > + * %SCX_DEQ_SCHED_CHANGE. > + */ > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > + p, deq_flags | SCX_DEQ_SCHED_CHANGE); > + } > + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | > + SCX_TASK_DISPATCH_DEQUEUED); > + } > break; > case SCX_OPSS_QUEUEING: > /* > @@ -1531,9 +1583,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > */ > BUG(); > case SCX_OPSS_QUEUED: > - if (SCX_HAS_OP(sch, dequeue)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > - p, deq_flags); > + /* > + * Task is still on the BPF scheduler (not dispatched yet). > + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE > + * only for property changes, not for core-sched picks. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + u64 flags = deq_flags; > + /* > + * Add %SCX_DEQ_SCHED_CHANGE for property changes, > + * but not for core-sched picks or sleep. > + */ > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > + flags |= SCX_DEQ_SCHED_CHANGE; > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | > + SCX_TASK_DISPATCH_DEQUEUED); > + } > > if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, > SCX_OPSS_NONE)) > @@ -2107,6 +2174,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > > BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); > > + /* > + * The task is about to be dispatched. If ops.enqueue() was called, > + * notify the BPF scheduler by calling ops.dequeue(). > + * > + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property > + * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE. > + * Mark that the dispatch dequeue has been called to distinguish > + * from property change dequeues. > + */ > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { > + struct rq *task_rq = task_rq(p); > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0); > + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; > + } > + > dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); > > if (dsq->id == SCX_DSQ_LOCAL) > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index 386c677e4c9a0..befa9a5d6e53f 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -982,6 +982,13 @@ enum scx_deq_flags { > * it hasn't been dispatched yet. Dequeue from the BPF side. > */ > SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, > + > + /* > + * The task is being dequeued due to a property change (e.g., > + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), > + * etc.). > + */ > + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, > }; > > enum scx_pick_idle_cpu_flags { > diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h > index c2c33df9292c2..8284f717ff05e 100644 > --- a/tools/sched_ext/include/scx/enum_defs.autogen.h > +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h > @@ -21,6 +21,7 @@ > #define HAVE_SCX_CPU_PREEMPT_UNKNOWN > #define HAVE_SCX_DEQ_SLEEP > #define HAVE_SCX_DEQ_CORE_SCHED_EXEC > +#define HAVE_SCX_DEQ_SCHED_CHANGE > #define HAVE_SCX_DSQ_FLAG_BUILTIN > #define HAVE_SCX_DSQ_FLAG_LOCAL_ON > #define HAVE_SCX_DSQ_INVALID > @@ -48,6 +49,7 @@ > #define HAVE_SCX_TASK_QUEUED > #define HAVE_SCX_TASK_RESET_RUNNABLE_AT > #define HAVE_SCX_TASK_DEQD_FOR_SLEEP > +#define HAVE_SCX_TASK_DISPATCH_DEQUEUED > #define HAVE_SCX_TASK_STATE_SHIFT > #define HAVE_SCX_TASK_STATE_BITS > #define HAVE_SCX_TASK_STATE_MASK > diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h > index 2f8002bcc19ad..5da50f9376844 100644 > --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h > +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h > @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; > const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; > #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ > > +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; > +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE > diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h > index fedec938584be..fc9a7a4d9dea5 100644 > --- a/tools/sched_ext/include/scx/enums.autogen.h > +++ b/tools/sched_ext/include/scx/enums.autogen.h > @@ -46,4 +46,5 @@ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ > + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ > } while (0) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi 2026-01-27 16:38 ` Emil Tsalapatis @ 2026-01-27 16:41 ` Kuba Piecuch 2026-01-30 7:34 ` Andrea Righi 2026-01-28 21:21 ` Tejun Heo 2 siblings, 1 reply; 81+ messages in thread From: Kuba Piecuch @ 2026-01-27 16:41 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hi Andrea, On Mon Jan 26, 2026 at 8:41 AM UTC, Andrea Righi wrote: > Currently, ops.dequeue() is only invoked when the sched_ext core knows > that a task resides in BPF-managed data structures, which causes it to > miss scheduling property change scenarios. As a result, BPF schedulers > cannot reliably track task state. > > In addition, some ops.dequeue() callbacks can be skipped (e.g., during > direct dispatch), so ops.enqueue() calls are not always paired with a > corresponding ops.dequeue(), potentially breaking accounting logic. > > Fix this by guaranteeing that every ops.enqueue() is matched with a > corresponding ops.dequeue(), and introduce the %SCX_DEQ_SCHED_CHANGE > flag to distinguish dequeues triggered by scheduling property changes > from those occurring in the normal dispatch/execution workflow. > > New semantics: > 1. ops.enqueue() is called when a task enters the BPF scheduler > 2. ops.dequeue() is called when the task leaves the BPF scheduler in > the following cases: > a) regular dispatch workflow: task dispatched to a DSQ, > b) core scheduling pick: core-sched picks task before dispatch, > c) property change: task properties modified. > > A new %SCX_DEQ_SCHED_CHANGE flag is also introduced, allowing BPF > schedulers to distinguish between: > - normal dispatch/execution workflow (dispatch, core-sched pick), > - property changes that require state updates (e.g., > sched_setaffinity(), sched_setscheduler(), set_user_nice(), > NUMA balancing, CPU migrations, etc.). > > With this, BPF schedulers can: > - reliably track task ownership and lifecycle, > - maintain accurate accounting of enqueue/dequeue pairs, > - distinguish between execution events and property changes, > - update internal state appropriately for each dequeue type. > > Cc: Tejun Heo <tj@kernel.org> > Cc: Emil Tsalapatis <emil@etsalapatis.com> > Cc: Kuba Piecuch <jpiecuch@google.com> > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- > Documentation/scheduler/sched-ext.rst | 33 +++++++ > include/linux/sched/ext.h | 11 +++ > kernel/sched/ext.c | 89 ++++++++++++++++++- > kernel/sched/ext_internal.h | 7 ++ > .../sched_ext/include/scx/enum_defs.autogen.h | 2 + > .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + > tools/sched_ext/include/scx/enums.autogen.h | 1 + > 7 files changed, 142 insertions(+), 3 deletions(-) > > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > index 404fe6126a769..ed6bf7d9e6e8c 100644 > --- a/Documentation/scheduler/sched-ext.rst > +++ b/Documentation/scheduler/sched-ext.rst > @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed. > > * Queue the task on the BPF side. > > + Once ``ops.enqueue()`` is called, the task enters the "enqueued state". > + The task remains in this state until ``ops.dequeue()`` is called, which > + happens in the following cases: > + > + 1. **Regular dispatch workflow**: when the task is successfully > + dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()`` > + is triggered immediately to notify the BPF scheduler. > + > + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and > + core scheduling picks a task for execution before it has been > + dispatched, ``ops.dequeue()`` is called with the > + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. > + > + 3. **Scheduling property change**: when a task property changes (via > + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, > + priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called > + with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. > + > + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, > + regardless of whether the task is still on a BPF data structure, or it > + has already been dispatched to a DSQ. This guarantees that every > + ``ops.enqueue()`` will eventually be followed by a corresponding > + ``ops.dequeue()``. Not sure I follow this paragraph, specifically the first sentence (starting with ``ops.dequeue()`` is called ...). It seems to imply that a task that has already been dispatched to a DSQ still counts as enqueued, but the preceding text contradicts that by saying that a task is in an "enqueued state" from the time ops.enqueue() is called until (among other things) it's successfully dispatched to a DSQ. This would make sense if this paragraph used "enqueued" in the SCX_TASK_QUEUED sense, while the first paragraph used the SCX_OPSS_QUEUED sense, but if that's the case, it's quite confusing and should be clarified IMO. > + > + This makes it reliable for BPF schedulers to track the enqueued state > + and maintain accurate accounting. > + > + BPF schedulers can choose not to implement ``ops.dequeue()`` if they > + don't need to track these transitions. The sched_ext core will safely > + handle all dequeue operations regardless. > + > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If > empty, it then looks at the global DSQ. If there still isn't a task to > run, ``ops.dispatch()`` is invoked which can use the following two > @@ -319,6 +350,8 @@ by a sched_ext scheduler: > /* Any usable CPU becomes available */ > > ops.dispatch(); /* Task is moved to a local DSQ */ > + > + ops.dequeue(); /* Exiting BPF scheduler */ > } > ops.running(); /* Task starts running on its assigned CPU */ > while (task->scx.slice > 0 && task is runnable) > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index bcb962d5ee7d8..59446cd0373fa 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -84,8 +84,19 @@ struct scx_dispatch_q { > /* scx_entity.flags */ > enum scx_ent_flags { > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > + /* > + * Set when ops.enqueue() is called; used to determine if ops.dequeue() > + * should be invoked when transitioning out of SCX_OPSS_NONE state. > + */ > + SCX_TASK_OPS_ENQUEUED = 1 << 1, > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > + /* > + * Set when ops.dequeue() is called after successful dispatch; used to > + * distinguish dispatch dequeues from property change dequeues and > + * prevent duplicate dequeue calls. > + */ What counts as a duplicate dequeue call? Looking at the code, we can clearly have ops.dequeue(SCHED_CHANGE) called after ops.dequeue(0) without an intervening call to ops.enqueue(). > + SCX_TASK_DISPATCH_DEQUEUED = 1 << 4, > > SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ > SCX_TASK_STATE_BITS = 2, > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index afe28c04d5aa7..18bca2b83f5c5 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > > p->scx.ddsp_enq_flags |= enq_flags; > > + /* > + * The task is about to be dispatched. If ops.enqueue() was called, > + * notify the BPF scheduler by calling ops.dequeue(). > + * > + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property > + * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE. > + * Mark that the dispatch dequeue has been called to distinguish > + * from property change dequeues. > + */ > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; > + } > + > /* > * We are in the enqueue path with @rq locked and pinned, and thus can't > * double lock a remote rq and enqueue to its local DSQ. For > @@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); > > + /* > + * Mark that ops.enqueue() is being called for this task. > + * Clear the dispatch dequeue flag for the new enqueue cycle. > + * Only track these flags if ops.dequeue() is implemented. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > + p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED; > + } > + > ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); > WARN_ON_ONCE(*ddsp_taskp); > *ddsp_taskp = p; > @@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > switch (opss & SCX_OPSS_STATE_MASK) { > case SCX_OPSS_NONE: > + if (SCX_HAS_OP(sch, dequeue) && > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { > + /* > + * Task was already dispatched. Only call ops.dequeue() > + * if it hasn't been called yet (check DISPATCH_DEQUEUED). > + * This can happen when: > + * 1. Core-sched picks a task that was dispatched > + * 2. Property changes occur after dispatch > + */ > + if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) { > + /* > + * ops.dequeue() wasn't called during dispatch. > + * This shouldn't normally happen, but call it now. > + */ Should we add a warning here? > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > + p, deq_flags); > + } else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) { > + /* > + * This is a property change after > + * dispatch. Call ops.dequeue() again with > + * %SCX_DEQ_SCHED_CHANGE. > + */ > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > + p, deq_flags | SCX_DEQ_SCHED_CHANGE); > + } > + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | > + SCX_TASK_DISPATCH_DEQUEUED); > + } If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called for a task at most once between it being dispatched and taken off the CPU, even if its properties are changed multiple times while it's on CPU. Is that intentional? I don't see it documented. To illustrate, assume we have a task p that has been enqueued, dispatched, and is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags. When a property of p is changed while it runs on the CPU, the sequence of calls is: dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) => (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) => set_next_task_scx(p). dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls ops.dequeue(p, ... | SCHED_CHANGE) and clears SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags. put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by dequeue_task_scx(). enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to jump straight to local_norefill, skipping the call to ops.enqueue(), leaving SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ. set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though this is not a core-sched pick, but it won't do much because the ops_state is SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls dispatch_dequeue(p) which the removes the task from the local DSQ it was just inserted into. So, we end up in a state where any subsequent property change while the task is still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are unset in p->scx.flags. I really hope I didn't mess anything up when tracing the code, but of course I'm happy to be corrected. > break; > case SCX_OPSS_QUEUEING: > /* > @@ -1531,9 +1583,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > */ > BUG(); > case SCX_OPSS_QUEUED: > - if (SCX_HAS_OP(sch, dequeue)) > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > - p, deq_flags); > + /* > + * Task is still on the BPF scheduler (not dispatched yet). > + * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE > + * only for property changes, not for core-sched picks. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > + u64 flags = deq_flags; > + /* > + * Add %SCX_DEQ_SCHED_CHANGE for property changes, > + * but not for core-sched picks or sleep. > + */ > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) > + flags |= SCX_DEQ_SCHED_CHANGE; > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags); > + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | > + SCX_TASK_DISPATCH_DEQUEUED); > + } > > if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, > SCX_OPSS_NONE)) > @@ -2107,6 +2174,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, > > BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); > > + /* > + * The task is about to be dispatched. If ops.enqueue() was called, > + * notify the BPF scheduler by calling ops.dequeue(). > + * > + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property > + * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE. > + * Mark that the dispatch dequeue has been called to distinguish > + * from property change dequeues. > + */ > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { > + struct rq *task_rq = task_rq(p); > + > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0); > + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; > + } > + > dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); > > if (dsq->id == SCX_DSQ_LOCAL) > diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h > index 386c677e4c9a0..befa9a5d6e53f 100644 > --- a/kernel/sched/ext_internal.h > +++ b/kernel/sched/ext_internal.h > @@ -982,6 +982,13 @@ enum scx_deq_flags { > * it hasn't been dispatched yet. Dequeue from the BPF side. > */ > SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, > + > + /* > + * The task is being dequeued due to a property change (e.g., > + * sched_setaffinity(), sched_setscheduler(), set_user_nice(), > + * etc.). > + */ > + SCX_DEQ_SCHED_CHANGE = 1LLU << 33, > }; > > enum scx_pick_idle_cpu_flags { > diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h > index c2c33df9292c2..8284f717ff05e 100644 > --- a/tools/sched_ext/include/scx/enum_defs.autogen.h > +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h > @@ -21,6 +21,7 @@ > #define HAVE_SCX_CPU_PREEMPT_UNKNOWN > #define HAVE_SCX_DEQ_SLEEP > #define HAVE_SCX_DEQ_CORE_SCHED_EXEC > +#define HAVE_SCX_DEQ_SCHED_CHANGE > #define HAVE_SCX_DSQ_FLAG_BUILTIN > #define HAVE_SCX_DSQ_FLAG_LOCAL_ON > #define HAVE_SCX_DSQ_INVALID > @@ -48,6 +49,7 @@ > #define HAVE_SCX_TASK_QUEUED > #define HAVE_SCX_TASK_RESET_RUNNABLE_AT > #define HAVE_SCX_TASK_DEQD_FOR_SLEEP > +#define HAVE_SCX_TASK_DISPATCH_DEQUEUED > #define HAVE_SCX_TASK_STATE_SHIFT > #define HAVE_SCX_TASK_STATE_BITS > #define HAVE_SCX_TASK_STATE_MASK > diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h > index 2f8002bcc19ad..5da50f9376844 100644 > --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h > +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h > @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; > const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; > #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ > > +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak; > +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE > diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h > index fedec938584be..fc9a7a4d9dea5 100644 > --- a/tools/sched_ext/include/scx/enums.autogen.h > +++ b/tools/sched_ext/include/scx/enums.autogen.h > @@ -46,4 +46,5 @@ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ > SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ > + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \ > } while (0) Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-27 16:41 ` Kuba Piecuch @ 2026-01-30 7:34 ` Andrea Righi 2026-01-30 13:14 ` Kuba Piecuch 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-01-30 7:34 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hi Kuba, On Tue, Jan 27, 2026 at 04:41:43PM +0000, Kuba Piecuch wrote: ... > > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > > index 404fe6126a769..ed6bf7d9e6e8c 100644 > > --- a/Documentation/scheduler/sched-ext.rst > > +++ b/Documentation/scheduler/sched-ext.rst > > @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed. > > > > * Queue the task on the BPF side. > > > > + Once ``ops.enqueue()`` is called, the task enters the "enqueued state". > > + The task remains in this state until ``ops.dequeue()`` is called, which > > + happens in the following cases: > > + > > + 1. **Regular dispatch workflow**: when the task is successfully > > + dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()`` > > + is triggered immediately to notify the BPF scheduler. > > + > > + 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and > > + core scheduling picks a task for execution before it has been > > + dispatched, ``ops.dequeue()`` is called with the > > + ``SCX_DEQ_CORE_SCHED_EXEC`` flag. > > + > > + 3. **Scheduling property change**: when a task property changes (via > > + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, > > + priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called > > + with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``. > > + > > + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, > > + regardless of whether the task is still on a BPF data structure, or it > > + has already been dispatched to a DSQ. This guarantees that every > > + ``ops.enqueue()`` will eventually be followed by a corresponding > > + ``ops.dequeue()``. > > Not sure I follow this paragraph, specifically the first sentence > (starting with ``ops.dequeue()`` is called ...). > It seems to imply that a task that has already been dispatched to a DSQ still > counts as enqueued, but the preceding text contradicts that by saying that > a task is in an "enqueued state" from the time ops.enqueue() is called until > (among other things) it's successfully dispatched to a DSQ. > > This would make sense if this paragraph used "enqueued" in the SCX_TASK_QUEUED > sense, while the first paragraph used the SCX_OPSS_QUEUED sense, but if that's > the case, it's quite confusing and should be clarified IMO. Good point, the confusion is on my side, the documentation overloads the term "enqueued" and doesn't clearly distinguish the different contexts. In that paragraph, "enqueued" refers to the ops lifecycle (i.e., a task for which ops.enqueue() has been called and whose scheduler-visible state is being tracked), not to the task being queued on a DSQ or having SCX_TASK_QUEUED set. The intent is to treat ops.enqueue() and ops.dequeue() as the boundaries of a scheduler-visible lifecycle, regardless of whether the task is eventually queued on a DSQ or dispatched directly. And as noted by Tejun in his last email, skipping ops.dequeue() for direct dispatches also makes sense, since in that case no new ops lifecycle is established (direct dispatch in ops.select_cpu() or ops.enqueue() can be seen as a shortcut to bypass the scheduler). I'll update the patch and documentation accordingly to make this distinction more explicit. > > > + > > + This makes it reliable for BPF schedulers to track the enqueued state > > + and maintain accurate accounting. > > + > > + BPF schedulers can choose not to implement ``ops.dequeue()`` if they > > + don't need to track these transitions. The sched_ext core will safely > > + handle all dequeue operations regardless. > > + > > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If > > empty, it then looks at the global DSQ. If there still isn't a task to > > run, ``ops.dispatch()`` is invoked which can use the following two > > @@ -319,6 +350,8 @@ by a sched_ext scheduler: > > /* Any usable CPU becomes available */ > > > > ops.dispatch(); /* Task is moved to a local DSQ */ > > + > > + ops.dequeue(); /* Exiting BPF scheduler */ > > } > > ops.running(); /* Task starts running on its assigned CPU */ > > while (task->scx.slice > 0 && task is runnable) > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > index bcb962d5ee7d8..59446cd0373fa 100644 > > --- a/include/linux/sched/ext.h > > +++ b/include/linux/sched/ext.h > > @@ -84,8 +84,19 @@ struct scx_dispatch_q { > > /* scx_entity.flags */ > > enum scx_ent_flags { > > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > > + /* > > + * Set when ops.enqueue() is called; used to determine if ops.dequeue() > > + * should be invoked when transitioning out of SCX_OPSS_NONE state. > > + */ > > + SCX_TASK_OPS_ENQUEUED = 1 << 1, > > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > > + /* > > + * Set when ops.dequeue() is called after successful dispatch; used to > > + * distinguish dispatch dequeues from property change dequeues and > > + * prevent duplicate dequeue calls. > > + */ > > What counts as a duplicate dequeue call? Looking at the code, we can clearly > have ops.dequeue(SCHED_CHANGE) called after ops.dequeue(0) without an > intervening call to ops.enqueue(). Yeah SCHED_CHANGE dequeues are the exception, and it's acceptable to have ops.dequeue(0) + ops.dequeue(SCHED_CHANGE). The idea is to catch potential duplicate dispatch dequeues. I'll clarify this. > > > + SCX_TASK_DISPATCH_DEQUEUED = 1 << 4, > > > > SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ > > SCX_TASK_STATE_BITS = 2, > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index afe28c04d5aa7..18bca2b83f5c5 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > > > > p->scx.ddsp_enq_flags |= enq_flags; > > > > + /* > > + * The task is about to be dispatched. If ops.enqueue() was called, > > + * notify the BPF scheduler by calling ops.dequeue(). > > + * > > + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property > > + * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE. > > + * Mark that the dispatch dequeue has been called to distinguish > > + * from property change dequeues. > > + */ > > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > > + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; > > + } > > + > > /* > > * We are in the enqueue path with @rq locked and pinned, and thus can't > > * double lock a remote rq and enqueue to its local DSQ. For > > @@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); > > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); > > > > + /* > > + * Mark that ops.enqueue() is being called for this task. > > + * Clear the dispatch dequeue flag for the new enqueue cycle. > > + * Only track these flags if ops.dequeue() is implemented. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) { > > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > > + p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED; > > + } > > + > > ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); > > WARN_ON_ONCE(*ddsp_taskp); > > *ddsp_taskp = p; > > @@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > > > switch (opss & SCX_OPSS_STATE_MASK) { > > case SCX_OPSS_NONE: > > + if (SCX_HAS_OP(sch, dequeue) && > > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { > > + /* > > + * Task was already dispatched. Only call ops.dequeue() > > + * if it hasn't been called yet (check DISPATCH_DEQUEUED). > > + * This can happen when: > > + * 1. Core-sched picks a task that was dispatched > > + * 2. Property changes occur after dispatch > > + */ > > + if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) { > > + /* > > + * ops.dequeue() wasn't called during dispatch. > > + * This shouldn't normally happen, but call it now. > > + */ > > Should we add a warning here? Good idea, I'll add a WARN_ON_ONCE(). > > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > > + p, deq_flags); > > + } else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) { > > + /* > > + * This is a property change after > > + * dispatch. Call ops.dequeue() again with > > + * %SCX_DEQ_SCHED_CHANGE. > > + */ > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > > + p, deq_flags | SCX_DEQ_SCHED_CHANGE); > > + } > > + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | > > + SCX_TASK_DISPATCH_DEQUEUED); > > + } > > If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called > for a task at most once between it being dispatched and taken off the CPU, > even if its properties are changed multiple times while it's on CPU. > Is that intentional? I don't see it documented. > > To illustrate, assume we have a task p that has been enqueued, dispatched, and > is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and > SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags. > > When a property of p is changed while it runs on the CPU, > the sequence of calls is: > dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) => > (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) => > set_next_task_scx(p). > > dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls > ops.dequeue(p, ... | SCHED_CHANGE) and clears > SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags. > > put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by > dequeue_task_scx(). > > enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is > currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to > jump straight to local_norefill, skipping the call to ops.enqueue(), leaving > SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ. > > set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though > this is not a core-sched pick, but it won't do much because the ops_state is > SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls > dispatch_dequeue(p) which the removes the task from the local DSQ it was just > inserted into. > > > So, we end up in a state where any subsequent property change while the task is > still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being > called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are > unset in p->scx.flags. > > I really hope I didn't mess anything up when tracing the code, but of course > I'm happy to be corrected. Correct. And the enqueue/dequeue balancing is preserved here. In the scenario you describe, subsequent property changes while the task remains running go through ENQUEUE_RESTORE, which intentionally skips ops.enqueue(). Since no new enqueue cycle is started, there is no corresponding ops.dequeue() to deliver either. In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the scheduler state established by the last ops.enqueue(), not with every individual property change. Multiple property changes while the task stays on CPU are coalesced and the enqueue/dequeue pairing remains balanced. I agree this distinction isn't obvious from the current documentation, I'll clarify that SCX_DEQ_SCHED_CHANGE is edge-triggered per enqueue/run cycle, not per property change. Do you see any practical use case where it'd be beneficial to tie individual ops.dequeue() calls to every property change, as opposed to the current coalesced behavior?? Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-30 7:34 ` Andrea Righi @ 2026-01-30 13:14 ` Kuba Piecuch 2026-01-31 6:54 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Kuba Piecuch @ 2026-01-30 13:14 UTC (permalink / raw) To: Andrea Righi, Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hi Andrea, On Fri Jan 30, 2026 at 7:34 AM UTC, Andrea Righi wrote: ... > Good point, the confusion is on my side, the documentation overloads the > term "enqueued" and doesn't clearly distinguish the different contexts. > > In that paragraph, "enqueued" refers to the ops lifecycle (i.e., a task for > which ops.enqueue() has been called and whose scheduler-visible state is > being tracked), not to the task being queued on a DSQ or having > SCX_TASK_QUEUED set. > > The intent is to treat ops.enqueue() and ops.dequeue() as the boundaries of > a scheduler-visible lifecycle, regardless of whether the task is eventually > queued on a DSQ or dispatched directly. > > And as noted by Tejun in his last email, skipping ops.dequeue() for direct > dispatches also makes sense, since in that case no new ops lifecycle is > established (direct dispatch in ops.select_cpu() or ops.enqueue() can be > seen as a shortcut to bypass the scheduler). Right, skipping ops.dequeue() for direct dispatches makes sense, provided the task is being dispatched to a local/global DSQ. Or at least that's my takeaway after reading Tejun's email. ... >> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called >> for a task at most once between it being dispatched and taken off the CPU, >> even if its properties are changed multiple times while it's on CPU. >> Is that intentional? I don't see it documented. >> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags. >> >> When a property of p is changed while it runs on the CPU, >> the sequence of calls is: >> dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) => >> (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) => >> set_next_task_scx(p). >> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls >> ops.dequeue(p, ... | SCHED_CHANGE) and clears >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags. >> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by >> dequeue_task_scx(). >> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ. >> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though >> this is not a core-sched pick, but it won't do much because the ops_state is >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just >> inserted into. >> >> >> So, we end up in a state where any subsequent property change while the task is >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are >> unset in p->scx.flags. >> >> I really hope I didn't mess anything up when tracing the code, but of course >> I'm happy to be corrected. > > Correct. And the enqueue/dequeue balancing is preserved here. In the > scenario you describe, subsequent property changes while the task remains > running go through ENQUEUE_RESTORE, which intentionally skips > ops.enqueue(). Since no new enqueue cycle is started, there is no > corresponding ops.dequeue() to deliver either. > > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the > scheduler state established by the last ops.enqueue(), not with every > individual property change. Multiple property changes while the task stays > on CPU are coalesced and the enqueue/dequeue pairing remains balanced. Ok, I think I understand the logic behind this, here's how I understand it: The BPF scheduler is naturally going to have some internal per-task state. That state may be expensive to compute from scratch, so we don't want to completely discard it when the BPF scheduler loses ownership of the task. ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler: "Hey, some scheduling properties of the task are about to change, so you probably should invalidate whatever state you have for that task which depends on these properties." That way, the BPF scheduler will know to recompute the invalidated state on the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the BPF scheduler knows that none of the task's fundamental scheduling properties (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing the state. Of course, the potential for savings depends on the particular scheduler's policy. This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while a task is running: for subsequent calls, the BPF scheduler had already been notified to invalidate its state, so there's no use in notifying it again. However, I feel like there's a hidden assumption here that the BPF scheduler doesn't recompute its state for the task before the next ops.enqueue(). What if the scheduler wanted to immediately react to the priority of a task being decreased by preempting it? You might say "hook into ops.set_weight()", but then doesn't that obviate the need for ops.dequeue(SCHED_CHANGE)? I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property changes that happen under ``scoped_guard (sched_change, ...)`` which don't have a dedicated ops callback, but I wasn't able to find any such properties which would be relevant to SCX. Another thought on the design: currently, the exact meaning of ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF scheduler: * When it's owned, it combines two notifications: BPF scheduler losing ownership AND that it should invalidate task state. * When it's not owned, it only serves as an "invalidate" notification, the ownership status doesn't change. Wouldn't it be more elegant to have another callback, say ops.property_change(), which would only serve as the "invalidate" notification, and leave ops.dequeue() only for tracking ownership? That would mean calling ops.dequeue() followed by ops.property_change() when changing properties of a task owned by the BPF scheduler, as opposed to a single call to ops.dequeue(SCHED_CHANGE). But honestly, when I put it like this, it gets harder to justify having this callback over just using ops.set_weight() etc. > > I agree this distinction isn't obvious from the current documentation, I'll > clarify that SCX_DEQ_SCHED_CHANGE is edge-triggered per enqueue/run cycle, > not per property change. > > Do you see any practical use case where it'd be beneficial to tie > individual ops.dequeue() calls to every property change, as opposed to the > current coalesced behavior?? I don't know how practical it is, but in my comment above I mention a BPF scheduler wanting to immediately preempt a running task on priority decrease, but in that case we need to hook into ops.set_weight() anyway to find out whether the priority was decreased. Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-30 13:14 ` Kuba Piecuch @ 2026-01-31 6:54 ` Andrea Righi 2026-01-31 16:45 ` Kuba Piecuch 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-01-31 6:54 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hi Kuba, On Fri, Jan 30, 2026 at 01:14:23PM +0000, Kuba Piecuch wrote: ... > >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called > >> for a task at most once between it being dispatched and taken off the CPU, > >> even if its properties are changed multiple times while it's on CPU. > >> Is that intentional? I don't see it documented. > >> > >> To illustrate, assume we have a task p that has been enqueued, dispatched, and > >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and > >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags. > >> > >> When a property of p is changed while it runs on the CPU, > >> the sequence of calls is: > >> dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) => > >> (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) => > >> set_next_task_scx(p). > >> > >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls > >> ops.dequeue(p, ... | SCHED_CHANGE) and clears > >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags. > >> > >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by > >> dequeue_task_scx(). > >> > >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is > >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to > >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving > >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ. > >> > >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though > >> this is not a core-sched pick, but it won't do much because the ops_state is > >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls > >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just > >> inserted into. > >> > >> > >> So, we end up in a state where any subsequent property change while the task is > >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being > >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are > >> unset in p->scx.flags. > >> > >> I really hope I didn't mess anything up when tracing the code, but of course > >> I'm happy to be corrected. > > > > Correct. And the enqueue/dequeue balancing is preserved here. In the > > scenario you describe, subsequent property changes while the task remains > > running go through ENQUEUE_RESTORE, which intentionally skips > > ops.enqueue(). Since no new enqueue cycle is started, there is no > > corresponding ops.dequeue() to deliver either. > > > > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the > > scheduler state established by the last ops.enqueue(), not with every > > individual property change. Multiple property changes while the task stays > > on CPU are coalesced and the enqueue/dequeue pairing remains balanced. > > Ok, I think I understand the logic behind this, here's how I understand it: > > The BPF scheduler is naturally going to have some internal per-task state. > That state may be expensive to compute from scratch, so we don't want to > completely discard it when the BPF scheduler loses ownership of the task. > > ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler: > "Hey, some scheduling properties of the task are about to change, so you > probably should invalidate whatever state you have for that task which depends > on these properties." Correct. And it's also a way to notify that the task has left the BPF scheduler, so if the task is stored in any internal queue it can/should be removed. > > That way, the BPF scheduler will know to recompute the invalidated state on > the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the > BPF scheduler knows that none of the task's fundamental scheduling properties > (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing > the state. Of course, the potential for savings depends on the particular > scheduler's policy. > > This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while > a task is running: for subsequent calls, the BPF scheduler had already been > notified to invalidate its state, so there's no use in notifying it again. Actually I think the proper behavior would be to trigger ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF scheduler. While running, tasks are outside the BPF scheduler ownership, so ops.dequeue() shouldn't be triggered at all. > > However, I feel like there's a hidden assumption here that the BPF scheduler > doesn't recompute its state for the task before the next ops.enqueue(). And that should be the proper behavior. BPF scheduler should recompute a task state only when the task is re-enqueued after a property change. > What if the scheduler wanted to immediately react to the priority of a task > being decreased by preempting it? You might say "hook into > ops.set_weight()", but then doesn't that obviate the need for > ops.dequeue(SCHED_CHANGE)? If a scheduler wants to implement preemption on property change, it can do so in ops.enqueue(): after a property change, the task is re-enqueued, triggering ops.enqueue(), at which point the BPF scheduler can decide whether and how to preempt currently running tasks. If a property change does not result in an ops.enqueue() call, it means the task is not runnable yet (or does not intend to run), so attempting to trigger a preemption at that point would be pointless. > > I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property > changes that happen under ``scoped_guard (sched_change, ...)`` which don't have > a dedicated ops callback, but I wasn't able to find any such properties which > would be relevant to SCX. > > Another thought on the design: currently, the exact meaning of > ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF > scheduler: > > * When it's owned, it combines two notifications: BPF scheduler losing > ownership AND that it should invalidate task state. > * When it's not owned, it only serves as an "invalidate" notification, > the ownership status doesn't change. When it's not owned I think ops.dequeue() shouldn't be triggered at all. > > Wouldn't it be more elegant to have another callback, say > ops.property_change(), which would only serve as the "invalidate" notification, > and leave ops.dequeue() only for tracking ownership? > That would mean calling ops.dequeue() followed by ops.property_change() when > changing properties of a task owned by the BPF scheduler, as opposed to a > single call to ops.dequeue(SCHED_CHANGE). We could provide an ops.property_change(), but honestly I don't see any practical usage of this callback. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-31 6:54 ` Andrea Righi @ 2026-01-31 16:45 ` Kuba Piecuch 2026-01-31 17:24 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Kuba Piecuch @ 2026-01-31 16:45 UTC (permalink / raw) To: Andrea Righi, Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hi Andrea, On Sat Jan 31, 2026 at 6:54 AM UTC, Andrea Righi wrote: >> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called >> >> for a task at most once between it being dispatched and taken off the CPU, >> >> even if its properties are changed multiple times while it's on CPU. >> >> Is that intentional? I don't see it documented. >> >> >> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and >> >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and >> >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags. >> >> >> >> When a property of p is changed while it runs on the CPU, >> >> the sequence of calls is: >> >> dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) => >> >> (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) => >> >> set_next_task_scx(p). >> >> >> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls >> >> ops.dequeue(p, ... | SCHED_CHANGE) and clears >> >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags. >> >> >> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by >> >> dequeue_task_scx(). >> >> >> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is >> >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to >> >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving >> >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ. >> >> >> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though >> >> this is not a core-sched pick, but it won't do much because the ops_state is >> >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls >> >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just >> >> inserted into. >> >> >> >> >> >> So, we end up in a state where any subsequent property change while the task is >> >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being >> >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are >> >> unset in p->scx.flags. >> >> >> >> I really hope I didn't mess anything up when tracing the code, but of course >> >> I'm happy to be corrected. >> > >> > Correct. And the enqueue/dequeue balancing is preserved here. In the >> > scenario you describe, subsequent property changes while the task remains >> > running go through ENQUEUE_RESTORE, which intentionally skips >> > ops.enqueue(). Since no new enqueue cycle is started, there is no >> > corresponding ops.dequeue() to deliver either. >> > >> > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the >> > scheduler state established by the last ops.enqueue(), not with every >> > individual property change. Multiple property changes while the task stays >> > on CPU are coalesced and the enqueue/dequeue pairing remains balanced. >> >> Ok, I think I understand the logic behind this, here's how I understand it: >> >> The BPF scheduler is naturally going to have some internal per-task state. >> That state may be expensive to compute from scratch, so we don't want to >> completely discard it when the BPF scheduler loses ownership of the task. >> >> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler: >> "Hey, some scheduling properties of the task are about to change, so you >> probably should invalidate whatever state you have for that task which depends >> on these properties." > > Correct. And it's also a way to notify that the task has left the BPF > scheduler, so if the task is stored in any internal queue it can/should be > removed. Right, unless the task has already been dispatched, in which case it's just an invalidation notification. > >> >> That way, the BPF scheduler will know to recompute the invalidated state on >> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the >> BPF scheduler knows that none of the task's fundamental scheduling properties >> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing >> the state. Of course, the potential for savings depends on the particular >> scheduler's policy. >> >> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while >> a task is running: for subsequent calls, the BPF scheduler had already been >> notified to invalidate its state, so there's no use in notifying it again. > > Actually I think the proper behavior would be to trigger > ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF > scheduler. While running, tasks are outside the BPF scheduler ownership, so > ops.dequeue() shouldn't be triggered at all. > I don't think this is what the current implementation does, right? >> >> However, I feel like there's a hidden assumption here that the BPF scheduler >> doesn't recompute its state for the task before the next ops.enqueue(). > > And that should be the proper behavior. BPF scheduler should recompute a > task state only when the task is re-enqueued after a property change. > That would make sense if ops.enqueue() was called immediately after a property change when a task is running, but I believe that's currently not the case, see my attempt at tracing the enqueue-dequeue cycle on property change in my first reply. >> What if the scheduler wanted to immediately react to the priority of a task >> being decreased by preempting it? You might say "hook into >> ops.set_weight()", but then doesn't that obviate the need for >> ops.dequeue(SCHED_CHANGE)? > > If a scheduler wants to implement preemption on property change, it can do > so in ops.enqueue(): after a property change, the task is re-enqueued, > triggering ops.enqueue(), at which point the BPF scheduler can decide > whether and how to preempt currently running tasks. > > If a property change does not result in an ops.enqueue() call, it means the > task is not runnable yet (or does not intend to run), so attempting to > trigger a preemption at that point would be pointless. > IIUC a dequeue-enqueue cycle on a running task during property change doesn't result in a call to ops.enqueue(), so if the BPF scheduler recomputed its state only in ops.enqueue(), then it wouldn't be able to react immediately. >> >> I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property >> changes that happen under ``scoped_guard (sched_change, ...)`` which don't have >> a dedicated ops callback, but I wasn't able to find any such properties which >> would be relevant to SCX. >> >> Another thought on the design: currently, the exact meaning of >> ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF >> scheduler: >> >> * When it's owned, it combines two notifications: BPF scheduler losing >> ownership AND that it should invalidate task state. >> * When it's not owned, it only serves as an "invalidate" notification, >> the ownership status doesn't change. > > When it's not owned I think ops.dequeue() shouldn't be triggered at all. > >> >> Wouldn't it be more elegant to have another callback, say >> ops.property_change(), which would only serve as the "invalidate" notification, >> and leave ops.dequeue() only for tracking ownership? >> That would mean calling ops.dequeue() followed by ops.property_change() when >> changing properties of a task owned by the BPF scheduler, as opposed to a >> single call to ops.dequeue(SCHED_CHANGE). > > We could provide an ops.property_change(), but honestly I don't see any > practical usage of this callback. > Neither do I, I just made it up for the sake of argument :-) Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-31 16:45 ` Kuba Piecuch @ 2026-01-31 17:24 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-01-31 17:24 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hi Kuba, On Sat, Jan 31, 2026 at 04:45:59PM +0000, Kuba Piecuch wrote: ... > >> The BPF scheduler is naturally going to have some internal per-task state. > >> That state may be expensive to compute from scratch, so we don't want to > >> completely discard it when the BPF scheduler loses ownership of the task. > >> > >> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler: > >> "Hey, some scheduling properties of the task are about to change, so you > >> probably should invalidate whatever state you have for that task which depends > >> on these properties." > > > > Correct. And it's also a way to notify that the task has left the BPF > > scheduler, so if the task is stored in any internal queue it can/should be > > removed. > > Right, unless the task has already been dispatched, in which case it's just > an invalidation notification. Right, but if the task has already been dispatched I don't think we should trigger ops.dequeue(SCHED_CHANGE), because it's not anymore under the BPF scheduler's custody (not the way it's implemented right now, I'm just trying to define the proper semantics based on the latest disussions). > >> That way, the BPF scheduler will know to recompute the invalidated state on > >> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the > >> BPF scheduler knows that none of the task's fundamental scheduling properties > >> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing > >> the state. Of course, the potential for savings depends on the particular > >> scheduler's policy. > >> > >> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while > >> a task is running: for subsequent calls, the BPF scheduler had already been > >> notified to invalidate its state, so there's no use in notifying it again. > > > > Actually I think the proper behavior would be to trigger > > ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF > > scheduler. While running, tasks are outside the BPF scheduler ownership, so > > ops.dequeue() shouldn't be triggered at all. > > > > I don't think this is what the current implementation does, right? Right, sorry, I wasn't clear. I'm just trying to define the behavior that makes more sense (see below). > >> However, I feel like there's a hidden assumption here that the BPF scheduler > >> doesn't recompute its state for the task before the next ops.enqueue(). > > > > And that should be the proper behavior. BPF scheduler should recompute a > > task state only when the task is re-enqueued after a property change. > > > > That would make sense if ops.enqueue() was called immediately after a property > change when a task is running, but I believe that's currently not the case, > see my attempt at tracing the enqueue-dequeue cycle on property change in my > first reply. Yeah, that's right. I have a new patch set where I've implemented the following semantics (that should match also Tejun's requirements). With the new semantics: - for running tasks: property changes do NOT trigger ops.dequeue(SCHED_CHANGE) - once a task leaves BPF custody (dispatched to local DSQ), the BPF scheduler no longer manages it - property changes on running tasks don't affect the BPF scheduler Key principle: ops.dequeue() is only called when a task leaves BPF scheduler's custody. A running task has already left BPF custody, so property changes don't trigger ops.dequeue(). Therefore, `ops.dequeue(SCHED_CHANGE)` gets called only when: - task is in BPF data structures (QUEUED state), or - task is on a non-local DSQ (still in BPF custody) In this case (BPF scheduler custody), if a property change happens, ops.dequeue(SCHED_CHANGE) is called to notify the BPF scheduler. Then if you want to react immediately on priority changes for running tasks we have: - ops.set_cpumask(): CPU affinity changes - ops.set_weight(): priority/nice changes - ops.cgroup_*(): cgroup changes In conclusion, we don't need ops.dequeue(SCHED_CHANGE) for running tasks, the dedicated callbacks (ops.set_cpumask(), ops.set_weight(), ...) already provide comprehensive coverage for property changes on all tasks, regardless of whether they're running or in BPF custody. And the new ops.dequeue(SCHED_CHANGE) semantics only notifies for property changes when tasks are actively managed by the BPF scheduler (in QUEUED state or on non-local DSQs). Do you think it's reasonable enough / do you see any flaws? Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi 2026-01-27 16:38 ` Emil Tsalapatis 2026-01-27 16:41 ` Kuba Piecuch @ 2026-01-28 21:21 ` Tejun Heo 2026-01-30 11:54 ` Kuba Piecuch 2 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-01-28 21:21 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hello, On Mon, Jan 26, 2026 at 09:41:49AM +0100, Andrea Righi wrote: > @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, > > p->scx.ddsp_enq_flags |= enq_flags; > > + /* > + * The task is about to be dispatched. If ops.enqueue() was called, > + * notify the BPF scheduler by calling ops.dequeue(). > + * > + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property > + * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE. > + * Mark that the dispatch dequeue has been called to distinguish > + * from property change dequeues. > + */ > + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); > + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; > + } 1. When to call ops.dequeue()? I'm not sure whether deciding whether to call ops.dequeue() solely onwhether ops.enqueue() was called. Direct dispatch has been expanded to include other DSQs but was originally added as a way to shortcut the dispatch path and "dispatch directly" for execution from ops.select_cpu/enqueue() paths. ie. When a task is dispatched directly to a local DSQ, the BPF scheduler is done with that task - the task is now in the same state with tasks that get dispatched to a local DSQ from ops.dispatch(). ie. What effectively decides whether a task left the BPF scheduler is whether the task reached a local DSQ or not, and direct dispatching into a local DSQ shouldn't trigger ops.dequeue() - the task never really "queues" on the BPF scheduler. This creates another discrepancy - From ops.enqueue(), direct dispatching into a non-local DSQ clearly makes the task enter the BPF scheduler and thus its departure should trigger ops.dequeue(). What about a task which is direct dispatched to a non-local DSQ from ops.select_cpu()? Superficially, the right thing to do seems to skip ops.dequeue(). After all, the task has never been ops.enqueue()'d. However, I think this is another case where what's obvious doesn't agree with what's happening underneath. ops.select_cpu() cannot actually queue anything. It's too early. Direct dispatch from ops.select_cpu() is a shortcut to schedule direct dispatch once the enqueue path is invoked so that the BPF scheudler can avoid invocation of ops.enqueue() when the decision has already been made. While this shortcut was added for convenience (so that e.g. the BPF scheduler doesn't have to pass a note from ops.select_cpu() to ops.enqueue()), it has real performance implications as it does save a roundtrip through ops.enqueue() and we know that such overheads do matter for some use cases (e.g. maximizing FPS on certain games). So, while more subtle on the surface, I think the right thing to do is basing the decision to call ops.dequeue() on the task's actual state - ops.dequeue() should be called if the task is "on" the BPF scheduler - ie. if the task ran ops.select_cpu/enqueue() paths and ended up in a non-local DSQ or on the BPF side. The subtlety would need clear documentation and we probably want to allow ops.dequeue() to distinguish different cases. If you boil it down to the actual task state, I don't think it's that subtle - if a task is in the custody of the BPF scheduler, ops.dequeue() will be called. Otherwise, not. Note that, this way, whether ops.dequeue() needs to be called agrees with whether the task needs to be dispatched to run. 2. Why keep %SCX_TASK_OPS_ENQUEUED for %SCX_DEQ_SCHED_CHANGE? Wouldn't that lead to calling ops.dequeue() more than once for the same enqueue event? If the BPF scheduler is told that the task has left it already, why does it matter whether the task gets dequeued for sched change afterwards? e.g. from the BPF sched's POV, it shouldn't matter whether the task is still on the local DSQ or already running, in which case the sched class's dequeue() wouldn't be called in the first place, no? Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-28 21:21 ` Tejun Heo @ 2026-01-30 11:54 ` Kuba Piecuch 2026-01-31 9:02 ` Andrea Righi 2026-02-01 17:43 ` Tejun Heo 0 siblings, 2 replies; 81+ messages in thread From: Kuba Piecuch @ 2026-01-30 11:54 UTC (permalink / raw) To: Tejun Heo, Andrea Righi Cc: David Vernet, Changwoo Min, Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hi Tejun, On Wed Jan 28, 2026 at 9:21 PM UTC, Tejun Heo wrote: ... > 1. When to call ops.dequeue()? > > I'm not sure whether deciding whether to call ops.dequeue() solely onwhether > ops.enqueue() was called. Direct dispatch has been expanded to include other > DSQs but was originally added as a way to shortcut the dispatch path and > "dispatch directly" for execution from ops.select_cpu/enqueue() paths. ie. > When a task is dispatched directly to a local DSQ, the BPF scheduler is done > with that task - the task is now in the same state with tasks that get > dispatched to a local DSQ from ops.dispatch(). > > ie. What effectively decides whether a task left the BPF scheduler is > whether the task reached a local DSQ or not, and direct dispatching into a > local DSQ shouldn't trigger ops.dequeue() - the task never really "queues" > on the BPF scheduler. Is "local" short for "local or global", i.e. not user-created? Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(), since dispatch isn't necessary for the task to run. This follows from the last paragraph: Note that, this way, whether ops.dequeue() needs to be called agrees with whether the task needs to be dispatched to run. I agree with your points, just wanted to clarify this one thing. > > This creates another discrepancy - From ops.enqueue(), direct dispatching > into a non-local DSQ clearly makes the task enter the BPF scheduler and thus > its departure should trigger ops.dequeue(). What about a task which is > direct dispatched to a non-local DSQ from ops.select_cpu()? Superficially, > the right thing to do seems to skip ops.dequeue(). After all, the task has > never been ops.enqueue()'d. However, I think this is another case where > what's obvious doesn't agree with what's happening underneath. > > ops.select_cpu() cannot actually queue anything. It's too early. Direct > dispatch from ops.select_cpu() is a shortcut to schedule direct dispatch > once the enqueue path is invoked so that the BPF scheudler can avoid > invocation of ops.enqueue() when the decision has already been made. While > this shortcut was added for convenience (so that e.g. the BPF scheduler > doesn't have to pass a note from ops.select_cpu() to ops.enqueue()), it has > real performance implications as it does save a roundtrip through > ops.enqueue() and we know that such overheads do matter for some use cases > (e.g. maximizing FPS on certain games). > > So, while more subtle on the surface, I think the right thing to do is > basing the decision to call ops.dequeue() on the task's actual state - > ops.dequeue() should be called if the task is "on" the BPF scheduler - ie. > if the task ran ops.select_cpu/enqueue() paths and ended up in a non-local > DSQ or on the BPF side. > > The subtlety would need clear documentation and we probably want to allow > ops.dequeue() to distinguish different cases. If you boil it down to the > actual task state, I don't think it's that subtle - if a task is in the > custody of the BPF scheduler, ops.dequeue() will be called. Otherwise, not. > Note that, this way, whether ops.dequeue() needs to be called agrees with > whether the task needs to be dispatched to run. Here's my attempt at documenting this behavior: After ops.enqueue() is called on a task, the task is owned by the BPF scheduler, provided the task wasn't direct-dispatched to a local/global DSQ. When a task is owned by the BPF scheduler, the scheduler needs to dispatch the task to a local/global DSQ in order for it to run. When the BPF scheduler loses ownership of the task, either due to dispatching it to a local/global DSQ or due to external events (core-sched pick, CPU migration, scheduling property changes), the BPF scheduler is notified through ops.dequeue() with appropriate flags (TBD). Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-30 11:54 ` Kuba Piecuch @ 2026-01-31 9:02 ` Andrea Righi 2026-01-31 17:53 ` Kuba Piecuch 2026-02-01 17:43 ` Tejun Heo 1 sibling, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-01-31 9:02 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote: > Hi Tejun, > > On Wed Jan 28, 2026 at 9:21 PM UTC, Tejun Heo wrote: > ... > > 1. When to call ops.dequeue()? > > > > I'm not sure whether deciding whether to call ops.dequeue() solely onwhether > > ops.enqueue() was called. Direct dispatch has been expanded to include other > > DSQs but was originally added as a way to shortcut the dispatch path and > > "dispatch directly" for execution from ops.select_cpu/enqueue() paths. ie. > > When a task is dispatched directly to a local DSQ, the BPF scheduler is done > > with that task - the task is now in the same state with tasks that get > > dispatched to a local DSQ from ops.dispatch(). > > > > ie. What effectively decides whether a task left the BPF scheduler is > > whether the task reached a local DSQ or not, and direct dispatching into a > > local DSQ shouldn't trigger ops.dequeue() - the task never really "queues" > > on the BPF scheduler. > > Is "local" short for "local or global", i.e. not user-created? > Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(), > since dispatch isn't necessary for the task to run. This follows from the last > paragraph: > > Note that, this way, whether ops.dequeue() needs to be called agrees with > whether the task needs to be dispatched to run. > > I agree with your points, just wanted to clarify this one thing. I think this should be interpreted as local DSQs only (SCX_DSQ_LOCAL / SCX_DSQ_LOCAL_ON), not any built-in DSQ. SCX_DSQ_GLOBAL is essentially a built-in user DSQ, provided for convenience, it's not really a "direct dispatch" DSQ. > > > > > This creates another discrepancy - From ops.enqueue(), direct dispatching > > into a non-local DSQ clearly makes the task enter the BPF scheduler and thus > > its departure should trigger ops.dequeue(). What about a task which is > > direct dispatched to a non-local DSQ from ops.select_cpu()? Superficially, > > the right thing to do seems to skip ops.dequeue(). After all, the task has > > never been ops.enqueue()'d. However, I think this is another case where > > what's obvious doesn't agree with what's happening underneath. > > > > ops.select_cpu() cannot actually queue anything. It's too early. Direct > > dispatch from ops.select_cpu() is a shortcut to schedule direct dispatch > > once the enqueue path is invoked so that the BPF scheudler can avoid > > invocation of ops.enqueue() when the decision has already been made. While > > this shortcut was added for convenience (so that e.g. the BPF scheduler > > doesn't have to pass a note from ops.select_cpu() to ops.enqueue()), it has > > real performance implications as it does save a roundtrip through > > ops.enqueue() and we know that such overheads do matter for some use cases > > (e.g. maximizing FPS on certain games). > > > > So, while more subtle on the surface, I think the right thing to do is > > basing the decision to call ops.dequeue() on the task's actual state - > > ops.dequeue() should be called if the task is "on" the BPF scheduler - ie. > > if the task ran ops.select_cpu/enqueue() paths and ended up in a non-local > > DSQ or on the BPF side. > > > > The subtlety would need clear documentation and we probably want to allow > > ops.dequeue() to distinguish different cases. If you boil it down to the > > actual task state, I don't think it's that subtle - if a task is in the > > custody of the BPF scheduler, ops.dequeue() will be called. Otherwise, not. > > Note that, this way, whether ops.dequeue() needs to be called agrees with > > whether the task needs to be dispatched to run. > > Here's my attempt at documenting this behavior: > > After ops.enqueue() is called on a task, the task is owned by the BPF > scheduler, provided the task wasn't direct-dispatched to a local/global DSQ. > When a task is owned by the BPF scheduler, the scheduler needs to dispatch the > task to a local/global DSQ in order for it to run. > When the BPF scheduler loses ownership of the task, either due to dispatching it > to a local/global DSQ or due to external events (core-sched pick, CPU > migration, scheduling property changes), the BPF scheduler is notified through > ops.dequeue() with appropriate flags (TBD). This looks good overall, except for the global DSQ part. Also, it might be better to avoid the term “owned”, internally the kernel already uses the concept of "task ownership" with a different meaning (see https://lore.kernel.org/all/aVHAZNbIJLLBHEXY@slm.duckdns.org), and reusing it here could be misleading. With that in mind, I'd probably rephrase your documentation along these lines: After ops.enqueue() is called, the task is considered *enqueued* by the BPF scheduler, unless it is directly dispatched to a local DSQ (via SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON). While a task is enqueued, the BPF scheduler must explicitly dispatch it to a DSQ in order for it to run. When a task leaves the enqueued state (either because it is dispatched to a non-local DSQ, or due to external events such as a core-sched pick, CPU migration, or scheduling property changes), ops.dequeue() is invoked to notify the BPF scheduler, with flags indicating the reason for the dequeue: regular dispatch dequeues have no flags set, whereas dequeues triggered by scheduling property changes are reported with SCX_DEQ_SCHED_CHANGE. What do you think? Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-31 9:02 ` Andrea Righi @ 2026-01-31 17:53 ` Kuba Piecuch 2026-01-31 20:26 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Kuba Piecuch @ 2026-01-31 17:53 UTC (permalink / raw) To: Andrea Righi, Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis On Sat Jan 31, 2026 at 9:02 AM UTC, Andrea Righi wrote: > On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote: >> Is "local" short for "local or global", i.e. not user-created? >> Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(), >> since dispatch isn't necessary for the task to run. This follows from the last >> paragraph: >> >> Note that, this way, whether ops.dequeue() needs to be called agrees with >> whether the task needs to be dispatched to run. >> >> I agree with your points, just wanted to clarify this one thing. > > I think this should be interpreted as local DSQs only > (SCX_DSQ_LOCAL / SCX_DSQ_LOCAL_ON), not any built-in DSQ. SCX_DSQ_GLOBAL is > essentially a built-in user DSQ, provided for convenience, it's not really > a "direct dispatch" DSQ. SCX_DSQ_GLOBAL is significantly different from user DSQs, because balance_one() can pull tasks directly from SCX_DSQ_GLOBAL, while it cannot pull tasks from user-created DSQs. If a BPF scheduler puts a task onto SCX_DSQ_GLOBAL, then it _must_ be ok with balance_one() coming along and pulling that task without the BPF scheduler's intervention, so in that way I believe SCX_DSQ_GLOBAL is semantically quite similar to local DSQs. >> Here's my attempt at documenting this behavior: >> >> After ops.enqueue() is called on a task, the task is owned by the BPF >> scheduler, provided the task wasn't direct-dispatched to a local/global DSQ. >> When a task is owned by the BPF scheduler, the scheduler needs to dispatch the >> task to a local/global DSQ in order for it to run. >> When the BPF scheduler loses ownership of the task, either due to dispatching it >> to a local/global DSQ or due to external events (core-sched pick, CPU >> migration, scheduling property changes), the BPF scheduler is notified through >> ops.dequeue() with appropriate flags (TBD). > > This looks good overall, except for the global DSQ part. Also, it might be > better to avoid the term “owned”, internally the kernel already uses the > concept of "task ownership" with a different meaning (see > https://lore.kernel.org/all/aVHAZNbIJLLBHEXY@slm.duckdns.org), and reusing > it here could be misleading. > > With that in mind, I'd probably rephrase your documentation along these > lines: > > After ops.enqueue() is called, the task is considered *enqueued* by the BPF > scheduler, unless it is directly dispatched to a local DSQ (via > SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON). > > While a task is enqueued, the BPF scheduler must explicitly dispatch it to > a DSQ in order for it to run. > > When a task leaves the enqueued state (either because it is dispatched to a > non-local DSQ, or due to external events such as a core-sched pick, CPU Shouldn't it be "dispatched to a local DSQ"? > migration, or scheduling property changes), ops.dequeue() is invoked to > notify the BPF scheduler, with flags indicating the reason for the dequeue: > regular dispatch dequeues have no flags set, whereas dequeues triggered by > scheduling property changes are reported with SCX_DEQ_SCHED_CHANGE. Core-sched dequeues also have a dedicated flag, it should probably be included here. > > What do you think? I think using the term "enqueued" isn't very good either since it results in two ways in which a task can be considered enqueued: 1. Between ops.enqueue() and ops.dequeue() 2. Between enqueue_task_scx() and dequeue_task_scx() The two are not equivalent, since a task that's running is not enqueued according to 1. but is enqueued according to 2. I would be ok with it if we change it to something unambiguous, e.g. "BPF-enqueued", although that poses a risk of people getting lazy and using "enqueued" anyway. Some potential alternative terms: "resident"/"BPF-resident", "managed"/"BPF-managed", "dispatchable", "pending dispatch", or simply "pending". Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-31 17:53 ` Kuba Piecuch @ 2026-01-31 20:26 ` Andrea Righi 2026-02-02 15:19 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-01-31 20:26 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis On Sat, Jan 31, 2026 at 05:53:27PM +0000, Kuba Piecuch wrote: > On Sat Jan 31, 2026 at 9:02 AM UTC, Andrea Righi wrote: > > On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote: > >> Is "local" short for "local or global", i.e. not user-created? > >> Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(), > >> since dispatch isn't necessary for the task to run. This follows from the last > >> paragraph: > >> > >> Note that, this way, whether ops.dequeue() needs to be called agrees with > >> whether the task needs to be dispatched to run. > >> > >> I agree with your points, just wanted to clarify this one thing. > > > > I think this should be interpreted as local DSQs only > > (SCX_DSQ_LOCAL / SCX_DSQ_LOCAL_ON), not any built-in DSQ. SCX_DSQ_GLOBAL is > > essentially a built-in user DSQ, provided for convenience, it's not really > > a "direct dispatch" DSQ. > > SCX_DSQ_GLOBAL is significantly different from user DSQs, because balance_one() > can pull tasks directly from SCX_DSQ_GLOBAL, while it cannot pull tasks from > user-created DSQs. > > If a BPF scheduler puts a task onto SCX_DSQ_GLOBAL, then it _must_ be ok with > balance_one() coming along and pulling that task without the BPF scheduler's > intervention, so in that way I believe SCX_DSQ_GLOBAL is semantically quite > similar to local DSQs. I agree that SCX_DSQ_GLOBAL behaves differently from user-created DSQs at the implementation level, but I think that difference shouldn't leak into the logical model. From a semantic point of view, dispatching a task to SCX_DSQ_GLOBAL does not mean that the task leaves the "enqueued by BPF" state. The task is still under the BPF scheduler's custody, not directly dispatched to a specific CPU, and remains sched_ext-managed. The scheduler has queued the task and it hasn't relinquished control over it. That said, I don't have a strong opinion here. If we prefer to treat SCX_DSQ_GLOBAL as a "direct dispatch" DSQ for the purposes of ops.dequeue() semantics, then I'm fine with adjusting the logic accordingly (with proper documentation). Tejun, thoughts? > > >> Here's my attempt at documenting this behavior: > >> > >> After ops.enqueue() is called on a task, the task is owned by the BPF > >> scheduler, provided the task wasn't direct-dispatched to a local/global DSQ. > >> When a task is owned by the BPF scheduler, the scheduler needs to dispatch the > >> task to a local/global DSQ in order for it to run. > >> When the BPF scheduler loses ownership of the task, either due to dispatching it > >> to a local/global DSQ or due to external events (core-sched pick, CPU > >> migration, scheduling property changes), the BPF scheduler is notified through > >> ops.dequeue() with appropriate flags (TBD). > > > > This looks good overall, except for the global DSQ part. Also, it might be > > better to avoid the term “owned”, internally the kernel already uses the > > concept of "task ownership" with a different meaning (see > > https://lore.kernel.org/all/aVHAZNbIJLLBHEXY@slm.duckdns.org), and reusing > > it here could be misleading. > > > > With that in mind, I'd probably rephrase your documentation along these > > lines: > > > > After ops.enqueue() is called, the task is considered *enqueued* by the BPF > > scheduler, unless it is directly dispatched to a local DSQ (via > > SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON). > > > > While a task is enqueued, the BPF scheduler must explicitly dispatch it to > > a DSQ in order for it to run. > > > > When a task leaves the enqueued state (either because it is dispatched to a > > non-local DSQ, or due to external events such as a core-sched pick, CPU > > Shouldn't it be "dispatched to a local DSQ"? Oh yes, sorry, it should be "dispatched to a local DSQ, ...". > > > migration, or scheduling property changes), ops.dequeue() is invoked to > > notify the BPF scheduler, with flags indicating the reason for the dequeue: > > regular dispatch dequeues have no flags set, whereas dequeues triggered by > > scheduling property changes are reported with SCX_DEQ_SCHED_CHANGE. > > Core-sched dequeues also have a dedicated flag, it should probably be included > here. Right, core-sched dequeues should be mentioned as well. > > > > > What do you think? > > I think using the term "enqueued" isn't very good either since it results in > two ways in which a task can be considered enqueued: > > 1. Between ops.enqueue() and ops.dequeue() > 2. Between enqueue_task_scx() and dequeue_task_scx() > > The two are not equivalent, since a task that's running is not enqueued > according to 1. but is enqueued according to 2. > > I would be ok with it if we change it to something unambiguous, e.g. > "BPF-enqueued", although that poses a risk of people getting lazy and using > "enqueued" anyway. > > Some potential alternative terms: "resident"/"BPF-resident", > "managed"/"BPF-managed", "dispatchable", "pending dispatch", > or simply "pending". I agree that enqueued is a very ambiguous term and we probably need something more BPF-specific. How about a task "under BPF custody"? Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-31 20:26 ` Andrea Righi @ 2026-02-02 15:19 ` Tejun Heo 2026-02-02 15:30 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-02 15:19 UTC (permalink / raw) To: Andrea Righi Cc: Kuba Piecuch, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hello, On Sat, Jan 31, 2026 at 09:26:56PM +0100, Andrea Righi wrote: > I agree that SCX_DSQ_GLOBAL behaves differently from user-created DSQs at > the implementation level, but I think that difference shouldn't leak into > the logical model. > > From a semantic point of view, dispatching a task to SCX_DSQ_GLOBAL does > not mean that the task leaves the "enqueued by BPF" state. The task is > still under the BPF scheduler's custody, not directly dispatched to a > specific CPU, and remains sched_ext-managed. The scheduler has queued the > task and it hasn't relinquished control over it. > > That said, I don't have a strong opinion here. If we prefer to treat > SCX_DSQ_GLOBAL as a "direct dispatch" DSQ for the purposes of ops.dequeue() > semantics, then I'm fine with adjusting the logic accordingly (with proper > documentation). > > Tejun, thoughts? I think putting a task into GLOBAL means that the BPF scheduler is done with it. Another data point in this direction is that when insertion into a local DSQ can't be done, the task falls back to the global DSQ although all the current ones also trigger error. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 15:19 ` Tejun Heo @ 2026-02-02 15:30 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-02-02 15:30 UTC (permalink / raw) To: Tejun Heo Cc: Kuba Piecuch, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis On Mon, Feb 02, 2026 at 05:19:51AM -1000, Tejun Heo wrote: > Hello, > > On Sat, Jan 31, 2026 at 09:26:56PM +0100, Andrea Righi wrote: > > I agree that SCX_DSQ_GLOBAL behaves differently from user-created DSQs at > > the implementation level, but I think that difference shouldn't leak into > > the logical model. > > > > From a semantic point of view, dispatching a task to SCX_DSQ_GLOBAL does > > not mean that the task leaves the "enqueued by BPF" state. The task is > > still under the BPF scheduler's custody, not directly dispatched to a > > specific CPU, and remains sched_ext-managed. The scheduler has queued the > > task and it hasn't relinquished control over it. > > > > That said, I don't have a strong opinion here. If we prefer to treat > > SCX_DSQ_GLOBAL as a "direct dispatch" DSQ for the purposes of ops.dequeue() > > semantics, then I'm fine with adjusting the logic accordingly (with proper > > documentation). > > > > Tejun, thoughts? > > I think putting a task into GLOBAL means that the BPF scheduler is done with > it. Another data point in this direction is that when insertion into a local > DSQ can't be done, the task falls back to the global DSQ although all the > current ones also trigger error. Alright, it seems that the general consensus, based on your feedback and Kuba's, is to treat SCX_DSQ_GLOBAL as a "terminal" DSQ for the purpose of triggering ops.dequeue(). I'll update the logic to do the following: - When a task is dispatched to SCX_DSQ_GLOBAL, the BPF scheduler is considered done with it (similar to local DSQ dispatches). - ops.dequeue() will not be called for SCX_DSQ_GLOBAL dispatches. - This aligns with the fallback behavior where tasks that fail local DSQ insertion end up in the global DSQ as a terminal destination. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-30 11:54 ` Kuba Piecuch 2026-01-31 9:02 ` Andrea Righi @ 2026-02-01 17:43 ` Tejun Heo 2026-02-02 15:52 ` Andrea Righi 1 sibling, 1 reply; 81+ messages in thread From: Tejun Heo @ 2026-02-01 17:43 UTC (permalink / raw) To: Kuba Piecuch Cc: Andrea Righi, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis Hello, Sorry about tardiness. On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote: > Is "local" short for "local or global", i.e. not user-created? Yes, maybe it'd be useful to come up with a terminology for them. e.g. terminal - once a task reaches a terminal DSQ, the only way that the BPF scheduler can affect the task is by triggering re-enqueue (although we don't yet support reenqueueing global DSQs). Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-01 17:43 ` Tejun Heo @ 2026-02-02 15:52 ` Andrea Righi 2026-02-02 16:23 ` Kuba Piecuch 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2026-02-02 15:52 UTC (permalink / raw) To: Tejun Heo Cc: Kuba Piecuch, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis On Sun, Feb 01, 2026 at 07:43:33AM -1000, Tejun Heo wrote: > Hello, > > Sorry about tardiness. > > On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote: > > Is "local" short for "local or global", i.e. not user-created? > > Yes, maybe it'd be useful to come up with a terminology for them. e.g. > terminal - once a task reaches a terminal DSQ, the only way that the BPF > scheduler can affect the task is by triggering re-enqueue (although we don't > yet support reenqueueing global DSQs). I like "terminal DSQ", if there's no objection I'll update the documentation using this terminology. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-02-02 15:52 ` Andrea Righi @ 2026-02-02 16:23 ` Kuba Piecuch 0 siblings, 0 replies; 81+ messages in thread From: Kuba Piecuch @ 2026-02-02 16:23 UTC (permalink / raw) To: Andrea Righi, Tejun Heo Cc: Kuba Piecuch, David Vernet, Changwoo Min, Christian Loehle, Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis On Mon Feb 2, 2026 at 3:52 PM UTC, Andrea Righi wrote: > On Sun, Feb 01, 2026 at 07:43:33AM -1000, Tejun Heo wrote: >> On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote: >> > Is "local" short for "local or global", i.e. not user-created? >> >> Yes, maybe it'd be useful to come up with a terminology for them. e.g. >> terminal - once a task reaches a terminal DSQ, the only way that the BPF >> scheduler can affect the task is by triggering re-enqueue (although we don't >> yet support reenqueueing global DSQs). > > I like "terminal DSQ", if there's no objection I'll update the > documentation using this terminology. "Built-in" would also work and avoids introducing new terminology, but it doesn't provide any insight into why these DSQs are special, whereas "terminal" suggests there's some finality to inserting a task there. I'm slightly leaning towards "terminal". Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCHSET v2 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics
@ 2026-01-21 12:25 Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
0 siblings, 1 reply; 81+ messages in thread
From: Andrea Righi @ 2026-01-21 12:25 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the BPF scheduler, either due to
dispatch or a property change, visibility of the task is lost and the
sched_ext core may not invoke ops.dequeue(). This breaks accurate
accounting (i.e., per-DSQ queued runtime sums) and prevents reliable
tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), ensuring that every
ops.enqueue() is balanced by a corresponding ops.dequeue() invocation. In
addition, ops.dequeue() is now properly invoked when tasks are removed from
the sched_ext class, such as on task property changes.
To distinguish between a dispatch dequeue and a property change dequeue,
introduce a new dequeue flag: SCX_DEQ_ASYNC (I'm open to suggestions to
find a better name for this flag). BPF schedulers can use this flag to
distinguish between regular dispatch dequeues (SCX_DSQ_ASYNC unset) and
property change dequeues (SCX_DEQ_ASYNC set).
In addition, a kselftest is provided to validate the behavior of
ops.dequeue() in all the different cases.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DSQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 33 ++++
include/linux/sched/ext.h | 11 ++
kernel/sched/ext.c | 63 ++++++-
kernel/sched/ext_internal.h | 6 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 2 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 209 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 182 +++++++++++++++++++++
10 files changed, 508 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
^ permalink raw reply [flat|nested] 81+ messages in thread* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi @ 2026-01-21 12:25 ` Andrea Righi 2026-01-21 12:54 ` Christian Loehle 2026-01-22 9:28 ` Kuba Piecuch 0 siblings, 2 replies; 81+ messages in thread From: Andrea Righi @ 2026-01-21 12:25 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change scenarios. As a result, BPF schedulers cannot reliably track task state. In addition, some ops.dequeue() callbacks can be skipped (e.g., during direct dispatch), so ops.enqueue() calls are not always paired with a corresponding ops.dequeue(), potentially breaking accounting logic. Fix this by guaranteeing that every ops.enqueue() is matched with a corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to distinguish dequeues triggered by scheduling property changes from those occurring in the normal dispatch workflow. New semantics: 1. ops.enqueue() is called when a task enters the BPF scheduler 2. ops.dequeue() is called when the task leaves the BPF scheduler, because it is dispatched to a DSQ (regular workflow) 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF scheduler, because a task property is changed (sched_change) The SCX_DEQ_ASYNC flag allows BPF schedulers to distinguish between a regular dispatch workflow and a task property changes (e.g., sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, CPU migrations, etc.). This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of enqueue/dequeue pairs, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- Documentation/scheduler/sched-ext.rst | 33 ++++++++++ include/linux/sched/ext.h | 11 ++++ kernel/sched/ext.c | 63 ++++++++++++++++++- kernel/sched/ext_internal.h | 6 ++ .../sched_ext/include/scx/enum_defs.autogen.h | 2 + .../sched_ext/include/scx/enums.autogen.bpf.h | 2 + tools/sched_ext/include/scx/enums.autogen.h | 1 + 7 files changed, 116 insertions(+), 2 deletions(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 404fe6126a769..960125c1439ab 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + Once ``ops.enqueue()`` is called, the task enters the "enqueued state". + The task remains in this state until ``ops.dequeue()`` is called, which + happens in two cases: + + 1. **Regular dispatch workflow**: when the task is successfully + dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()`` + is triggered immediately to notify the BPF scheduler. + + 2. **Scheduling property change**: when a task property changes (via + operations like ``sched_setaffinity()``, ``sched_setscheduler()``, + priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called + with the ``SCX_DEQ_ASYNC`` flag set in ``deq_flags``. + + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, + regardless of whether the task is still on a BPF data structure, or it + has already been dispatched to a DSQ. This guarantees that every + ``ops.enqueue()`` will eventually be followed by a corresponding + ``ops.dequeue()``. + + The ``SCX_DEQ_ASYNC`` flag allows BPF schedulers to distinguish between: + - normal dispatch workflow (task successfully dispatched to a DSQ), + - asynchronous dequeues (``SCX_DEQ_ASYNC``): task property changes that + require the scheduler to update its internal state. + + This makes it reliable for BPF schedulers to track the enqueued state + and maintain accurate accounting. + + BPF schedulers can choose not to implement ``ops.dequeue()`` if they + don't need to track these transitions. The sched_ext core will safely + handle all dequeue operations regardless. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -319,6 +350,8 @@ by a sched_ext scheduler: /* Any usable CPU becomes available */ ops.dispatch(); /* Task is moved to a local DSQ */ + + ops.dequeue(); /* Exiting BPF scheduler */ } ops.running(); /* Task starts running on its assigned CPU */ while (task->scx.slice > 0 && task is runnable) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d8..f3094b4a72a56 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -84,8 +84,19 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + /* + * Set when ops.enqueue() is called; used to determine if ops.dequeue() + * should be invoked when transitioning out of SCX_OPSS_NONE state. + */ + SCX_TASK_OPS_ENQUEUED = 1 << 1, SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ + /* + * Set when ops.dequeue() is called after successful dispatch; used to + * distinguish dispatch dequeues from async dequeues (property changes) + * and to prevent duplicate dequeue calls. + */ + SCX_TASK_DISPATCH_DEQUEUED = 1 << 4, SCX_TASK_STATE_SHIFT = 8, /* bit 8 and 9 are used to carry scx_task_state */ SCX_TASK_STATE_BITS = 2, diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 809f774183202..ac13115c463d2 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1289,6 +1289,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p, p->scx.ddsp_enq_flags |= enq_flags; + /* + * The task is about to be dispatched. If ops.enqueue() was called, + * notify the BPF scheduler by calling ops.dequeue(). + * + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property + * changes can trigger ops.dequeue() with %SCX_DEQ_ASYNC. Mark that + * the dispatch dequeue has been called to distinguish from + * property change dequeues. + */ + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0); + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; + } + /* * We are in the enqueue path with @rq locked and pinned, and thus can't * double lock a remote rq and enqueue to its local DSQ. For @@ -1393,6 +1407,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); + /* + * Mark that ops.enqueue() is being called for this task. + * Clear the dispatch dequeue flag for the new enqueue cycle. + * Only track these flags if ops.dequeue() is implemented. + */ + if (SCX_HAS_OP(sch, dequeue)) { + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED; + } + ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; @@ -1529,6 +1553,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) switch (opss & SCX_OPSS_STATE_MASK) { case SCX_OPSS_NONE: + if (SCX_HAS_OP(sch, dequeue) && + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { + bool is_async_dequeue = + !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)); + + if (is_async_dequeue) + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, + p, deq_flags | SCX_DEQ_ASYNC); + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | + SCX_TASK_DISPATCH_DEQUEUED); + } break; case SCX_OPSS_QUEUEING: /* @@ -1537,9 +1572,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: - if (SCX_HAS_OP(sch, dequeue)) + /* + * Task is in the enqueued state. This is a property change + * dequeue before dispatch completes. Notify the BPF scheduler + * with SCX_DEQ_ASYNC flag. + */ + if (SCX_HAS_OP(sch, dequeue)) { SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, - p, deq_flags); + p, deq_flags | SCX_DEQ_ASYNC); + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | + SCX_TASK_DISPATCH_DEQUEUED); + } if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -2113,6 +2156,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq, BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED)); + /* + * The task is about to be dispatched. If ops.enqueue() was called, + * notify the BPF scheduler by calling ops.dequeue(). + * + * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property + * changes can trigger ops.dequeue() with %SCX_DEQ_ASYNC. Mark that + * the dispatch dequeue has been called to distinguish from + * property change dequeues. + */ + if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) { + struct rq *task_rq = task_rq(p); + + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0); + p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED; + } + dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p); if (dsq->id == SCX_DSQ_LOCAL) diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h index 386c677e4c9a0..068c7c2892a16 100644 --- a/kernel/sched/ext_internal.h +++ b/kernel/sched/ext_internal.h @@ -982,6 +982,12 @@ enum scx_deq_flags { * it hasn't been dispatched yet. Dequeue from the BPF side. */ SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32, + + /* + * The task is being dequeued due to an asynchronous event (e.g., + * property change via sched_setaffinity(), priority change, etc.). + */ + SCX_DEQ_ASYNC = 1LLU << 33, }; enum scx_pick_idle_cpu_flags { diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h index c2c33df9292c2..17d8f4324b856 100644 --- a/tools/sched_ext/include/scx/enum_defs.autogen.h +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h @@ -21,6 +21,7 @@ #define HAVE_SCX_CPU_PREEMPT_UNKNOWN #define HAVE_SCX_DEQ_SLEEP #define HAVE_SCX_DEQ_CORE_SCHED_EXEC +#define HAVE_SCX_DEQ_ASYNC #define HAVE_SCX_DSQ_FLAG_BUILTIN #define HAVE_SCX_DSQ_FLAG_LOCAL_ON #define HAVE_SCX_DSQ_INVALID @@ -48,6 +49,7 @@ #define HAVE_SCX_TASK_QUEUED #define HAVE_SCX_TASK_RESET_RUNNABLE_AT #define HAVE_SCX_TASK_DEQD_FOR_SLEEP +#define HAVE_SCX_TASK_DISPATCH_DEQUEUED #define HAVE_SCX_TASK_STATE_SHIFT #define HAVE_SCX_TASK_STATE_BITS #define HAVE_SCX_TASK_STATE_MASK diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h index 2f8002bcc19ad..b3ecd6783d1e5 100644 --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak; const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak; #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ +const volatile u64 __SCX_DEQ_ASYNC __weak; +#define SCX_DEQ_ASYNC __SCX_DEQ_ASYNC diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h index fedec938584be..89359ab65cd3c 100644 --- a/tools/sched_ext/include/scx/enums.autogen.h +++ b/tools/sched_ext/include/scx/enums.autogen.h @@ -46,4 +46,5 @@ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \ SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \ + SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_ASYNC); \ } while (0) -- 2.52.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi @ 2026-01-21 12:54 ` Christian Loehle 2026-01-21 12:57 ` Andrea Righi 2026-01-22 9:28 ` Kuba Piecuch 1 sibling, 1 reply; 81+ messages in thread From: Christian Loehle @ 2026-01-21 12:54 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On 1/21/26 12:25, Andrea Righi wrote: > Currently, ops.dequeue() is only invoked when the sched_ext core knows > that a task resides in BPF-managed data structures, which causes it to > miss scheduling property change scenarios. As a result, BPF schedulers > cannot reliably track task state. > > In addition, some ops.dequeue() callbacks can be skipped (e.g., during > direct dispatch), so ops.enqueue() calls are not always paired with a > corresponding ops.dequeue(), potentially breaking accounting logic. > > Fix this by guaranteeing that every ops.enqueue() is matched with a > corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to > distinguish dequeues triggered by scheduling property changes from those > occurring in the normal dispatch workflow. > > New semantics: > 1. ops.enqueue() is called when a task enters the BPF scheduler > 2. ops.dequeue() is called when the task leaves the BPF scheduler, > because it is dispatched to a DSQ (regular workflow) > 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF > scheduler, because a task property is changed (sched_change) > > The SCX_DEQ_ASYNC flag allows BPF schedulers to distinguish between a > regular dispatch workflow and a task property changes (e.g., > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA > balancing, CPU migrations, etc.). > > This allows BPF schedulers to: > - reliably track task ownership and lifecycle, > - maintain accurate accounting of enqueue/dequeue pairs, > - update internal state when tasks change properties. > [snip] Cool, so with this patch I should be able to fix my scx_storm BPF scheduler doing local inserts, as long as I track all the task's status that are not in a DSQ? https://github.com/cloehle/scx/commit/25ea91d8f7fea1f31cf426561b432180fb9cf76a mentioned in https://github.com/sched-ext/scx/issues/2825 Let me give that a go and report back! ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-21 12:54 ` Christian Loehle @ 2026-01-21 12:57 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-01-21 12:57 UTC (permalink / raw) To: Christian Loehle Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Wed, Jan 21, 2026 at 12:54:42PM +0000, Christian Loehle wrote: ... > > This allows BPF schedulers to: > > - reliably track task ownership and lifecycle, > > - maintain accurate accounting of enqueue/dequeue pairs, > > - update internal state when tasks change properties. > > [snip] > > Cool, so with this patch I should be able to fix my scx_storm BPF > scheduler doing local inserts, as long as I track all the task's status > that are not in a DSQ? > https://github.com/cloehle/scx/commit/25ea91d8f7fea1f31cf426561b432180fb9cf76a > mentioned in > https://github.com/sched-ext/scx/issues/2825 In theory, yes... > > Let me give that a go and report back! Let me know how it goes. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi 2026-01-21 12:54 ` Christian Loehle @ 2026-01-22 9:28 ` Kuba Piecuch 2026-01-23 13:32 ` Andrea Righi 1 sibling, 1 reply; 81+ messages in thread From: Kuba Piecuch @ 2026-01-22 9:28 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel [Resending with reply-all, messed up the first time, apologies.] Hi Andrea, On Wed Jan 21, 2026 at 12:25 PM UTC, Andrea Righi wrote: > Currently, ops.dequeue() is only invoked when the sched_ext core knows > that a task resides in BPF-managed data structures, which causes it to > miss scheduling property change scenarios. As a result, BPF schedulers > cannot reliably track task state. > > In addition, some ops.dequeue() callbacks can be skipped (e.g., during > direct dispatch), so ops.enqueue() calls are not always paired with a > corresponding ops.dequeue(), potentially breaking accounting logic. > > Fix this by guaranteeing that every ops.enqueue() is matched with a > corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to > distinguish dequeues triggered by scheduling property changes from those > occurring in the normal dispatch workflow. > > New semantics: > 1. ops.enqueue() is called when a task enters the BPF scheduler > 2. ops.dequeue() is called when the task leaves the BPF scheduler, > because it is dispatched to a DSQ (regular workflow) > 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF > scheduler, because a task property is changed (sched_change) What about the case where ops.dequeue() is called due to core-sched picking the task through sched_core_find()? If I understand core-sched correctly, it can happen without prior dispatch, so it doesn't fit case 2, and we're not changing task properties, so it doesn't fit case 3 either. > + /* > + * Set when ops.dequeue() is called after successful dispatch; used to > + * distinguish dispatch dequeues from async dequeues (property changes) > + * and to prevent duplicate dequeue calls. > + */ > + SCX_TASK_DISPATCH_DEQUEUED = 1 << 4, I see this flag being set and cleared in several places, but I don't see it actually being read, is that intentional? > @@ -1529,6 +1553,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > switch (opss & SCX_OPSS_STATE_MASK) { > case SCX_OPSS_NONE: > + if (SCX_HAS_OP(sch, dequeue) && > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) { > + bool is_async_dequeue = > + !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)); > + > + if (is_async_dequeue) > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > + p, deq_flags | SCX_DEQ_ASYNC); > + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | > + SCX_TASK_DISPATCH_DEQUEUED); > + } > break; > case SCX_OPSS_QUEUEING: > /* > @@ -1537,9 +1572,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > */ > BUG(); > case SCX_OPSS_QUEUED: > - if (SCX_HAS_OP(sch, dequeue)) > + /* > + * Task is in the enqueued state. This is a property change > + * dequeue before dispatch completes. Notify the BPF scheduler > + * with SCX_DEQ_ASYNC flag. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { > SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > - p, deq_flags); > + p, deq_flags | SCX_DEQ_ASYNC); > + p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED | > + SCX_TASK_DISPATCH_DEQUEUED); > + } > A core-sched pick of a task queued on the BPF scheduler will result in entering the SCX_OPSS_QUEUED case, which in turn will call ops.dequeue(SCX_DEQ_ASYNC). This seems to conflict with the is_async_dequeue check above, which treats SCX_DEQ_CORE_SCHED_EXEC as a synchronous dequeue. Thanks, Kuba ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2026-01-22 9:28 ` Kuba Piecuch @ 2026-01-23 13:32 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2026-01-23 13:32 UTC (permalink / raw) To: Kuba Piecuch Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Thu, Jan 22, 2026 at 09:28:39AM +0000, Kuba Piecuch wrote: > [Resending with reply-all, messed up the first time, apologies.] Re-sendind my reply as well, just for the records. :) > > Hi Andrea, > > On Wed Jan 21, 2026 at 12:25 PM UTC, Andrea Righi wrote: > > Currently, ops.dequeue() is only invoked when the sched_ext core knows > > that a task resides in BPF-managed data structures, which causes it to > > miss scheduling property change scenarios. As a result, BPF schedulers > > cannot reliably track task state. > > > > In addition, some ops.dequeue() callbacks can be skipped (e.g., during > > direct dispatch), so ops.enqueue() calls are not always paired with a > > corresponding ops.dequeue(), potentially breaking accounting logic. > > > > Fix this by guaranteeing that every ops.enqueue() is matched with a > > corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to > > distinguish dequeues triggered by scheduling property changes from those > > occurring in the normal dispatch workflow. > > > > New semantics: > > 1. ops.enqueue() is called when a task enters the BPF scheduler > > 2. ops.dequeue() is called when the task leaves the BPF scheduler, > > because it is dispatched to a DSQ (regular workflow) > > 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF > > scheduler, because a task property is changed (sched_change) > > What about the case where ops.dequeue() is called due to core-sched picking the > task through sched_core_find()? If I understand core-sched correctly, it can > happen without prior dispatch, so it doesn't fit case 2, and we're not changing > task properties, so it doesn't fit case 3 either. You're absolutely right, core-sched picks are inconsistently handled. They're treated as property change dequeues in the SCX_OPSS_QUEUED case and as dispatch dequeues in SCX_OPSS_NONE. Core-sched picks should be treated consistently as regular dequeues since they're not property changes. I'll fix this in the next version (adding SCX_DEQ_CORE_SCHED_EXEC check in the SCX_OPSS_QUEUED should make the core-sched case consistent). > > > + /* > > + * Set when ops.dequeue() is called after successful dispatch; used to > > + * distinguish dispatch dequeues from async dequeues (property changes) > > + * and to prevent duplicate dequeue calls. > > + */ > > + SCX_TASK_DISPATCH_DEQUEUED = 1 << 4, > > I see this flag being set and cleared in several places, but I don't see it > actually being read, is that intentional? And you're right here as well. At some point this was used to distinguish dispatch dequeues vs async dequeues, but isn't actually used anymore. I'll clean this up in the next version. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* [PATCH 0/2] sched_ext: Implement proper ops.dequeue() semantics
@ 2025-12-19 22:43 Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
0 siblings, 1 reply; 81+ messages in thread
From: Andrea Righi @ 2025-12-19 22:43 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min
Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel
Currently, ops.dequeue() is only invoked when tasks are still owned by the
BPF scheduler (i.e., not yet dispatched to any DSQ). However, BPF
schedulers may need to track task ownership transitions reliably.
The issue is that once a task is dispatched, the BPF scheduler loses
visibility of when the task leaves its ownership. This makes it impossible
to maintain accurate accounting (e.g., per-DSQ queued runtime sums) or
properly track task lifecycle events.
This fixes the semantics of ops.dequeue() to ensure that every
ops.enqueue() is properly balanced by a corresponding ops.dequeue() call.
With this, a task is considered "enqueued" from the moment ops.enqueue() is
called until it either:
1. Gets dispatched (moved to a local DSQ for execution) or,
2. Is removed from the scheduler (e.g., blocks, or properties like CPU
affinity or priority are changed)
When either happens, ops.dequeue() is invoked, ensuring reliable 1:1
pairing between enqueue and dequeue operations.
This allows BPF schedulers to reliably track task ownership and maintain
accurate accounting.
Andrea Righi (2):
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue()
Documentation/scheduler/sched-ext.rst | 22 +++
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 27 +++-
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 139 +++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 172 ++++++++++++++++++++++++
6 files changed, 361 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
^ permalink raw reply [flat|nested] 81+ messages in thread* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi @ 2025-12-19 22:43 ` Andrea Righi 2025-12-28 3:20 ` Emil Tsalapatis ` (3 more replies) 0 siblings, 4 replies; 81+ messages in thread From: Andrea Righi @ 2025-12-19 22:43 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Properly implement ops.dequeue() to ensure every ops.enqueue() is balanced by a corresponding ops.dequeue() call, regardless of whether the task is on a BPF data structure or already dispatched to a DSQ. A task is considered enqueued when it is owned by the BPF scheduler. This ownership persists until the task is either dispatched (moved to a local DSQ for execution) or removed from the BPF scheduler, such as when it blocks waiting for an event or when its properties (for example, CPU affinity or priority) are updated. When the task enters the BPF scheduler ops.enqueue() is invoked, when it leaves BPF scheduler ownership, ops.dequeue() is invoked. This allows BPF schedulers to reliably track task ownership and maintain accurate accounting. Cc: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++ include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 27 ++++++++++++++++++++++++++- 3 files changed, 49 insertions(+), 1 deletion(-) diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst index 404fe6126a769..3ed4be53f97da 100644 --- a/Documentation/scheduler/sched-ext.rst +++ b/Documentation/scheduler/sched-ext.rst @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed. * Queue the task on the BPF side. + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and + is owned by the BPF scheduler. Ownership is retained until the task is + either dispatched (moved to a local DSQ for execution) or dequeued + (removed from the scheduler due to a blocking event, or to modify a + property, like CPU affinity, priority, etc.). When the task leaves the + BPF scheduler ``ops.dequeue()`` is invoked. + + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, + regardless of whether the task is still on a BPF data structure, or it + is already dispatched to a DSQ (global, local, or user DSQ) + + This guarantees that every ``ops.enqueue()`` will eventually be followed + by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to + track task ownership and maintain accurate accounting, such as per-DSQ + queued runtime sums. + + BPF schedulers can choose not to implement ``ops.dequeue()`` if they + don't need to track these transitions. The sched_ext core will safely + handle all dequeue operations regardless. + 3. When a CPU is ready to schedule, it first looks at its local DSQ. If empty, it then looks at the global DSQ. If there still isn't a task to run, ``ops.dispatch()`` is invoked which can use the following two @@ -319,6 +339,8 @@ by a sched_ext scheduler: /* Any usable CPU becomes available */ ops.dispatch(); /* Task is moved to a local DSQ */ + + ops.dequeue(); /* Exiting BPF scheduler */ } ops.running(); /* Task starts running on its assigned CPU */ while (task->scx.slice > 0 && task is runnable) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index bcb962d5ee7d8..334c3692a9c62 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -84,6 +84,7 @@ struct scx_dispatch_q { /* scx_entity.flags */ enum scx_ent_flags { SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* ops.enqueue() was called */ SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 94164f2dec6dc..985d75d374385 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); + /* Mark that ops.enqueue() is being called for this task */ + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; + ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) switch (opss & SCX_OPSS_STATE_MASK) { case SCX_OPSS_NONE: + /* + * Task is not currently being enqueued or queued on the BPF + * scheduler. Check if ops.enqueue() was called for this task. + */ + if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) && + SCX_HAS_OP(sch, dequeue)) { + /* + * ops.enqueue() was called and the task was dispatched. + * Call ops.dequeue() to notify the BPF scheduler that + * the task is leaving. + */ + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, + p, deq_flags); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } break; case SCX_OPSS_QUEUEING: /* @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) */ BUG(); case SCX_OPSS_QUEUED: - if (SCX_HAS_OP(sch, dequeue)) + /* + * Task is owned by the BPF scheduler. Call ops.dequeue() + * to notify the BPF scheduler that the task is being + * removed. + */ + if (SCX_HAS_OP(sch, dequeue)) { SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags); + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; + } if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) -- 2.52.0 ^ permalink raw reply related [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi @ 2025-12-28 3:20 ` Emil Tsalapatis 2025-12-29 16:36 ` Andrea Righi 2025-12-28 17:19 ` Tejun Heo ` (2 subsequent siblings) 3 siblings, 1 reply; 81+ messages in thread From: Emil Tsalapatis @ 2025-12-28 3:20 UTC (permalink / raw) To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min Cc: Daniel Hodges, sched-ext, linux-kernel On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote: > Properly implement ops.dequeue() to ensure every ops.enqueue() is > balanced by a corresponding ops.dequeue() call, regardless of whether > the task is on a BPF data structure or already dispatched to a DSQ. > > A task is considered enqueued when it is owned by the BPF scheduler. > This ownership persists until the task is either dispatched (moved to a > local DSQ for execution) or removed from the BPF scheduler, such as when > it blocks waiting for an event or when its properties (for example, CPU > affinity or priority) are updated. > > When the task enters the BPF scheduler ops.enqueue() is invoked, when it > leaves BPF scheduler ownership, ops.dequeue() is invoked. > > This allows BPF schedulers to reliably track task ownership and maintain > accurate accounting. > > Cc: Emil Tsalapatis <emil@etsalapatis.com> > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- Hi Andrea, This change looks reasonable to me. Some comments inline: > Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++ > include/linux/sched/ext.h | 1 + > kernel/sched/ext.c | 27 ++++++++++++++++++++++++++- > 3 files changed, 49 insertions(+), 1 deletion(-) > > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > index 404fe6126a769..3ed4be53f97da 100644 > --- a/Documentation/scheduler/sched-ext.rst > +++ b/Documentation/scheduler/sched-ext.rst > @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed. > > * Queue the task on the BPF side. > > + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and > + is owned by the BPF scheduler. Ownership is retained until the task is > + either dispatched (moved to a local DSQ for execution) or dequeued > + (removed from the scheduler due to a blocking event, or to modify a > + property, like CPU affinity, priority, etc.). When the task leaves the > + BPF scheduler ``ops.dequeue()`` is invoked. > + Can we say "leaves the scx class" instead? On direct dispatch we technically never insert the task into the BPF scheduler. > + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, > + regardless of whether the task is still on a BPF data structure, or it > + is already dispatched to a DSQ (global, local, or user DSQ) > + > + This guarantees that every ``ops.enqueue()`` will eventually be followed > + by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to > + track task ownership and maintain accurate accounting, such as per-DSQ > + queued runtime sums. > + > + BPF schedulers can choose not to implement ``ops.dequeue()`` if they > + don't need to track these transitions. The sched_ext core will safely > + handle all dequeue operations regardless. > + > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If > empty, it then looks at the global DSQ. If there still isn't a task to > run, ``ops.dispatch()`` is invoked which can use the following two > @@ -319,6 +339,8 @@ by a sched_ext scheduler: > /* Any usable CPU becomes available */ > > ops.dispatch(); /* Task is moved to a local DSQ */ > + > + ops.dequeue(); /* Exiting BPF scheduler */ > } > ops.running(); /* Task starts running on its assigned CPU */ > while (task->scx.slice > 0 && task is runnable) > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > index bcb962d5ee7d8..334c3692a9c62 100644 > --- a/include/linux/sched/ext.h > +++ b/include/linux/sched/ext.h > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > /* scx_entity.flags */ > enum scx_ent_flags { > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* ops.enqueue() was called */ Can we rename this flag? For direct dispatch we never got enqueued. Something like "DEQ_ON_DISPATCH" would show the purpose of the flag more clearly. > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 94164f2dec6dc..985d75d374385 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); > > + /* Mark that ops.enqueue() is being called for this task */ > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > + Can we avoid setting this flag when we have no .dequeue() method? Otherwise it stays set forever AFAICT, even after the task has been sent to the runqueues. > ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); > WARN_ON_ONCE(*ddsp_taskp); > *ddsp_taskp = p; > @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > switch (opss & SCX_OPSS_STATE_MASK) { > case SCX_OPSS_NONE: > + /* > + * Task is not currently being enqueued or queued on the BPF > + * scheduler. Check if ops.enqueue() was called for this task. > + */ > + if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) && > + SCX_HAS_OP(sch, dequeue)) { > + /* > + * ops.enqueue() was called and the task was dispatched. > + * Call ops.dequeue() to notify the BPF scheduler that > + * the task is leaving. > + */ > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > + p, deq_flags); > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + } > break; > case SCX_OPSS_QUEUEING: > /* > @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > */ > BUG(); > case SCX_OPSS_QUEUED: > - if (SCX_HAS_OP(sch, dequeue)) > + /* > + * Task is owned by the BPF scheduler. Call ops.dequeue() > + * to notify the BPF scheduler that the task is being > + * removed. > + */ > + if (SCX_HAS_OP(sch, dequeue)) { Edge case, but if we have a .dequeue() method but not an .enqueue() we still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an extra condition to be consistent with the SCX_OPSS_NONE case above? > SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > p, deq_flags); > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > + } > > if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, > SCX_OPSS_NONE)) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-28 3:20 ` Emil Tsalapatis @ 2025-12-29 16:36 ` Andrea Righi 2025-12-29 18:35 ` Emil Tsalapatis 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2025-12-29 16:36 UTC (permalink / raw) To: Emil Tsalapatis Cc: Tejun Heo, David Vernet, Changwoo Min, Daniel Hodges, sched-ext, linux-kernel Hi Emil, On Sat, Dec 27, 2025 at 10:20:06PM -0500, Emil Tsalapatis wrote: > On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote: > > Properly implement ops.dequeue() to ensure every ops.enqueue() is > > balanced by a corresponding ops.dequeue() call, regardless of whether > > the task is on a BPF data structure or already dispatched to a DSQ. > > > > A task is considered enqueued when it is owned by the BPF scheduler. > > This ownership persists until the task is either dispatched (moved to a > > local DSQ for execution) or removed from the BPF scheduler, such as when > > it blocks waiting for an event or when its properties (for example, CPU > > affinity or priority) are updated. > > > > When the task enters the BPF scheduler ops.enqueue() is invoked, when it > > leaves BPF scheduler ownership, ops.dequeue() is invoked. > > > > This allows BPF schedulers to reliably track task ownership and maintain > > accurate accounting. > > > > Cc: Emil Tsalapatis <emil@etsalapatis.com> > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > > --- > > > Hi Andrea, > > This change looks reasonable to me. Some comments inline: > > > Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++ > > include/linux/sched/ext.h | 1 + > > kernel/sched/ext.c | 27 ++++++++++++++++++++++++++- > > 3 files changed, 49 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst > > index 404fe6126a769..3ed4be53f97da 100644 > > --- a/Documentation/scheduler/sched-ext.rst > > +++ b/Documentation/scheduler/sched-ext.rst > > @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed. > > > > * Queue the task on the BPF side. > > > > + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and > > + is owned by the BPF scheduler. Ownership is retained until the task is > > + either dispatched (moved to a local DSQ for execution) or dequeued > > + (removed from the scheduler due to a blocking event, or to modify a > > + property, like CPU affinity, priority, etc.). When the task leaves the > > + BPF scheduler ``ops.dequeue()`` is invoked. > > + > > Can we say "leaves the scx class" instead? On direct dispatch we > technically never insert the task into the BPF scheduler. Hm.. I agree that'd be more accurate, but it might also be slightly misleading, as it could be interpreted as the task being moved to a different scheduling class. How about saying "leaves the enqueued state" instead, where enqueued means ops.enqueue() being called... I can't find a better name for this state, like "ops_enqueued", but that's be even more confusing. :) > > > + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, > > + regardless of whether the task is still on a BPF data structure, or it > > + is already dispatched to a DSQ (global, local, or user DSQ) > > + > > + This guarantees that every ``ops.enqueue()`` will eventually be followed > > + by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to > > + track task ownership and maintain accurate accounting, such as per-DSQ > > + queued runtime sums. > > + > > + BPF schedulers can choose not to implement ``ops.dequeue()`` if they > > + don't need to track these transitions. The sched_ext core will safely > > + handle all dequeue operations regardless. > > + > > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If > > empty, it then looks at the global DSQ. If there still isn't a task to > > run, ``ops.dispatch()`` is invoked which can use the following two > > @@ -319,6 +339,8 @@ by a sched_ext scheduler: > > /* Any usable CPU becomes available */ > > > > ops.dispatch(); /* Task is moved to a local DSQ */ > > + > > + ops.dequeue(); /* Exiting BPF scheduler */ > > } > > ops.running(); /* Task starts running on its assigned CPU */ > > while (task->scx.slice > 0 && task is runnable) > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h > > index bcb962d5ee7d8..334c3692a9c62 100644 > > --- a/include/linux/sched/ext.h > > +++ b/include/linux/sched/ext.h > > @@ -84,6 +84,7 @@ struct scx_dispatch_q { > > /* scx_entity.flags */ > > enum scx_ent_flags { > > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ > > + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* ops.enqueue() was called */ > > Can we rename this flag? For direct dispatch we never got enqueued. > Something like "DEQ_ON_DISPATCH" would show the purpose of the > flag more clearly. Good point. However, ops.dequeue() isn't only called on dispatch, it can also be triggered when a task property is changed. So the flag should represent the "enqueued state" in the sense that ops.enqueue() has been called and a corresponding ops.dequeue() is expected. This is a lifecycle state, not an indication that the task is in any queue. Would a more descriptive comment clarify this? Something like: SCX_TASK_OPS_ENQUEUED = 1 << 1, /* Task in enqueued state: ops.enqueue() called, ops.dequeue() will be called when task leaves this state. */ > > > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ > > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ > > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index 94164f2dec6dc..985d75d374385 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); > > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); > > > > + /* Mark that ops.enqueue() is being called for this task */ > > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > > + > > Can we avoid setting this flag when we have no .dequeue() method? > Otherwise it stays set forever AFAICT, even after the task has been > sent to the runqueues. Good catch! Definitely we don't need to set this for schedulers that don't implement ops.dequeue(). > > > ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); > > WARN_ON_ONCE(*ddsp_taskp); > > *ddsp_taskp = p; > > @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > > > switch (opss & SCX_OPSS_STATE_MASK) { > > case SCX_OPSS_NONE: > > + /* > > + * Task is not currently being enqueued or queued on the BPF > > + * scheduler. Check if ops.enqueue() was called for this task. > > + */ > > + if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) && > > + SCX_HAS_OP(sch, dequeue)) { > > + /* > > + * ops.enqueue() was called and the task was dispatched. > > + * Call ops.dequeue() to notify the BPF scheduler that > > + * the task is leaving. > > + */ > > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > > + p, deq_flags); > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + } > > break; > > case SCX_OPSS_QUEUEING: > > /* > > @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) > > */ > > BUG(); > > case SCX_OPSS_QUEUED: > > - if (SCX_HAS_OP(sch, dequeue)) > > + /* > > + * Task is owned by the BPF scheduler. Call ops.dequeue() > > + * to notify the BPF scheduler that the task is being > > + * removed. > > + */ > > + if (SCX_HAS_OP(sch, dequeue)) { > > Edge case, but if we have a .dequeue() method but not an .enqueue() we > still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an > extra condition to be consistent with the SCX_OPSS_NONE case above? Also good catch. Will add that. > > > SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, > > p, deq_flags); > > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; > > + } > > > > if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, > > SCX_OPSS_NONE)) > Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-29 16:36 ` Andrea Righi @ 2025-12-29 18:35 ` Emil Tsalapatis 0 siblings, 0 replies; 81+ messages in thread From: Emil Tsalapatis @ 2025-12-29 18:35 UTC (permalink / raw) To: Andrea Righi Cc: Tejun Heo, David Vernet, Changwoo Min, Daniel Hodges, sched-ext, linux-kernel On Mon Dec 29, 2025 at 11:36 AM EST, Andrea Righi wrote: > Hi Emil, > > On Sat, Dec 27, 2025 at 10:20:06PM -0500, Emil Tsalapatis wrote: >> On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote: >> > Properly implement ops.dequeue() to ensure every ops.enqueue() is >> > balanced by a corresponding ops.dequeue() call, regardless of whether >> > the task is on a BPF data structure or already dispatched to a DSQ. >> > >> > A task is considered enqueued when it is owned by the BPF scheduler. >> > This ownership persists until the task is either dispatched (moved to a >> > local DSQ for execution) or removed from the BPF scheduler, such as when >> > it blocks waiting for an event or when its properties (for example, CPU >> > affinity or priority) are updated. >> > >> > When the task enters the BPF scheduler ops.enqueue() is invoked, when it >> > leaves BPF scheduler ownership, ops.dequeue() is invoked. >> > >> > This allows BPF schedulers to reliably track task ownership and maintain >> > accurate accounting. >> > >> > Cc: Emil Tsalapatis <emil@etsalapatis.com> >> > Signed-off-by: Andrea Righi <arighi@nvidia.com> >> > --- >> >> >> Hi Andrea, >> >> This change looks reasonable to me. Some comments inline: >> >> > Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++ >> > include/linux/sched/ext.h | 1 + >> > kernel/sched/ext.c | 27 ++++++++++++++++++++++++++- >> > 3 files changed, 49 insertions(+), 1 deletion(-) >> > >> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst >> > index 404fe6126a769..3ed4be53f97da 100644 >> > --- a/Documentation/scheduler/sched-ext.rst >> > +++ b/Documentation/scheduler/sched-ext.rst >> > @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed. >> > >> > * Queue the task on the BPF side. >> > >> > + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and >> > + is owned by the BPF scheduler. Ownership is retained until the task is >> > + either dispatched (moved to a local DSQ for execution) or dequeued >> > + (removed from the scheduler due to a blocking event, or to modify a >> > + property, like CPU affinity, priority, etc.). When the task leaves the >> > + BPF scheduler ``ops.dequeue()`` is invoked. >> > + >> >> Can we say "leaves the scx class" instead? On direct dispatch we >> technically never insert the task into the BPF scheduler. > > Hm.. I agree that'd be more accurate, but it might also be slightly > misleading, as it could be interpreted as the task being moved to a > different scheduling class. How about saying "leaves the enqueued state" > instead, where enqueued means ops.enqueue() being called... I can't find a > better name for this state, like "ops_enqueued", but that's be even more > confusing. :) > I like "leaves the enqueued state", it implies that the task has no state in the scx scheduler. >> >> > + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, >> > + regardless of whether the task is still on a BPF data structure, or it >> > + is already dispatched to a DSQ (global, local, or user DSQ) >> > + >> > + This guarantees that every ``ops.enqueue()`` will eventually be followed >> > + by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to >> > + track task ownership and maintain accurate accounting, such as per-DSQ >> > + queued runtime sums. >> > + >> > + BPF schedulers can choose not to implement ``ops.dequeue()`` if they >> > + don't need to track these transitions. The sched_ext core will safely >> > + handle all dequeue operations regardless. >> > + >> > 3. When a CPU is ready to schedule, it first looks at its local DSQ. If >> > empty, it then looks at the global DSQ. If there still isn't a task to >> > run, ``ops.dispatch()`` is invoked which can use the following two >> > @@ -319,6 +339,8 @@ by a sched_ext scheduler: >> > /* Any usable CPU becomes available */ >> > >> > ops.dispatch(); /* Task is moved to a local DSQ */ >> > + >> > + ops.dequeue(); /* Exiting BPF scheduler */ >> > } >> > ops.running(); /* Task starts running on its assigned CPU */ >> > while (task->scx.slice > 0 && task is runnable) >> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h >> > index bcb962d5ee7d8..334c3692a9c62 100644 >> > --- a/include/linux/sched/ext.h >> > +++ b/include/linux/sched/ext.h >> > @@ -84,6 +84,7 @@ struct scx_dispatch_q { >> > /* scx_entity.flags */ >> > enum scx_ent_flags { >> > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */ >> > + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* ops.enqueue() was called */ >> >> Can we rename this flag? For direct dispatch we never got enqueued. >> Something like "DEQ_ON_DISPATCH" would show the purpose of the >> flag more clearly. > > Good point. However, ops.dequeue() isn't only called on dispatch, it can > also be triggered when a task property is changed. > > So the flag should represent the "enqueued state" in the sense that > ops.enqueue() has been called and a corresponding ops.dequeue() is > expected. This is a lifecycle state, not an indication that the task is in > any queue. > > Would a more descriptive comment clarify this? Something like: > > SCX_TASK_OPS_ENQUEUED = 1 << 1, /* Task in enqueued state: ops.enqueue() > called, ops.dequeue() will be called > when task leaves this state. */ > That makes sense, my reasoning was that we actually use the flag for is not whether the task is enqueued, but rather whether whether we need to call the dequeue callback when dequeueing from the SCX_OPSS_NONE state. Can the comment maybe more concretely explain this? As an aside, I think this change makes it so we can remove the _OPSS_ state machine with some more refactoring. >> >> > SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */ >> > SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */ >> > >> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c >> > index 94164f2dec6dc..985d75d374385 100644 >> > --- a/kernel/sched/ext.c >> > +++ b/kernel/sched/ext.c >> > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, >> > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); >> > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); >> > >> > + /* Mark that ops.enqueue() is being called for this task */ >> > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; >> > + >> >> Can we avoid setting this flag when we have no .dequeue() method? >> Otherwise it stays set forever AFAICT, even after the task has been >> sent to the runqueues. > > Good catch! Definitely we don't need to set this for schedulers that don't > implement ops.dequeue(). > >> >> > ddsp_taskp = this_cpu_ptr(&direct_dispatch_task); >> > WARN_ON_ONCE(*ddsp_taskp); >> > *ddsp_taskp = p; >> > @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) >> > >> > switch (opss & SCX_OPSS_STATE_MASK) { >> > case SCX_OPSS_NONE: >> > + /* >> > + * Task is not currently being enqueued or queued on the BPF >> > + * scheduler. Check if ops.enqueue() was called for this task. >> > + */ >> > + if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) && >> > + SCX_HAS_OP(sch, dequeue)) { >> > + /* >> > + * ops.enqueue() was called and the task was dispatched. >> > + * Call ops.dequeue() to notify the BPF scheduler that >> > + * the task is leaving. >> > + */ >> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, >> > + p, deq_flags); >> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; >> > + } >> > break; >> > case SCX_OPSS_QUEUEING: >> > /* >> > @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) >> > */ >> > BUG(); >> > case SCX_OPSS_QUEUED: >> > - if (SCX_HAS_OP(sch, dequeue)) >> > + /* >> > + * Task is owned by the BPF scheduler. Call ops.dequeue() >> > + * to notify the BPF scheduler that the task is being >> > + * removed. >> > + */ >> > + if (SCX_HAS_OP(sch, dequeue)) { >> >> Edge case, but if we have a .dequeue() method but not an .enqueue() we >> still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an >> extra condition to be consistent with the SCX_OPSS_NONE case above? > > Also good catch. Will add that. > >> >> > SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, >> > p, deq_flags); >> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED; >> > + } >> > >> > if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, >> > SCX_OPSS_NONE)) >> > > Thanks, > -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi 2025-12-28 3:20 ` Emil Tsalapatis @ 2025-12-28 17:19 ` Tejun Heo 2025-12-28 23:28 ` Tejun Heo 2025-12-28 23:42 ` Tejun Heo 2025-12-29 0:06 ` Tejun Heo 3 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2025-12-28 17:19 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, Andrea. On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote: ... > + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and > + is owned by the BPF scheduler. Ownership is retained until the task is > + either dispatched (moved to a local DSQ for execution) or dequeued > + (removed from the scheduler due to a blocking event, or to modify a > + property, like CPU affinity, priority, etc.). When the task leaves the > + BPF scheduler ``ops.dequeue()`` is invoked. > + > + **Important**: ``ops.dequeue()`` is called for *any* enqueued task, > + regardless of whether the task is still on a BPF data structure, or it > + is already dispatched to a DSQ (global, local, or user DSQ) > + > + This guarantees that every ``ops.enqueue()`` will eventually be followed > + by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to > + track task ownership and maintain accurate accounting, such as per-DSQ > + queued runtime sums. While this works, from the BPF sched's POV, there's no way to tell whether an ops.dequeue() call is from the task being actually dequeued or the follow-up to the dispatch operation it just did. This won't make much difference if ops.dequeue() is just used for accounting purposes, but, a scheduler which uses an arena data structure for queueing would likely need to perform extra tests to tell whether the task needs to be dequeued from the arena side. I *think* hot path (ops.dequeue() following the task's dispatch) can be a simple lockless test, so this may be okay, but from API POV, it can probably be better. The counter interlocking point is scx_bpf_dsq_insert(). If we can synchronize scx_bpf_dsq_insert() and dequeue so that ops.dequeue() is not called for a successfully inserted task, I think the semantics would be neater - an enqueued task is either dispatched or dequeued. Due to the async dispatch operation, this likely is difficult to do without adding extra sync operations in scx_bpf_dsq_insert(). However, I *think* we may be able to get rid of dspc and async inserting if we call ops.dispatch() w/ rq lock dropped. That may make the whole dispatch path simpler and the behavior neater too. What do you think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-28 17:19 ` Tejun Heo @ 2025-12-28 23:28 ` Tejun Heo 2025-12-28 23:38 ` Tejun Heo 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2025-12-28 23:28 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, On Sun, Dec 28, 2025 at 07:19:46AM -1000, Tejun Heo wrote: > While this works, from the BPF sched's POV, there's no way to tell whether > an ops.dequeue() call is from the task being actually dequeued or the > follow-up to the dispatch operation it just did. This won't make much > difference if ops.dequeue() is just used for accounting purposes, but, a > scheduler which uses an arena data structure for queueing would likely need > to perform extra tests to tell whether the task needs to be dequeued from > the arena side. I *think* hot path (ops.dequeue() following the task's > dispatch) can be a simple lockless test, so this may be okay, but from API > POV, it can probably be better. > > The counter interlocking point is scx_bpf_dsq_insert(). If we can > synchronize scx_bpf_dsq_insert() and dequeue so that ops.dequeue() is not > called for a successfully inserted task, I think the semantics would be > neater - an enqueued task is either dispatched or dequeued. Due to the async > dispatch operation, this likely is difficult to do without adding extra sync > operations in scx_bpf_dsq_insert(). However, I *think* we may be able to get > rid of dspc and async inserting if we call ops.dispatch() w/ rq lock > dropped. That may make the whole dispatch path simpler and the behavior > neater too. What do you think? I sat down and went through the code to see whether I was actually making sense, and I wasn't: The async dispatch buffering is necessary to avoid lock inversion between rq lock and whatever locks the BPF scheduler might be using internally. This is necessary because enqueue path runs with rq lock held. Thus, any lock that BPF sched uses in tne enqueue path has to nest inside rq lock. In dispatch, scx_bpf_dsq_insert() is likely to be called with the same BPF sched side lock held. If we try to do rq lock dancing synchronously, we can end up trying to grab rq lock while holding BPF side lock leading to deadlock. Kernel side has no control over BPF side locking, so the asynchronous operation is there to side-step the issue. I don't see a good way to make this synchronous. So, please ignore that part. That's non-sense. I still wonder whether we can create some interlocking between scx_bpf_dsq_insert() and ops.dequeue() without making hot path slower. I'll think more about it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-28 23:28 ` Tejun Heo @ 2025-12-28 23:38 ` Tejun Heo 2025-12-29 17:07 ` Andrea Righi 0 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2025-12-28 23:38 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello again, again. On Sun, Dec 28, 2025 at 01:28:04PM -1000, Tejun Heo wrote: ... > So, please ignore that part. That's non-sense. I still wonder whether we can > create some interlocking between scx_bpf_dsq_insert() and ops.dequeue() > without making hot path slower. I'll think more about it. And we can't create an interlocking between scx_bpf_dsq_insert() and ops.dequeue() without adding extra atomic operations in hot paths. The only thing shared is task rq lock and dispatch path can't do that synchronously. So, yeah, it looks like the best we can do is always letting the BPF sched know and let it figure out locking and whether the task needs to be dequeued from BPF side. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-28 23:38 ` Tejun Heo @ 2025-12-29 17:07 ` Andrea Righi 2025-12-29 18:55 ` Emil Tsalapatis 0 siblings, 1 reply; 81+ messages in thread From: Andrea Righi @ 2025-12-29 17:07 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hi Tejun, On Sun, Dec 28, 2025 at 01:38:01PM -1000, Tejun Heo wrote: > Hello again, again. > > On Sun, Dec 28, 2025 at 01:28:04PM -1000, Tejun Heo wrote: > ... > > So, please ignore that part. That's non-sense. I still wonder whether we can > > create some interlocking between scx_bpf_dsq_insert() and ops.dequeue() > > without making hot path slower. I'll think more about it. > > And we can't create an interlocking between scx_bpf_dsq_insert() and > ops.dequeue() without adding extra atomic operations in hot paths. The only > thing shared is task rq lock and dispatch path can't do that synchronously. > So, yeah, it looks like the best we can do is always letting the BPF sched > know and let it figure out locking and whether the task needs to be > dequeued from BPF side. How about setting a flag in deq_flags to distinguish between a "dispatch" dequeue vs a real dequeue (due to property changes or other reasons)? We should be able to pass this information in a reliable way without any additional synchronization in the hot paths. This would let schedulers that use arena data structures check the flag instead of doing their own internal lookups. And it would also allow us to provide both semantics: 1) Catch real dequeues that need special BPF-side actions (check the flag) 2) Track all ops.enqueue()/ops.dequeue() pairs for accounting purposes (ignore the flag) Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-29 17:07 ` Andrea Righi @ 2025-12-29 18:55 ` Emil Tsalapatis 0 siblings, 0 replies; 81+ messages in thread From: Emil Tsalapatis @ 2025-12-29 18:55 UTC (permalink / raw) To: Andrea Righi, Tejun Heo Cc: David Vernet, Changwoo Min, Daniel Hodges, sched-ext, linux-kernel On Mon Dec 29, 2025 at 12:07 PM EST, Andrea Righi wrote: > Hi Tejun, > > On Sun, Dec 28, 2025 at 01:38:01PM -1000, Tejun Heo wrote: >> Hello again, again. >> >> On Sun, Dec 28, 2025 at 01:28:04PM -1000, Tejun Heo wrote: >> ... >> > So, please ignore that part. That's non-sense. I still wonder whether we can >> > create some interlocking between scx_bpf_dsq_insert() and ops.dequeue() >> > without making hot path slower. I'll think more about it. >> >> And we can't create an interlocking between scx_bpf_dsq_insert() and >> ops.dequeue() without adding extra atomic operations in hot paths. The only >> thing shared is task rq lock and dispatch path can't do that synchronously. >> So, yeah, it looks like the best we can do is always letting the BPF sched >> know and let it figure out locking and whether the task needs to be >> dequeued from BPF side. > > How about setting a flag in deq_flags to distinguish between a "dispatch" > dequeue vs a real dequeue (due to property changes or other reasons)? > > We should be able to pass this information in a reliable way without any > additional synchronization in the hot paths. This would let schedulers that > use arena data structures check the flag instead of doing their own > internal lookups. > > And it would also allow us to provide both semantics: > 1) Catch real dequeues that need special BPF-side actions (check the flag) > 2) Track all ops.enqueue()/ops.dequeue() pairs for accounting purposes > (ignore the flag) > IMO the extra flag suffices for arena-based queueing, the arena data structures already have to track the state of the task already: Even without the flag it should be possible to infer the task is in in from inside the BPF code. For example, calling .dequeue() while the task is in an arena queue means the task got dequeued _after_ being dispatched, while calling .dequeue() on a queued task means we are removing it because of a true dequeue event (e.g. sched_setaffinity() was called). The only edge case in the logic is if a true dequeue event happens between .dispatch() and .dequeue(), but a new flag would take care of that. > Thanks, > -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi 2025-12-28 3:20 ` Emil Tsalapatis 2025-12-28 17:19 ` Tejun Heo @ 2025-12-28 23:42 ` Tejun Heo 2025-12-29 17:17 ` Andrea Righi 2025-12-29 0:06 ` Tejun Heo 3 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2025-12-28 23:42 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hello, On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote: > + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and > + is owned by the BPF scheduler. Ownership is retained until the task is Can we avoid using "ownership" for this? From user's POV, this is fine but kernel side internally uses the word for different purposes - e.g. we say the BPF side owns the task if the task's SCX_OPSS_QUEUED is set (ie. it's on BPF data structure, not on a DSQ). Here, the ownership encompasses both kernel-side and BPF-side queueing, so the term becomes rather confusing. Maybe we can stick with "queued" or "enqueued"? Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-28 23:42 ` Tejun Heo @ 2025-12-29 17:17 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2025-12-29 17:17 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Hi, On Sun, Dec 28, 2025 at 01:42:28PM -1000, Tejun Heo wrote: > Hello, > > On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote: > > + Once ``ops.enqueue()`` is called, the task is considered "enqueued" and > > + is owned by the BPF scheduler. Ownership is retained until the task is > > Can we avoid using "ownership" for this? From user's POV, this is fine but > kernel side internally uses the word for different purposes - e.g. we say > the BPF side owns the task if the task's SCX_OPSS_QUEUED is set (ie. it's on > BPF data structure, not on a DSQ). Here, the ownership encompasses both > kernel-side and BPF-side queueing, so the term becomes rather confusing. > Maybe we can stick with "queued" or "enqueued"? Agreed. I can't find a better term to describe this phase of the lifecycle, where ops.enqueue() has been called and the task remains in that state until the corresponding ops.dequeue() occurs (either due to a "dispatch" dequeue or "real" dequeue). So maybe we should stick with "enqueued" and clarify exactly what this state means. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi ` (2 preceding siblings ...) 2025-12-28 23:42 ` Tejun Heo @ 2025-12-29 0:06 ` Tejun Heo 2025-12-29 18:56 ` Andrea Righi 3 siblings, 1 reply; 81+ messages in thread From: Tejun Heo @ 2025-12-29 0:06 UTC (permalink / raw) To: Andrea Righi Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel Sorry about the million replies. Pretty squirrel brained right now. On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote: > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); > > + /* Mark that ops.enqueue() is being called for this task */ > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; Is this guaranteed to be cleared after dispatch? ops_dequeue() is called from dequeue_task_scx() and set_next_task_scx(). It looks like the call from set_next_task_scx() may end up calling ops.dequeue() when the task starts running, this seems mostly accidental. - The BPF sched probably expects ops.dequeue() call immediately after dispatch rather than on the running transition. e.g. imagine a scenario where a BPF sched dispatches multiple tasks to a local DSQ. Wouldn't the expectation be that ops.dequeue() is called as soon as a task is dispatched into a local DSQ? - If this depends on the ops_dequeue() call from set_next_task_scx(), it'd also be using the wrong DEQ flag - SCX_DEQ_CORE_SCHED_EXEC - for regular ops.dequeue() following a dispatch. That call there is that way only because ops_dequeue() didn't do anything when OPSS_NONE. Thanks. -- tejun ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics 2025-12-29 0:06 ` Tejun Heo @ 2025-12-29 18:56 ` Andrea Righi 0 siblings, 0 replies; 81+ messages in thread From: Andrea Righi @ 2025-12-29 18:56 UTC (permalink / raw) To: Tejun Heo Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel On Sun, Dec 28, 2025 at 02:06:19PM -1000, Tejun Heo wrote: > Sorry about the million replies. Pretty squirrel brained right now. > > On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote: > > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > > WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE); > > atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq); > > > > + /* Mark that ops.enqueue() is being called for this task */ > > + p->scx.flags |= SCX_TASK_OPS_ENQUEUED; > > Is this guaranteed to be cleared after dispatch? ops_dequeue() is called > from dequeue_task_scx() and set_next_task_scx(). It looks like the call from > set_next_task_scx() may end up calling ops.dequeue() when the task starts > running, this seems mostly accidental. > > - The BPF sched probably expects ops.dequeue() call immediately after > dispatch rather than on the running transition. e.g. imagine a scenario > where a BPF sched dispatches multiple tasks to a local DSQ. Wouldn't the > expectation be that ops.dequeue() is called as soon as a task is > dispatched into a local DSQ? > > - If this depends on the ops_dequeue() call from set_next_task_scx(), it'd > also be using the wrong DEQ flag - SCX_DEQ_CORE_SCHED_EXEC - for regular > ops.dequeue() following a dispatch. That call there is that way only > because ops_dequeue() didn't do anything when OPSS_NONE. You're right, the flag should be cleared and ops.dequeue() should be called immediately when the async dispatch completes and the task is inserted into the DSQ. I'll add an explicit ops.dequeue() call in the dispatch completion path. Thanks, -Andrea ^ permalink raw reply [flat|nested] 81+ messages in thread
end of thread, other threads:[~2026-02-14 19:32 UTC | newest] Thread overview: 81+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-05 15:32 [PATCHSET v6] sched_ext: Fix ops.dequeue() semantics Andrea Righi 2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi 2026-02-05 19:29 ` Kuba Piecuch 2026-02-05 21:32 ` Andrea Righi 2026-02-05 15:32 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi -- strict thread matches above, loose matches on Subject: below -- 2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi 2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi 2026-02-10 23:20 ` Tejun Heo 2026-02-11 16:06 ` Andrea Righi 2026-02-11 19:47 ` Tejun Heo 2026-02-11 22:34 ` Andrea Righi 2026-02-11 22:37 ` Tejun Heo 2026-02-11 22:48 ` Andrea Righi 2026-02-12 10:16 ` Andrea Righi 2026-02-12 14:32 ` Christian Loehle 2026-02-12 15:45 ` Andrea Righi 2026-02-12 17:07 ` Tejun Heo 2026-02-12 18:14 ` Andrea Righi 2026-02-12 18:35 ` Tejun Heo 2026-02-12 22:30 ` Andrea Righi 2026-02-14 10:16 ` Andrea Righi 2026-02-14 17:56 ` Tejun Heo 2026-02-14 19:32 ` Andrea Righi 2026-02-10 23:54 ` Tejun Heo 2026-02-11 16:07 ` Andrea Righi 2026-02-06 13:54 [PATCHSET v7] " Andrea Righi 2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi 2026-02-06 20:35 ` Emil Tsalapatis 2026-02-07 9:26 ` Andrea Righi 2026-02-09 17:28 ` Tejun Heo 2026-02-09 19:06 ` Andrea Righi 2026-02-04 16:05 [PATCHSET v5] " Andrea Righi 2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi 2026-02-04 22:14 ` Tejun Heo 2026-02-05 9:26 ` Andrea Righi 2026-02-01 9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi 2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi 2026-02-01 22:47 ` Christian Loehle 2026-02-02 7:45 ` Andrea Righi 2026-02-02 9:26 ` Andrea Righi 2026-02-02 10:02 ` Christian Loehle 2026-02-02 15:32 ` Andrea Righi 2026-02-02 10:09 ` Christian Loehle 2026-02-02 13:59 ` Kuba Piecuch 2026-02-04 9:36 ` Andrea Righi 2026-02-04 9:51 ` Kuba Piecuch 2026-02-02 11:56 ` Kuba Piecuch 2026-02-04 10:11 ` Andrea Righi 2026-02-04 10:33 ` Kuba Piecuch 2026-01-26 8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi 2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi 2026-01-27 16:38 ` Emil Tsalapatis 2026-01-27 16:41 ` Kuba Piecuch 2026-01-30 7:34 ` Andrea Righi 2026-01-30 13:14 ` Kuba Piecuch 2026-01-31 6:54 ` Andrea Righi 2026-01-31 16:45 ` Kuba Piecuch 2026-01-31 17:24 ` Andrea Righi 2026-01-28 21:21 ` Tejun Heo 2026-01-30 11:54 ` Kuba Piecuch 2026-01-31 9:02 ` Andrea Righi 2026-01-31 17:53 ` Kuba Piecuch 2026-01-31 20:26 ` Andrea Righi 2026-02-02 15:19 ` Tejun Heo 2026-02-02 15:30 ` Andrea Righi 2026-02-01 17:43 ` Tejun Heo 2026-02-02 15:52 ` Andrea Righi 2026-02-02 16:23 ` Kuba Piecuch 2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi 2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi 2026-01-21 12:54 ` Christian Loehle 2026-01-21 12:57 ` Andrea Righi 2026-01-22 9:28 ` Kuba Piecuch 2026-01-23 13:32 ` Andrea Righi 2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi 2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi 2025-12-28 3:20 ` Emil Tsalapatis 2025-12-29 16:36 ` Andrea Righi 2025-12-29 18:35 ` Emil Tsalapatis 2025-12-28 17:19 ` Tejun Heo 2025-12-28 23:28 ` Tejun Heo 2025-12-28 23:38 ` Tejun Heo 2025-12-29 17:07 ` Andrea Righi 2025-12-29 18:55 ` Emil Tsalapatis 2025-12-28 23:42 ` Tejun Heo 2025-12-29 17:17 ` Andrea Righi 2025-12-29 0:06 ` Tejun Heo 2025-12-29 18:56 ` Andrea Righi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox