public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v10] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-18  8:32 Andrea Righi
  2026-02-18  8:32 ` [PATCH 1/4] sched_ext: Properly mark SCX-internal migrations via sticky_cpu Andrea Righi
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Andrea Righi @ 2026-02-18  8:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().

This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g., sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).

To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

This patchset is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dequeue

Changes in v10:
 - Rely only on p->scx.sticky_cpu to detect in-progress SCX internal
   migrations
 - Set SCX_TASK_IN_CUSTODY only on the non‑local path and only for
   non‑terminal DSQs
 - Centralize "call ops.dequeue() when leaving custody" in one place,
   call_task_dequeue(), after the task state switch
 - Fixed a p->scx.flags race between dispatch_enqueue() and
   dequeue_task_scx()
 - Add proper synchronization in the dequeue kselftest to validate the
   usage of SCX_DSQ_LOCAL_ON from ops.dispatch()
 - Link to v9:
   https://lore.kernel.org/all/20260215191933.2358161-1-arighi@nvidia.com

Changes in v9:
 - Ignore internal SCX migrations (do not notify BPF schedulers for
   internal enqueue/dequeue events)
 - Rely on sticky_cpu to determine when a task is doing an internal
   migration
 - Trigger ops.dequeue() consistently from ops_dequeue() or when directly
   dispatching to terminal DSQs
 - Add preliminary patches to refactor dispatch_enqueue() and mark internal
   migrations using sticky_cpu
 - Link to v8:
   https://lore.kernel.org/all/20260210212813.796548-1-arighi@nvidia.com

Changes in v8:
 - Rename SCX_TASK_NEED_DEQ -> SCX_TASK_IN_CUSTODY and set/clear this flag
   also when ops.dequeue() is not implemented (can be used for other
   purposes in the future)
 - Clarify ops.select_cpu() behavior: dispatch to terminal DSQs doesn't
   trigger ops.dequeue(), dispatch to user DSQs triggers ops.dequeue(),
   store to BPF-internal data structure is discouraged
 - Link to v7:
   https://lore.kernel.org/all/20260206135742.2339918-1-arighi@nvidia.com

Changes in v7:
 - Handle tasks stored to BPF internal data structures (trigger
   ops.dequeue())
 - Add a kselftest scenario with a BPF queue to verify ops.dequeue()
   behavior with tasks stored in internal BPF data structures
 - Link to v6:
   https://lore.kernel.org/all/20260205153304.1996142-1-arighi@nvidia.com

Changes in v6:
 - Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
 - Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
   DSQs (local, global, bypass)
 - centralize ops.dequeue() logic in dispatch_enqueue()
 - Remove "Property Change Notifications for Running Tasks" section from
   the documentation
 - The kselftest now validates the right behavior both from ops.enqueue()
   and ops.select_cpu()
 - Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@nvidia.com

Changes in v5:
 - Introduce the concept of "terminal DSQ" (when a task is dispatched to a
   terminal DSQ, the task leaves the BPF scheduler's custody)
 - Consider SCX_DSQ_GLOBAL as a terminal DSQ
 - Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com

Changes in v4:
 - Introduce the concept of "BPF scheduler custody"
 - Do not trigger ops.dequeue() for direct dispatches to local DSQs
 - Trigger ops.dequeue() only once; after the task leaves BPF scheduler
   custody, further dequeue events are not reported.
 - Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com

Changes in v3:
 - Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
 - Handle core-sched dequeues (Kuba)
 - Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DEQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (4):
      sched_ext: Properly mark SCX-internal migrations via sticky_cpu
      sched_ext: Add rq parameter to dispatch_enqueue()
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  78 ++++-
 include/linux/sched/ext.h                       |   1 +
 kernel/sched/ext.c                              | 164 ++++++++--
 kernel/sched/ext_internal.h                     |   7 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   1 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 394 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 274 ++++++++++++++++
 10 files changed, 890 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-02-23 20:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-18  8:32 [PATCHSET v10] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-18  8:32 ` [PATCH 1/4] sched_ext: Properly mark SCX-internal migrations via sticky_cpu Andrea Righi
2026-02-18  8:32 ` [PATCH 2/4] sched_ext: Add rq parameter to dispatch_enqueue() Andrea Righi
2026-02-18  8:32 ` [PATCH 3/4] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-18  8:32 ` [PATCH 4/4] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-21  2:26   ` Daniel Jordan
2026-02-21 20:02     ` [PATCH v2 " Andrea Righi
2026-02-23 20:11 ` [PATCHSET v10] sched_ext: Fix " Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox