public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v7] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-06 13:54 Andrea Righi
  2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
  2026-02-06 13:54 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  0 siblings, 2 replies; 33+ messages in thread
From: Andrea Righi @ 2026-02-06 13:54 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().

This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g.  sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).

To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

Changes in v7:
 - Handle tasks stored to BPF internal data structures (trigger
   ops.dequeue())
 - Add a new kselftest scenario to verify ops.dequeue() behavior with tasks
   stored to internal BPF data structures
 - Link to v6:
   https://lore.kernel.org/all/20260205153304.1996142-1-arighi@nvidia.com

Changes in v6:
 - Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
 - Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
   DSQs (local, global, bypass)
 - centralize ops.dequeue() logic in dispatch_enqueue()
 - Remove "Property Change Notifications for Running Tasks" section from
   the documentation
 - The kselftest now validates the right behavior both from ops.enqueue()
   and ops.select_cpu()
 - Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@nvidia.com

Changes in v5:
 - Introduce the concept of "terminal DSQ" (when a task is dispatched to a
   terminal DSQ, the task leaves the BPF scheduler's custody)
 - Consider SCX_DSQ_GLOBAL as a terminal DSQ
 - Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com

Changes in v4:
 - Introduce the concept of "BPF scheduler custody"
 - Do not trigger ops.dequeue() for direct dispatches to local DSQs
 - Trigger ops.dequeue() only once; after the task leaves BPF scheduler
   custody, further dequeue events are not reported.
 - Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com

Changes in v3:
 - Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
 - Handle core-sched dequeues (Kuba)
 - Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DEQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (2):
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  58 ++++
 include/linux/sched/ext.h                       |   1 +
 kernel/sched/ext.c                              | 157 ++++++++-
 kernel/sched/ext_internal.h                     |   7 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   1 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 403 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 258 +++++++++++++++
 10 files changed, 875 insertions(+), 14 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 33+ messages in thread
* [PATCHSET v8] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-10 21:26 Andrea Righi
  2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  0 siblings, 1 reply; 33+ messages in thread
From: Andrea Righi @ 2026-02-10 21:26 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().

This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g., sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).

To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

Changes in v8:
 - Rename SCX_TASK_NEED_DEQ -> SCX_TASK_IN_CUSTODY and set/clear this flag
   also when ops.dequeue() is not implemented (can be used for other
   purposes in the future)
 - Clarify ops.select_cpu() behavior: dispatch to terminal DSQs doesn't
   trigger ops.dequeue(), dispatch to user DSQs triggers ops.dequeue(),
   store to BPF-internal data structure is discouraged
 - Link to v7:
   https://lore.kernel.org/all/20260206135742.2339918-1-arighi@nvidia.com

Changes in v7:
 - Handle tasks stored to BPF internal data structures (trigger
   ops.dequeue())
 - Add a kselftest scenario with a BPF queue to verify ops.dequeue()
   behavior with tasks stored in internal BPF data structures
 - Link to v6:
   https://lore.kernel.org/all/20260205153304.1996142-1-arighi@nvidia.com

Changes in v6:
 - Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
 - Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
   DSQs (local, global, bypass)
 - centralize ops.dequeue() logic in dispatch_enqueue()
 - Remove "Property Change Notifications for Running Tasks" section from
   the documentation
 - The kselftest now validates the right behavior both from ops.enqueue()
   and ops.select_cpu()
 - Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@nvidia.com

Changes in v5:
 - Introduce the concept of "terminal DSQ" (when a task is dispatched to a
   terminal DSQ, the task leaves the BPF scheduler's custody)
 - Consider SCX_DSQ_GLOBAL as a terminal DSQ
 - Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com

Changes in v4:
 - Introduce the concept of "BPF scheduler custody"
 - Do not trigger ops.dequeue() for direct dispatches to local DSQs
 - Trigger ops.dequeue() only once; after the task leaves BPF scheduler
   custody, further dequeue events are not reported.
 - Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com

Changes in v3:
 - Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
 - Handle core-sched dequeues (Kuba)
 - Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DEQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (2):
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  78 ++++-
 include/linux/sched/ext.h                       |   1 +
 kernel/sched/ext.c                              | 155 ++++++++--
 kernel/sched/ext_internal.h                     |   7 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   1 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 368 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 265 +++++++++++++++++
 10 files changed, 855 insertions(+), 24 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 33+ messages in thread
* [PATCHSET v6] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-05 15:32 Andrea Righi
  2026-02-05 15:32 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  0 siblings, 1 reply; 33+ messages in thread
From: Andrea Righi @ 2026-02-05 15:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().

This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g.  sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).

To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

Changes in v6:
 - Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
 - Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
   DSQs (local, global, bypass)
 - centralize ops.dequeue() logic in dispatch_enqueue()
 - Remove "Property Change Notifications for Running Tasks" section from
   the documentation
 - The kselftest now validates the right behavior both from ops.enqueue()
   and ops.select_cpu()
 - Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@nvidia.com

Changes in v5:
 - Introduce the concept of "terminal DSQ" (when a task is dispatched to a
   terminal DSQ, the task leaves the BPF scheduler's custody)
 - Consider SCX_DSQ_GLOBAL as a terminal DSQ
 - Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com

Changes in v4:
 - Introduce the concept of "BPF scheduler custody"
 - Do not trigger ops.dequeue() for direct dispatches to local DSQs
 - Trigger ops.dequeue() only once; after the task leaves BPF scheduler
   custody, further dequeue events are not reported.
 - Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com

Changes in v3:
 - Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
 - Handle core-sched dequeues (Kuba)
 - Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DEQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (2):
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  53 ++++
 include/linux/sched/ext.h                       |   1 +
 kernel/sched/ext.c                              | 130 ++++++++-
 kernel/sched/ext_internal.h                     |   7 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   1 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 334 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 222 ++++++++++++++++
 10 files changed, 739 insertions(+), 13 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 33+ messages in thread
* [PATCHSET v5] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-04 16:05 Andrea Righi
  2026-02-04 16:05 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  0 siblings, 1 reply; 33+ messages in thread
From: Andrea Righi @ 2026-02-04 16:05 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().

This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g.  sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).

To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

Changes in v5:
 - Introduce the concept of "terminal DSQ" (when a task is dispatched to a
   terminal DSQ, the task leaves the BPF scheduler's custody)
 - Consider SCX_DSQ_GLOBAL as a terminal DSQ
 - Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com

Changes in v4:
 - Introduce the concept of "BPF scheduler custody"
 - Do not trigger ops.dequeue() for direct dispatches to local DSQs
 - Trigger ops.dequeue() only once; after the task leaves BPF scheduler
   custody, further dequeue events are not reported.
 - Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com

Changes in v3:
 - Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
 - Handle core-sched dequeues (Kuba)
 - Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DEQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (2):
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  74 +++++++
 include/linux/sched/ext.h                       |   1 +
 kernel/sched/ext.c                              | 186 +++++++++++++++-
 kernel/sched/ext_internal.h                     |   7 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   1 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 269 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 207 ++++++++++++++++++
 10 files changed, 746 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 33+ messages in thread
* [PATCHSET v4 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-01  9:08 Andrea Righi
  2026-02-01  9:08 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  0 siblings, 1 reply; 33+ messages in thread
From: Andrea Righi @ 2026-02-01  9:08 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().

This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g.  sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).

To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

= Open issues =

Even though a few refinements are still pending, I'm sending a new
patchset, so we can comment more effectively based on the latest
agreed-upon semantics.

Open issues that still need agreement:

 - Should we trigger ops.dequeue() for tasks dispatched to SCX_DSQ_GLOBAL?
   (do we treat SCX_DSQ_GLOBAL as a local DSQ or a user DSQ?). In the
   current implementation, SCX_DSQ_GLOBAL is treated like a built-in user
   DSQ -> ops.dequeue() invoked for tasks dispatched to SCX_DSQ_GLOBAL.

Changes in v4:
 - Introduce the concept of "BPF scheduler custody"
 - Do not trigger ops.dequeue() for tasks directly dispatched to local DSQs
 - Trigger ops.dequeue() only once; after the task leaves BPF scheduler
   custody, further dequeue events are not reported.
 - Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com

Changes in v3:
 - Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
 - Handle core-sched dequeues (Kuba)
 - Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DEQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (2):
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  76 ++++++++
 include/linux/sched/ext.h                       |   1 +
 kernel/sched/ext.c                              | 168 ++++++++++++++++-
 kernel/sched/ext_internal.h                     |   7 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   1 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 234 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 198 ++++++++++++++++++++
 10 files changed, 686 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 33+ messages in thread
* [PATCHSET v3 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics
@ 2026-01-26  8:41 Andrea Righi
  2026-01-26  8:41 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  0 siblings, 1 reply; 33+ messages in thread
From: Andrea Righi @ 2026-01-26  8:41 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext,
	linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().

This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), ensuring that every
ops.enqueue() is balanced by a corresponding ops.dequeue() invocation. In
addition, ops.dequeue() is now properly invoked when tasks are removed from
the sched_ext class, such as on task property changes.

To distinguish between a "regular" dequeue and a property change dequeue a
new dequeue flag is introduced: %SCX_DEQ_SCHED_CHANGE. BPF schedulers can
use this flag to distinguish between regular dispatch dequeues
(%SCX_DEQ_SCHED_CHANGE unset) and property change dequeues
(%SCX_DEQ_SCHED_CHANGE set).

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

Changes in v3:
 - Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
 - Handle core-sched dequeues (Kuba)
 - Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com/

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DEQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (2):
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  33 ++++
 include/linux/sched/ext.h                       |  11 ++
 kernel/sched/ext.c                              |  89 +++++++++-
 kernel/sched/ext_internal.h                     |   7 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 209 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 182 +++++++++++++++++++++
 10 files changed, 534 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 33+ messages in thread
* [PATCHSET v2 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics
@ 2026-01-21 12:25 Andrea Righi
  2026-01-21 12:25 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  0 siblings, 1 reply; 33+ messages in thread
From: Andrea Righi @ 2026-01-21 12:25 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the BPF scheduler, either due to
dispatch or a property change, visibility of the task is lost and the
sched_ext core may not invoke ops.dequeue(). This breaks accurate
accounting (i.e., per-DSQ queued runtime sums) and prevents reliable
tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), ensuring that every
ops.enqueue() is balanced by a corresponding ops.dequeue() invocation. In
addition, ops.dequeue() is now properly invoked when tasks are removed from
the sched_ext class, such as on task property changes.

To distinguish between a dispatch dequeue and a property change dequeue,
introduce a new dequeue flag: SCX_DEQ_ASYNC (I'm open to suggestions to
find a better name for this flag). BPF schedulers can use this flag to
distinguish between regular dispatch dequeues (SCX_DSQ_ASYNC unset) and
property change dequeues (SCX_DEQ_ASYNC set).

In addition, a kselftest is provided to validate the behavior of
ops.dequeue() in all the different cases.

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DSQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (2):
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  33 ++++
 include/linux/sched/ext.h                       |  11 ++
 kernel/sched/ext.c                              |  63 ++++++-
 kernel/sched/ext_internal.h                     |   6 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 209 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 182 +++++++++++++++++++++
 10 files changed, 508 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2026-02-12 18:25 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-06 13:54 [PATCHSET v7] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-06 13:54 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-06 20:10   ` Emil Tsalapatis
2026-02-07  9:16     ` Andrea Righi
2026-02-08  5:11       ` Emil Tsalapatis
2026-02-08  9:02         ` Andrea Righi
2026-02-08 10:26           ` Andrea Righi
2026-02-08 13:55             ` Andrea Righi
2026-02-08 17:59               ` Emil Tsalapatis
2026-02-08 20:08                 ` Andrea Righi
2026-02-09 10:20                   ` Andrea Righi
2026-02-09 15:00                     ` Emil Tsalapatis
2026-02-09 15:43                       ` Andrea Righi
2026-02-09 17:23                         ` Tejun Heo
2026-02-09 19:17                           ` Andrea Righi
2026-02-09 20:10                             ` Tejun Heo
2026-02-09 22:22                               ` Andrea Righi
2026-02-10  0:42                                 ` Tejun Heo
2026-02-10  7:29                                   ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-12 17:15   ` Christian Loehle
2026-02-12 18:25     ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] sched_ext: Fix " Andrea Righi
2026-02-05 15:32 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-04 16:05 [PATCHSET v5] sched_ext: Fix " Andrea Righi
2026-02-04 16:05 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] sched_ext: Fix " Andrea Righi
2026-02-01  9:08 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] sched_ext: Fix " Andrea Righi
2026-01-26  8:41 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-01-27 16:53   ` Emil Tsalapatis
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] sched_ext: Fix " Andrea Righi
2026-01-21 12:25 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox