The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext
@ 2026-07-02 17:09 Andrea Righi
  2026-07-02 17:09 ` [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
                   ` (11 more replies)
  0 siblings, 12 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

This series enables using proxy execution with sched_ext and is based on early
work by John Stultz [1].

Background
==========

Proxy execution (proxy-exec) lets a waiting task ("donor") donate its execution
context to a mutex owner, so the owner can run while the donor stays eligible on
the runqueue.

Currently, proxy execution and sched_ext are mutually exclusive at build time:
we can't enable CONFIG_SCHED_PROXY_EXEC=y and CONFIG_SCHED_CLASS_EXT=y in the
same kernel.

This restriction can be problematic for Linux distributions and for anyone who
wants to ship one kernel and choose features at runtime.

Why they are mutually exclusive?
================================

sched_ext schedulers drive dispatch through their own interfaces. A proxy-exec
handoff can run a task that the BPF scheduler never dispatched through that
path. sched_ext callbacks then observe a "current" task that does not match what
the BPF side considers running, so kfuncs and helper state can see an
inconsistent view of the executing task.

sched_ext also tracks runnable work through Dispatch Queues (DSQs) and BPF
chosen dispatch rules, while the core scheduler still maintains classic per-CPU
runqueues and pick paths. A proxy handoff can therefore switch the CPU to a task
that the BPF scheduler never inserted or ordered through its DSQ interface.

DSQ state, vtime, and "who is running" bookkeeping inside the BPF program can
then disagree with what the core actually executes, so helpers and kfuncs that
assume their dispatched task is current may observe stale or inconsistent state.

Design: supporting proxy execution with sched_ext
=================================================

Proxy execution support is an optional per-scheduler capability: a BPF scheduler
can set SCX_OPS_ENQ_BLOCKED to receive mutex blocked tasks, without this flag
mutex waiters block normally and proxy execution is automaticaly disabled. When
this flag is set, blocked donors are passed through ops.enqueue(), where
scx_bpf_task_is_blocked() lets BPF recognize them and apply its own admission
and ordering policy.

Knowing the mutex owner's location is optional, BPF may enqueue a donor on its
current CPU's local DSQ and let the core resolve the owner and perform the proxy
exec handoff. Schedulers that want to reduce handoff latency can use
scx_bpf_task_proxy_cpu() or scx_bpf_task_proxy_cid() to obtain the owner
location and steer the donation there when affinity constraints and policy allow
it. The core still revalidates the owner relationship to perform the actual
migration.

The donor-to-owner handoff is modeled like a "function call" from the
scheduler's perspective. The donor remains the running scheduling entity
selected by BPF: its scheduling context, runtime and slice are consumed while
the core temporarily invokes the mutex owner's code to make the critical section
progress. It is not a scheduler-visible switch to the owner.

Accordingly, the donor remains the scheduling context presented to sched_ext,
while the mutex owner is treated solely as the execution context selected
internally by the core scheduler. Scheduling state is accounted against
rq->donor where appropriate, while rq->curr identifies the execution context.
The internal owner substitution does not generate synthetic sched_ext callbacks
for a task that BPF did not dispatch.

The sched_ext callback bookkeeping is adjusted accordingly. Blocked proxy donors
do not generate spurious ops.running() callbacks and ops.stopping() is only
called when ops.running() was emitted. The normal sched_ext migration path also
leaves blocked donors to the proxy machinery, and cross-CPU migration is avoided
for migration-disabled and single-CPU tasks.

scx_qmap is modified to demonstrate the opt-in policy: the scheduler immediately
admits blocked donors, prefers the owner's cid when allowed by the donor's
affinity and inserts the donor at the head of the selected local DSQ.

A new kselftest (enq_blocked) is also introduced to validate the proxy execution
support with sched_ext. The test creates a three-task priority inversion on one
CPU: a low-priority owner (nice +19) holds a kernel mutex, a high-priority donor
(nice -20) blocks on it and a nice 0 contender competes for the CPU. It runs the
same workload with SCX_OPS_ENQ_BLOCKED disabled and enabled, validates blocked
donor admission, the reported proxy CPU and prints average mutex hold/wait times
with their deltas for manual comparison without enforcing performance
thresholds. Access to the mutex is provided by a loadable kernel module built
via TEST_GEN_MODS_DIR, with the test responsible for loading, unloading, and
managing the module's lifecycle.

Example kselftest run:

  $ sudo tools/testing/selftests/sched_ext/runner -t enq_blocked
  ===== START =====
  TEST: enq_blocked
  DESCRIPTION: Verify BPF-driven proxy donor admission
  OUTPUT:

  [SCX_OPS_ENQ_BLOCKED=disabled]
    proxy_exec=enabled
    owner_nice=19
    donor_nice=-20
    contender_nice=0
    mutex_hold_avg_ns=254084719 (254.084 ms, samples=10)
    mutex_wait_avg_ns=254095120 (254.095 ms, samples=10)
    nr_blocked_enqueues=0

  [SCX_OPS_ENQ_BLOCKED=enabled]
    proxy_exec=enabled
    owner_nice=19
    donor_nice=-20
    contender_nice=0
    mutex_hold_avg_ns=228884734 (228.884 ms, samples=10)
    mutex_wait_avg_ns=207903720 (207.903 ms, samples=10)
    nr_blocked_enqueues=51

  [delta: enabled - disabled]
    mutex_hold_delta_ns=-25199985 (-9.92%)
    mutex_wait_delta_ns=-46191400 (-18.18%)
  ok 1 enq_blocked #
  =====  END  =====


  =============================

  RESULTS:

  PASSED:  1
  SKIPPED: 0
  FAILED:  0

References
==========

[1] https://lore.kernel.org/all/20251206001451.1418225-1-jstultz@google.com

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-proxy-exec

Changes in v2:
 - Rebased onto sched_ext/for-7.3 and adapted the series to the split
   sched_ext implementation and cid-form scheduler interfaces.
 - Replaced the global sched_proxy_exec_scx boot-time opt-in with the
   per-scheduler SCX_OPS_ENQ_BLOCKED capability, allowing BPF to control donor
   admission and ordering through ops.enqueue().
 - Added scx_bpf_task_is_blocked(), scx_bpf_task_proxy_cpu(), and
   scx_bpf_task_proxy_cid(); enforce CPU/cid API separation for cid-form
   schedulers.
 - Added proxy exec support to scx_qmap, including optional owner-cid steering,
   affinity validation, and fallback to the donor's current cid.
 - Added a kselftest with a kernel mutex test and a three-task priority
   inversion workload, test is executed with blocked task admission disabled and
   enabled, validate the behavior, and report hold/wait-time deltas.
 - Link to v1: https://lore.kernel.org/all/20260506174639.535232-1-arighi@nvidia.com/

Andrea Righi (10):
      sched/core: Skip migration disabled tasks in proxy execution
      sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
      sched_ext: Fix TOCTOU race in consume_remote_task()
      sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors
      sched_ext: Save/restore kf_tasks[] when task ops nest
      sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK
      sched_ext: Delegate proxy donor admission to BPF schedulers
      sched_ext: Add selftest for blocked donor admission
      sched_ext: scx_qmap: Add proxy execution support
      sched: Allow enabling proxy exec with sched_ext

John Stultz (2):
      sched/ext: Split curr|donor references properly
      sched/ext: Avoid migrating blocked tasks with proxy execution

 include/linux/sched/ext.h                     |   9 +
 init/Kconfig                                  |   2 -
 kernel/sched/core.c                           |  59 +-
 kernel/sched/ext/ext.c                        | 209 +++++-
 kernel/sched/ext/ext.h                        |   8 +
 kernel/sched/ext/internal.h                   |  58 +-
 kernel/sched/sched.h                          |   6 +
 tools/sched_ext/include/scx/common.bpf.h      |   3 +
 tools/sched_ext/include/scx/compat.h          |   1 +
 tools/sched_ext/scx_qmap.bpf.c                |  18 +-
 tools/testing/selftests/sched_ext/.gitignore  |   4 +
 tools/testing/selftests/sched_ext/Makefile    |   2 +
 tools/testing/selftests/sched_ext/config      |   2 +
 .../selftests/sched_ext/enq_blocked.bpf.c     | 116 +++
 .../testing/selftests/sched_ext/enq_blocked.c | 682 ++++++++++++++++++
 .../testing/selftests/sched_ext/enq_blocked.h |  21 +
 .../selftests/sched_ext/test_modules/Makefile |  13 +
 .../test_modules/scx_enq_blocked_test.c       | 134 ++++
 18 files changed, 1299 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.c
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.h
 create mode 100644 tools/testing/selftests/sched_ext/test_modules/Makefile
 create mode 100644 tools/testing/selftests/sched_ext/test_modules/scx_enq_blocked_test.c

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-07-03 20:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-07-02 17:09 ` [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-07-02 18:17   ` K Prateek Nayak
2026-07-02 18:37     ` Andrea Righi
2026-07-02 18:21   ` Peter Zijlstra
2026-07-02 18:34     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 02/12] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
2026-07-02 18:24   ` Peter Zijlstra
2026-07-02 18:46     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 03/12] sched_ext: Split curr|donor references properly Andrea Righi
2026-07-03  6:10   ` Aiqun(Maria) Yu
2026-07-03  8:37     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 04/12] sched_ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
2026-07-03  8:02   ` Aiqun(Maria) Yu
2026-07-03 20:05     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 05/12] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
2026-07-02 17:09 ` [PATCH 06/12] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
2026-07-02 17:09 ` [PATCH 07/12] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
2026-07-02 17:09 ` [PATCH 08/12] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
2026-07-02 17:09 ` [PATCH 09/12] sched_ext: Delegate proxy donor admission to BPF schedulers Andrea Righi
2026-07-02 18:41   ` K Prateek Nayak
2026-07-02 19:10     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 10/12] sched_ext: Add selftest for blocked donor admission Andrea Righi
2026-07-02 17:09 ` [PATCH 11/12] sched_ext: scx_qmap: Add proxy execution support Andrea Righi
2026-07-02 17:09 ` [PATCH 12/12] sched: Allow enabling proxy exec with sched_ext Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox