[PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched

Sched_ext development
 help / color / mirror / Atom feed

* [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext
@ 2026-07-02 17:09 Andrea Righi
  2026-07-02 17:09 ` [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
                   ` (11 more replies)
  0 siblings, 12 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

This series enables using proxy execution with sched_ext and is based on early
work by John Stultz [1].

Background
==========

Proxy execution (proxy-exec) lets a waiting task ("donor") donate its execution
context to a mutex owner, so the owner can run while the donor stays eligible on
the runqueue.

Currently, proxy execution and sched_ext are mutually exclusive at build time:
we can't enable CONFIG_SCHED_PROXY_EXEC=y and CONFIG_SCHED_CLASS_EXT=y in the
same kernel.

This restriction can be problematic for Linux distributions and for anyone who
wants to ship one kernel and choose features at runtime.

Why they are mutually exclusive?
================================

sched_ext schedulers drive dispatch through their own interfaces. A proxy-exec
handoff can run a task that the BPF scheduler never dispatched through that
path. sched_ext callbacks then observe a "current" task that does not match what
the BPF side considers running, so kfuncs and helper state can see an
inconsistent view of the executing task.

sched_ext also tracks runnable work through Dispatch Queues (DSQs) and BPF
chosen dispatch rules, while the core scheduler still maintains classic per-CPU
runqueues and pick paths. A proxy handoff can therefore switch the CPU to a task
that the BPF scheduler never inserted or ordered through its DSQ interface.

DSQ state, vtime, and "who is running" bookkeeping inside the BPF program can
then disagree with what the core actually executes, so helpers and kfuncs that
assume their dispatched task is current may observe stale or inconsistent state.

Design: supporting proxy execution with sched_ext
=================================================

Proxy execution support is an optional per-scheduler capability: a BPF scheduler
can set SCX_OPS_ENQ_BLOCKED to receive mutex blocked tasks, without this flag
mutex waiters block normally and proxy execution is automaticaly disabled. When
this flag is set, blocked donors are passed through ops.enqueue(), where
scx_bpf_task_is_blocked() lets BPF recognize them and apply its own admission
and ordering policy.

Knowing the mutex owner's location is optional, BPF may enqueue a donor on its
current CPU's local DSQ and let the core resolve the owner and perform the proxy
exec handoff. Schedulers that want to reduce handoff latency can use
scx_bpf_task_proxy_cpu() or scx_bpf_task_proxy_cid() to obtain the owner
location and steer the donation there when affinity constraints and policy allow
it. The core still revalidates the owner relationship to perform the actual
migration.

The donor-to-owner handoff is modeled like a "function call" from the
scheduler's perspective. The donor remains the running scheduling entity
selected by BPF: its scheduling context, runtime and slice are consumed while
the core temporarily invokes the mutex owner's code to make the critical section
progress. It is not a scheduler-visible switch to the owner.

Accordingly, the donor remains the scheduling context presented to sched_ext,
while the mutex owner is treated solely as the execution context selected
internally by the core scheduler. Scheduling state is accounted against
rq->donor where appropriate, while rq->curr identifies the execution context.
The internal owner substitution does not generate synthetic sched_ext callbacks
for a task that BPF did not dispatch.

The sched_ext callback bookkeeping is adjusted accordingly. Blocked proxy donors
do not generate spurious ops.running() callbacks and ops.stopping() is only
called when ops.running() was emitted. The normal sched_ext migration path also
leaves blocked donors to the proxy machinery, and cross-CPU migration is avoided
for migration-disabled and single-CPU tasks.

scx_qmap is modified to demonstrate the opt-in policy: the scheduler immediately
admits blocked donors, prefers the owner's cid when allowed by the donor's
affinity and inserts the donor at the head of the selected local DSQ.

A new kselftest (enq_blocked) is also introduced to validate the proxy execution
support with sched_ext. The test creates a three-task priority inversion on one
CPU: a low-priority owner (nice +19) holds a kernel mutex, a high-priority donor
(nice -20) blocks on it and a nice 0 contender competes for the CPU. It runs the
same workload with SCX_OPS_ENQ_BLOCKED disabled and enabled, validates blocked
donor admission, the reported proxy CPU and prints average mutex hold/wait times
with their deltas for manual comparison without enforcing performance
thresholds. Access to the mutex is provided by a loadable kernel module built
via TEST_GEN_MODS_DIR, with the test responsible for loading, unloading, and
managing the module's lifecycle.

Example kselftest run:

  $ sudo tools/testing/selftests/sched_ext/runner -t enq_blocked
  ===== START =====
  TEST: enq_blocked
  DESCRIPTION: Verify BPF-driven proxy donor admission
  OUTPUT:

  [SCX_OPS_ENQ_BLOCKED=disabled]
    proxy_exec=enabled
    owner_nice=19
    donor_nice=-20
    contender_nice=0
    mutex_hold_avg_ns=254084719 (254.084 ms, samples=10)
    mutex_wait_avg_ns=254095120 (254.095 ms, samples=10)
    nr_blocked_enqueues=0

  [SCX_OPS_ENQ_BLOCKED=enabled]
    proxy_exec=enabled
    owner_nice=19
    donor_nice=-20
    contender_nice=0
    mutex_hold_avg_ns=228884734 (228.884 ms, samples=10)
    mutex_wait_avg_ns=207903720 (207.903 ms, samples=10)
    nr_blocked_enqueues=51

  [delta: enabled - disabled]
    mutex_hold_delta_ns=-25199985 (-9.92%)
    mutex_wait_delta_ns=-46191400 (-18.18%)
  ok 1 enq_blocked #
  =====  END  =====

  =============================

  RESULTS:

  PASSED:  1
  SKIPPED: 0
  FAILED:  0

References
==========

[1] https://lore.kernel.org/all/20251206001451.1418225-1-jstultz@google.com

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-proxy-exec

Changes in v2:
 - Rebased onto sched_ext/for-7.3 and adapted the series to the split
   sched_ext implementation and cid-form scheduler interfaces.
 - Replaced the global sched_proxy_exec_scx boot-time opt-in with the
   per-scheduler SCX_OPS_ENQ_BLOCKED capability, allowing BPF to control donor
   admission and ordering through ops.enqueue().
 - Added scx_bpf_task_is_blocked(), scx_bpf_task_proxy_cpu(), and
   scx_bpf_task_proxy_cid(); enforce CPU/cid API separation for cid-form
   schedulers.
 - Added proxy exec support to scx_qmap, including optional owner-cid steering,
   affinity validation, and fallback to the donor's current cid.
 - Added a kselftest with a kernel mutex test and a three-task priority
   inversion workload, test is executed with blocked task admission disabled and
   enabled, validate the behavior, and report hold/wait-time deltas.
 - Link to v1: https://lore.kernel.org/all/20260506174639.535232-1-arighi@nvidia.com/

Andrea Righi (10):
      sched/core: Skip migration disabled tasks in proxy execution
      sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
      sched_ext: Fix TOCTOU race in consume_remote_task()
      sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors
      sched_ext: Save/restore kf_tasks[] when task ops nest
      sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK
      sched_ext: Delegate proxy donor admission to BPF schedulers
      sched_ext: Add selftest for blocked donor admission
      sched_ext: scx_qmap: Add proxy execution support
      sched: Allow enabling proxy exec with sched_ext

John Stultz (2):
      sched/ext: Split curr|donor references properly
      sched/ext: Avoid migrating blocked tasks with proxy execution

 include/linux/sched/ext.h                     |   9 +
 init/Kconfig                                  |   2 -
 kernel/sched/core.c                           |  59 +-
 kernel/sched/ext/ext.c                        | 209 +++++-
 kernel/sched/ext/ext.h                        |   8 +
 kernel/sched/ext/internal.h                   |  58 +-
 kernel/sched/sched.h                          |   6 +
 tools/sched_ext/include/scx/common.bpf.h      |   3 +
 tools/sched_ext/include/scx/compat.h          |   1 +
 tools/sched_ext/scx_qmap.bpf.c                |  18 +-
 tools/testing/selftests/sched_ext/.gitignore  |   4 +
 tools/testing/selftests/sched_ext/Makefile    |   2 +
 tools/testing/selftests/sched_ext/config      |   2 +
 .../selftests/sched_ext/enq_blocked.bpf.c     | 116 +++
 .../testing/selftests/sched_ext/enq_blocked.c | 682 ++++++++++++++++++
 .../testing/selftests/sched_ext/enq_blocked.h |  21 +
 .../selftests/sched_ext/test_modules/Makefile |  13 +
 .../test_modules/scx_enq_blocked_test.c       | 134 ++++
 18 files changed, 1299 insertions(+), 48 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.c
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.h
 create mode 100644 tools/testing/selftests/sched_ext/test_modules/Makefile
 create mode 100644 tools/testing/selftests/sched_ext/test_modules/scx_enq_blocked_test.c

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 18:17   ` K Prateek Nayak
  2026-07-02 18:21   ` Peter Zijlstra
  2026-07-02 17:09 ` [PATCH 02/12] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

Never attempt to migrate migration-disabled tasks or tasks that can only
run on a single CPU when switching donor's execution context, preventing
task pinning violations.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/core.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3cc6fb1d20547..8a3eecc7caf5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6936,6 +6936,20 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			 */
 			if (curr_in_chain)
 				return proxy_resched_idle(rq);
+			/*
+			 * Tasks pinned to a single CPU (per-CPU kthreads via
+			 * kthread_bind(), tasks under migrate_disable()) cannot
+			 * be moved to @owner_cpu. proxy_migrate_task() uses
+			 * __set_task_cpu() which would silently violate the
+			 * pinning and leave the task to run on a CPU outside
+			 * its cpus_ptr once it is unblocked. Deactivate it on
+			 * this CPU; the owner running elsewhere will wake @p
+			 * back up when the mutex becomes available.
+			 */
+			if (p->nr_cpus_allowed == 1 || is_migration_disabled(p)) {
+				__clear_task_blocked_on(p, NULL);
+				goto deactivate;
+			}
 			goto migrate_task;
 		}
 
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/12] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
  2026-07-02 17:09 ` [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 18:24   ` Peter Zijlstra
  2026-07-02 17:09 ` [PATCH 03/12] sched_ext: Split curr|donor references properly Andrea Righi
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

In __schedule(), the proxy-exec donor-stabilization block calls
put_prev_task() and set_next_task() when rq->donor == prev_donor and
prev != next.

For sched_ext tasks, re-entering set_next_task_scx() for a donor that
has already been seen by BPF ops.running via the normal pick path causes
issues. It fires SCX_CALL_OP_TASK(sch, running, rq, donor) a second
time, and sch->ops dispatch can land on a vtable slot in a state that
yields a NULL function pointer or corrupts the stack.

Fix this by skipping the put_prev_task/set_next_task re-entry when the
donor is in the ext_sched_class, since sched_ext tracks curr/donor
itself.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/core.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8a3eecc7caf5d..2e4c98b4ea6b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7185,9 +7185,14 @@ static void __sched notrace __schedule(int sched_mode)
 			 * anything, since B == B. However, A might have
 			 * missed a RT/DL balance opportunity due to being
 			 * on_cpu.
+			 *
+			 * sched_ext tracks curr/donor itself; re-entering set_next_task_scx
+			 * here dispatches through a stale/NULL BPF ops vtable.
 			 */
-			donor->sched_class->put_prev_task(rq, donor, donor);
-			donor->sched_class->set_next_task(rq, donor, true);
+			if (donor->sched_class != &ext_sched_class) {
+				donor->sched_class->put_prev_task(rq, donor, donor);
+				donor->sched_class->set_next_task(rq, donor, true);
+			}
 		}
 	} else {
 		rq_set_donor(rq, next);
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/12] sched_ext: Split curr|donor references properly
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
  2026-07-02 17:09 ` [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
  2026-07-02 17:09 ` [PATCH 02/12] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-03  6:10   ` Aiqun(Maria) Yu
  2026-07-02 17:09 ` [PATCH 04/12] sched_ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

From: John Stultz <jstultz@google.com>

With proxy-exec, we want to do the accounting against the donor most of
the time. Without proxy-exec, there should be no difference as the
rq->donor and rq->curr are the same.

So rework the logic to reference the rq->donor where appropriate.

Also add donor info to scx_dump_state().

Since CONFIG_SCHED_PROXY_EXEC currently depends on
!CONFIG_SCHED_CLASS_EXT, this should have no effect (other than the
extra donor output in scx_dump_state), but this is one step needed to
eventually remove that constraint for proxy-exec.

Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/ext/ext.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 1a0ec985da77d..1588565050679 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -1145,17 +1145,17 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
 
 static void update_curr_scx(struct rq *rq)
 {
-	struct task_struct *curr = rq->curr;
+	struct task_struct *donor = rq->donor;
 	s64 delta_exec;
 
 	delta_exec = update_curr_common(rq);
 	if (unlikely(delta_exec <= 0))
 		return;
 
-	if (curr->scx.slice != SCX_SLICE_INF) {
-		curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec);
-		if (!curr->scx.slice)
-			touch_core_sched(rq, curr);
+	if (donor->scx.slice != SCX_SLICE_INF) {
+		donor->scx.slice -= min_t(u64, donor->scx.slice, delta_exec);
+		if (!donor->scx.slice)
+			touch_core_sched(rq, donor);
 	}
 
 	dl_server_update(&rq->ext_server, delta_exec);
@@ -1316,8 +1316,8 @@ static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq
 	if (rq->scx.flags & SCX_RQ_IN_BALANCE)
 		return;
 
-	if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
-	    rq->curr->sched_class == &ext_sched_class) {
+	if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->donor &&
+	    rq->donor->sched_class == &ext_sched_class) {
 		rq->curr->scx.slice = 0;
 		resched_curr(rq);
 	}
@@ -2464,7 +2464,8 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 		}
 
 		/* if the destination CPU is idle, wake it up */
-		if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class))
+		if (!fallback && sched_class_above(p->sched_class,
+						      dst_rq->donor->sched_class))
 			resched_curr(dst_rq);
 	}
 
@@ -2876,7 +2877,7 @@ static struct task_struct *first_local_task(struct rq *rq)
 static struct task_struct *
 do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
 {
-	struct task_struct *prev = rq->curr;
+	struct task_struct *prev = rq->donor;
 	bool keep_prev;
 	struct task_struct *p;
 
@@ -4029,7 +4030,7 @@ static void run_deferred(struct rq *rq)
 #ifdef CONFIG_NO_HZ_FULL
 bool scx_can_stop_tick(struct rq *rq)
 {
-	struct task_struct *p = rq->curr;
+	struct task_struct *p = rq->donor;
 	struct scx_sched *sch = scx_task_sched(p);
 
 	if (p->sched_class != &ext_sched_class)
@@ -6007,6 +6008,9 @@ static void scx_dump_cpu(struct scx_sched *sch, struct seq_buf *s,
 	dump_line(&ns, "          curr=%s[%d] class=%ps",
 		  rq->curr->comm, rq->curr->pid,
 		  rq->curr->sched_class);
+	dump_line(&ns, "          donor=%s[%d] class=%ps",
+		  rq->donor->comm, rq->donor->pid,
+		  rq->donor->sched_class);
 	if (!cpumask_empty(rq->scx.cpus_to_kick))
 		dump_line(&ns, "  cpus_to_kick   : %*pb",
 			  cpumask_pr_args(rq->scx.cpus_to_kick));
@@ -7452,7 +7456,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
 	unsigned long flags;
 
 	raw_spin_rq_lock_irqsave(rq, flags);
-	cur_class = rq->curr->sched_class;
+	cur_class = rq->donor->sched_class;
 
 	/*
 	 * During CPU hotplug, a CPU may depend on kicking itself to make
@@ -7464,7 +7468,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
 	    !sched_class_above(cur_class, &ext_sched_class)) {
 		if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
 			if (cur_class == &ext_sched_class)
-				rq->curr->scx.slice = 0;
+				rq->donor->scx.slice = 0;
 			cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
 		}
 
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/12] sched_ext: Avoid migrating blocked tasks with proxy execution
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (2 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 03/12] sched_ext: Split curr|donor references properly Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-03  8:02   ` Aiqun(Maria) Yu
  2026-07-02 17:09 ` [PATCH 05/12] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

From: John Stultz <jstultz@google.com>

With proxy execution enabled, mutex blocked tasks stay on the runqueue.
Later with donor migration they will be migrated when necessary by the
core scheduler to boost lock owners.

Don't try to migrate mutex blocked tasks, the proxy logic will handle
that.

Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/ext/ext.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 1588565050679..9a672b9a55f6e 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -2150,6 +2150,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
 
 	WARN_ON_ONCE(task_cpu(p) == cpu);
 
+	/* Make sure tasks aren't on a cpu */
+	if (task_on_cpu(task_rq(p), p))
+		return false;
+
+	/* Don't migrate blocked tasks, proxy-exec will handle this */
+	if (task_is_blocked(p))
+		return false;
+
 	/*
 	 * If @p has migration disabled, @p->cpus_ptr is updated to contain only
 	 * the pinned CPU in migrate_disable_switch() while @p is being switched
@@ -2784,6 +2792,23 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 	if (p->scx.flags & SCX_TASK_QUEUED) {
 		set_task_runnable(rq, p);
 
+		/*
+		 * Mutex-blocked donors stay queued on the runqueue under proxy
+		 * execution, but the donor never runs as itself, proxy-exec
+		 * walks the blocked_on chain on the next __schedule() and runs
+		 * the lock owner in its place.
+		 *
+		 * Put the donor on the local DSQ directly, so pick_next_task()
+		 * can still see it, find_proxy_task() will be invoked on
+		 * next->blocked_on and either run the chain owner here, or call
+		 * proxy_force_return() and let BPF make a new dispatch decision
+		 * once the task is no longer blocked.
+		 */
+		if (task_is_blocked(p)) {
+			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
+			goto switch_class;
+		}
+
 		/*
 		 * If @p has slice left and is being put, @p is getting
 		 * preempted by a higher priority scheduler class or core-sched
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/12] sched_ext: Fix TOCTOU race in consume_remote_task()
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (3 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 04/12] sched_ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 17:09 ` [PATCH 06/12] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

When pulling a task from a non-local DSQ, consume_dispatch_q() checks if
the task can run on the destination rq via task_can_run_on_remote_rq().
However, it then drops the destination rq lock and locks the source rq
in consume_remote_task() -> unlink_dsq_and_lock_src_rq(). During this
window, the task might have become migration disabled, making it invalid
to migrate it to the destination rq.

Fix this by re-evaluating task_can_run_on_remote_rq() in
consume_remote_task() after the source rq is locked. If the task can no
longer be migrated, we clear its DSQ association, reset the holding CPU,
and enqueue it to the source rq's local DSQ instead.

While the destination rq lock is dropped, clear the tracked rq state and
restore it after reacquiring the lock. Otherwise, a nested ops.dequeue()
callback can attempt to restore an rq which is no longer locked.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext/ext.c | 43 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 9a672b9a55f6e..189ba9c42043a 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -2248,20 +2248,49 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
 		!WARN_ON_ONCE(src_rq != task_rq(p));
 }
 
-static bool consume_remote_task(struct rq *this_rq,
+static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
 				struct task_struct *p, u64 enq_flags,
 				struct scx_dispatch_q *dsq, struct rq *src_rq)
 {
+	struct rq *tracked_rq = scx_locked_rq();
+	bool consumed = false;
+
+	/*
+	 * consume_remote_task() may be called from an SCX op with @this_rq
+	 * recorded as the currently locked rq. Clear the tracking while the rq
+	 * lock is dropped so nested callbacks don't save and later try to restore
+	 * an rq which isn't locked anymore.
+	 */
+	if (tracked_rq) {
+		WARN_ON_ONCE(tracked_rq != this_rq);
+		update_locked_rq(NULL);
+	}
 	raw_spin_rq_unlock(this_rq);
 
 	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
+		if (unlikely(!task_can_run_on_remote_rq(sch, p, this_rq, true))) {
+			p->scx.dsq = NULL;
+			p->scx.holding_cpu = -1;
+			scx_dispatch_enqueue(sch, src_rq, &src_rq->scx.local_dsq, p,
+					     enq_flags | SCX_ENQ_CLEAR_OPSS);
+			if (sched_class_above(p->sched_class, src_rq->donor->sched_class))
+				resched_curr(src_rq);
+			raw_spin_rq_unlock(src_rq);
+			goto relock;
+		}
 		move_remote_task_to_local_dsq(p, enq_flags, src_rq, this_rq);
-		return true;
-	} else {
-		raw_spin_rq_unlock(src_rq);
-		raw_spin_rq_lock(this_rq);
-		return false;
+		consumed = true;
+		goto restore;
 	}
+	raw_spin_rq_unlock(src_rq);
+
+relock:
+	raw_spin_rq_lock(this_rq);
+restore:
+	if (tracked_rq)
+		update_locked_rq(tracked_rq);
+
+	return consumed;
 }
 
 /**
@@ -2371,7 +2400,7 @@ bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 		}
 
 		if (task_can_run_on_remote_rq(sch, p, rq, false)) {
-			if (likely(consume_remote_task(rq, p, enq_flags, dsq, task_rq)))
+			if (likely(consume_remote_task(sch, rq, p, enq_flags, dsq, task_rq)))
 				return true;
 			goto retry;
 		}
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/12] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (4 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 05/12] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 17:09 ` [PATCH 07/12] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

With proxy-exec, pick_next_task() can return a task with blocked_on set
(a proxy donor); put_prev_set_next_task() then calls set_next_task_scx()
on this "ghost" task, which fires ops.running(). However, the task never
actually runs.

If we simply short-circuit set_next_task_scx() for blocked tasks, we
break DSQ bookkeeping. If we only skip ops.running(), we create an
ops.enqueue() -> ops.stopping() pair without running, because
ops.stopping() is still called in put_prev_task_scx().

Fix this by introducing a new flag SCX_TASK_IS_RUNNING to track whether
ops.running() was actually called. Skip ops.running() for blocked tasks,
and only call ops.stopping() if SCX_TASK_IS_RUNNING is set. This ensures
that running and stopping callbacks are perfectly paired even when a
blocked task is picked as a proxy donor.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 include/linux/sched/ext.h |  2 ++
 kernel/sched/core.c       |  2 +-
 kernel/sched/ext/ext.c    | 14 +++++++++++---
 kernel/sched/ext/ext.h    |  6 ++++++
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 75cb8b119fb79..e599bb86f8acd 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -102,6 +102,8 @@ enum scx_ent_flags {
 	SCX_TASK_SUB_INIT	= 1 << 4, /* task being initialized for a sub sched */
 	SCX_TASK_IMMED		= 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */
 
+	SCX_TASK_IS_RUNNING	= 1 << 6, /* ops.running() has been called */
+
 	/*
 	 * Bits 8 to 10 are used to carry task state:
 	 *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e4c98b4ea6b0..6aedb26c08ee7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7189,7 +7189,7 @@ static void __sched notrace __schedule(int sched_mode)
 			 * sched_ext tracks curr/donor itself; re-entering set_next_task_scx
 			 * here dispatches through a stale/NULL BPF ops vtable.
 			 */
-			if (donor->sched_class != &ext_sched_class) {
+			if (!is_ext_class(donor)) {
 				donor->sched_class->put_prev_task(rq, donor, donor);
 				donor->sched_class->set_next_task(rq, donor, true);
 			}
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 189ba9c42043a..b0ec579e3a3ef 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -1985,9 +1985,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_
 	 * information meaningful to the BPF scheduler and can be suppressed by
 	 * skipping the callbacks if the task is !QUEUED.
 	 */
-	if (SCX_HAS_OP(sch, stopping) && task_current(rq, p)) {
+	if (SCX_HAS_OP(sch, stopping) && task_current(rq, p) &&
+	    (p->scx.flags & SCX_TASK_IS_RUNNING)) {
 		update_curr_scx(rq);
 		SCX_CALL_OP_TASK(sch, stopping, rq, p, false);
+		p->scx.flags &= ~SCX_TASK_IS_RUNNING;
 	}
 
 	if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p))
@@ -2725,8 +2727,11 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 	p->se.exec_start = rq_clock_task(rq);
 
 	/* see dequeue_task_scx() on why we skip when !QUEUED */
-	if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED))
+	if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED) &&
+	    !task_is_blocked(p)) {
 		SCX_CALL_OP_TASK(sch, running, rq, p);
+		p->scx.flags |= SCX_TASK_IS_RUNNING;
+	}
 
 	clr_task_runnable(p, true);
 
@@ -2815,8 +2820,11 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 	update_curr_scx(rq);
 
 	/* see dequeue_task_scx() on why we skip when !QUEUED */
-	if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED))
+	if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED) &&
+	    (p->scx.flags & SCX_TASK_IS_RUNNING)) {
 		SCX_CALL_OP_TASK(sch, stopping, rq, p, true);
+		p->scx.flags &= ~SCX_TASK_IS_RUNNING;
+	}
 
 	if (p->scx.flags & SCX_TASK_QUEUED) {
 		set_task_runnable(rq, p);
diff --git a/kernel/sched/ext/ext.h b/kernel/sched/ext/ext.h
index 0b7fc46aee08c..c7fa4d06ac7d3 100644
--- a/kernel/sched/ext/ext.h
+++ b/kernel/sched/ext/ext.h
@@ -35,6 +35,11 @@ static inline bool task_on_scx(const struct task_struct *p)
 	return scx_enabled() && p->sched_class == &ext_sched_class;
 }
 
+static inline bool is_ext_class(const struct task_struct *p)
+{
+	return p->sched_class == &ext_sched_class;
+}
+
 #ifdef CONFIG_SCHED_CORE
 bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
 		   bool in_fi);
@@ -53,6 +58,7 @@ static inline void scx_rq_activate(struct rq *rq) {}
 static inline void scx_rq_deactivate(struct rq *rq) {}
 static inline int scx_check_setscheduler(struct task_struct *p, int policy) { return 0; }
 static inline bool task_on_scx(const struct task_struct *p) { return false; }
+static inline bool is_ext_class(const struct task_struct *p) { return false; }
 static inline bool scx_allow_ttwu_queue(const struct task_struct *p) { return true; }
 static inline void init_sched_ext_class(void) {}
 
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/12] sched_ext: Save/restore kf_tasks[] when task ops nest
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (5 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 06/12] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 17:09 ` [PATCH 08/12] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

SCX_CALL_OP_TASK*() stored the subject task in current->scx.kf_tasks[]
and assumed ops would not nest. A BPF ops.running() callback can call
kfuncs (e.g. scx_bpf_dsq_insert) that enqueue work and trigger
enqueue_task_scx() -> ops.runnable(), which used SCX_CALL_OP_TASK again
and overwrote kf_tasks[0] then cleared it, leaving the running context
wrong and leading to NULL function dispatches from BPF helpers.

Save and restore kf_tasks[] (both slots for the two-task variant) around
each invocation so nested task-based ops preserve the outer context.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext/ext.c      | 10 ++++++++--
 kernel/sched/ext/internal.h | 33 +++++++++++++++++++++++----------
 2 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index b0ec579e3a3ef..4cefe5acb36ff 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -410,8 +410,12 @@ static inline void scx_call_op_set_cpumask(struct scx_sched *sch, struct rq *rq,
 					   struct task_struct *task,
 					   const struct cpumask *cpumask)
 {
-	WARN_ON_ONCE(current->scx.kf_tasks[0]);
+	struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];
+	struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];
+
+	current->scx.kf_nest++;
 	current->scx.kf_tasks[0] = task;
+	current->scx.kf_tasks[1] = NULL;
 	if (rq)
 		update_locked_rq(rq);
 
@@ -430,7 +434,9 @@ static inline void scx_call_op_set_cpumask(struct scx_sched *sch, struct rq *rq,
 
 	if (rq)
 		update_locked_rq(NULL);
-	current->scx.kf_tasks[0] = NULL;
+	current->scx.kf_tasks[0] = __scx_kf0_sv;
+	current->scx.kf_tasks[1] = __scx_kf1_sv;
+	current->scx.kf_nest--;
 }
 
 enum scx_dsq_iter_flags {
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index f9fe7c6ebc4be..7d92826fc3a5f 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1792,37 +1792,50 @@ do {										\
  * pi_lock held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu.
  * So if kf_tasks[] is set, @p's scheduler-protected fields are stable.
  *
- * kf_tasks[] can not stack, so task-based SCX ops must not nest. The
- * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants
- * while a previous one is still in progress.
+ * Task-based SCX ops may nest (e.g. ops.running() calling a kfunc that ends up
+ * in enqueue_task_scx() -> ops.runnable()). Save and restore kf_tasks[] around
+ * each invocation so the outer op's context is restored for kfuncs and for
+ * further nested calls. Single-task ops save/restore both slots and clear
+ * kf_tasks[1] while active so a nested call under SCX_CALL_OP_2TASKS_RET does
+ * not leave the outer pair's second task authenticated for kfuncs.
  */
 #define SCX_CALL_OP_TASK(sch, op, locked_rq, task, args...)			\
 do {										\
-	WARN_ON_ONCE(current->scx.kf_tasks[0]);					\
+	struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];		\
+	struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];		\
+										\
 	current->scx.kf_tasks[0] = task;					\
+	current->scx.kf_tasks[1] = NULL;					\
 	SCX_CALL_OP((sch), op, locked_rq, task, ##args);			\
-	current->scx.kf_tasks[0] = NULL;					\
+	current->scx.kf_tasks[0] = __scx_kf0_sv;				\
+	current->scx.kf_tasks[1] = __scx_kf1_sv;				\
 } while (0)
 
 #define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...)			\
 ({										\
 	__typeof__((sch)->ops.op(task, ##args)) __ret;				\
-	WARN_ON_ONCE(current->scx.kf_tasks[0]);					\
+	struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];		\
+	struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];		\
+										\
 	current->scx.kf_tasks[0] = task;					\
+	current->scx.kf_tasks[1] = NULL;					\
 	__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args);		\
-	current->scx.kf_tasks[0] = NULL;					\
+	current->scx.kf_tasks[0] = __scx_kf0_sv;				\
+	current->scx.kf_tasks[1] = __scx_kf1_sv;				\
 	__ret;									\
 })
 
 #define SCX_CALL_OP_2TASKS_RET(sch, op, locked_rq, task0, task1, args...)	\
 ({										\
 	__typeof__((sch)->ops.op(task0, task1, ##args)) __ret;			\
-	WARN_ON_ONCE(current->scx.kf_tasks[0]);					\
+	struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];		\
+	struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];		\
+										\
 	current->scx.kf_tasks[0] = task0;					\
 	current->scx.kf_tasks[1] = task1;					\
 	__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args);	\
-	current->scx.kf_tasks[0] = NULL;					\
-	current->scx.kf_tasks[1] = NULL;					\
+	current->scx.kf_tasks[0] = __scx_kf0_sv;				\
+	current->scx.kf_tasks[1] = __scx_kf1_sv;				\
 	__ret;									\
 })
 
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/12] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (6 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 07/12] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 17:09 ` [PATCH 09/12] sched_ext: Delegate proxy donor admission to BPF schedulers Andrea Righi
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

ops.running() can pull in enqueue_task_scx() -> ops.runnable() on the
same current task while kf_tasks[] save/restore is still insufficient
for every BPF/kfunc combination, leading to NULL dispatches and stack
corruption.

Track SCX_CALL_OP_TASK nesting in current->scx.kf_nest (incremented by
all SCX_CALL_OP_TASK* macros) and omit the ops.runnable() callback when
non-zero. The full enqueue path including ops.enqueue() still runs, only
the runnable hook is skipped in this case.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 include/linux/sched/ext.h   | 7 +++++++
 kernel/sched/ext/ext.c      | 3 ++-
 kernel/sched/ext/internal.h | 6 ++++++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index e599bb86f8acd..a0c2077216094 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -201,6 +201,13 @@ struct sched_ext_entity {
 	s32			holding_cpu;
 	s32			selected_cpu;
 	struct task_struct	*kf_tasks[2];	/* see SCX_CALL_OP_TASK() */
+	/*
+	 * Nesting depth of SCX_CALL_OP_TASK() on this task as %current (e.g.
+	 * during schedule() %current is still the previous task). Used to skip
+	 * ops.runnable() when invoked from inside another task op such as
+	 * ops.running() to avoid breaking BPF re-entrance guarantees.
+	 */
+	u32			kf_nest;
 
 	struct list_head	runnable_node;	/* rq->scx.runnable_list */
 	unsigned long		runnable_at;
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 4cefe5acb36ff..c48d043dbe58f 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -1862,7 +1862,8 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_
 	rq->scx.nr_running++;
 	add_nr_running(rq, 1);
 
-	if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p))
+	if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p) &&
+	    !READ_ONCE(current->scx.kf_nest))
 		SCX_CALL_OP_TASK(sch, runnable, rq, p, enq_flags);
 
 	if (enq_flags & SCX_ENQ_WAKEUP)
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 7d92826fc3a5f..1774e44aebcf7 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1804,11 +1804,13 @@ do {										\
 	struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];		\
 	struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];		\
 										\
+	current->scx.kf_nest++;							\
 	current->scx.kf_tasks[0] = task;					\
 	current->scx.kf_tasks[1] = NULL;					\
 	SCX_CALL_OP((sch), op, locked_rq, task, ##args);			\
 	current->scx.kf_tasks[0] = __scx_kf0_sv;				\
 	current->scx.kf_tasks[1] = __scx_kf1_sv;				\
+	current->scx.kf_nest--;							\
 } while (0)
 
 #define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...)			\
@@ -1817,11 +1819,13 @@ do {										\
 	struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];		\
 	struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];		\
 										\
+	current->scx.kf_nest++;							\
 	current->scx.kf_tasks[0] = task;					\
 	current->scx.kf_tasks[1] = NULL;					\
 	__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args);		\
 	current->scx.kf_tasks[0] = __scx_kf0_sv;				\
 	current->scx.kf_tasks[1] = __scx_kf1_sv;				\
+	current->scx.kf_nest--;							\
 	__ret;									\
 })
 
@@ -1831,11 +1835,13 @@ do {										\
 	struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];		\
 	struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];		\
 										\
+	current->scx.kf_nest++;							\
 	current->scx.kf_tasks[0] = task0;					\
 	current->scx.kf_tasks[1] = task1;					\
 	__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args);	\
 	current->scx.kf_tasks[0] = __scx_kf0_sv;				\
 	current->scx.kf_tasks[1] = __scx_kf1_sv;				\
+	current->scx.kf_nest--;							\
 	__ret;									\
 })
 
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/12] sched_ext: Delegate proxy donor admission to BPF schedulers
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (7 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 08/12] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 18:41   ` K Prateek Nayak
  2026-07-02 17:09 ` [PATCH 10/12] sched_ext: Add selftest for blocked donor admission Andrea Righi
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

Proxy execution keeps a blocked donor runnable so its scheduling context
can execute the mutex owner. Dispatching sched_ext donors on a local DSQ
bypasses the BPF scheduler's ordering policy and can give donors more
CPU priority than intended to perform the proxy execution handoff to the
mutex owner.

Add SCX_OPS_ENQ_BLOCKED as an explicit proxy execution capability.
Tasks owned by schedulers without the flag block normally. Schedulers
with the flag receive blocked donors through ops.enqueue() and can use
scx_bpf_task_is_blocked() to apply their own admission policy.

The donor starts associated with its original CPU. A BPF scheduler may
dispatch it directly into that CPU's local DSQ to let the core resolve
the mutex owner and execute it with the donor's scheduling context.

Knowing the mutex owner's location in BPF is not strictly required, but
steering the donation there can reduce handoff latency when migration
fits the donor's affinity constraints and the scheduler's policy. Add
scx_bpf_task_proxy_cpu() and scx_bpf_task_proxy_cid() to report that
destination in the corresponding address space, and let the BPF
scheduler decide to migrate or not the donor to the mutex owner's
location.

Reschedule a retained donor when its mutex wakes it so ops.dispatch()
can reconsider the now-unblocked task. Make SCX_OPS_ENQ_BLOCKED
override the exiting and migration-disabled enqueue fallbacks so
opted-in schedulers receive all eligible donor requests.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/core.c                      |  36 +++++++-
 kernel/sched/ext/ext.c                   | 108 +++++++++++++++++++----
 kernel/sched/ext/ext.h                   |   2 +
 kernel/sched/ext/internal.h              |  19 +++-
 kernel/sched/sched.h                     |   6 ++
 tools/sched_ext/include/scx/common.bpf.h |   3 +
 tools/sched_ext/include/scx/compat.h     |   1 +
 7 files changed, 156 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6aedb26c08ee7..f0edfbe8ce232 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7018,6 +7018,39 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	proxy_migrate_task(rq, rf, p, owner_cpu);
 	return NULL;
 }
+
+int task_proxy_cpu(struct task_struct *p)
+{
+	struct task_struct *owner;
+	struct mutex *mutex;
+	int owner_cpu;
+
+	if (!task_is_blocked(p))
+		return -ENOENT;
+
+	mutex = READ_ONCE(p->blocked_on);
+	if (!mutex)
+		return -ENOENT;
+
+	guard(raw_spinlock)(&mutex->wait_lock);
+	guard(raw_spinlock)(&p->blocked_lock);
+
+	if (mutex != __get_task_blocked_on(p))
+		return -EAGAIN;
+
+	owner = __mutex_owner(mutex);
+	if (!owner)
+		return -ENOENT;
+	if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed)
+		return -EAGAIN;
+
+	owner_cpu = task_cpu(owner);
+	if (owner_cpu != task_cpu(p) &&
+	    (p->nr_cpus_allowed == 1 || is_migration_disabled(p)))
+		return -EOPNOTSUPP;
+
+	return owner_cpu;
+}
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
@@ -7148,7 +7181,8 @@ static void __sched notrace __schedule(int sched_mode)
 		 * task_is_blocked() will always be false).
 		 */
 		try_to_block_task(rq, prev, &prev_state,
-				  !task_is_blocked(prev));
+				  !task_is_blocked(prev) ||
+				  !scx_allow_proxy_exec(prev));
 		switch_count = &prev->nvcsw;
 	}
 
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index c48d043dbe58f..8ffbf857acd51 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -23,6 +23,14 @@
 
 DEFINE_RAW_SPINLOCK(scx_sched_lock);
 
+bool scx_allow_proxy_exec(const struct task_struct *p)
+{
+	if (!task_on_scx(p))
+		return true;
+
+	return scx_task_sched(p)->ops.flags & SCX_OPS_ENQ_BLOCKED;
+}
+
 /*
  * NOTE: sched_ext is in the process of growing multiple scheduler support and
  * scx_root usage is in a transitional state. Naked dereferences are safe if the
@@ -1700,6 +1708,7 @@ static void scx_do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_fl
 	struct scx_sched *sch = scx_task_sched(p);
 	struct task_struct **ddsp_taskp;
 	struct scx_dispatch_q *dsq;
+	bool enq_blocked;
 	unsigned long qseq;
 
 	WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
@@ -1732,15 +1741,20 @@ static void scx_do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_fl
 	if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID)
 		goto direct;
 
+	/* %SCX_OPS_ENQ_BLOCKED takes precedence over the fallbacks below. */
+	enq_blocked = task_is_blocked(p) &&
+		      (sch->ops.flags & SCX_OPS_ENQ_BLOCKED);
+
 	/* see %SCX_OPS_ENQ_EXITING */
-	if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
+	if (!enq_blocked && !(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
 	    unlikely(p->flags & PF_EXITING)) {
 		__scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
 		goto local;
 	}
 
 	/* see %SCX_OPS_ENQ_MIGRATION_DISABLED */
-	if (!(sch->ops.flags & SCX_OPS_ENQ_MIGRATION_DISABLED) &&
+	if (!enq_blocked &&
+	    !(sch->ops.flags & SCX_OPS_ENQ_MIGRATION_DISABLED) &&
 	    is_migration_disabled(p)) {
 		__scx_add_event(sch, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED, 1);
 		goto local;
@@ -2042,11 +2056,18 @@ static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_fl
 {
 	/*
 	 * Preemption between SCX tasks is implemented by resetting the victim
-	 * task's slice to 0 and triggering reschedule on the target CPU.
-	 * Nothing to do.
-	 */
-	if (p->sched_class == &ext_sched_class)
+	 * task's slice to 0 and triggering reschedule on the target CPU. A
+	 * mutex-blocked task is kept queued for proxy execution, so its wakeup
+	 * doesn't go through enqueue_task_scx(). If the BPF scheduler manages
+	 * blocked donors, reschedule explicitly so that it can reconsider a
+	 * donor it declined to dispatch while blocked.
+	 */
+	if (p->sched_class == &ext_sched_class) {
+		if (p->is_blocked &&
+		    (scx_task_sched(p)->ops.flags & SCX_OPS_ENQ_BLOCKED))
+			resched_curr(rq);
 		return;
+	}
 
 	/*
 	 * Getting preempted by a higher-priority class. Reenqueue IMMED tasks.
@@ -2837,19 +2858,12 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		set_task_runnable(rq, p);
 
 		/*
-		 * Mutex-blocked donors stay queued on the runqueue under proxy
-		 * execution, but the donor never runs as itself, proxy-exec
-		 * walks the blocked_on chain on the next __schedule() and runs
-		 * the lock owner in its place.
-		 *
-		 * Put the donor on the local DSQ directly, so pick_next_task()
-		 * can still see it, find_proxy_task() will be invoked on
-		 * next->blocked_on and either run the chain owner here, or call
-		 * proxy_force_return() and let BPF make a new dispatch decision
-		 * once the task is no longer blocked.
+		 * Mutex-blocked donors only stay queued when their BPF scheduler
+		 * enables %SCX_OPS_ENQ_BLOCKED, so always delegate their admission.
 		 */
 		if (task_is_blocked(p)) {
-			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
+			WARN_ON_ONCE(!(sch->ops.flags & SCX_OPS_ENQ_BLOCKED));
+			scx_do_enqueue_task(rq, p, 0, -1);
 			goto switch_class;
 		}
 
@@ -6581,6 +6595,11 @@ int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
 		return -EINVAL;
 	}
 
+	if ((ops->flags & SCX_OPS_ENQ_BLOCKED) && !ops->enqueue) {
+		scx_error(sch, "SCX_OPS_ENQ_BLOCKED requires ops.enqueue() to be implemented");
+		return -EINVAL;
+	}
+
 	/*
 	 * SCX_OPS_TID_TO_TASK is enabled by the root scheduler. A sub-sched
 	 * may set it to declare a dependency; reject if the root hasn't
@@ -9294,6 +9313,57 @@ __bpf_kfunc bool scx_bpf_task_running(const struct task_struct *p)
 	return task_rq(p)->curr == p;
 }
 
+/**
+ * scx_bpf_task_is_blocked - Is a task currently blocked?
+ * @p: task of interest
+ *
+ * A BPF scheduler using %SCX_OPS_ENQ_BLOCKED receives blocked donors through
+ * ops.enqueue() and can decide when to make them available for proxy
+ * execution.
+ */
+__bpf_kfunc bool scx_bpf_task_is_blocked(struct task_struct *p)
+{
+	return task_is_blocked(p);
+}
+
+/**
+ * scx_bpf_task_proxy_cpu - Return the next proxy execution CPU
+ * @p: task of interest
+ *
+ * Return the CPU of the mutex owner toward which @p's scheduling context
+ * would next be migrated for proxy execution. The owner relationship can
+ * change after this function returns, so the result is only a scheduling
+ * hint. Returns a negative errno if no valid proxy destination is available.
+ */
+__bpf_kfunc s32 scx_bpf_task_proxy_cpu(struct task_struct *p)
+{
+	return task_proxy_cpu(p);
+}
+
+/**
+ * scx_bpf_task_proxy_cid - Return the next proxy execution cid
+ * @p: task of interest
+ *
+ * cid-addressed equivalent of scx_bpf_task_proxy_cpu(). Return the cid of the
+ * mutex owner toward which @p's scheduling context would next be migrated for
+ * proxy execution. The owner relationship can change after this function
+ * returns, so the result is only a scheduling hint. Returns a negative errno
+ * if no valid proxy destination or cid mapping is available.
+ */
+__bpf_kfunc s32 scx_bpf_task_proxy_cid(struct task_struct *p)
+{
+	s16 *tbl = READ_ONCE(scx_cpu_to_cid_tbl);
+	s32 cpu;
+
+	cpu = task_proxy_cpu(p);
+	if (cpu < 0)
+		return cpu;
+	if (!tbl)
+		return -EINVAL;
+
+	return READ_ONCE(tbl[cpu]);
+}
+
 /**
  * scx_bpf_task_cpu - CPU a task is currently associated with
  * @p: task of interest
@@ -9597,6 +9667,9 @@ BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_task_is_blocked, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_task_proxy_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_task_proxy_cid, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cid, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL)
@@ -9632,6 +9705,7 @@ static const struct btf_kfunc_id_set scx_kfunc_set_any = {
  */
 BTF_KFUNCS_START(scx_kfunc_ids_cpu_only)
 BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_task_proxy_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
 BTF_ID_FLAGS(func, scx_bpf_cpu_node, KF_IMPLICIT_ARGS)
diff --git a/kernel/sched/ext/ext.h b/kernel/sched/ext/ext.h
index c7fa4d06ac7d3..08c64547b0143 100644
--- a/kernel/sched/ext/ext.h
+++ b/kernel/sched/ext/ext.h
@@ -20,6 +20,7 @@ void scx_rq_deactivate(struct rq *rq);
 int scx_check_setscheduler(struct task_struct *p, int policy);
 bool task_should_scx(int policy);
 bool scx_allow_ttwu_queue(const struct task_struct *p);
+bool scx_allow_proxy_exec(const struct task_struct *p);
 void init_sched_ext_class(void);
 
 static inline u32 scx_cpuperf_target(s32 cpu)
@@ -60,6 +61,7 @@ static inline int scx_check_setscheduler(struct task_struct *p, int policy) { re
 static inline bool task_on_scx(const struct task_struct *p) { return false; }
 static inline bool is_ext_class(const struct task_struct *p) { return false; }
 static inline bool scx_allow_ttwu_queue(const struct task_struct *p) { return true; }
+static inline bool scx_allow_proxy_exec(const struct task_struct *p) { return true; }
 static inline void init_sched_ext_class(void) {}
 
 #endif	/* CONFIG_SCHED_CLASS_EXT */
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 1774e44aebcf7..a958c9d3144b3 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -212,6 +212,22 @@ enum scx_ops_flags {
 	 */
 	SCX_OPS_TID_TO_TASK		= 1LLU << 8,
 
+	/*
+	 * If set, mutex-blocked tasks remain runnable as proxy donors and are
+	 * passed to ops.enqueue(). The BPF scheduler can identify them with
+	 * scx_bpf_task_is_blocked() and query the next mutex owner's CPU or cid
+	 * with scx_bpf_task_proxy_cpu() or scx_bpf_task_proxy_cid(). It controls
+	 * when donors are dispatched and whether they should preempt work on the
+	 * owner's CPU.
+	 *
+	 * If clear, mutex-blocked tasks are removed from the runqueue normally
+	 * and cannot donate their scheduling context through proxy execution.
+	 *
+	 * For blocked donors, this flag takes precedence over
+	 * %SCX_OPS_ENQ_EXITING and %SCX_OPS_ENQ_MIGRATION_DISABLED.
+	 */
+	SCX_OPS_ENQ_BLOCKED		= 1LLU << 9,
+
 	SCX_OPS_ALL_FLAGS		= SCX_OPS_KEEP_BUILTIN_IDLE |
 					  SCX_OPS_ENQ_LAST |
 					  SCX_OPS_ENQ_EXITING |
@@ -220,7 +236,8 @@ enum scx_ops_flags {
 					  SCX_OPS_SWITCH_PARTIAL |
 					  SCX_OPS_BUILTIN_IDLE_PER_NODE |
 					  SCX_OPS_ALWAYS_ENQ_IMMED |
-					  SCX_OPS_TID_TO_TASK,
+					  SCX_OPS_TID_TO_TASK |
+					  SCX_OPS_ENQ_BLOCKED,
 
 	/* high 8 bits are internal, don't include in SCX_OPS_ALL_FLAGS */
 	__SCX_OPS_INTERNAL_MASK		= 0xffLLU << 56,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 56acf502ba260..c4eb70371f7af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2470,6 +2470,12 @@ static inline bool task_is_blocked(struct task_struct *p)
 	return !!p->blocked_on;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+int task_proxy_cpu(struct task_struct *p);
+#else
+static inline int task_proxy_cpu(struct task_struct *p) { return -EOPNOTSUPP; }
+#endif
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 	return p->on_cpu;
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index bd51986c4c42e..4b82af377a15d 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -95,6 +95,9 @@ s32 scx_bpf_pick_idle_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed, int node, u64 flags) __ksym __weak;
 s32 scx_bpf_pick_any_cpu(const cpumask_t *cpus_allowed, u64 flags) __ksym;
 bool scx_bpf_task_running(const struct task_struct *p) __ksym;
+bool scx_bpf_task_is_blocked(struct task_struct *p) __ksym __weak;
+s32 scx_bpf_task_proxy_cpu(struct task_struct *p) __ksym __weak;
+s32 scx_bpf_task_proxy_cid(struct task_struct *p) __ksym __weak;
 s32 scx_bpf_task_cpu(const struct task_struct *p) __ksym;
 struct rq *scx_bpf_locked_rq(void) __ksym;
 struct task_struct *scx_bpf_cpu_curr(s32 cpu) __ksym __weak;
diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h
index 23d9ef3e4c9d2..8d5606e465080 100644
--- a/tools/sched_ext/include/scx/compat.h
+++ b/tools/sched_ext/include/scx/compat.h
@@ -117,6 +117,7 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field
 #define SCX_OPS_ALLOW_QUEUED_WAKEUP SCX_OPS_FLAG(SCX_OPS_ALLOW_QUEUED_WAKEUP)
 #define SCX_OPS_BUILTIN_IDLE_PER_NODE SCX_OPS_FLAG(SCX_OPS_BUILTIN_IDLE_PER_NODE)
 #define SCX_OPS_ALWAYS_ENQ_IMMED SCX_OPS_FLAG(SCX_OPS_ALWAYS_ENQ_IMMED)
+#define SCX_OPS_ENQ_BLOCKED SCX_OPS_FLAG(SCX_OPS_ENQ_BLOCKED)
 
 #define SCX_PICK_IDLE_FLAG(name) __COMPAT_ENUM_OR_ZERO("scx_pick_idle_cpu_flags", #name)
 
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/12] sched_ext: Add selftest for blocked donor admission
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (8 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 09/12] sched_ext: Delegate proxy donor admission to BPF schedulers Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 17:09 ` [PATCH 11/12] sched_ext: scx_qmap: Add proxy execution support Andrea Righi
  2026-07-02 17:09 ` [PATCH 12/12] sched: Allow enabling proxy exec with sched_ext Andrea Righi
  11 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

SCX_OPS_ENQ_BLOCKED allows BPF schedulers to receive blocked proxy
donors through ops.enqueue(), while scx_bpf_task_is_blocked() identifies
donor admission requests and scx_bpf_task_proxy_cpu() returns the CPU
where the mutex owner is blocked. Add test coverage for this interface,
including validation of the owner's CPU.

Exercise a three-task priority inversion on a single CPU using a
weighted vruntime BPF scheduler. A nice +19 owner holds a shared mutex,
a nice -20 donor blocks on it and a nice 0 CPU contender delays the
owner.

Run the workload with SCX_OPS_ENQ_BLOCKED first disabled and then
enabled. In the enabled run, count blocked donor enqueues, validate the
proxy CPU, and dispatch the donor there to facilitate proxy execution.
Report average mutex hold and wait times and their deltas without
enforcing performance results.

Proxy execution coverage requires CONFIG_SCHED_PROXY_EXEC=y, which the
selftest config selects. Expect blocked enqueue events when generic
proxy execution is enabled and none when proxy execution is disabled.

The mutex is provided via a loadable kernel module, built via
TEST_GEN_MODS_DIR, with the test responsible for loading, unloading, and
managing its lifecycle.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/.gitignore  |   4 +
 tools/testing/selftests/sched_ext/Makefile    |   2 +
 tools/testing/selftests/sched_ext/config      |   2 +
 .../selftests/sched_ext/enq_blocked.bpf.c     | 116 +++
 .../testing/selftests/sched_ext/enq_blocked.c | 682 ++++++++++++++++++
 .../testing/selftests/sched_ext/enq_blocked.h |  21 +
 .../selftests/sched_ext/test_modules/Makefile |  13 +
 .../test_modules/scx_enq_blocked_test.c       | 134 ++++
 8 files changed, 974 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.c
 create mode 100644 tools/testing/selftests/sched_ext/enq_blocked.h
 create mode 100644 tools/testing/selftests/sched_ext/test_modules/Makefile
 create mode 100644 tools/testing/selftests/sched_ext/test_modules/scx_enq_blocked_test.c

diff --git a/tools/testing/selftests/sched_ext/.gitignore b/tools/testing/selftests/sched_ext/.gitignore
index ae5491a114c09..54a1fd2af713d 100644
--- a/tools/testing/selftests/sched_ext/.gitignore
+++ b/tools/testing/selftests/sched_ext/.gitignore
@@ -4,3 +4,7 @@
 !Makefile
 !.gitignore
 !config
+!test_modules/
+!test_modules/scx_enq_blocked_test.c
+!test_modules/Makefile
+test_modules/*.mod.c
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 5d2dffca0e918..8eeece67d1de1 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -5,6 +5,7 @@ include ../../../scripts/Makefile.arch
 include ../../../scripts/Makefile.include
 
 TEST_GEN_PROGS := runner
+TEST_GEN_MODS_DIR := test_modules
 
 # override lib.mk's default rules
 OVERRIDE_TARGETS := 1
@@ -164,6 +165,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs
 auto-test-targets :=			\
 	create_dsq			\
 	dequeue				\
+	enq_blocked			\
 	enq_last_no_enq_fails		\
 	ddsp_bogus_dsq_fail		\
 	ddsp_vtimelocal_fail		\
diff --git a/tools/testing/selftests/sched_ext/config b/tools/testing/selftests/sched_ext/config
index aa901b05c8ad6..affa3cf33470a 100644
--- a/tools/testing/selftests/sched_ext/config
+++ b/tools/testing/selftests/sched_ext/config
@@ -6,3 +6,5 @@ CONFIG_BPF=y
 CONFIG_BPF_SYSCALL=y
 CONFIG_DEBUG_INFO=y
 CONFIG_DEBUG_INFO_BTF=y
+CONFIG_EXPERT=y
+CONFIG_SCHED_PROXY_EXEC=y
diff --git a/tools/testing/selftests/sched_ext/enq_blocked.bpf.c b/tools/testing/selftests/sched_ext/enq_blocked.bpf.c
new file mode 100644
index 0000000000000..c18be4a10f35e
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/enq_blocked.bpf.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES
+ *
+ * Verify that SCX_OPS_ENQ_BLOCKED passes blocked proxy donors through
+ * ops.enqueue().
+ */
+
+#include <scx/common.bpf.h>
+
+#define SHARED_DSQ 0
+
+char _license[] SEC("license") = "GPL";
+
+s32 donor_pid;
+s32 expected_proxy_cpu;
+u64 nr_blocked_enqueues;
+u64 nr_bad_proxy_cpus;
+static u64 vtime_now;
+
+UEI_DEFINE(uei);
+
+s32 BPF_STRUCT_OPS(enq_blocked_select_cpu,
+		   struct task_struct *p, s32 prev_cpu, u64 wake_flags)
+{
+	return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(enq_blocked_enqueue, struct task_struct *p, u64 enq_flags)
+{
+	if (scx_bpf_task_is_blocked(p)) {
+		int cpu = scx_bpf_task_proxy_cpu(p);
+
+		if (p->pid == donor_pid) {
+			__sync_fetch_and_add(&nr_blocked_enqueues, 1);
+			if (cpu != expected_proxy_cpu)
+				__sync_fetch_and_add(&nr_bad_proxy_cpus, 1);
+		}
+
+		/*
+		 * Try to migrate the donor to the owner's CPU if possible, to
+		 * speed up the proxy exec switch and reduce handoff latency.
+		 */
+		if (cpu < 0 || !bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
+			cpu = scx_bpf_task_cpu(p);
+
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, enq_flags);
+		return;
+	}
+
+	/* Limit the amount of budget an idling task can accumulate. */
+	u64 vtime = p->scx.dsq_vtime;
+
+	if (time_before(vtime, vtime_now - SCX_SLICE_DFL))
+		vtime = vtime_now - SCX_SLICE_DFL;
+
+	scx_bpf_dsq_insert_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
+				 enq_flags);
+	if (enq_flags & SCX_ENQ_LAST)
+		scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);
+}
+
+void BPF_STRUCT_OPS(enq_blocked_dispatch, s32 cpu, struct task_struct *prev)
+{
+	scx_bpf_dsq_move_to_local(SHARED_DSQ, 0);
+}
+
+void BPF_STRUCT_OPS(enq_blocked_running, struct task_struct *p)
+{
+	if (time_before(vtime_now, p->scx.dsq_vtime))
+		vtime_now = p->scx.dsq_vtime;
+}
+
+void BPF_STRUCT_OPS(enq_blocked_stopping, struct task_struct *p, bool runnable)
+{
+	u64 delta = scale_by_task_weight_inverse(p,
+					 SCX_SLICE_DFL - p->scx.slice);
+
+	scx_bpf_task_set_dsq_vtime(p, p->scx.dsq_vtime + delta);
+}
+
+void BPF_STRUCT_OPS(enq_blocked_enable, struct task_struct *p)
+{
+	scx_bpf_task_set_dsq_vtime(p, vtime_now);
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(enq_blocked_init)
+{
+	int ret;
+
+	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
+	if (ret) {
+		scx_bpf_error("failed to create DSQ %d (%d)", SHARED_DSQ, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+void BPF_STRUCT_OPS(enq_blocked_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops enq_blocked_ops = {
+	.select_cpu		= (void *)enq_blocked_select_cpu,
+	.enqueue		= (void *)enq_blocked_enqueue,
+	.dispatch		= (void *)enq_blocked_dispatch,
+	.running		= (void *)enq_blocked_running,
+	.stopping		= (void *)enq_blocked_stopping,
+	.enable			= (void *)enq_blocked_enable,
+	.init			= (void *)enq_blocked_init,
+	.exit			= (void *)enq_blocked_exit,
+	.name			= "enq_blocked",
+};
diff --git a/tools/testing/selftests/sched_ext/enq_blocked.c b/tools/testing/selftests/sched_ext/enq_blocked.c
new file mode 100644
index 0000000000000..cb1c164d0a127
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/enq_blocked.c
@@ -0,0 +1,682 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES
+ *
+ * Exercise a three-task priority inversion on a single CPU. A high-priority
+ * donor blocks on a mutex held by a low-priority owner while a medium-priority
+ * contender consumes CPU. A weighted-vruntime BPF scheduler runs the workload
+ * with SCX_OPS_ENQ_BLOCKED first disabled and then enabled. When enabled, the
+ * scheduler observes the blocked donor in ops.enqueue(), validates the proxy
+ * CPU, and dispatches the donor there to facilitate proxy execution.
+ *
+ * Report the average mutex hold and wait times for both runs and their deltas.
+ * The timing data is informational; only blocked-donor admission and proxy CPU
+ * selection are validated. CONFIG_SCHED_PROXY_EXEC=y is required to exercise
+ * the proxy-execution paths.
+ */
+#define _GNU_SOURCE
+
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <pthread.h>
+#include <sched.h>
+#include <scx/common.h>
+#include <stdatomic.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/resource.h>
+#include <sys/syscall.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "enq_blocked.bpf.skel.h"
+#include "enq_blocked.h"
+#include "scx_test.h"
+
+#define MODULE_NAME	"scx_enq_blocked_test"
+#define MODULE_FILE	"test_modules/" MODULE_NAME ".ko"
+#define DEVICE_PATH	"/dev/scx_enq_blocked"
+#define WAIT_STEP_US	1000
+#define WAIT_TIMEOUT_MS	2000
+#define NR_TRIALS	10
+#define JOIN_TIMEOUT_MS	((NR_TRIALS + 1) * WAIT_TIMEOUT_MS)
+#define OWNER_NICE	19
+#define DONOR_NICE	-20
+#define CONTENDER_NICE	0
+
+struct thread_ctx {
+	atomic_bool start_donor;
+	atomic_bool abort;
+	atomic_bool stop_contender;
+	atomic_int contender_status;
+	atomic_int donor_pid;
+	atomic_int donor_completed;
+	int fd;
+	int cpu;
+};
+
+static bool parse_bool(const char *value, bool *result)
+{
+	if (!strcasecmp(value, "1") || !strcasecmp(value, "y") ||
+	    !strcasecmp(value, "yes") || !strcasecmp(value, "on") ||
+	    !strcasecmp(value, "true")) {
+		*result = true;
+		return true;
+	}
+
+	if (!strcasecmp(value, "0") || !strcasecmp(value, "n") ||
+	    !strcasecmp(value, "no") || !strcasecmp(value, "off") ||
+	    !strcasecmp(value, "false")) {
+		*result = false;
+		return true;
+	}
+
+	return false;
+}
+
+static bool cmdline_bool(const char *name, bool default_value)
+{
+	char cmdline[4096], *newline, *saveptr = NULL, *token;
+	size_t name_len = strlen(name);
+	bool value = default_value;
+	FILE *file;
+
+	file = fopen("/proc/cmdline", "r");
+	if (!file)
+		return default_value;
+
+	if (!fgets(cmdline, sizeof(cmdline), file)) {
+		fclose(file);
+		return default_value;
+	}
+	fclose(file);
+	newline = strchr(cmdline, '\n');
+	if (newline)
+		*newline = '\0';
+
+	for (token = strtok_r(cmdline, " ", &saveptr); token;
+	     token = strtok_r(NULL, " ", &saveptr)) {
+		bool parsed;
+
+		if (strncmp(token, name, name_len) || token[name_len] != '=')
+			continue;
+		if (parse_bool(token + name_len + 1, &parsed))
+			value = parsed;
+	}
+
+	return value;
+}
+
+static int module_path(char *path, size_t size)
+{
+	ssize_t len;
+	char *slash;
+
+	len = readlink("/proc/self/exe", path, size - 1);
+	if (len < 0)
+		return -errno;
+	path[len] = '\0';
+
+	slash = strrchr(path, '/');
+	if (!slash)
+		return -EINVAL;
+	*slash = '\0';
+
+	if (snprintf(slash, size - (slash - path), "/%s", MODULE_FILE) >=
+	    size - (slash - path))
+		return -ENAMETOOLONG;
+
+	return 0;
+}
+
+static int load_test_module(bool *loaded_here)
+{
+	char path[PATH_MAX];
+	int fd, err;
+
+	err = module_path(path, sizeof(path));
+	if (err)
+		return err;
+
+	fd = open(path, O_RDONLY | O_CLOEXEC);
+	if (fd < 0)
+		return -errno;
+
+	if (syscall(SYS_finit_module, fd, "", 0)) {
+		err = errno;
+		close(fd);
+		if (err == EEXIST)
+			return 0;
+		return -err;
+	}
+
+	close(fd);
+	*loaded_here = true;
+	return 0;
+}
+
+static void unload_test_module(bool loaded_here)
+{
+	if (loaded_here && syscall(SYS_delete_module, MODULE_NAME, O_NONBLOCK))
+		SCX_ERR("Failed to unload %s (%d)", MODULE_NAME, errno);
+}
+
+static int pin_to_cpu(int cpu)
+{
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(cpu, &mask);
+	return sched_setaffinity(0, sizeof(mask), &mask) ? errno : 0;
+}
+
+static int set_nice(int nice)
+{
+	return setpriority(PRIO_PROCESS, 0, nice) ? errno : 0;
+}
+
+static bool wait_for_pid(atomic_int *pid)
+{
+	int waited_ms;
+
+	for (waited_ms = 0; waited_ms < WAIT_TIMEOUT_MS; waited_ms++) {
+		if (atomic_load_explicit(pid, memory_order_acquire) > 0)
+			return true;
+		usleep(WAIT_STEP_US);
+	}
+
+	return false;
+}
+
+static int wait_for_contender(struct thread_ctx *ctx)
+{
+	int status, waited_ms;
+
+	for (waited_ms = 0; waited_ms < WAIT_TIMEOUT_MS; waited_ms++) {
+		status = atomic_load_explicit(&ctx->contender_status,
+					      memory_order_acquire);
+		if (status)
+			return status;
+		usleep(WAIT_STEP_US);
+	}
+
+	return -ETIMEDOUT;
+}
+
+static bool wait_for_donor(struct thread_ctx *ctx, int trial)
+{
+	int waited_ms;
+
+	for (waited_ms = 0; waited_ms < WAIT_TIMEOUT_MS; waited_ms++) {
+		if (atomic_load_explicit(&ctx->donor_completed,
+					 memory_order_acquire) >= trial)
+			return true;
+		if (atomic_load_explicit(&ctx->abort, memory_order_relaxed))
+			return false;
+		usleep(WAIT_STEP_US);
+	}
+
+	return false;
+}
+
+static void *contender_fn(void *arg)
+{
+	struct thread_ctx *ctx = arg;
+	int err;
+
+	err = pin_to_cpu(ctx->cpu);
+	if (!err)
+		err = set_nice(CONTENDER_NICE);
+	atomic_store_explicit(&ctx->contender_status, err ? -err : 1,
+			      memory_order_release);
+	if (err)
+		return (void *)(uintptr_t)err;
+
+	while (!atomic_load_explicit(&ctx->stop_contender,
+				     memory_order_relaxed))
+		;
+
+	return NULL;
+}
+
+static void *owner_fn(void *arg)
+{
+	struct thread_ctx *ctx = arg;
+	int err, i;
+
+	err = pin_to_cpu(ctx->cpu);
+	if (err)
+		return (void *)(uintptr_t)err;
+	err = set_nice(OWNER_NICE);
+	if (err)
+		return (void *)(uintptr_t)err;
+
+	for (i = 0; i < NR_TRIALS; i++) {
+		if (ioctl(ctx->fd, ENQ_BLOCKED_IOCTL_OWNER))
+			return (void *)(uintptr_t)errno;
+		if (!wait_for_donor(ctx, i + 1))
+			return (void *)(uintptr_t)ETIMEDOUT;
+	}
+
+	return NULL;
+}
+
+static int run_donor_trial(struct thread_ctx *ctx)
+{
+	int waited_ms;
+
+	for (waited_ms = 0; waited_ms < WAIT_TIMEOUT_MS; waited_ms++) {
+		if (!ioctl(ctx->fd, ENQ_BLOCKED_IOCTL_DONOR))
+			return 0;
+		if (errno != EAGAIN)
+			return -errno;
+		usleep(WAIT_STEP_US);
+	}
+
+	return -ETIMEDOUT;
+}
+
+static void *donor_fn(void *arg)
+{
+	struct thread_ctx *ctx = arg;
+	int err, i;
+
+	err = pin_to_cpu(ctx->cpu);
+	if (err)
+		return (void *)(uintptr_t)err;
+	err = set_nice(DONOR_NICE);
+	if (err)
+		return (void *)(uintptr_t)err;
+
+	atomic_store_explicit(&ctx->donor_pid, syscall(SYS_gettid),
+			      memory_order_release);
+	while (!atomic_load_explicit(&ctx->start_donor, memory_order_acquire) &&
+	       !atomic_load_explicit(&ctx->abort, memory_order_relaxed))
+		sched_yield();
+
+	if (atomic_load_explicit(&ctx->abort, memory_order_relaxed))
+		return NULL;
+
+	for (i = 0; i < NR_TRIALS; i++) {
+		err = run_donor_trial(ctx);
+		if (err)
+			return (void *)(uintptr_t)-err;
+		atomic_store_explicit(&ctx->donor_completed, i + 1,
+				      memory_order_release);
+	}
+
+	return NULL;
+}
+
+static void print_avg_time(const char *name, u64 total_ns, u64 samples)
+{
+	u64 avg_ns = samples ? total_ns / samples : 0;
+
+	printf("  %s_avg_ns=%llu (%llu.%03llu ms, samples=%llu)\n", name,
+	       (unsigned long long)avg_ns,
+	       (unsigned long long)(avg_ns / 1000000),
+	       (unsigned long long)((avg_ns / 1000) % 1000),
+	       (unsigned long long)samples);
+}
+
+static void print_avg_delta(const char *name, u64 disabled_total,
+			    u64 disabled_samples, u64 enabled_total,
+			    u64 enabled_samples)
+{
+	u64 disabled_avg, enabled_avg;
+	s64 delta_ns;
+	double delta_pct;
+
+	if (!disabled_samples || !enabled_samples)
+		return;
+
+	disabled_avg = disabled_total / disabled_samples;
+	enabled_avg = enabled_total / enabled_samples;
+	delta_ns = (s64)enabled_avg - (s64)disabled_avg;
+	delta_pct = disabled_avg ? 100.0 * delta_ns / disabled_avg : 0.0;
+
+	printf("  %s_delta_ns=%+lld (%+.2f%%)\n", name,
+	       (long long)delta_ns, delta_pct);
+}
+
+static int join_thread(pthread_t thread, const struct timespec *deadline,
+		       int *thread_err)
+{
+	void *result;
+	int err;
+
+	err = pthread_timedjoin_np(thread, &result, deadline);
+	if (err)
+		return err;
+
+	*thread_err = (int)(uintptr_t)result;
+	return 0;
+}
+
+static void set_join_deadline(struct timespec *deadline)
+{
+	clock_gettime(CLOCK_REALTIME, deadline);
+	deadline->tv_sec += JOIN_TIMEOUT_MS / 1000;
+	deadline->tv_nsec += (JOIN_TIMEOUT_MS % 1000) * 1000000;
+	if (deadline->tv_nsec >= 1000000000) {
+		deadline->tv_sec++;
+		deadline->tv_nsec -= 1000000000;
+	}
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct enq_blocked *skel;
+	u64 flag;
+
+	skel = enq_blocked__open();
+	SCX_FAIL_IF(!skel, "Failed to open skel");
+	SCX_ENUM_INIT(skel);
+
+	flag = SCX_OPS_ENQ_BLOCKED;
+	if (!flag) {
+		enq_blocked__destroy(skel);
+		fprintf(stderr, "SKIP: SCX_OPS_ENQ_BLOCKED is unavailable\n");
+		return SCX_TEST_SKIP;
+	}
+
+	enq_blocked__destroy(skel);
+	*ctx = NULL;
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run_one(bool enq_blocked,
+				    struct enq_blocked_stats *result)
+{
+	struct enq_blocked *skel;
+	struct thread_ctx thread_ctx = {};
+	struct bpf_link *link = NULL;
+	pthread_t owner, donor, contender;
+	struct timespec join_deadline;
+	cpu_set_t mask;
+	bool module_loaded = false;
+	bool owner_started = false, donor_started = false;
+	bool contender_started = false;
+	bool join_timed_out = false;
+	bool proxy_enabled;
+	enum scx_test_status status = SCX_TEST_PASS;
+	int cpu, donor_pid, err, thread_err;
+	u64 nr_blocked;
+	struct enq_blocked_stats stats;
+
+	skel = enq_blocked__open();
+	if (!skel) {
+		SCX_ERR("Failed to open skel");
+		return SCX_TEST_FAIL;
+	}
+	SCX_ENUM_INIT(skel);
+	skel->struct_ops.enq_blocked_ops->flags =
+		SCX_OPS_ENQ_LAST |
+		(enq_blocked ? SCX_OPS_ENQ_BLOCKED : 0);
+	if (enq_blocked__load(skel)) {
+		SCX_ERR("Failed to load skel");
+		status = SCX_TEST_FAIL;
+		goto out_skel;
+	}
+
+	proxy_enabled = cmdline_bool("sched_proxy_exec", true);
+
+	err = load_test_module(&module_loaded);
+	if (err == -EPERM || err == -ENOENT) {
+		fprintf(stderr, "SKIP: cannot load mutex fixture (%d)\n", -err);
+		status = SCX_TEST_SKIP;
+		goto out_skel;
+	}
+	if (err) {
+		SCX_ERR("Failed to load mutex fixture (%d)", -err);
+		status = SCX_TEST_FAIL;
+		goto out_skel;
+	}
+
+	thread_ctx.fd = open(DEVICE_PATH, O_RDONLY | O_CLOEXEC);
+	if (thread_ctx.fd < 0) {
+		SCX_ERR("Failed to open %s (%d)", DEVICE_PATH, errno);
+		status = SCX_TEST_FAIL;
+		goto out_module;
+	}
+	if (ioctl(thread_ctx.fd, ENQ_BLOCKED_IOCTL_RESET_STATS)) {
+		SCX_ERR("Failed to reset mutex statistics (%d)", errno);
+		status = SCX_TEST_FAIL;
+		goto out_fd;
+	}
+
+	if (sched_getaffinity(0, sizeof(mask), &mask)) {
+		SCX_ERR("Failed to get CPU affinity (%d)", errno);
+		status = SCX_TEST_FAIL;
+		goto out_fd;
+	}
+	for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+		if (CPU_ISSET(cpu, &mask))
+			break;
+	}
+	if (cpu == CPU_SETSIZE) {
+		status = SCX_TEST_SKIP;
+		goto out_fd;
+	}
+	thread_ctx.cpu = cpu;
+	skel->bss->expected_proxy_cpu = cpu;
+
+	link = bpf_map__attach_struct_ops(skel->maps.enq_blocked_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler");
+		status = SCX_TEST_FAIL;
+		goto out_fd;
+	}
+
+	err = pthread_create(&contender, NULL, contender_fn, &thread_ctx);
+	if (err) {
+		SCX_ERR("Failed to create contender thread (%d)", err);
+		status = SCX_TEST_FAIL;
+		goto out;
+	}
+	contender_started = true;
+
+	err = wait_for_contender(&thread_ctx);
+	if (err != 1) {
+		SCX_ERR("Contender thread failed (%d)", -err);
+		status = SCX_TEST_FAIL;
+		goto out;
+	}
+
+	err = pthread_create(&owner, NULL, owner_fn, &thread_ctx);
+	if (err) {
+		SCX_ERR("Failed to create owner thread (%d)", err);
+		status = SCX_TEST_FAIL;
+		goto out;
+	}
+	owner_started = true;
+
+	err = pthread_create(&donor, NULL, donor_fn, &thread_ctx);
+	if (err) {
+		SCX_ERR("Failed to create donor thread (%d)", err);
+		status = SCX_TEST_FAIL;
+		goto out;
+	}
+	donor_started = true;
+
+	if (!wait_for_pid(&thread_ctx.donor_pid)) {
+		SCX_ERR("Timed out waiting for donor thread");
+		status = SCX_TEST_FAIL;
+		goto out;
+	}
+
+	donor_pid = atomic_load_explicit(&thread_ctx.donor_pid,
+					 memory_order_acquire);
+	skel->bss->donor_pid = donor_pid;
+	atomic_store_explicit(&thread_ctx.start_donor, true,
+			      memory_order_release);
+
+out:
+	if (status != SCX_TEST_PASS) {
+		atomic_store_explicit(&thread_ctx.abort, true, memory_order_release);
+		atomic_store_explicit(&thread_ctx.start_donor, true,
+				      memory_order_release);
+	}
+
+	set_join_deadline(&join_deadline);
+	if (donor_started) {
+		err = join_thread(donor, &join_deadline, &thread_err);
+		if (err == ETIMEDOUT) {
+			SCX_ERR("Timed out waiting for donor thread");
+			join_timed_out = true;
+			status = SCX_TEST_FAIL;
+		} else if (err) {
+			SCX_ERR("Failed to join donor thread (%d)", err);
+			status = SCX_TEST_FAIL;
+		} else {
+			donor_started = false;
+			if (thread_err) {
+				SCX_ERR("Donor thread failed (%d)", thread_err);
+				status = SCX_TEST_FAIL;
+			}
+		}
+	}
+	if (!join_timed_out && owner_started) {
+		err = join_thread(owner, &join_deadline, &thread_err);
+		if (err == ETIMEDOUT) {
+			SCX_ERR("Timed out waiting for owner thread");
+			join_timed_out = true;
+			status = SCX_TEST_FAIL;
+		} else if (err) {
+			SCX_ERR("Failed to join owner thread (%d)", err);
+			status = SCX_TEST_FAIL;
+		} else {
+			owner_started = false;
+			if (thread_err) {
+				SCX_ERR("Owner thread failed (%d)", thread_err);
+				status = SCX_TEST_FAIL;
+			}
+		}
+	}
+	atomic_store_explicit(&thread_ctx.stop_contender, true,
+			      memory_order_release);
+	if (!join_timed_out && contender_started) {
+		err = join_thread(contender, &join_deadline, &thread_err);
+		if (err == ETIMEDOUT) {
+			SCX_ERR("Timed out waiting for contender thread");
+			join_timed_out = true;
+			status = SCX_TEST_FAIL;
+		} else if (err) {
+			SCX_ERR("Failed to join contender thread (%d)", err);
+			status = SCX_TEST_FAIL;
+		} else {
+			contender_started = false;
+			if (thread_err) {
+				SCX_ERR("Contender thread failed (%d)", thread_err);
+				status = SCX_TEST_FAIL;
+			}
+		}
+	}
+
+	/* Restore the fair scheduler before waiting for any stranded thread. */
+	if (join_timed_out) {
+		atomic_store_explicit(&thread_ctx.abort, true,
+				      memory_order_release);
+		if (link) {
+			bpf_link__destroy(link);
+			link = NULL;
+		}
+		if (donor_started)
+			pthread_join(donor, NULL);
+		if (owner_started)
+			pthread_join(owner, NULL);
+		if (contender_started)
+			pthread_join(contender, NULL);
+	}
+
+	if (ioctl(thread_ctx.fd, ENQ_BLOCKED_IOCTL_GET_STATS, &stats)) {
+		SCX_ERR("Failed to read mutex statistics (%d)", errno);
+		status = SCX_TEST_FAIL;
+	} else {
+		*result = stats;
+		printf("\n[SCX_OPS_ENQ_BLOCKED=%s]\n",
+		       enq_blocked ? "enabled" : "disabled");
+		printf("  proxy_exec=%s\n",
+		       proxy_enabled ? "enabled" : "disabled");
+		printf("  owner_nice=%d\n", OWNER_NICE);
+		printf("  donor_nice=%d\n", DONOR_NICE);
+		printf("  contender_nice=%d\n", CONTENDER_NICE);
+		print_avg_time("mutex_hold", stats.hold_time_ns, stats.nr_holds);
+		print_avg_time("mutex_wait", stats.wait_time_ns, stats.nr_waits);
+	}
+
+	nr_blocked = skel->bss->nr_blocked_enqueues;
+	printf("  nr_blocked_enqueues=%llu\n",
+	       (unsigned long long)nr_blocked);
+	if (status == SCX_TEST_PASS) {
+		if (enq_blocked && proxy_enabled && !nr_blocked) {
+			SCX_ERR("ops.enqueue() did not receive the blocked donor");
+			status = SCX_TEST_FAIL;
+		} else if ((!enq_blocked || !proxy_enabled) && nr_blocked) {
+			SCX_ERR("ops.enqueue() unexpectedly received %llu blocked donors",
+				(unsigned long long)nr_blocked);
+			status = SCX_TEST_FAIL;
+		} else if (skel->bss->nr_bad_proxy_cpus) {
+			SCX_ERR("scx_bpf_task_proxy_cpu() returned an unexpected CPU %llu times",
+				(unsigned long long)skel->bss->nr_bad_proxy_cpus);
+			status = SCX_TEST_FAIL;
+		}
+	}
+
+	if (skel->data->uei.kind != EXIT_KIND(SCX_EXIT_NONE)) {
+		SCX_ERR("Scheduler exited unexpectedly (kind=%llu code=%lld)",
+			(unsigned long long)skel->data->uei.kind,
+			(long long)skel->data->uei.exit_code);
+		status = SCX_TEST_FAIL;
+	}
+
+	if (link)
+		bpf_link__destroy(link);
+out_fd:
+	close(thread_ctx.fd);
+out_module:
+	unload_test_module(module_loaded);
+out_skel:
+	enq_blocked__destroy(skel);
+	return status;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct enq_blocked_stats disabled = {}, enabled = {};
+	enum scx_test_status status;
+
+	(void)ctx;
+
+	status = run_one(false, &disabled);
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	status = run_one(true, &enabled);
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	printf("\n[delta: enabled - disabled]\n");
+	print_avg_delta("mutex_hold", disabled.hold_time_ns,
+			disabled.nr_holds, enabled.hold_time_ns,
+			enabled.nr_holds);
+	print_avg_delta("mutex_wait", disabled.wait_time_ns,
+			disabled.nr_waits, enabled.wait_time_ns,
+			enabled.nr_waits);
+
+	return SCX_TEST_PASS;
+}
+
+struct scx_test enq_blocked = {
+	.name = "enq_blocked",
+	.description = "Verify BPF-driven proxy donor admission",
+	.setup = setup,
+	.run = run,
+};
+
+REGISTER_SCX_TEST(&enq_blocked)
diff --git a/tools/testing/selftests/sched_ext/enq_blocked.h b/tools/testing/selftests/sched_ext/enq_blocked.h
new file mode 100644
index 0000000000000..e2f9be0f57c9d
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/enq_blocked.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES */
+#ifndef __ENQ_BLOCKED_H
+#define __ENQ_BLOCKED_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+struct enq_blocked_stats {
+	__u64 hold_time_ns;
+	__u64 wait_time_ns;
+	__u64 nr_holds;
+	__u64 nr_waits;
+};
+
+#define ENQ_BLOCKED_IOCTL_OWNER	_IO('s', 1)
+#define ENQ_BLOCKED_IOCTL_DONOR	_IO('s', 2)
+#define ENQ_BLOCKED_IOCTL_RESET_STATS	_IO('s', 3)
+#define ENQ_BLOCKED_IOCTL_GET_STATS	_IOR('s', 4, struct enq_blocked_stats)
+
+#endif /* __ENQ_BLOCKED_H */
diff --git a/tools/testing/selftests/sched_ext/test_modules/Makefile b/tools/testing/selftests/sched_ext/test_modules/Makefile
new file mode 100644
index 0000000000000..a0e9e9401ead6
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/test_modules/Makefile
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES
+
+TESTMODS_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST)))))
+KDIR ?= $(if $(O),$(O),$(realpath ../../../../..))
+
+obj-m += scx_enq_blocked_test.o
+
+all:
+	+$(Q)$(MAKE) -C $(KDIR) M=$(TESTMODS_DIR) modules
+
+clean:
+	+$(Q)$(MAKE) -C $(KDIR) M=$(TESTMODS_DIR) clean
diff --git a/tools/testing/selftests/sched_ext/test_modules/scx_enq_blocked_test.c b/tools/testing/selftests/sched_ext/test_modules/scx_enq_blocked_test.c
new file mode 100644
index 0000000000000..9662d7872b03f
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/test_modules/scx_enq_blocked_test.c
@@ -0,0 +1,134 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES
+ *
+ * Kernel mutex fixture for the sched_ext SCX_OPS_ENQ_BLOCKED selftest.
+ */
+
+#include <linux/atomic.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/ktime.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+
+#include "../enq_blocked.h"
+
+#define DONOR_WAIT_TIMEOUT	msecs_to_jiffies(2000)
+#define MUTEX_HOLD_TIME		msecs_to_jiffies(200)
+
+static DEFINE_MUTEX(test_mutex);
+static atomic_t owner_ready = ATOMIC_INIT(0);
+static atomic_t donor_started = ATOMIC_INIT(0);
+static atomic64_t hold_time_ns = ATOMIC64_INIT(0);
+static atomic64_t wait_time_ns = ATOMIC64_INIT(0);
+static atomic64_t nr_holds = ATOMIC64_INIT(0);
+static atomic64_t nr_waits = ATOMIC64_INIT(0);
+
+static long run_owner(void)
+{
+	unsigned long timeout;
+	u64 start_ns;
+	long ret = 0;
+
+	atomic_set(&donor_started, 0);
+	mutex_lock(&test_mutex);
+	start_ns = ktime_get_ns();
+	atomic_set(&owner_ready, 1);
+
+	timeout = jiffies + DONOR_WAIT_TIMEOUT;
+	while (!atomic_read(&donor_started)) {
+		if (time_after(jiffies, timeout)) {
+			ret = -ETIMEDOUT;
+			goto out;
+		}
+		cond_resched();
+	}
+
+	/* Keep yielding while the donor blocks on test_mutex. */
+	timeout = jiffies + MUTEX_HOLD_TIME;
+	while (time_before(jiffies, timeout))
+		cond_resched();
+
+out:
+	atomic_set(&owner_ready, 0);
+	atomic64_add(ktime_get_ns() - start_ns, &hold_time_ns);
+	atomic64_inc(&nr_holds);
+	mutex_unlock(&test_mutex);
+	return ret;
+}
+
+static long run_donor(void)
+{
+	u64 start_ns;
+
+	if (!atomic_read(&owner_ready))
+		return -EAGAIN;
+
+	atomic_set(&donor_started, 1);
+	start_ns = ktime_get_ns();
+	mutex_lock(&test_mutex);
+	atomic64_add(ktime_get_ns() - start_ns, &wait_time_ns);
+	atomic64_inc(&nr_waits);
+	mutex_unlock(&test_mutex);
+	return 0;
+}
+
+static void reset_stats(void)
+{
+	atomic64_set(&hold_time_ns, 0);
+	atomic64_set(&wait_time_ns, 0);
+	atomic64_set(&nr_holds, 0);
+	atomic64_set(&nr_waits, 0);
+}
+
+static long get_stats(unsigned long arg)
+{
+	struct enq_blocked_stats stats = {
+		.hold_time_ns = atomic64_read(&hold_time_ns),
+		.wait_time_ns = atomic64_read(&wait_time_ns),
+		.nr_holds = atomic64_read(&nr_holds),
+		.nr_waits = atomic64_read(&nr_waits),
+	};
+
+	return copy_to_user((void __user *)arg, &stats, sizeof(stats)) ?
+		-EFAULT : 0;
+}
+
+static long enq_blocked_ioctl(struct file *file, unsigned int cmd,
+			      unsigned long arg)
+{
+	switch (cmd) {
+	case ENQ_BLOCKED_IOCTL_OWNER:
+		return run_owner();
+	case ENQ_BLOCKED_IOCTL_DONOR:
+		return run_donor();
+	case ENQ_BLOCKED_IOCTL_RESET_STATS:
+		reset_stats();
+		return 0;
+	case ENQ_BLOCKED_IOCTL_GET_STATS:
+		return get_stats(arg);
+	default:
+		return -EINVAL;
+	}
+}
+
+static const struct file_operations enq_blocked_fops = {
+	.owner			= THIS_MODULE,
+	.unlocked_ioctl		= enq_blocked_ioctl,
+};
+
+static struct miscdevice enq_blocked_device = {
+	.minor	= MISC_DYNAMIC_MINOR,
+	.name	= "scx_enq_blocked",
+	.fops	= &enq_blocked_fops,
+	.mode	= 0600,
+};
+
+module_misc_device(enq_blocked_device);
+MODULE_AUTHOR("Andrea Righi <arighi@nvidia.com>");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("sched_ext blocked donor test module");
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/12] sched_ext: scx_qmap: Add proxy execution support
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (9 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 10/12] sched_ext: Add selftest for blocked donor admission Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  2026-07-02 17:09 ` [PATCH 12/12] sched: Allow enabling proxy exec with sched_ext Andrea Righi
  11 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

Blocked sched_ext tasks must remain available to the scheduler before
the core can use their scheduling contexts to execute mutex owners. BPF
schedulers opt into receiving these donor requests through
SCX_OPS_ENQ_BLOCKED and are responsible for deciding when to admit them.

Enable SCX_OPS_ENQ_BLOCKED in scx_qmap and immediately admit every
blocked donor. Query the mutex owner's cid and use it when allowed by
the donor's affinity, falling back to the donor's current cid otherwise.
Insert the donor at the head of the selected local DSQ.

Applying the same policy after a proxy migration expedites the donor on
the owner's CPU so proxy execution can begin without waiting behind
unrelated local work.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/sched_ext/scx_qmap.bpf.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index fd9a82a676278..5ecd2ffe05270 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -396,6 +396,21 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	 */
 	taskc->core_sched_seq = qa.core_sched_tail_seqs[idx]++;
 
+	/*
+	 * Admit blocked proxy donors immediately. Dispatch directly on the mutex
+	 * owner's cid when available to facilitate the proxy execution switch.
+	 * SCX_ENQ_HEAD also puts the donor at the head of the local DSQ and
+	 * expedites the destination enqueue after a proxy migration.
+	 */
+	if (scx_bpf_task_is_blocked(p)) {
+		cid = scx_bpf_task_proxy_cid(p);
+		if (cid < 0 || !cmask_test(cid, &taskc->cpus_allowed))
+			cid = scx_bpf_task_cid(p);
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cid,
+				   slice_ns, enq_flags | SCX_ENQ_HEAD);
+		return;
+	}
+
 	/*
 	 * IMMED stress testing: Every immed_stress_nth'th enqueue, dispatch
 	 * directly to prev_cpu's local DSQ even when busy to force dsq->nr > 1
@@ -1216,7 +1231,8 @@ void BPF_STRUCT_OPS(qmap_sub_detach, struct scx_sub_detach_args *args)
 }
 
 SCX_OPS_CID_DEFINE(qmap_ops,
-	       .flags			= SCX_OPS_ENQ_EXITING | SCX_OPS_TID_TO_TASK,
+	       .flags			= SCX_OPS_ENQ_EXITING | SCX_OPS_TID_TO_TASK |
+					  SCX_OPS_ENQ_BLOCKED,
 	       .select_cid		= (void *)qmap_select_cid,
 	       .enqueue			= (void *)qmap_enqueue,
 	       .dequeue			= (void *)qmap_dequeue,
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 12/12] sched: Allow enabling proxy exec with sched_ext
  2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
                   ` (10 preceding siblings ...)
  2026-07-02 17:09 ` [PATCH 11/12] sched_ext: scx_qmap: Add proxy execution support Andrea Righi
@ 2026-07-02 17:09 ` Andrea Righi
  11 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 17:09 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

Now that sched_ext supports proxy execution, allow enabling both options
together.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 init/Kconfig | 2 --
 1 file changed, 2 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 5230d4879b1c8..0817e62266e03 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -934,8 +934,6 @@ config SCHED_PROXY_EXEC
 	bool "Proxy Execution"
 	# Avoid some build failures w/ PREEMPT_RT until it can be fixed
 	depends on !PREEMPT_RT
-	# Need to investigate how to inform sched_ext of split contexts
-	depends on !SCHED_CLASS_EXT
 	# Not particularly useful until we get to multi-rq proxying
 	depends on EXPERT
 	help
-- 
2.55.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution
  2026-07-02 17:09 ` [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
@ 2026-07-02 18:17   ` K Prateek Nayak
  2026-07-02 18:37     ` Andrea Righi
  2026-07-02 18:21   ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2026-07-02 18:17 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, David Dai, Koba Ko,
	Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

Hello Andrea,

On 7/2/2026 10:39 PM, Andrea Righi wrote:
> Never attempt to migrate migration-disabled tasks or tasks that can only
> run on a single CPU when switching donor's execution context, preventing
> task pinning violations.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/core.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3cc6fb1d20547..8a3eecc7caf5d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6936,6 +6936,20 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  			 */
>  			if (curr_in_chain)
>  				return proxy_resched_idle(rq);
> +			/*
> +			 * Tasks pinned to a single CPU (per-CPU kthreads via
> +			 * kthread_bind(), tasks under migrate_disable()) cannot
> +			 * be moved to @owner_cpu. proxy_migrate_task() uses

We only move it to donate the vruntime context. It is never actually
run there.

> +			 * __set_task_cpu() which would silently violate the
> +			 * pinning and leave the task to run on a CPU outside
> +			 * its cpus_ptr once it is unblocked. Deactivate it on

For the task to run as normal, p->is_blocked needs to be cleared. It is
only done in the wakeup path (and sometimes in find_proxy_task() if
the task is rq->curr) which ensures all the affinity / migrate disable
bits get fixed when the task gets to actually run.

Where / how is this being violated?

> +			 * this CPU; the owner running elsewhere will wake @p
> +			 * back up when the mutex becomes available.
> +			 */
> +			if (p->nr_cpus_allowed == 1 || is_migration_disabled(p)) {
> +				__clear_task_blocked_on(p, NULL);
> +				goto deactivate;
> +			}

Proxy depends on the ability to migrate the donor context to owner's
CPU without actually running the task. The task is superficially on
the CPU only to give it's runtime share to the lock owner. Feels like
you are tripping a shortcoming in ext core if this is a problem.

>  			goto migrate_task;
>  		}
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution
  2026-07-02 17:09 ` [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
  2026-07-02 18:17   ` K Prateek Nayak
@ 2026-07-02 18:21   ` Peter Zijlstra
  2026-07-02 18:34     ` Andrea Righi
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2026-07-02 18:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, John Stultz, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	Christian Loehle, David Dai, Koba Ko, Aiqun Yu, Shuah Khan,
	sched-ext, linux-kernel

On Thu, Jul 02, 2026 at 07:09:17PM +0200, Andrea Righi wrote:
> Never attempt to migrate migration-disabled tasks or tasks that can only
> run on a single CPU when switching donor's execution context, preventing
> task pinning violations.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/core.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3cc6fb1d20547..8a3eecc7caf5d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6936,6 +6936,20 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  			 */
>  			if (curr_in_chain)
>  				return proxy_resched_idle(rq);
> +			/*
> +			 * Tasks pinned to a single CPU (per-CPU kthreads via
> +			 * kthread_bind(), tasks under migrate_disable()) cannot
> +			 * be moved to @owner_cpu. proxy_migrate_task() uses
> +			 * __set_task_cpu() which would silently violate the
> +			 * pinning and leave the task to run on a CPU outside
> +			 * its cpus_ptr once it is unblocked. Deactivate it on
> +			 * this CPU; the owner running elsewhere will wake @p
> +			 * back up when the mutex becomes available.
> +			 */

No, this is actually OK. Remember, we only migrate the scheduling
context, but the task as such won't ever execute on the remote CPU.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/12] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
  2026-07-02 17:09 ` [PATCH 02/12] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
@ 2026-07-02 18:24   ` Peter Zijlstra
  2026-07-02 18:46     ` Andrea Righi
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2026-07-02 18:24 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, John Stultz, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	Christian Loehle, David Dai, Koba Ko, Aiqun Yu, Shuah Khan,
	sched-ext, linux-kernel

On Thu, Jul 02, 2026 at 07:09:18PM +0200, Andrea Righi wrote:
> In __schedule(), the proxy-exec donor-stabilization block calls
> put_prev_task() and set_next_task() when rq->donor == prev_donor and
> prev != next.
> 
> For sched_ext tasks, re-entering set_next_task_scx() for a donor that
> has already been seen by BPF ops.running via the normal pick path causes
> issues. It fires SCX_CALL_OP_TASK(sch, running, rq, donor) a second
> time, and sch->ops dispatch can land on a vtable slot in a state that
> yields a NULL function pointer or corrupts the stack.
> 
> Fix this by skipping the put_prev_task/set_next_task re-entry when the
> donor is in the ext_sched_class, since sched_ext tracks curr/donor
> itself.

This really sounds like a bug in ext; how is this different from the
sched_change pattern doing a put/set cycle?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution
  2026-07-02 18:21   ` Peter Zijlstra
@ 2026-07-02 18:34     ` Andrea Righi
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, David Vernet, Changwoo Min, John Stultz, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	Christian Loehle, David Dai, Koba Ko, Aiqun Yu, Shuah Khan,
	sched-ext, linux-kernel

Hi Peter,

On Thu, Jul 02, 2026 at 08:21:18PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2026 at 07:09:17PM +0200, Andrea Righi wrote:
> > Never attempt to migrate migration-disabled tasks or tasks that can only
> > run on a single CPU when switching donor's execution context, preventing
> > task pinning violations.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/core.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3cc6fb1d20547..8a3eecc7caf5d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6936,6 +6936,20 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >  			 */
> >  			if (curr_in_chain)
> >  				return proxy_resched_idle(rq);
> > +			/*
> > +			 * Tasks pinned to a single CPU (per-CPU kthreads via
> > +			 * kthread_bind(), tasks under migrate_disable()) cannot
> > +			 * be moved to @owner_cpu. proxy_migrate_task() uses
> > +			 * __set_task_cpu() which would silently violate the
> > +			 * pinning and leave the task to run on a CPU outside
> > +			 * its cpus_ptr once it is unblocked. Deactivate it on
> > +			 * this CPU; the owner running elsewhere will wake @p
> > +			 * back up when the mutex becomes available.
> > +			 */
> 
> No, this is actually OK. Remember, we only migrate the scheduling
> context, but the task as such won't ever execute on the remote CPU.

Yeah... makes sense. This patch is actually coming from the previous series,
maybe I was hitting a different issue with the sched_ext core. I'll remove this
one and repeat my tests.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution
  2026-07-02 18:17   ` K Prateek Nayak
@ 2026-07-02 18:37     ` Andrea Righi
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 18:37 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tejun Heo, David Vernet, Changwoo Min, John Stultz, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, David Dai, Koba Ko, Aiqun Yu, Shuah Khan,
	sched-ext, linux-kernel

Hi Prateek,

On Thu, Jul 02, 2026 at 11:47:35PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 7/2/2026 10:39 PM, Andrea Righi wrote:
> > Never attempt to migrate migration-disabled tasks or tasks that can only
> > run on a single CPU when switching donor's execution context, preventing
> > task pinning violations.
> > 
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/core.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 3cc6fb1d20547..8a3eecc7caf5d 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6936,6 +6936,20 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >  			 */
> >  			if (curr_in_chain)
> >  				return proxy_resched_idle(rq);
> > +			/*
> > +			 * Tasks pinned to a single CPU (per-CPU kthreads via
> > +			 * kthread_bind(), tasks under migrate_disable()) cannot
> > +			 * be moved to @owner_cpu. proxy_migrate_task() uses
> 
> We only move it to donate the vruntime context. It is never actually
> run there.
> 
> > +			 * __set_task_cpu() which would silently violate the
> > +			 * pinning and leave the task to run on a CPU outside
> > +			 * its cpus_ptr once it is unblocked. Deactivate it on
> 
> For the task to run as normal, p->is_blocked needs to be cleared. It is
> only done in the wakeup path (and sometimes in find_proxy_task() if
> the task is rq->curr) which ensures all the affinity / migrate disable
> bits get fixed when the task gets to actually run.
> 
> Where / how is this being violated?
> 
> > +			 * this CPU; the owner running elsewhere will wake @p
> > +			 * back up when the mutex becomes available.
> > +			 */
> > +			if (p->nr_cpus_allowed == 1 || is_migration_disabled(p)) {
> > +				__clear_task_blocked_on(p, NULL);
> > +				goto deactivate;
> > +			}
> 
> Proxy depends on the ability to migrate the donor context to owner's
> CPU without actually running the task. The task is superficially on
> the CPU only to give it's runtime share to the lock owner. Feels like
> you are tripping a shortcoming in ext core if this is a problem.

Agree with all of the above. As I mention in the other email, I had this in the
previous version, because I was hitting migration-disabled errors. Maybe they
were related to an issue in the sched_ext core, I'll remove this and repeat my
tests.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 09/12] sched_ext: Delegate proxy donor admission to BPF schedulers
  2026-07-02 17:09 ` [PATCH 09/12] sched_ext: Delegate proxy donor admission to BPF schedulers Andrea Righi
@ 2026-07-02 18:41   ` K Prateek Nayak
  2026-07-02 19:10     ` Andrea Righi
  0 siblings, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2026-07-02 18:41 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, David Dai, Koba Ko,
	Aiqun Yu, Shuah Khan, sched-ext, linux-kernel

Hello Andrea,

On 7/2/2026 10:39 PM, Andrea Righi wrote:
> @@ -7148,7 +7181,8 @@ static void __sched notrace __schedule(int sched_mode)
>  		 * task_is_blocked() will always be false).
>  		 */
>  		try_to_block_task(rq, prev, &prev_state,
> -				  !task_is_blocked(prev));
> +				  !task_is_blocked(prev) ||
> +				  !scx_allow_proxy_exec(prev));
>  		switch_count = &prev->nvcsw;
>  	}
>  
> diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> index c48d043dbe58f..8ffbf857acd51 100644
> --- a/kernel/sched/ext/ext.c
> +++ b/kernel/sched/ext/ext.c
> @@ -23,6 +23,14 @@
>  
>  DEFINE_RAW_SPINLOCK(scx_sched_lock);
>  
> +bool scx_allow_proxy_exec(const struct task_struct *p)
> +{
> +	if (!task_on_scx(p))
> +		return true;
> +
> +	return scx_task_sched(p)->ops.flags & SCX_OPS_ENQ_BLOCKED;
> +}
> +

What happens when we switch in a a sched-ext scheduler that
doesn't support proxy?

There would be a bunch proxy donors already preset at the time of the
scheduler loading but I don't see anything in the enable path that
would take care of those tasks by blocking them completely.

Is it handled elsewhere?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/12] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
  2026-07-02 18:24   ` Peter Zijlstra
@ 2026-07-02 18:46     ` Andrea Righi
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 18:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, David Vernet, Changwoo Min, John Stultz, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, K Prateek Nayak,
	Christian Loehle, David Dai, Koba Ko, Aiqun Yu, Shuah Khan,
	sched-ext, linux-kernel

On Thu, Jul 02, 2026 at 08:24:06PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2026 at 07:09:18PM +0200, Andrea Righi wrote:
> > In __schedule(), the proxy-exec donor-stabilization block calls
> > put_prev_task() and set_next_task() when rq->donor == prev_donor and
> > prev != next.
> > 
> > For sched_ext tasks, re-entering set_next_task_scx() for a donor that
> > has already been seen by BPF ops.running via the normal pick path causes
> > issues. It fires SCX_CALL_OP_TASK(sch, running, rq, donor) a second
> > time, and sch->ops dispatch can land on a vtable slot in a state that
> > yields a NULL function pointer or corrupts the stack.
> > 
> > Fix this by skipping the put_prev_task/set_next_task re-entry when the
> > donor is in the ext_sched_class, since sched_ext tracks curr/donor
> > itself.
> 
> This really sounds like a bug in ext; how is this different from the
> sched_change pattern doing a put/set cycle?

I think you're right, the patch differs from sched_change because it doesn't
bracket the put/set cycle with a dequeue/enqueue, but sched_ext still needs to
support the put/set contract.

The underlying problem is that set_next_task_scx() triggers ops.running() for a
blocked proxy donor even though the donor never becomes the execution context,
so we need to prevent triggering ops.running() in this case. But this should be
handled in sched_ext by pairing ops.running()/ops.stopping() only when the donor
actually runs. This is addressed later in the series with explicit running-state
tracking. I'll drop this core special case and rework that sched_ext fix.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 09/12] sched_ext: Delegate proxy donor admission to BPF schedulers
  2026-07-02 18:41   ` K Prateek Nayak
@ 2026-07-02 19:10     ` Andrea Righi
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-02 19:10 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tejun Heo, David Vernet, Changwoo Min, John Stultz, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, David Dai, Koba Ko, Aiqun Yu, Shuah Khan,
	sched-ext, linux-kernel

Hi Prateek,

On Fri, Jul 03, 2026 at 12:11:40AM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 7/2/2026 10:39 PM, Andrea Righi wrote:
> > @@ -7148,7 +7181,8 @@ static void __sched notrace __schedule(int sched_mode)
> >  		 * task_is_blocked() will always be false).
> >  		 */
> >  		try_to_block_task(rq, prev, &prev_state,
> > -				  !task_is_blocked(prev));
> > +				  !task_is_blocked(prev) ||
> > +				  !scx_allow_proxy_exec(prev));
> >  		switch_count = &prev->nvcsw;
> >  	}
> >  
> > diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> > index c48d043dbe58f..8ffbf857acd51 100644
> > --- a/kernel/sched/ext/ext.c
> > +++ b/kernel/sched/ext/ext.c
> > @@ -23,6 +23,14 @@
> >  
> >  DEFINE_RAW_SPINLOCK(scx_sched_lock);
> >  
> > +bool scx_allow_proxy_exec(const struct task_struct *p)
> > +{
> > +	if (!task_on_scx(p))
> > +		return true;
> > +
> > +	return scx_task_sched(p)->ops.flags & SCX_OPS_ENQ_BLOCKED;
> > +}
> > +
> 
> What happens when we switch in a a sched-ext scheduler that
> doesn't support proxy?
> 
> There would be a bunch proxy donors already preset at the time of the
> scheduler loading but I don't see anything in the enable path that
> would take care of those tasks by blocking them completely.
> 
> Is it handled elsewhere?

Good catch! It's not currently handled, a proxy donor that blocked before
sched_ext was enabled can remain queued after being transferred to a scheduler
that doesn't have SCX_OPS_ENQ_BLOCKED.

We should probably deactivate existing blocked donors while switching them to a
scheduler without proxy support.

I'll fix this in the next version and add this case to the kselftest.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/12] sched_ext: Split curr|donor references properly
  2026-07-02 17:09 ` [PATCH 03/12] sched_ext: Split curr|donor references properly Andrea Righi
@ 2026-07-03  6:10   ` Aiqun(Maria) Yu
  2026-07-03  8:37     ` Andrea Righi
  0 siblings, 1 reply; 25+ messages in thread
From: Aiqun(Maria) Yu @ 2026-07-03  6:10 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Shuah Khan, sched-ext, linux-kernel

On 7/3/2026 1:09 AM, Andrea Righi wrote:
> From: John Stultz <jstultz@google.com>
> 
> With proxy-exec, we want to do the accounting against the donor most of
> the time. Without proxy-exec, there should be no difference as the
> rq->donor and rq->curr are the same.

Trying to understand more of the situation when donor and curr in
different sched_class? one is in scx and the other is not.
 could you pls help to explain more with this information in commit message?

> 
> So rework the logic to reference the rq->donor where appropriate.
> 
> Also add donor info to scx_dump_state().
> 
> Since CONFIG_SCHED_PROXY_EXEC currently depends on
> !CONFIG_SCHED_CLASS_EXT, this should have no effect (other than the
> extra donor output in scx_dump_state), but this is one step needed to
> eventually remove that constraint for proxy-exec.
> 
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
>  kernel/sched/ext/ext.c | 28 ++++++++++++++++------------
>  1 file changed, 16 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> index 1a0ec985da77d..1588565050679 100644
> --- a/kernel/sched/ext/ext.c
> +++ b/kernel/sched/ext/ext.c
> @@ -1145,17 +1145,17 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
>  
>  static void update_curr_scx(struct rq *rq)
>  {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *donor = rq->donor;
>  	s64 delta_exec;
>  
>  	delta_exec = update_curr_common(rq);
>  	if (unlikely(delta_exec <= 0))
>  		return;
>  
> -	if (curr->scx.slice != SCX_SLICE_INF) {
> -		curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec);
> -		if (!curr->scx.slice)
> -			touch_core_sched(rq, curr);
> +	if (donor->scx.slice != SCX_SLICE_INF) {
> +		donor->scx.slice -= min_t(u64, donor->scx.slice, delta_exec);
> +		if (!donor->scx.slice)
> +			touch_core_sched(rq, donor);
>  	}
>  
>  	dl_server_update(&rq->ext_server, delta_exec);
> @@ -1316,8 +1316,8 @@ static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq
>  	if (rq->scx.flags & SCX_RQ_IN_BALANCE)
>  		return;
>  
> -	if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
> -	    rq->curr->sched_class == &ext_sched_class) {
> +	if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->donor &&
> +	    rq->donor->sched_class == &ext_sched_class) {
>  		rq->curr->scx.slice = 0;

Do you forget to update rq->curr with rq->donor here?

>  		resched_curr(rq);
>  	}
> @@ -2464,7 +2464,8 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		}
>  
>  		/* if the destination CPU is idle, wake it up */
> -		if (!fallback && sched_class_above(p->sched_class, dst_rq->curr->sched_class))
> +		if (!fallback && sched_class_above(p->sched_class,
> +						      dst_rq->donor->sched_class))
>  			resched_curr(dst_rq);
>  	}
>  
> @@ -2876,7 +2877,7 @@ static struct task_struct *first_local_task(struct rq *rq)
>  static struct task_struct *
>  do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
>  {
> -	struct task_struct *prev = rq->curr;
> +	struct task_struct *prev = rq->donor;
>  	bool keep_prev;
>  	struct task_struct *p;
>  
> @@ -4029,7 +4030,7 @@ static void run_deferred(struct rq *rq)
>  #ifdef CONFIG_NO_HZ_FULL
>  bool scx_can_stop_tick(struct rq *rq)
>  {
> -	struct task_struct *p = rq->curr;
> +	struct task_struct *p = rq->donor;
>  	struct scx_sched *sch = scx_task_sched(p);
>  
>  	if (p->sched_class != &ext_sched_class)
> @@ -6007,6 +6008,9 @@ static void scx_dump_cpu(struct scx_sched *sch, struct seq_buf *s,
>  	dump_line(&ns, "          curr=%s[%d] class=%ps",
>  		  rq->curr->comm, rq->curr->pid,
>  		  rq->curr->sched_class);
> +	dump_line(&ns, "          donor=%s[%d] class=%ps",
> +		  rq->donor->comm, rq->donor->pid,
> +		  rq->donor->sched_class);
>  	if (!cpumask_empty(rq->scx.cpus_to_kick))
>  		dump_line(&ns, "  cpus_to_kick   : %*pb",
>  			  cpumask_pr_args(rq->scx.cpus_to_kick));
> @@ -7452,7 +7456,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
>  	unsigned long flags;
>  
>  	raw_spin_rq_lock_irqsave(rq, flags);
> -	cur_class = rq->curr->sched_class;
> +	cur_class = rq->donor->sched_class;
>  
>  	/*
>  	 * During CPU hotplug, a CPU may depend on kicking itself to make
> @@ -7464,7 +7468,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
>  	    !sched_class_above(cur_class, &ext_sched_class)) {
>  		if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
>  			if (cur_class == &ext_sched_class)
> -				rq->curr->scx.slice = 0;
> +				rq->donor->scx.slice = 0;
>  			cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
>  		}
>  


-- 
Thx and BRs,
Aiqun(Maria) Yu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 04/12] sched_ext: Avoid migrating blocked tasks with proxy execution
  2026-07-02 17:09 ` [PATCH 04/12] sched_ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
@ 2026-07-03  8:02   ` Aiqun(Maria) Yu
  2026-07-03 20:05     ` Andrea Righi
  0 siblings, 1 reply; 25+ messages in thread
From: Aiqun(Maria) Yu @ 2026-07-03  8:02 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min, John Stultz
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, David Dai,
	Koba Ko, Shuah Khan, sched-ext, linux-kernel

On 7/3/2026 1:09 AM, Andrea Righi wrote:
> From: John Stultz <jstultz@google.com>
> 
> With proxy execution enabled, mutex blocked tasks stay on the runqueue.
> Later with donor migration they will be migrated when necessary by the
> core scheduler to boost lock owners.
> 
> Don't try to migrate mutex blocked tasks, the proxy logic will handle
> that.
> 
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: John Stultz <jstultz@google.com>

SOB was suggested to have the current committer at the last line.
Not sure if it is the same rule for the subsystem here.

My understanding would be:
Signed-off-by: John Stultz <jstultz@google.com>
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>


> ---
>  kernel/sched/ext/ext.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> index 1588565050679..9a672b9a55f6e 100644
> --- a/kernel/sched/ext/ext.c
> +++ b/kernel/sched/ext/ext.c
> @@ -2150,6 +2150,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
>  
>  	WARN_ON_ONCE(task_cpu(p) == cpu);
>  
> +	/* Make sure tasks aren't on a cpu */
> +	if (task_on_cpu(task_rq(p), p))
> +		return false;
> +
> +	/* Don't migrate blocked tasks, proxy-exec will handle this */
> +	if (task_is_blocked(p))
> +		return false;

what about return true here for owner_cpu?
if the owner_cpu is in the move_task_between_dsqs stage for example.

> +
>  	/*
>  	 * If @p has migration disabled, @p->cpus_ptr is updated to contain only
>  	 * the pinned CPU in migrate_disable_switch() while @p is being switched
> @@ -2784,6 +2792,23 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  	if (p->scx.flags & SCX_TASK_QUEUED) {
>  		set_task_runnable(rq, p);
>  
> +		/*
> +		 * Mutex-blocked donors stay queued on the runqueue under proxy
> +		 * execution, but the donor never runs as itself, proxy-exec
> +		 * walks the blocked_on chain on the next __schedule() and runs
> +		 * the lock owner in its place.
> +		 *
> +		 * Put the donor on the local DSQ directly, so pick_next_task()
> +		 * can still see it, find_proxy_task() will be invoked on
> +		 * next->blocked_on and either run the chain owner here, or call
> +		 * proxy_force_return() and let BPF make a new dispatch decision
> +		 * once the task is no longer blocked.
> +		 */
> +		if (task_is_blocked(p)) {
> +			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
> +			goto switch_class;
> +		}
> +
>  		/*
>  		 * If @p has slice left and is being put, @p is getting
>  		 * preempted by a higher priority scheduler class or core-sched


-- 
Thx and BRs,
Aiqun(Maria) Yu

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/12] sched_ext: Split curr|donor references properly
  2026-07-03  6:10   ` Aiqun(Maria) Yu
@ 2026-07-03  8:37     ` Andrea Righi
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-03  8:37 UTC (permalink / raw)
  To: Aiqun(Maria) Yu
  Cc: Tejun Heo, David Vernet, Changwoo Min, John Stultz, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, David Dai, Koba Ko, Shuah Khan,
	sched-ext, linux-kernel

Hi Maria,

On Fri, Jul 03, 2026 at 02:10:05PM +0800, Aiqun(Maria) Yu wrote:
> On 7/3/2026 1:09 AM, Andrea Righi wrote:
> > From: John Stultz <jstultz@google.com>
> > 
> > With proxy-exec, we want to do the accounting against the donor most of
> > the time. Without proxy-exec, there should be no difference as the
> > rq->donor and rq->curr are the same.
> 
> Trying to understand more of the situation when donor and curr in
> different sched_class? one is in scx and the other is not.
>  could you pls help to explain more with this information in commit message?

Sure, I'll add more details. Do you think something like the following would
help (maybe in a shorter form)?

Let use FAIR and EXT and assume we run the BPF scheduler in partial mode. We can
replace FAIR with RT/deadline, the result is the same.

Terminology:

  D = blocked donor
  M = mutex
  O = mutex owner
  T = competing runnable task

 D -----------------> M -------------> O ----------------> T
 [donor] blocked on [mutex] owned by [owner] preempted by [task]
    \_________________________________^
         donates scheduling context

During a proxy exec switch:
 - D supplies scheduling class, priority and runtime budget
 - O is the task whose code is physically executing
 - T is a competing task that preempts O

Scenarios:

 1) D is EXT, O is EXT, T is EXT

    Result:
    - D can interrupt T depending on the BPF scheduling policy
    - O is executed with D's EXT priority and runtime budget
    - When D runs, T waits in EXT

 2) D is EXT, O is EXT, T is FAIR

    Result:
    - D is visible to the BPF scheduler
    - D cannot preempt T (EXT < FAIR)
    - Once T stops, BPF dispatches D
    - D executes O using D's EXT priority and runtime budget
    - if T becomes runnable again, it preempts the D/O proxy execution

 3) D is EXT, O is FAIR, T is EXT

    Result:
    - Not possible, T can't preempt O (EXT < FAIR)

 4) D is EXT, O is FAIR, T is FAIR

    Result:
    - D cannot boost O because EXT < FAIR
    - O and T continue competing under FAIR
    - O eventually runs and releases M
    - D then wakes and resumes normal EXT scheduling

 5) D is FAIR, O is EXT, T is EXT

    Result:
    - D preempts T immediately (higher sched class)
    - O is executed with D's FAIR priority and runtime budget
    - When D runs, T waits in EXT
    - D is not visible to the BPF scheduler

 6) D is FAIR, O is EXT, T is FAIR

    Result:
    - D runs based on its FAIR deadline (competing with T)
    - O is executed with D's FAIR priority and runtime budget
    - When D runs, T waits in FAIR
    - D is not visible to the BPF scheduler

 7) D is FAIR, O is FAIR, T is EXT

    Result:
    - Not possible, T can't preempt O (EXT < FAIR)

 8) D is FAIR, O is FAIR, T is FAIR

    Result:
    - O, T and D all have FAIR scheduling contexts
    - D remains runnable as a blocked proxy donor
    - When CFS selects D, O executes using D's FAIR scheduling context
    - When CFS selects O, O executes using its own FAIR context
    - When CFS selects T, T executes normally
    - D is not visible to the BPF scheduler

> 
> > 
> > So rework the logic to reference the rq->donor where appropriate.
> > 
> > Also add donor info to scx_dump_state().
> > 
> > Since CONFIG_SCHED_PROXY_EXEC currently depends on
> > !CONFIG_SCHED_CLASS_EXT, this should have no effect (other than the
> > extra donor output in scx_dump_state), but this is one step needed to
> > eventually remove that constraint for proxy-exec.
> > 
> > Signed-off-by: John Stultz <jstultz@google.com>
> > ---
> >  kernel/sched/ext/ext.c | 28 ++++++++++++++++------------
> >  1 file changed, 16 insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> > index 1a0ec985da77d..1588565050679 100644
> > --- a/kernel/sched/ext/ext.c
> > +++ b/kernel/sched/ext/ext.c
> > @@ -1145,17 +1145,17 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
> >  
> >  static void update_curr_scx(struct rq *rq)
> >  {
> > -	struct task_struct *curr = rq->curr;
> > +	struct task_struct *donor = rq->donor;
> >  	s64 delta_exec;
> >  
> >  	delta_exec = update_curr_common(rq);
> >  	if (unlikely(delta_exec <= 0))
> >  		return;
> >  
> > -	if (curr->scx.slice != SCX_SLICE_INF) {
> > -		curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec);
> > -		if (!curr->scx.slice)
> > -			touch_core_sched(rq, curr);
> > +	if (donor->scx.slice != SCX_SLICE_INF) {
> > +		donor->scx.slice -= min_t(u64, donor->scx.slice, delta_exec);
> > +		if (!donor->scx.slice)
> > +			touch_core_sched(rq, donor);
> >  	}
> >  
> >  	dl_server_update(&rq->ext_server, delta_exec);
> > @@ -1316,8 +1316,8 @@ static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq
> >  	if (rq->scx.flags & SCX_RQ_IN_BALANCE)
> >  		return;
> >  
> > -	if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
> > -	    rq->curr->sched_class == &ext_sched_class) {
> > +	if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->donor &&
> > +	    rq->donor->sched_class == &ext_sched_class) {
> >  		rq->curr->scx.slice = 0;
> 
> Do you forget to update rq->curr with rq->donor here?

Yes, good catch. This should be rq->donor->scx.slice = 0. I'll fix it in the
next version.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 04/12] sched_ext: Avoid migrating blocked tasks with proxy execution
  2026-07-03  8:02   ` Aiqun(Maria) Yu
@ 2026-07-03 20:05     ` Andrea Righi
  0 siblings, 0 replies; 25+ messages in thread
From: Andrea Righi @ 2026-07-03 20:05 UTC (permalink / raw)
  To: Aiqun(Maria) Yu
  Cc: Tejun Heo, David Vernet, Changwoo Min, John Stultz, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, David Dai, Koba Ko, Shuah Khan,
	sched-ext, linux-kernel

Hi Maria,

On Fri, Jul 03, 2026 at 04:02:16PM +0800, Aiqun(Maria) Yu wrote:
> On 7/3/2026 1:09 AM, Andrea Righi wrote:
> > From: John Stultz <jstultz@google.com>
> > 
> > With proxy execution enabled, mutex blocked tasks stay on the runqueue.
> > Later with donor migration they will be migrated when necessary by the
> > core scheduler to boost lock owners.
> > 
> > Don't try to migrate mutex blocked tasks, the proxy logic will handle
> > that.
> > 
> > Co-developed-by: Andrea Righi <arighi@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > Signed-off-by: John Stultz <jstultz@google.com>
> 
> SOB was suggested to have the current committer at the last line.
> Not sure if it is the same rule for the subsystem here.
> 
> My understanding would be:
> Signed-off-by: John Stultz <jstultz@google.com>
> Co-developed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

You're right, I'll fix it in the next version.

> 
> 
> > ---
> >  kernel/sched/ext/ext.c | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> > index 1588565050679..9a672b9a55f6e 100644
> > --- a/kernel/sched/ext/ext.c
> > +++ b/kernel/sched/ext/ext.c
> > @@ -2150,6 +2150,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
> >  
> >  	WARN_ON_ONCE(task_cpu(p) == cpu);
> >  
> > +	/* Make sure tasks aren't on a cpu */
> > +	if (task_on_cpu(task_rq(p), p))
> > +		return false;
> > +
> > +	/* Don't migrate blocked tasks, proxy-exec will handle this */
> > +	if (task_is_blocked(p))
> > +		return false;
> 
> what about return true here for owner_cpu?
> if the owner_cpu is in the move_task_between_dsqs stage for example.

Good point. Returning false for every blocked task is too restrictive, because
it'd reject potential remote placements requested by the BPF scheduler from
ops.enqueue().

But we can't simply return true either, otherwise we would bypass all the checks
below, allowing sched_ext to move the donor even if the destination is outside
its affinity mask or the destination rq is offline.

I think we should do something like this:

  if (task_is_blocked(p) && task_current_donor(task_rq(p), p))
          return false;

In practice, normal migration can happen if a blocker donor is not already the
rq's current scheduling context.

With that, the BPF scheduler should be able to decide the "call back home" CPU
for the donor, if it performs a direct dispatch to a local DSQ different than
its current CPU's local DSQ.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-07-03 20:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 17:09 [PATCHSET v2 sched_ext/for-7.3] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-07-02 17:09 ` [PATCH 01/12] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-07-02 18:17   ` K Prateek Nayak
2026-07-02 18:37     ` Andrea Righi
2026-07-02 18:21   ` Peter Zijlstra
2026-07-02 18:34     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 02/12] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
2026-07-02 18:24   ` Peter Zijlstra
2026-07-02 18:46     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 03/12] sched_ext: Split curr|donor references properly Andrea Righi
2026-07-03  6:10   ` Aiqun(Maria) Yu
2026-07-03  8:37     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 04/12] sched_ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
2026-07-03  8:02   ` Aiqun(Maria) Yu
2026-07-03 20:05     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 05/12] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
2026-07-02 17:09 ` [PATCH 06/12] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
2026-07-02 17:09 ` [PATCH 07/12] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
2026-07-02 17:09 ` [PATCH 08/12] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
2026-07-02 17:09 ` [PATCH 09/12] sched_ext: Delegate proxy donor admission to BPF schedulers Andrea Righi
2026-07-02 18:41   ` K Prateek Nayak
2026-07-02 19:10     ` Andrea Righi
2026-07-02 17:09 ` [PATCH 10/12] sched_ext: Add selftest for blocked donor admission Andrea Righi
2026-07-02 17:09 ` [PATCH 11/12] sched_ext: scx_qmap: Add proxy execution support Andrea Righi
2026-07-02 17:09 ` [PATCH 12/12] sched: Allow enabling proxy exec with sched_ext Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox