* [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext
@ 2026-05-06 17:45 Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
` (9 more replies)
0 siblings, 10 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
This series enables using proxy execution with sched_ext and is based on early
work by John Stultz [1].
Background
==========
Proxy execution (proxy-exec) lets a waiting task ("donor") donate its execution
context to a mutex owner, so the owner can run while the donor stays eligible on
the runqueue.
Currently, proxy execution and sched_ext are mutually exclusive at build time:
we can't enable CONFIG_SCHED_PROXY_EXEC=y and CONFIG_SCHED_CLASS_EXT=y in the
same kernel.
This restriction can be problematic for Linux distributions and for anyone who
wants to ship one kernel and choose features at runtime.
Why they are mutually exclusive?
================================
sched_ext schedulers drive dispatch through their own interfaces. A proxy-exec
handoff can run a task that the BPF scheduler never dispatched through that
path. sched_ext callbacks then observe a "current" task that does not match what
the BPF side considers running, so kfuncs and helper state can see an
inconsistent view of the executing task.
sched_ext also tracks runnable work through Dispatch Queues (DSQs) and BPF
chosen dispatch rules, while the core scheduler still maintains classic per-CPU
runqueues and pick paths. A proxy handoff can therefore switch the CPU to a task
that the BPF scheduler never inserted or ordered through its DSQ interface.
DSQ state, vtime, and "who is running" bookkeeping inside the BPF program can
then disagree with what the core actually executes, so helpers and kfuncs that
assume their dispatched task is current may observe stale or inconsistent state.
Default behaviour when sched_ext is in use
==========================================
The series relaxes the Kconfig coupling, but keeps proxy-exec context donation
off by default whenever a sched_ext scheduler is loaded: mutex-blocked
tasks are forced to block instead of staying as donors, and the pick path skips
proxy selection; leftover handoff state is cleared so mutex retry paths do not
trip blocked_on consistency checks.
Users who accept the semantic mismatch for their BPF scheduler can opt in at
boot:
sched_proxy_exec_scx=0|1 (default 0)
Setting 1 allows donor->owner context switches under sched_ext as well.
This enables Linux distributions to set CONFIG_SCHED_PROXY_EXEC and
CONFIG_SCHED_CLASS_EXT together and ship kernels capable of supporting both
features.
Then users can decide, via sched_proxy_exec and sched_proxy_exec_scx, whether to
enable proxy-exec alongside sched_ext, use sched_ext without proxy-exec, or
disable proxy-exec entirely.
References
==========
[1] https://lore.kernel.org/all/20251206001451.1418225-1-jstultz@google.com
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-proxy-exec
Andrea Righi (8):
sched/core: Skip migration disabled tasks in proxy execution
sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
sched_ext: Fix TOCTOU race in consume_remote_task()
sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors
sched_ext: Save/restore kf_tasks[] when task ops nest
sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK
sched/core: Disable proxy-exec context switch under sched_ext by default
sched: Allow enabling proxy exec with sched_ext
John Stultz (2):
sched/ext: Split curr|donor references properly
sched/ext: Avoid migrating blocked tasks with proxy execution
Documentation/admin-guide/kernel-parameters.txt | 6 ++
include/linux/sched/ext.h | 9 ++
init/Kconfig | 2 -
kernel/sched/core.c | 78 ++++++++++++--
kernel/sched/ext.c | 138 ++++++++++++++++++------
5 files changed, 193 insertions(+), 40 deletions(-)
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 21:09 ` John Stultz
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
` (8 subsequent siblings)
9 siblings, 1 reply; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
Never attempt to migrate migration-disabled tasks or tasks that can only
run on a single CPU when switching donor's execution context, preventing
task pinning violations.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/core.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25ae..75541e5bb66d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6793,9 +6793,13 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
update_rq_clock(task_rq);
deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
- cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
- set_task_cpu(p, cpu);
- target_rq = cpu_rq(cpu);
+ if (p->nr_cpus_allowed > 1 && !is_migration_disabled(p)) {
+ cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+ set_task_cpu(p, cpu);
+ target_rq = cpu_rq(cpu);
+ } else {
+ target_rq = task_rq;
+ }
clear_task_blocked_on(p, NULL);
}
@@ -6893,6 +6897,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
*/
if (curr_in_chain)
return proxy_resched_idle(rq);
+ /*
+ * Tasks pinned to a single CPU (per-CPU kthreads via
+ * kthread_bind(), tasks under migrate_disable()) cannot
+ * be moved to @owner_cpu. proxy_migrate_task() uses
+ * __set_task_cpu() which would silently violate the
+ * pinning and leave the task to run on a CPU outside
+ * its cpus_ptr once it is unblocked. Stay on this CPU
+ * via force_return; the owner running elsewhere will
+ * wake @p back up when the mutex becomes available.
+ */
+ if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
+ goto force_return;
goto migrate_task;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
` (7 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
In __schedule(), the proxy-exec donor-stabilization block calls
put_prev_task() and set_next_task() when rq->donor == prev_donor and
prev != next.
For sched_ext tasks, re-entering set_next_task_scx() for a donor that
has already been seen by BPF ops.running via the normal pick path causes
issues. It fires SCX_CALL_OP_TASK(sch, running, rq, donor) a second
time, and sch->ops dispatch can land on a vtable slot in a state that
yields a NULL function pointer or corrupts the stack.
Fix this by skipping the put_prev_task/set_next_task re-entry when the
donor is in the ext_sched_class, since sched_ext tracks curr/donor
itself.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/core.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 75541e5bb66d1..1c161dd9d7440 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7147,9 +7147,14 @@ static void __sched notrace __schedule(int sched_mode)
* anything, since B == B. However, A might have
* missed a RT/DL balance opportunity due to being
* on_cpu.
+ *
+ * sched_ext tracks curr/donor itself; re-entering set_next_task_scx
+ * here dispatches through a stale/NULL BPF ops vtable.
*/
- donor->sched_class->put_prev_task(rq, donor, donor);
- donor->sched_class->set_next_task(rq, donor, true);
+ if (donor->sched_class != &ext_sched_class) {
+ donor->sched_class->put_prev_task(rq, donor, donor);
+ donor->sched_class->set_next_task(rq, donor, true);
+ }
}
} else {
rq_set_donor(rq, next);
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 03/10] sched/ext: Split curr|donor references properly
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
` (6 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
From: John Stultz <jstultz@google.com>
With proxy-exec, we want to do the accounting against the donor most of
the time. Without proxy-exec, there should be no difference as the
rq->donor and rq->curr are the same.
So rework the logic to reference the rq->donor where appropriate.
Also add donor info to scx_dump_state().
Since CONFIG_SCHED_PROXY_EXEC currently depends on
!CONFIG_SCHED_CLASS_EXT, this should have no effect (other than the
extra donor output in scx_dump_state), but this is one step needed to
eventually remove that constraint for proxy-exec.
Signed-off-by: John Stultz <jstultz@google.com>
---
kernel/sched/ext.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7ac7d10a41bef..c410afd28fb6d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1370,17 +1370,17 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
static void update_curr_scx(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *donor = rq->donor;
s64 delta_exec;
delta_exec = update_curr_common(rq);
if (unlikely(delta_exec <= 0))
return;
- if (curr->scx.slice != SCX_SLICE_INF) {
- curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec);
- if (!curr->scx.slice)
- touch_core_sched(rq, curr);
+ if (donor->scx.slice != SCX_SLICE_INF) {
+ donor->scx.slice -= min_t(u64, donor->scx.slice, delta_exec);
+ if (!donor->scx.slice)
+ touch_core_sched(rq, donor);
}
dl_server_update(&rq->ext_server, delta_exec);
@@ -1504,13 +1504,14 @@ static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq
if (rq->scx.flags & SCX_RQ_IN_BALANCE)
return;
- if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
- rq->curr->sched_class == &ext_sched_class) {
- rq->curr->scx.slice = 0;
+ if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->donor &&
+ rq->donor->sched_class == &ext_sched_class) {
+ rq->donor->scx.slice = 0;
preempt = true;
}
- if (preempt || sched_class_above(&ext_sched_class, rq->curr->sched_class))
+ if (preempt || sched_class_above(&ext_sched_class,
+ rq->donor->sched_class))
resched_curr(rq);
}
@@ -2634,7 +2635,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
}
/* if the destination CPU is idle, wake it up */
- if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
+ if (sched_class_above(p->sched_class, dst_rq->donor->sched_class))
resched_curr(dst_rq);
}
@@ -3150,7 +3151,7 @@ static struct task_struct *first_local_task(struct rq *rq)
static struct task_struct *
do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
{
- struct task_struct *prev = rq->curr;
+ struct task_struct *prev = rq->donor;
bool keep_prev;
struct task_struct *p;
@@ -4323,7 +4324,7 @@ static void run_deferred(struct rq *rq)
#ifdef CONFIG_NO_HZ_FULL
bool scx_can_stop_tick(struct rq *rq)
{
- struct task_struct *p = rq->curr;
+ struct task_struct *p = rq->donor;
struct scx_sched *sch = scx_task_sched(p);
if (p->sched_class != &ext_sched_class)
@@ -6355,6 +6356,9 @@ static void scx_dump_cpu(struct scx_sched *sch, struct seq_buf *s,
dump_line(&ns, " curr=%s[%d] class=%ps",
rq->curr->comm, rq->curr->pid,
rq->curr->sched_class);
+ dump_line(&ns, " donor=%s[%d] class=%ps",
+ rq->donor->comm, rq->donor->pid,
+ rq->donor->sched_class);
if (!cpumask_empty(rq->scx.cpus_to_kick))
dump_line(&ns, " cpus_to_kick : %*pb",
cpumask_pr_args(rq->scx.cpus_to_kick));
@@ -7974,7 +7978,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
unsigned long flags;
raw_spin_rq_lock_irqsave(rq, flags);
- cur_class = rq->curr->sched_class;
+ cur_class = rq->donor->sched_class;
/*
* During CPU hotplug, a CPU may depend on kicking itself to make
@@ -7986,7 +7990,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
!sched_class_above(cur_class, &ext_sched_class)) {
if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
if (cur_class == &ext_sched_class)
- rq->curr->scx.slice = 0;
+ rq->donor->scx.slice = 0;
cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (2 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
` (5 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
From: John Stultz <jstultz@google.com>
With proxy execution enabled, mutex blocked tasks stay on the runqueue.
Later with donor migration they will be migrated when necessary by the
core scheduler to boost lock owners.
Don't try to migrate mutex blocked tasks, the proxy logic will handle
that.
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
kernel/sched/ext.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c410afd28fb6d..d64b1283fa851 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2320,6 +2320,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
WARN_ON_ONCE(task_cpu(p) == cpu);
+ /* Make sure tasks aren't on a cpu */
+ if (task_on_cpu(task_rq(p), p))
+ return false;
+
+ /* Don't migrate blocked tasks, proxy-exec will handle this */
+ if (task_is_blocked(p))
+ return false;
+
/*
* If @p has migration disabled, @p->cpus_ptr is updated to contain only
* the pinned CPU in migrate_disable_switch() while @p is being switched
@@ -3063,6 +3071,23 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
if (p->scx.flags & SCX_TASK_QUEUED) {
set_task_runnable(rq, p);
+ /*
+ * Mutex-blocked donors stay queued on the runqueue under proxy
+ * execution, but the donor never runs as itself, proxy-exec
+ * walks the blocked_on chain on the next __schedule() and runs
+ * the lock owner in its place.
+ *
+ * Put the donor on the local DSQ directly, so pick_next_task()
+ * can still see it, find_proxy_task() will be invoked on
+ * next->blocked_on and either run the chain owner here, or call
+ * proxy_force_return() and let BPF make a new dispatch decision
+ * once the task is no longer blocked.
+ */
+ if (task_is_blocked(p)) {
+ dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
+ goto switch_class;
+ }
+
/*
* If @p has slice left and is being put, @p is getting
* preempted by a higher priority scheduler class or core-sched
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task()
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (3 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
` (4 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
When pulling a task from a non-local DSQ, consume_dispatch_q() checks if
the task can run on the destination rq via task_can_run_on_remote_rq().
However, it then drops the destination rq lock and locks the source rq
in consume_remote_task() -> unlink_dsq_and_lock_src_rq(). During this
window, the task might have become migration disabled, making it invalid
to migrate it to the destination rq.
Fix this by re-evaluating task_can_run_on_remote_rq() in
consume_remote_task() after the source rq is locked. If the task can no
longer be migrated, we clear its DSQ association, reset the holding CPU,
and enqueue it to the source rq's local DSQ instead.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d64b1283fa851..a70f8693b906f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2418,13 +2418,24 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
!WARN_ON_ONCE(src_rq != task_rq(p));
}
-static bool consume_remote_task(struct rq *this_rq,
+static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
struct task_struct *p, u64 enq_flags,
struct scx_dispatch_q *dsq, struct rq *src_rq)
{
raw_spin_rq_unlock(this_rq);
if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
+ if (unlikely(!task_can_run_on_remote_rq(sch, p, this_rq, true))) {
+ p->scx.dsq = NULL;
+ p->scx.holding_cpu = -1;
+ dispatch_enqueue(sch, src_rq, &src_rq->scx.local_dsq, p,
+ enq_flags | SCX_ENQ_CLEAR_OPSS);
+ if (sched_class_above(p->sched_class, src_rq->donor->sched_class))
+ resched_curr(src_rq);
+ raw_spin_rq_unlock(src_rq);
+ raw_spin_rq_lock(this_rq);
+ return false;
+ }
move_remote_task_to_local_dsq(p, enq_flags, src_rq, this_rq);
return true;
} else {
@@ -2541,7 +2552,7 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
}
if (task_can_run_on_remote_rq(sch, p, rq, false)) {
- if (likely(consume_remote_task(rq, p, enq_flags, dsq, task_rq)))
+ if (likely(consume_remote_task(sch, rq, p, enq_flags, dsq, task_rq)))
return true;
goto retry;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (4 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
` (3 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
With proxy-exec, pick_next_task() can return a task with blocked_on set
(a proxy donor); put_prev_set_next_task() then calls set_next_task_scx()
on this "ghost" task, which fires ops.running(). However, the task never
actually runs.
If we simply short-circuit set_next_task_scx() for blocked tasks, we
break DSQ bookkeeping. If we only skip ops.running(), we create an
ops.enqueue() -> ops.stopping() pair without running, because
ops.stopping() is still called in put_prev_task_scx().
Fix this by introducing a new flag SCX_TASK_IS_RUNNING to track whether
ops.running() was actually called. Skip ops.running() for blocked tasks,
and only call ops.stopping() if SCX_TASK_IS_RUNNING is set. This ensures
that running and stopping callbacks are perfectly paired even when a
blocked task is picked as a proxy donor.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
include/linux/sched/ext.h | 2 ++
kernel/sched/ext.c | 14 +++++++++++---
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d05efcac794d6..5096c05d7a978 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -102,6 +102,8 @@ enum scx_ent_flags {
SCX_TASK_SUB_INIT = 1 << 4, /* task being initialized for a sub sched */
SCX_TASK_IMMED = 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */
+ SCX_TASK_IS_RUNNING = 1 << 6, /* ops.running() has been called */
+
/*
* Bits 8 and 9 are used to carry task state:
*
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a70f8693b906f..b6d29087ec0e8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2164,9 +2164,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_
* information meaningful to the BPF scheduler and can be suppressed by
* skipping the callbacks if the task is !QUEUED.
*/
- if (SCX_HAS_OP(sch, stopping) && task_current(rq, p)) {
+ if (SCX_HAS_OP(sch, stopping) && task_current(rq, p) &&
+ (p->scx.flags & SCX_TASK_IS_RUNNING)) {
update_curr_scx(rq);
SCX_CALL_OP_TASK(sch, stopping, rq, p, false);
+ p->scx.flags &= ~SCX_TASK_IS_RUNNING;
}
if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p))
@@ -2986,8 +2988,11 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
p->se.exec_start = rq_clock_task(rq);
/* see dequeue_task_scx() on why we skip when !QUEUED */
- if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED))
+ if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED) &&
+ !task_is_blocked(p)) {
SCX_CALL_OP_TASK(sch, running, rq, p);
+ p->scx.flags |= SCX_TASK_IS_RUNNING;
+ }
clr_task_runnable(p, true);
@@ -3076,8 +3081,11 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
update_curr_scx(rq);
/* see dequeue_task_scx() on why we skip when !QUEUED */
- if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED))
+ if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED) &&
+ (p->scx.flags & SCX_TASK_IS_RUNNING)) {
SCX_CALL_OP_TASK(sch, stopping, rq, p, true);
+ p->scx.flags &= ~SCX_TASK_IS_RUNNING;
+ }
if (p->scx.flags & SCX_TASK_QUEUED) {
set_task_runnable(rq, p);
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (5 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
` (2 subsequent siblings)
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
SCX_CALL_OP_TASK*() stored the subject task in current->scx.kf_tasks[]
and assumed ops would not nest. A BPF ops.running() callback can call
kfuncs (e.g. scx_bpf_dsq_insert) that enqueue work and trigger
enqueue_task_scx() -> ops.runnable(), which used SCX_CALL_OP_TASK again
and overwrote kf_tasks[0] then cleared it, leaving the running context
wrong and leading to NULL function dispatches from BPF helpers.
Save and restore kf_tasks[] (both slots for the two-task variant) around
each invocation so nested task-based ops preserve the outer context.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 43 +++++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b6d29087ec0e8..1ac885eadfa8e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -567,37 +567,50 @@ static s32 scx_cpu_ret(struct scx_sched *sch, s32 cpu_or_cid)
* pi_lock held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu.
* So if kf_tasks[] is set, @p's scheduler-protected fields are stable.
*
- * kf_tasks[] can not stack, so task-based SCX ops must not nest. The
- * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants
- * while a previous one is still in progress.
+ * Task-based SCX ops may nest (e.g. ops.running() calling a kfunc that ends up
+ * in enqueue_task_scx() -> ops.runnable()). Save and restore kf_tasks[] around
+ * each invocation so the outer op's context is restored for kfuncs and for
+ * further nested calls. Single-task ops save/restore both slots and clear
+ * kf_tasks[1] while active so a nested call under SCX_CALL_OP_2TASKS_RET does
+ * not leave the outer pair's second task authenticated for kfuncs.
*/
#define SCX_CALL_OP_TASK(sch, op, locked_rq, task, args...) \
do { \
- WARN_ON_ONCE(current->scx.kf_tasks[0]); \
+ struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
+ struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
+ \
current->scx.kf_tasks[0] = task; \
+ current->scx.kf_tasks[1] = NULL; \
SCX_CALL_OP((sch), op, locked_rq, task, ##args); \
- current->scx.kf_tasks[0] = NULL; \
+ current->scx.kf_tasks[0] = __scx_kf0_sv; \
+ current->scx.kf_tasks[1] = __scx_kf1_sv; \
} while (0)
#define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...) \
({ \
__typeof__((sch)->ops.op(task, ##args)) __ret; \
- WARN_ON_ONCE(current->scx.kf_tasks[0]); \
+ struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
+ struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
+ \
current->scx.kf_tasks[0] = task; \
+ current->scx.kf_tasks[1] = NULL; \
__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args); \
- current->scx.kf_tasks[0] = NULL; \
+ current->scx.kf_tasks[0] = __scx_kf0_sv; \
+ current->scx.kf_tasks[1] = __scx_kf1_sv; \
__ret; \
})
#define SCX_CALL_OP_2TASKS_RET(sch, op, locked_rq, task0, task1, args...) \
({ \
__typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \
- WARN_ON_ONCE(current->scx.kf_tasks[0]); \
+ struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
+ struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
+ \
current->scx.kf_tasks[0] = task0; \
current->scx.kf_tasks[1] = task1; \
__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args); \
- current->scx.kf_tasks[0] = NULL; \
- current->scx.kf_tasks[1] = NULL; \
+ current->scx.kf_tasks[0] = __scx_kf0_sv; \
+ current->scx.kf_tasks[1] = __scx_kf1_sv; \
__ret; \
})
@@ -616,8 +629,12 @@ static inline void scx_call_op_set_cpumask(struct scx_sched *sch, struct rq *rq,
struct task_struct *task,
const struct cpumask *cpumask)
{
- WARN_ON_ONCE(current->scx.kf_tasks[0]);
+ struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];
+ struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];
+
+ current->scx.kf_nest++;
current->scx.kf_tasks[0] = task;
+ current->scx.kf_tasks[1] = NULL;
if (rq)
update_locked_rq(rq);
@@ -633,7 +650,9 @@ static inline void scx_call_op_set_cpumask(struct scx_sched *sch, struct rq *rq,
if (rq)
update_locked_rq(NULL);
- current->scx.kf_tasks[0] = NULL;
+ current->scx.kf_tasks[0] = __scx_kf0_sv;
+ current->scx.kf_tasks[1] = __scx_kf1_sv;
+ current->scx.kf_nest--;
}
/* see SCX_CALL_OP_TASK() */
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (6 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
ops.running() can pull in enqueue_task_scx() -> ops.runnable() on the
same current task while kf_tasks[] save/restore is still insufficient
for every BPF/kfunc combination, leading to NULL dispatches and stack
corruption.
Track SCX_CALL_OP_TASK nesting in current->scx.kf_nest (incremented by
all SCX_CALL_OP_TASK* macros) and omit the ops.runnable() callback when
non-zero. The full enqueue path including ops.enqueue() still runs, only
the runnable hook is skipped in this case.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
include/linux/sched/ext.h | 7 +++++++
kernel/sched/ext.c | 9 ++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 5096c05d7a978..8c04edf1bc91a 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -197,6 +197,13 @@ struct sched_ext_entity {
s32 holding_cpu;
s32 selected_cpu;
struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */
+ /*
+ * Nesting depth of SCX_CALL_OP_TASK() on this task as %current (e.g.
+ * during schedule() %current is still the previous task). Used to skip
+ * ops.runnable() when invoked from inside another task op such as
+ * ops.running() to avoid breaking BPF re-entrance guarantees.
+ */
+ u32 kf_nest;
struct list_head runnable_node; /* rq->scx.runnable_list */
unsigned long runnable_at;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 1ac885eadfa8e..af9b10cd82c4a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -579,11 +579,13 @@ do { \
struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
\
+ current->scx.kf_nest++; \
current->scx.kf_tasks[0] = task; \
current->scx.kf_tasks[1] = NULL; \
SCX_CALL_OP((sch), op, locked_rq, task, ##args); \
current->scx.kf_tasks[0] = __scx_kf0_sv; \
current->scx.kf_tasks[1] = __scx_kf1_sv; \
+ current->scx.kf_nest--; \
} while (0)
#define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...) \
@@ -592,11 +594,13 @@ do { \
struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
\
+ current->scx.kf_nest++; \
current->scx.kf_tasks[0] = task; \
current->scx.kf_tasks[1] = NULL; \
__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args); \
current->scx.kf_tasks[0] = __scx_kf0_sv; \
current->scx.kf_tasks[1] = __scx_kf1_sv; \
+ current->scx.kf_nest--; \
__ret; \
})
@@ -606,11 +610,13 @@ do { \
struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
\
+ current->scx.kf_nest++; \
current->scx.kf_tasks[0] = task0; \
current->scx.kf_tasks[1] = task1; \
__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args); \
current->scx.kf_tasks[0] = __scx_kf0_sv; \
current->scx.kf_tasks[1] = __scx_kf1_sv; \
+ current->scx.kf_nest--; \
__ret; \
})
@@ -2067,7 +2073,8 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_
rq->scx.nr_running++;
add_nr_running(rq, 1);
- if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p))
+ if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p) &&
+ !READ_ONCE(current->scx.kf_nest))
SCX_CALL_OP_TASK(sch, runnable, rq, p, enq_flags);
if (enq_flags & SCX_ENQ_WAKEUP)
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (7 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
Proxy execution switches a donor's execution context to the mutex owner,
so the owner can make progress while the donor remains on the runqueue.
This logic might be incompatible with some sched_ext schedulers: the BPF
scheduler picks tasks through its own dispatch interface, and a
proxy-exec switch may end up running a task the BPF scheduler never
dispatched. This mismatch can break BPF context: sched_ext callbacks
fire against a task that isn't the one the BPF scheduler tracks as
running, so any kfunc they invoke operates on an inconsistent view of
the current task.
Therefore, when sched_ext is enabled, disable proxy-exec context
donation by default:
- Force try_to_block_task() to actually block a mutex-blocked prev
instead of keeping it on the rq as a donor.
- Skip find_proxy_task() in the pick path. Clear any leftover
PROXY_WAKING marker set by the mutex handoff, since
find_proxy_task() is no longer there to do it; otherwise the task
trips the blocked_on mismatch WARN in __set_task_blocked_on() when
it resumes the mutex_lock() retry loop.
However, some schedulers may not consider proxy execution as a real
"task switch" and more like a "function call": the donor effectively
executes the lock owner's critical section, so the switch does not
represent a true change in scheduling ownership.
To handle both semantics, add a boot-time knob to enable proxy execution
under sched_ext when explicitly desired:
sched_proxy_exec_scx=0|1
The default is 0, keeping proxy-exec disabled for the reasons described
above. Setting it to 1 allows donor->owner context switch even with
sched_ext enabled.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
.../admin-guide/kernel-parameters.txt | 6 +++
kernel/sched/core.c | 47 ++++++++++++++++++-
2 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4510b4b3c4165..f73c12e9645de 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6821,6 +6821,12 @@ Kernel parameters
solution to mutex-based priority inversion.
Format: <bool>
+ sched_proxy_exec_scx= [KNL]
+ Enables or disables proxy execution when sched_ext is
+ enabled. The default is disabled, meaning proxy-exec
+ context donation is suppressed while sched_ext is active.
+ Format: <bool>
+
sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages.
schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1c161dd9d7440..0f714c6613771 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -151,14 +151,52 @@ static int __init setup_proxy_exec(char *str)
}
return 1;
}
+
+DEFINE_STATIC_KEY_FALSE(__sched_proxy_exec_scx);
+static __always_inline bool sched_proxy_exec_scx(void)
+{
+ return static_branch_unlikely(&__sched_proxy_exec_scx);
+}
+
+static int __init setup_proxy_exec_scx(char *str)
+{
+ bool proxy_scx_enable = false;
+
+ if (*str && kstrtobool(str + 1, &proxy_scx_enable)) {
+ pr_warn("Unable to parse sched_proxy_exec_scx=\n");
+ return 0;
+ }
+
+ if (proxy_scx_enable) {
+ pr_info("sched_proxy_exec_scx enabled via boot arg\n");
+ static_branch_enable(&__sched_proxy_exec_scx);
+ } else {
+ pr_info("sched_proxy_exec_scx disabled via boot arg\n");
+ static_branch_disable(&__sched_proxy_exec_scx);
+ }
+
+ return 1;
+}
#else
static int __init setup_proxy_exec(char *str)
{
pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so it cannot be enabled or disabled at boot time\n");
return 0;
}
+
+static __always_inline bool sched_proxy_exec_scx(void)
+{
+ return false;
+}
+
+static int __init setup_proxy_exec_scx(char *str)
+{
+ pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so sched_proxy_exec_scx= is ignored\n");
+ return 0;
+}
#endif
__setup("sched_proxy_exec", setup_proxy_exec);
+__setup("sched_proxy_exec_scx", setup_proxy_exec_scx);
/*
* Debugging: various feature bits
@@ -7111,7 +7149,8 @@ static void __sched notrace __schedule(int sched_mode)
* task_is_blocked() will always be false).
*/
try_to_block_task(rq, prev, &prev_state,
- !task_is_blocked(prev));
+ !task_is_blocked(prev) ||
+ (scx_enabled() && !sched_proxy_exec_scx()));
switch_count = &prev->nvcsw;
}
@@ -7123,6 +7162,12 @@ static void __sched notrace __schedule(int sched_mode)
struct task_struct *prev_donor = rq->donor;
rq_set_donor(rq, next);
+ if (scx_enabled() && !sched_proxy_exec_scx()) {
+ if (unlikely(next->blocked_on))
+ clear_task_blocked_on(next, PROXY_WAKING);
+ goto picked;
+ }
+
if (unlikely(next->blocked_on)) {
next = find_proxy_task(rq, next, &rf);
if (!next) {
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (8 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
9 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
Now that sched_ext supports proxy execution, allow enabling both options
together.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
init/Kconfig | 2 --
1 file changed, 2 deletions(-)
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308aec..6b18ba7263f0b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -934,8 +934,6 @@ config SCHED_PROXY_EXEC
bool "Proxy Execution"
# Avoid some build failures w/ PREEMPT_RT until it can be fixed
depends on !PREEMPT_RT
- # Need to investigate how to inform sched_ext of split contexts
- depends on !SCHED_CLASS_EXT
# Not particularly useful until we get to multi-rq proxying
depends on EXPERT
help
--
2.54.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
@ 2026-05-06 21:09 ` John Stultz
2026-05-07 3:34 ` K Prateek Nayak
0 siblings, 1 reply; 14+ messages in thread
From: John Stultz @ 2026-05-06 21:09 UTC (permalink / raw)
To: Andrea Righi
Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Koba Ko, Joel Fernandes,
sched-ext, linux-kernel
On Wed, May 6, 2026 at 10:47 AM Andrea Righi <arighi@nvidia.com> wrote:
>
> Never attempt to migrate migration-disabled tasks or tasks that can only
> run on a single CPU when switching donor's execution context, preventing
> task pinning violations.
>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/core.c | 22 +++++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index da20fb6ea25ae..75541e5bb66d1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6793,9 +6793,13 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
>
> update_rq_clock(task_rq);
> deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
> - cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> - set_task_cpu(p, cpu);
> - target_rq = cpu_rq(cpu);
> + if (p->nr_cpus_allowed > 1 && !is_migration_disabled(p)) {
> + cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> + set_task_cpu(p, cpu);
> + target_rq = cpu_rq(cpu);
> + } else {
> + target_rq = task_rq;
> + }
> clear_task_blocked_on(p, NULL);
> }
>
> @@ -6893,6 +6897,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> */
> if (curr_in_chain)
> return proxy_resched_idle(rq);
> + /*
> + * Tasks pinned to a single CPU (per-CPU kthreads via
> + * kthread_bind(), tasks under migrate_disable()) cannot
> + * be moved to @owner_cpu. proxy_migrate_task() uses
> + * __set_task_cpu() which would silently violate the
> + * pinning and leave the task to run on a CPU outside
> + * its cpus_ptr once it is unblocked. Stay on this CPU
> + * via force_return; the owner running elsewhere will
> + * wake @p back up when the mutex becomes available.
> + */
> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
> + goto force_return;
> goto migrate_task;
Hey Andrea!
I'm excited to see this series! Thanks for your efforts here!
Though I'm a bit confused on this patch. I see the patch changes it
so we don't proxy-migrate pinned/migration-disabled patches, but I'm
not sure I understand why.
We only proxy-migrate blocked_on tasks, which don't run on the cpu
they are migrated to (they are only migrated to be used as a donor).
That's why we have the proxy_force_return() function to return-migrate
them back when they do become runnable.
Could you provide some more details about what motivated this change
(ie: how you tripped a problem that it resolved?).
thanks
-john
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-06 21:09 ` John Stultz
@ 2026-05-07 3:34 ` K Prateek Nayak
2026-05-07 6:31 ` Andrea Righi
0 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2026-05-07 3:34 UTC (permalink / raw)
To: John Stultz, Andrea Righi
Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Joel Fernandes, sched-ext,
linux-kernel
Hello John, Andrea,
(Full disclaimer: I haven't looked at the entire series)
On 5/7/2026 2:39 AM, John Stultz wrote:
>> + /*
>> + * Tasks pinned to a single CPU (per-CPU kthreads via
>> + * kthread_bind(), tasks under migrate_disable()) cannot
>> + * be moved to @owner_cpu. proxy_migrate_task() uses
>> + * __set_task_cpu() which would silently violate the
>> + * pinning and leave the task to run on a CPU outside
>> + * its cpus_ptr once it is unblocked. Stay on this CPU
>> + * via force_return; the owner running elsewhere will
>> + * wake @p back up when the mutex becomes available.
>> + */
>> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
>> + goto force_return;
>> goto migrate_task;
>
> Hey Andrea!
> I'm excited to see this series! Thanks for your efforts here!
>
> Though I'm a bit confused on this patch. I see the patch changes it
> so we don't proxy-migrate pinned/migration-disabled patches, but I'm
> not sure I understand why.
>
> We only proxy-migrate blocked_on tasks, which don't run on the cpu
> they are migrated to (they are only migrated to be used as a donor).
> That's why we have the proxy_force_return() function to return-migrate
> them back when they do become runnable.
I agree this shouldn't be a problem from core perspective but there
are some interesting sched-ext interactions possible. More on that
below:
>
> Could you provide some more details about what motivated this change
> (ie: how you tripped a problem that it resolved?).
I think ops.enqueue() always assumes that the task being enqueued is
runnable on the task_cpu() and when the the sched-ext layer tries to
dispatch this task to local DSQ, the ext core complains and marks
the sched-ext scheduler as buggy.
With sched-ext, even the lock owner's CPU is slightly complicated
since the owner might be associated with a CPU but it is in fact on a
custom DSQ and after moving the donor to owner's CPU, we will need
sched-ext scheduler to guarantee that the owner runs there else
there is no point in doing a proxy.
scx flow should look something like (please correct me if I'm
wrong):
CPU0: donor CPU1: owner
=========== ===========
/* Donor is retained on rq*/
put_prev_task_scx()
ops.stopping()
ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
do_pick_task_scx()
next = donor;
find_proxy_task()
proxy_migrate_task()
ops.dequeue()
======================> /*
* Moves to owner CPU (May be outside of affinity list)
* ops.enqueue() still happens on CPU0 but I've shown it
* here to depict the context has moved to owner's CPU.
*/
ops.enqueue()
scx_bpf_dsq_insert()
/*
* !!! Cannot dispatch to local CPU; Outside affinity !!!
*
* We need to allow local dispatch outside affinity iff:
*
* p->is_blocked && cpu == task_cpu(p)
*
* Since enqueue_task_scx() hold's the task's rq_lock, the
* is_blocked indicator should be stable during a dispatch.
*/
ops.dispatch()
do_pick_task_scx()
set_next_task_scx()
ops.running(donor)
find_proxy_task()
next = owner
/*
* !!! Owner stats running without any notification. !!!
*
* If owner blocks, dequeue_task_scx() is executed first and
* the sched-ext scheduler sees:
*
* ops.stopping(owner)
*
* which leads to some asymmetry.
*
* XXX: Below is how I imagine the flow should continue.
*/
ops.quiescent(owner) /* Core is taking back control of owner's running */
/* Runs owner */
ops.runnable(owner) /* Core is giving back control to ext layer */
ops.stopping(donor); /* Accounting symmetry for donor */
I think dequeue_task_scx() should see task_current_donor() before
calling ops.stopping() else we get some asymmetry. The donor will
anyways be placed back via put_prev_task_scx() and since it hasn't run,
it cannot block itself and there should be no dependency on
dequeue_task_scx() for donors.
With the quiescent() + runnable() scheme, the sched-ext schedulers need
to be made aware that task can go quiescent() and then back to
runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
full:
dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
Also since the mutex owner can block, the sched-ext scheduler needs to
be aware of the fact that it can get a dequeue() -> quiescent()
without having stopping() in between if we plan to keep
symmetry.
There might be more issues there that I'm missing.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-07 3:34 ` K Prateek Nayak
@ 2026-05-07 6:31 ` Andrea Righi
0 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-05-07 6:31 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Joel Fernandes, sched-ext,
linux-kernel
Hi John, Prateek,
On Thu, May 07, 2026 at 09:04:57AM +0530, K Prateek Nayak wrote:
> Hello John, Andrea,
>
> (Full disclaimer: I haven't looked at the entire series)
>
> On 5/7/2026 2:39 AM, John Stultz wrote:
> >> + /*
> >> + * Tasks pinned to a single CPU (per-CPU kthreads via
> >> + * kthread_bind(), tasks under migrate_disable()) cannot
> >> + * be moved to @owner_cpu. proxy_migrate_task() uses
> >> + * __set_task_cpu() which would silently violate the
> >> + * pinning and leave the task to run on a CPU outside
> >> + * its cpus_ptr once it is unblocked. Stay on this CPU
> >> + * via force_return; the owner running elsewhere will
> >> + * wake @p back up when the mutex becomes available.
> >> + */
> >> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
> >> + goto force_return;
> >> goto migrate_task;
> >
> > Hey Andrea!
> > I'm excited to see this series! Thanks for your efforts here!
> >
> > Though I'm a bit confused on this patch. I see the patch changes it
> > so we don't proxy-migrate pinned/migration-disabled patches, but I'm
> > not sure I understand why.
> >
> > We only proxy-migrate blocked_on tasks, which don't run on the cpu
> > they are migrated to (they are only migrated to be used as a donor).
> > That's why we have the proxy_force_return() function to return-migrate
> > them back when they do become runnable.
>
> I agree this shouldn't be a problem from core perspective but there
> are some interesting sched-ext interactions possible. More on that
> below:
So, I included this patch, because in a previous version of this series it was
preventing a "SCX_DSQ_LOCAL[_ON] cannot move migration disabled task" error.
However, I tried again this series without this and everything seems to work. I
guess this was fixed by "sched/ext: Avoid migrating blocked tasks with proxy
execution", that was not present in my previous early implementation. So, let's
ignore this for now...
>
> >
> > Could you provide some more details about what motivated this change
> > (ie: how you tripped a problem that it resolved?).
>
> I think ops.enqueue() always assumes that the task being enqueued is
> runnable on the task_cpu() and when the the sched-ext layer tries to
> dispatch this task to local DSQ, the ext core complains and marks
> the sched-ext scheduler as buggy.
Correct that ops.enqueue() assumes that the task being enqueued is runnable on
task_cpu(), but this should still be true even when the donor is migrated:
proxy-exec should only migrate the donor to the owner's CPU when the placement
is allowed.
>
> With sched-ext, even the lock owner's CPU is slightly complicated
> since the owner might be associated with a CPU but it is in fact on a
> custom DSQ and after moving the donor to owner's CPU, we will need
> sched-ext scheduler to guarantee that the owner runs there else
> there is no point in doing a proxy.
But a donor is always a running task (by definition), so it can't be on a custom
DSQ. Custom DSQs only hold tasks that are in the BPF scheduler's custody,
waiting to be dispatched.
The core keeps the donor logically runnable / on_rq and the ext core always
parks blocked donors on the built-in local DSQ:
put_prev_task_scx():
...
if (p->scx.flags & SCX_TASK_QUEUED) {
set_task_runnable(rq, p);
if (task_is_blocked(p)) {
dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
goto switch_class;
}
...
>
> scx flow should look something like (please correct me if I'm
> wrong):
>
> CPU0: donor CPU1: owner
> =========== ===========
>
> /* Donor is retained on rq*/
> put_prev_task_scx()
> ops.stopping()
> ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
> do_pick_task_scx()
> next = donor;
> find_proxy_task()
> proxy_migrate_task()
> ops.dequeue()
> ======================> /*
> * Moves to owner CPU (May be outside of affinity list)
> * ops.enqueue() still happens on CPU0 but I've shown it
> * here to depict the context has moved to owner's CPU.
> */
> ops.enqueue()
> scx_bpf_dsq_insert()
> /*
> * !!! Cannot dispatch to local CPU; Outside affinity !!!
> *
> * We need to allow local dispatch outside affinity iff:
> *
> * p->is_blocked && cpu == task_cpu(p)
> *
> * Since enqueue_task_scx() hold's the task's rq_lock, the
> * is_blocked indicator should be stable during a dispatch.
> */
> ops.dispatch()
> do_pick_task_scx()
> set_next_task_scx()
> ops.running(donor)
> find_proxy_task()
> next = owner
> /*
> * !!! Owner stats running without any notification. !!!
> *
> * If owner blocks, dequeue_task_scx() is executed first and
> * the sched-ext scheduler sees:
> *
> * ops.stopping(owner)
> *
> * which leads to some asymmetry.
> *
> * XXX: Below is how I imagine the flow should continue.
> */
> ops.quiescent(owner) /* Core is taking back control of owner's running */
> /* Runs owner */
> ops.runnable(owner) /* Core is giving back control to ext layer */
> ops.stopping(donor); /* Accounting symmetry for donor */
I think the order of operations should be the following:
ops.runnable(donor)
-> ops.enqueue(donor)
-> donor becomes curr
-> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
-> donor executes
-> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
-> __schedule()
-> pick_next -> proxy-exec selects owner as next
-> put_prev_task_scx(donor)
-> ops.stopping(donor)
-> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
-> set_next_task_scx(owner)
-> ops.running(owner)
-> donor runs as rq->donor, owner runs as rq->curr /* execution / accounting split */
Later, when the owner is switched away (another schedule)
... owner running ...
-> __schedule() / switch away from owner
-> put_prev_task_scx(owner)
-> ops.stopping(owner) /* if QUEUED && IS_RUNNING */
-> set_next_task_scx() /* whoever is next */
Later, mutex is released - donor can run as itself again
-> mutex released / donor unblocked (!task_is_blocked(donor))
-> donor selected as next /* becomes rq->curr as donor; not superseded by proxy */
-> ops.running(donor) /* set_next_task_scx(donor); QUEUED && !task_is_blocked(donor) */
-> donor executes as rq->curr
> I think dequeue_task_scx() should see task_current_donor() before
> calling ops.stopping() else we get some asymmetry. The donor will
> anyways be placed back via put_prev_task_scx() and since it hasn't run,
> it cannot block itself and there should be no dependency on
> dequeue_task_scx() for donors.
The ops.running/stopping() pair should be always enforced by
SCX_TASK_IS_RUNNING, so we either see a pair of them or none. So in theory,
there shouldn't be any asymmetry.
>
> With the quiescent() + runnable() scheme, the sched-ext schedulers need
> to be made aware that task can go quiescent() and then back to
> runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
> full:
>
> dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
>
> Also since the mutex owner can block, the sched-ext scheduler needs to
> be aware of the fact that it can get a dequeue() -> quiescent()
> without having stopping() in between if we plan to keep
> symmetry.
We can see ops.dequeue() -> ops.quiescent() without ops.stopping() even without
proxy-exec: if a task becomes runnable and then it's moved to a different sched
class, the BPF scheduler can see ops.runnable/quiescent() without
ops.running/stopping().
As long as ops.runnable/quiescent() and ops.running/stopping() are symmetric I
think we're fine.
>
> There might be more issues there that I'm missing.
>
Right, I'm still trying to figure out if there's any scenario that can break
some BPF assumptions (kfunc permissions or similar), but considering that the
BPF context is usually associated to task_struct I can't see any potential
violation/breakage at the moment.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-05-07 6:31 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-05-06 21:09 ` John Stultz
2026-05-07 3:34 ` K Prateek Nayak
2026-05-07 6:31 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox