* [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 21:09 ` John Stultz
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
` (9 subsequent siblings)
10 siblings, 1 reply; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
Never attempt to migrate migration-disabled tasks or tasks that can only
run on a single CPU when switching donor's execution context, preventing
task pinning violations.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/core.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da20fb6ea25ae..75541e5bb66d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6793,9 +6793,13 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
update_rq_clock(task_rq);
deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
- cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
- set_task_cpu(p, cpu);
- target_rq = cpu_rq(cpu);
+ if (p->nr_cpus_allowed > 1 && !is_migration_disabled(p)) {
+ cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+ set_task_cpu(p, cpu);
+ target_rq = cpu_rq(cpu);
+ } else {
+ target_rq = task_rq;
+ }
clear_task_blocked_on(p, NULL);
}
@@ -6893,6 +6897,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
*/
if (curr_in_chain)
return proxy_resched_idle(rq);
+ /*
+ * Tasks pinned to a single CPU (per-CPU kthreads via
+ * kthread_bind(), tasks under migrate_disable()) cannot
+ * be moved to @owner_cpu. proxy_migrate_task() uses
+ * __set_task_cpu() which would silently violate the
+ * pinning and leave the task to run on a CPU outside
+ * its cpus_ptr once it is unblocked. Stay on this CPU
+ * via force_return; the owner running elsewhere will
+ * wake @p back up when the mutex becomes available.
+ */
+ if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
+ goto force_return;
goto migrate_task;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
@ 2026-05-06 21:09 ` John Stultz
2026-05-07 3:34 ` K Prateek Nayak
0 siblings, 1 reply; 21+ messages in thread
From: John Stultz @ 2026-05-06 21:09 UTC (permalink / raw)
To: Andrea Righi
Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
K Prateek Nayak, Christian Loehle, Koba Ko, Joel Fernandes,
sched-ext, linux-kernel
On Wed, May 6, 2026 at 10:47 AM Andrea Righi <arighi@nvidia.com> wrote:
>
> Never attempt to migrate migration-disabled tasks or tasks that can only
> run on a single CPU when switching donor's execution context, preventing
> task pinning violations.
>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/core.c | 22 +++++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index da20fb6ea25ae..75541e5bb66d1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6793,9 +6793,13 @@ static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
>
> update_rq_clock(task_rq);
> deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
> - cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> - set_task_cpu(p, cpu);
> - target_rq = cpu_rq(cpu);
> + if (p->nr_cpus_allowed > 1 && !is_migration_disabled(p)) {
> + cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> + set_task_cpu(p, cpu);
> + target_rq = cpu_rq(cpu);
> + } else {
> + target_rq = task_rq;
> + }
> clear_task_blocked_on(p, NULL);
> }
>
> @@ -6893,6 +6897,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> */
> if (curr_in_chain)
> return proxy_resched_idle(rq);
> + /*
> + * Tasks pinned to a single CPU (per-CPU kthreads via
> + * kthread_bind(), tasks under migrate_disable()) cannot
> + * be moved to @owner_cpu. proxy_migrate_task() uses
> + * __set_task_cpu() which would silently violate the
> + * pinning and leave the task to run on a CPU outside
> + * its cpus_ptr once it is unblocked. Stay on this CPU
> + * via force_return; the owner running elsewhere will
> + * wake @p back up when the mutex becomes available.
> + */
> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
> + goto force_return;
> goto migrate_task;
Hey Andrea!
I'm excited to see this series! Thanks for your efforts here!
Though I'm a bit confused on this patch. I see the patch changes it
so we don't proxy-migrate pinned/migration-disabled patches, but I'm
not sure I understand why.
We only proxy-migrate blocked_on tasks, which don't run on the cpu
they are migrated to (they are only migrated to be used as a donor).
That's why we have the proxy_force_return() function to return-migrate
them back when they do become runnable.
Could you provide some more details about what motivated this change
(ie: how you tripped a problem that it resolved?).
thanks
-john
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-06 21:09 ` John Stultz
@ 2026-05-07 3:34 ` K Prateek Nayak
2026-05-07 6:31 ` Andrea Righi
0 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-07 3:34 UTC (permalink / raw)
To: John Stultz, Andrea Righi
Cc: Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Joel Fernandes, sched-ext,
linux-kernel
Hello John, Andrea,
(Full disclaimer: I haven't looked at the entire series)
On 5/7/2026 2:39 AM, John Stultz wrote:
>> + /*
>> + * Tasks pinned to a single CPU (per-CPU kthreads via
>> + * kthread_bind(), tasks under migrate_disable()) cannot
>> + * be moved to @owner_cpu. proxy_migrate_task() uses
>> + * __set_task_cpu() which would silently violate the
>> + * pinning and leave the task to run on a CPU outside
>> + * its cpus_ptr once it is unblocked. Stay on this CPU
>> + * via force_return; the owner running elsewhere will
>> + * wake @p back up when the mutex becomes available.
>> + */
>> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
>> + goto force_return;
>> goto migrate_task;
>
> Hey Andrea!
> I'm excited to see this series! Thanks for your efforts here!
>
> Though I'm a bit confused on this patch. I see the patch changes it
> so we don't proxy-migrate pinned/migration-disabled patches, but I'm
> not sure I understand why.
>
> We only proxy-migrate blocked_on tasks, which don't run on the cpu
> they are migrated to (they are only migrated to be used as a donor).
> That's why we have the proxy_force_return() function to return-migrate
> them back when they do become runnable.
I agree this shouldn't be a problem from core perspective but there
are some interesting sched-ext interactions possible. More on that
below:
>
> Could you provide some more details about what motivated this change
> (ie: how you tripped a problem that it resolved?).
I think ops.enqueue() always assumes that the task being enqueued is
runnable on the task_cpu() and when the the sched-ext layer tries to
dispatch this task to local DSQ, the ext core complains and marks
the sched-ext scheduler as buggy.
With sched-ext, even the lock owner's CPU is slightly complicated
since the owner might be associated with a CPU but it is in fact on a
custom DSQ and after moving the donor to owner's CPU, we will need
sched-ext scheduler to guarantee that the owner runs there else
there is no point in doing a proxy.
scx flow should look something like (please correct me if I'm
wrong):
CPU0: donor CPU1: owner
=========== ===========
/* Donor is retained on rq*/
put_prev_task_scx()
ops.stopping()
ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
do_pick_task_scx()
next = donor;
find_proxy_task()
proxy_migrate_task()
ops.dequeue()
======================> /*
* Moves to owner CPU (May be outside of affinity list)
* ops.enqueue() still happens on CPU0 but I've shown it
* here to depict the context has moved to owner's CPU.
*/
ops.enqueue()
scx_bpf_dsq_insert()
/*
* !!! Cannot dispatch to local CPU; Outside affinity !!!
*
* We need to allow local dispatch outside affinity iff:
*
* p->is_blocked && cpu == task_cpu(p)
*
* Since enqueue_task_scx() hold's the task's rq_lock, the
* is_blocked indicator should be stable during a dispatch.
*/
ops.dispatch()
do_pick_task_scx()
set_next_task_scx()
ops.running(donor)
find_proxy_task()
next = owner
/*
* !!! Owner stats running without any notification. !!!
*
* If owner blocks, dequeue_task_scx() is executed first and
* the sched-ext scheduler sees:
*
* ops.stopping(owner)
*
* which leads to some asymmetry.
*
* XXX: Below is how I imagine the flow should continue.
*/
ops.quiescent(owner) /* Core is taking back control of owner's running */
/* Runs owner */
ops.runnable(owner) /* Core is giving back control to ext layer */
ops.stopping(donor); /* Accounting symmetry for donor */
I think dequeue_task_scx() should see task_current_donor() before
calling ops.stopping() else we get some asymmetry. The donor will
anyways be placed back via put_prev_task_scx() and since it hasn't run,
it cannot block itself and there should be no dependency on
dequeue_task_scx() for donors.
With the quiescent() + runnable() scheme, the sched-ext schedulers need
to be made aware that task can go quiescent() and then back to
runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
full:
dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
Also since the mutex owner can block, the sched-ext scheduler needs to
be aware of the fact that it can get a dequeue() -> quiescent()
without having stopping() in between if we plan to keep
symmetry.
There might be more issues there that I'm missing.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-07 3:34 ` K Prateek Nayak
@ 2026-05-07 6:31 ` Andrea Righi
2026-05-07 7:45 ` K Prateek Nayak
0 siblings, 1 reply; 21+ messages in thread
From: Andrea Righi @ 2026-05-07 6:31 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Joel Fernandes, sched-ext,
linux-kernel
Hi John, Prateek,
On Thu, May 07, 2026 at 09:04:57AM +0530, K Prateek Nayak wrote:
> Hello John, Andrea,
>
> (Full disclaimer: I haven't looked at the entire series)
>
> On 5/7/2026 2:39 AM, John Stultz wrote:
> >> + /*
> >> + * Tasks pinned to a single CPU (per-CPU kthreads via
> >> + * kthread_bind(), tasks under migrate_disable()) cannot
> >> + * be moved to @owner_cpu. proxy_migrate_task() uses
> >> + * __set_task_cpu() which would silently violate the
> >> + * pinning and leave the task to run on a CPU outside
> >> + * its cpus_ptr once it is unblocked. Stay on this CPU
> >> + * via force_return; the owner running elsewhere will
> >> + * wake @p back up when the mutex becomes available.
> >> + */
> >> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
> >> + goto force_return;
> >> goto migrate_task;
> >
> > Hey Andrea!
> > I'm excited to see this series! Thanks for your efforts here!
> >
> > Though I'm a bit confused on this patch. I see the patch changes it
> > so we don't proxy-migrate pinned/migration-disabled patches, but I'm
> > not sure I understand why.
> >
> > We only proxy-migrate blocked_on tasks, which don't run on the cpu
> > they are migrated to (they are only migrated to be used as a donor).
> > That's why we have the proxy_force_return() function to return-migrate
> > them back when they do become runnable.
>
> I agree this shouldn't be a problem from core perspective but there
> are some interesting sched-ext interactions possible. More on that
> below:
So, I included this patch, because in a previous version of this series it was
preventing a "SCX_DSQ_LOCAL[_ON] cannot move migration disabled task" error.
However, I tried again this series without this and everything seems to work. I
guess this was fixed by "sched/ext: Avoid migrating blocked tasks with proxy
execution", that was not present in my previous early implementation. So, let's
ignore this for now...
>
> >
> > Could you provide some more details about what motivated this change
> > (ie: how you tripped a problem that it resolved?).
>
> I think ops.enqueue() always assumes that the task being enqueued is
> runnable on the task_cpu() and when the the sched-ext layer tries to
> dispatch this task to local DSQ, the ext core complains and marks
> the sched-ext scheduler as buggy.
Correct that ops.enqueue() assumes that the task being enqueued is runnable on
task_cpu(), but this should still be true even when the donor is migrated:
proxy-exec should only migrate the donor to the owner's CPU when the placement
is allowed.
>
> With sched-ext, even the lock owner's CPU is slightly complicated
> since the owner might be associated with a CPU but it is in fact on a
> custom DSQ and after moving the donor to owner's CPU, we will need
> sched-ext scheduler to guarantee that the owner runs there else
> there is no point in doing a proxy.
But a donor is always a running task (by definition), so it can't be on a custom
DSQ. Custom DSQs only hold tasks that are in the BPF scheduler's custody,
waiting to be dispatched.
The core keeps the donor logically runnable / on_rq and the ext core always
parks blocked donors on the built-in local DSQ:
put_prev_task_scx():
...
if (p->scx.flags & SCX_TASK_QUEUED) {
set_task_runnable(rq, p);
if (task_is_blocked(p)) {
dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
goto switch_class;
}
...
>
> scx flow should look something like (please correct me if I'm
> wrong):
>
> CPU0: donor CPU1: owner
> =========== ===========
>
> /* Donor is retained on rq*/
> put_prev_task_scx()
> ops.stopping()
> ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
> do_pick_task_scx()
> next = donor;
> find_proxy_task()
> proxy_migrate_task()
> ops.dequeue()
> ======================> /*
> * Moves to owner CPU (May be outside of affinity list)
> * ops.enqueue() still happens on CPU0 but I've shown it
> * here to depict the context has moved to owner's CPU.
> */
> ops.enqueue()
> scx_bpf_dsq_insert()
> /*
> * !!! Cannot dispatch to local CPU; Outside affinity !!!
> *
> * We need to allow local dispatch outside affinity iff:
> *
> * p->is_blocked && cpu == task_cpu(p)
> *
> * Since enqueue_task_scx() hold's the task's rq_lock, the
> * is_blocked indicator should be stable during a dispatch.
> */
> ops.dispatch()
> do_pick_task_scx()
> set_next_task_scx()
> ops.running(donor)
> find_proxy_task()
> next = owner
> /*
> * !!! Owner stats running without any notification. !!!
> *
> * If owner blocks, dequeue_task_scx() is executed first and
> * the sched-ext scheduler sees:
> *
> * ops.stopping(owner)
> *
> * which leads to some asymmetry.
> *
> * XXX: Below is how I imagine the flow should continue.
> */
> ops.quiescent(owner) /* Core is taking back control of owner's running */
> /* Runs owner */
> ops.runnable(owner) /* Core is giving back control to ext layer */
> ops.stopping(donor); /* Accounting symmetry for donor */
I think the order of operations should be the following:
ops.runnable(donor)
-> ops.enqueue(donor)
-> donor becomes curr
-> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
-> donor executes
-> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
-> __schedule()
-> pick_next -> proxy-exec selects owner as next
-> put_prev_task_scx(donor)
-> ops.stopping(donor)
-> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
-> set_next_task_scx(owner)
-> ops.running(owner)
-> donor runs as rq->donor, owner runs as rq->curr /* execution / accounting split */
Later, when the owner is switched away (another schedule)
... owner running ...
-> __schedule() / switch away from owner
-> put_prev_task_scx(owner)
-> ops.stopping(owner) /* if QUEUED && IS_RUNNING */
-> set_next_task_scx() /* whoever is next */
Later, mutex is released - donor can run as itself again
-> mutex released / donor unblocked (!task_is_blocked(donor))
-> donor selected as next /* becomes rq->curr as donor; not superseded by proxy */
-> ops.running(donor) /* set_next_task_scx(donor); QUEUED && !task_is_blocked(donor) */
-> donor executes as rq->curr
> I think dequeue_task_scx() should see task_current_donor() before
> calling ops.stopping() else we get some asymmetry. The donor will
> anyways be placed back via put_prev_task_scx() and since it hasn't run,
> it cannot block itself and there should be no dependency on
> dequeue_task_scx() for donors.
The ops.running/stopping() pair should be always enforced by
SCX_TASK_IS_RUNNING, so we either see a pair of them or none. So in theory,
there shouldn't be any asymmetry.
>
> With the quiescent() + runnable() scheme, the sched-ext schedulers need
> to be made aware that task can go quiescent() and then back to
> runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
> full:
>
> dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
>
> Also since the mutex owner can block, the sched-ext scheduler needs to
> be aware of the fact that it can get a dequeue() -> quiescent()
> without having stopping() in between if we plan to keep
> symmetry.
We can see ops.dequeue() -> ops.quiescent() without ops.stopping() even without
proxy-exec: if a task becomes runnable and then it's moved to a different sched
class, the BPF scheduler can see ops.runnable/quiescent() without
ops.running/stopping().
As long as ops.runnable/quiescent() and ops.running/stopping() are symmetric I
think we're fine.
>
> There might be more issues there that I'm missing.
>
Right, I'm still trying to figure out if there's any scenario that can break
some BPF assumptions (kfunc permissions or similar), but considering that the
BPF context is usually associated to task_struct I can't see any potential
violation/breakage at the moment.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-07 6:31 ` Andrea Righi
@ 2026-05-07 7:45 ` K Prateek Nayak
2026-05-07 10:13 ` Andrea Righi
0 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-07 7:45 UTC (permalink / raw)
To: Andrea Righi
Cc: John Stultz, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Joel Fernandes, sched-ext,
linux-kernel
On 5/7/2026 12:01 PM, Andrea Righi wrote:
> Hi John, Prateek,
>
> On Thu, May 07, 2026 at 09:04:57AM +0530, K Prateek Nayak wrote:
>> Hello John, Andrea,
>>
>> (Full disclaimer: I haven't looked at the entire series)
>>
>> On 5/7/2026 2:39 AM, John Stultz wrote:
>>>> + /*
>>>> + * Tasks pinned to a single CPU (per-CPU kthreads via
>>>> + * kthread_bind(), tasks under migrate_disable()) cannot
>>>> + * be moved to @owner_cpu. proxy_migrate_task() uses
>>>> + * __set_task_cpu() which would silently violate the
>>>> + * pinning and leave the task to run on a CPU outside
>>>> + * its cpus_ptr once it is unblocked. Stay on this CPU
>>>> + * via force_return; the owner running elsewhere will
>>>> + * wake @p back up when the mutex becomes available.
>>>> + */
>>>> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
>>>> + goto force_return;
>>>> goto migrate_task;
>>>
>>> Hey Andrea!
>>> I'm excited to see this series! Thanks for your efforts here!
>>>
>>> Though I'm a bit confused on this patch. I see the patch changes it
>>> so we don't proxy-migrate pinned/migration-disabled patches, but I'm
>>> not sure I understand why.
>>>
>>> We only proxy-migrate blocked_on tasks, which don't run on the cpu
>>> they are migrated to (they are only migrated to be used as a donor).
>>> That's why we have the proxy_force_return() function to return-migrate
>>> them back when they do become runnable.
>>
>> I agree this shouldn't be a problem from core perspective but there
>> are some interesting sched-ext interactions possible. More on that
>> below:
>
> So, I included this patch, because in a previous version of this series it was
> preventing a "SCX_DSQ_LOCAL[_ON] cannot move migration disabled task" error.
>
> However, I tried again this series without this and everything seems to work. I
> guess this was fixed by "sched/ext: Avoid migrating blocked tasks with proxy
> execution", that was not present in my previous early implementation. So, let's
> ignore this for now...
>
>>
>>>
>>> Could you provide some more details about what motivated this change
>>> (ie: how you tripped a problem that it resolved?).
>>
>> I think ops.enqueue() always assumes that the task being enqueued is
>> runnable on the task_cpu() and when the the sched-ext layer tries to
>> dispatch this task to local DSQ, the ext core complains and marks
>> the sched-ext scheduler as buggy.
>
> Correct that ops.enqueue() assumes that the task being enqueued is runnable on
> task_cpu(), but this should still be true even when the donor is migrated:
> proxy-exec should only migrate the donor to the owner's CPU when the placement
> is allowed.
Not really - it'll migrate the task to donor's CPU even if it is outside
the task's affinity with the reasoning that the donor will never run
there - it only exists on the runqueue to donate it's time to the lock
owner.
But if you mean runnable in the sense it hasn't blocked then yes it is
SCX_TASK_QUEUED + set_task_runnable().
>
>>
>> With sched-ext, even the lock owner's CPU is slightly complicated
>> since the owner might be associated with a CPU but it is in fact on a
>> custom DSQ and after moving the donor to owner's CPU, we will need
>> sched-ext scheduler to guarantee that the owner runs there else
>> there is no point in doing a proxy.
>
> But a donor is always a running task (by definition), so it can't be on a custom
> DSQ. Custom DSQs only hold tasks that are in the BPF scheduler's custody,
> waiting to be dispatched.
I was thinking more from a proxy migration standpoint - when the donor
is on a different CPU and the owner is on another one, and the core.c
bits move the donor to the owner's CPU.
>
> The core keeps the donor logically runnable / on_rq and the ext core always
> parks blocked donors on the built-in local DSQ:
>
> put_prev_task_scx():
> ...
> if (p->scx.flags & SCX_TASK_QUEUED) {
> set_task_runnable(rq, p);
>
> if (task_is_blocked(p)) {
> dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
> goto switch_class;
> }
> ...
Ah! This is what I was missing but then, this task gets picked and
is moved by find_proxy_task() in core.c right?
>
>>
>> scx flow should look something like (please correct me if I'm
>> wrong):
>>
>> CPU0: donor CPU1: owner
>> =========== ===========
>>
>> /* Donor is retained on rq*/
>> put_prev_task_scx()
>> ops.stopping()
>> ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
>> do_pick_task_scx()
>> next = donor;
>> find_proxy_task()
>> proxy_migrate_task()
>> ops.dequeue()
>> ======================> /*
At this point I mean ^
>> * Moves to owner CPU (May be outside of affinity list)
>> * ops.enqueue() still happens on CPU0 but I've shown it
>> * here to depict the context has moved to owner's CPU.
>> */
>> ops.enqueue()
>> scx_bpf_dsq_insert()
>> /*
>> * !!! Cannot dispatch to local CPU; Outside affinity !!!
>> *
>> * We need to allow local dispatch outside affinity iff:
>> *
>> * p->is_blocked && cpu == task_cpu(p)
>> *
>> * Since enqueue_task_scx() hold's the task's rq_lock, the
>> * is_blocked indicator should be stable during a dispatch.
>> */
>> ops.dispatch()
>> do_pick_task_scx()
>> set_next_task_scx()
>> ops.running(donor)
>> find_proxy_task()
>> next = owner
>> /*
>> * !!! Owner stats running without any notification. !!!
>> *
>> * If owner blocks, dequeue_task_scx() is executed first and
>> * the sched-ext scheduler sees:
>> *
>> * ops.stopping(owner)
>> *
>> * which leads to some asymmetry.
>> *
>> * XXX: Below is how I imagine the flow should continue.
>> */
>> ops.quiescent(owner) /* Core is taking back control of owner's running */
>> /* Runs owner */
>> ops.runnable(owner) /* Core is giving back control to ext layer */
>> ops.stopping(donor); /* Accounting symmetry for donor */
>
> I think the order of operations should be the following:
>
> ops.runnable(donor)
> -> ops.enqueue(donor)
> -> donor becomes curr
> -> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
> -> donor executes
> -> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
> -> __schedule()
> -> pick_next -> proxy-exec selects owner as next
> -> put_prev_task_scx(donor)
> -> ops.stopping(donor)
> -> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
> -> set_next_task_scx(owner)
> -> ops.running(owner)
So ext will just switch the context back to owner? But how does this
happen with the changes in your series?
Based on my understanding, this happens:
-> pick_next -> sced-ext returns donor as next
/* prev's context is put back */
-> set_next_task_scx(donor)
-> ops.running(donor)
/* In core.c */
/* next = donor */
if (next->blocked_on) /* true since we have blocked donor */
next = find_proxy_task(); /* Returns owner */
/* next = owner; */
/* Starts running owner */
How does ext core swap back the owner context here? Am I missing
something? find_proxy_task() doesn't call put_prev_set_next_task() so
I'm at a loss how we get to set_next_task_scx(owner).
> -> donor runs as rq->donor, owner runs as rq->curr /* execution / accounting split */
>
> Later, when the owner is switched away (another schedule)
>
> ... owner running ...
> -> __schedule() / switch away from owner
> -> put_prev_task_scx(owner)
> -> ops.stopping(owner) /* if QUEUED && IS_RUNNING */
> -> set_next_task_scx() /* whoever is next */
>
> Later, mutex is released - donor can run as itself again
>
> -> mutex released / donor unblocked (!task_is_blocked(donor))
> -> donor selected as next /* becomes rq->curr as donor; not superseded by proxy */
> -> ops.running(donor) /* set_next_task_scx(donor); QUEUED && !task_is_blocked(donor) */
> -> donor executes as rq->curr
>
>> I think dequeue_task_scx() should see task_current_donor() before
>> calling ops.stopping() else we get some asymmetry. The donor will
>> anyways be placed back via put_prev_task_scx() and since it hasn't run,
>> it cannot block itself and there should be no dependency on
>> dequeue_task_scx() for donors.
>
> The ops.running/stopping() pair should be always enforced by
> SCX_TASK_IS_RUNNING, so we either see a pair of them or none. So in theory,
> there shouldn't be any asymmetry.
>
>>
>> With the quiescent() + runnable() scheme, the sched-ext schedulers need
>> to be made aware that task can go quiescent() and then back to
>> runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
>> full:
>>
>> dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
>>
>> Also since the mutex owner can block, the sched-ext scheduler needs to
>> be aware of the fact that it can get a dequeue() -> quiescent()
>> without having stopping() in between if we plan to keep
>> symmetry.
>
> We can see ops.dequeue() -> ops.quiescent() without ops.stopping() even without
> proxy-exec: if a task becomes runnable and then it's moved to a different sched
> class, the BPF scheduler can see ops.runnable/quiescent() without
> ops.running/stopping().
Ack!
>
> As long as ops.runnable/quiescent() and ops.running/stopping() are symmetric I
> think we're fine.
I think it is mostly symmetric other than for that one scenario I'm
confused about above.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-07 7:45 ` K Prateek Nayak
@ 2026-05-07 10:13 ` Andrea Righi
2026-05-07 15:47 ` K Prateek Nayak
0 siblings, 1 reply; 21+ messages in thread
From: Andrea Righi @ 2026-05-07 10:13 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Joel Fernandes, sched-ext,
linux-kernel
Hi Prateek,
On Thu, May 07, 2026 at 01:15:22PM +0530, K Prateek Nayak wrote:
> On 5/7/2026 12:01 PM, Andrea Righi wrote:
> > Hi John, Prateek,
> >
> > On Thu, May 07, 2026 at 09:04:57AM +0530, K Prateek Nayak wrote:
> >> Hello John, Andrea,
> >>
> >> (Full disclaimer: I haven't looked at the entire series)
> >>
> >> On 5/7/2026 2:39 AM, John Stultz wrote:
> >>>> + /*
> >>>> + * Tasks pinned to a single CPU (per-CPU kthreads via
> >>>> + * kthread_bind(), tasks under migrate_disable()) cannot
> >>>> + * be moved to @owner_cpu. proxy_migrate_task() uses
> >>>> + * __set_task_cpu() which would silently violate the
> >>>> + * pinning and leave the task to run on a CPU outside
> >>>> + * its cpus_ptr once it is unblocked. Stay on this CPU
> >>>> + * via force_return; the owner running elsewhere will
> >>>> + * wake @p back up when the mutex becomes available.
> >>>> + */
> >>>> + if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
> >>>> + goto force_return;
> >>>> goto migrate_task;
> >>>
> >>> Hey Andrea!
> >>> I'm excited to see this series! Thanks for your efforts here!
> >>>
> >>> Though I'm a bit confused on this patch. I see the patch changes it
> >>> so we don't proxy-migrate pinned/migration-disabled patches, but I'm
> >>> not sure I understand why.
> >>>
> >>> We only proxy-migrate blocked_on tasks, which don't run on the cpu
> >>> they are migrated to (they are only migrated to be used as a donor).
> >>> That's why we have the proxy_force_return() function to return-migrate
> >>> them back when they do become runnable.
> >>
> >> I agree this shouldn't be a problem from core perspective but there
> >> are some interesting sched-ext interactions possible. More on that
> >> below:
> >
> > So, I included this patch, because in a previous version of this series it was
> > preventing a "SCX_DSQ_LOCAL[_ON] cannot move migration disabled task" error.
> >
> > However, I tried again this series without this and everything seems to work. I
> > guess this was fixed by "sched/ext: Avoid migrating blocked tasks with proxy
> > execution", that was not present in my previous early implementation. So, let's
> > ignore this for now...
> >
> >>
> >>>
> >>> Could you provide some more details about what motivated this change
> >>> (ie: how you tripped a problem that it resolved?).
> >>
> >> I think ops.enqueue() always assumes that the task being enqueued is
> >> runnable on the task_cpu() and when the the sched-ext layer tries to
> >> dispatch this task to local DSQ, the ext core complains and marks
> >> the sched-ext scheduler as buggy.
> >
> > Correct that ops.enqueue() assumes that the task being enqueued is runnable on
> > task_cpu(), but this should still be true even when the donor is migrated:
> > proxy-exec should only migrate the donor to the owner's CPU when the placement
> > is allowed.
>
> Not really - it'll migrate the task to donor's CPU even if it is outside
> the task's affinity with the reasoning that the donor will never run
> there - it only exists on the runqueue to donate it's time to the lock
> owner.
>
> But if you mean runnable in the sense it hasn't blocked then yes it is
> SCX_TASK_QUEUED + set_task_runnable().
proxy-exec can migrate the donor to the owner's rq/DSQ, but it doesn't actually
execute it, it's just there to bump the owner on the same CPU.
Then when the donor returns back home, via proxy_force_return(), we trigger
deactivate_task() -> dequeue_task_scx() and that also unlinks the donor from the
local DSQ. So we shouldn't break affinity.
> >> With sched-ext, even the lock owner's CPU is slightly complicated
> >> since the owner might be associated with a CPU but it is in fact on a
> >> custom DSQ and after moving the donor to owner's CPU, we will need
> >> sched-ext scheduler to guarantee that the owner runs there else
> >> there is no point in doing a proxy.
> >
> > But a donor is always a running task (by definition), so it can't be on a custom
> > DSQ. Custom DSQs only hold tasks that are in the BPF scheduler's custody,
> > waiting to be dispatched.
>
> I was thinking more from a proxy migration standpoint - when the donor
> is on a different CPU and the owner is on another one, and the core.c
> bits move the donor to the owner's CPU.
Ah I see what you mean. So you're saying for example: if owner is on CPU1's rq,
but it's also sitting on a global DSQ (that can be consumed by any CPU), we'd
move the donor to CPU1's rq, park it in CPU1's local DSQ, but then the owner can
be consumed by any CPU, because it's in a global DSQ.
However, if I'm not missing anything, I think in this case the core scheduler
should select the donor via pick_next_task(), then proxy-exec can replace it
with the owner, removing it from the global DSQ and run it.
From set_next_task_scx():
if (p->scx.flags & SCX_TASK_QUEUED) {
ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC);
dispatch_dequeue(rq, p);
}
That dispatch_dequeue() removes the owner from whatever DSQ it was on
(global/custom/local), then the owner becomes rq->curr and runs.
>
> >
> > The core keeps the donor logically runnable / on_rq and the ext core always
> > parks blocked donors on the built-in local DSQ:
> >
> > put_prev_task_scx():
> > ...
> > if (p->scx.flags & SCX_TASK_QUEUED) {
> > set_task_runnable(rq, p);
> >
> > if (task_is_blocked(p)) {
> > dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
> > goto switch_class;
> > }
> > ...
>
> Ah! This is what I was missing but then, this task gets picked and
> is moved by find_proxy_task() in core.c right?
Yes, that's the intent.
>
> >
> >>
> >> scx flow should look something like (please correct me if I'm
> >> wrong):
> >>
> >> CPU0: donor CPU1: owner
> >> =========== ===========
> >>
> >> /* Donor is retained on rq*/
> >> put_prev_task_scx()
> >> ops.stopping()
> >> ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
> >> do_pick_task_scx()
> >> next = donor;
> >> find_proxy_task()
> >> proxy_migrate_task()
> >> ops.dequeue()
> >> ======================> /*
>
> At this point I mean ^
>
> >> * Moves to owner CPU (May be outside of affinity list)
> >> * ops.enqueue() still happens on CPU0 but I've shown it
> >> * here to depict the context has moved to owner's CPU.
> >> */
> >> ops.enqueue()
> >> scx_bpf_dsq_insert()
> >> /*
> >> * !!! Cannot dispatch to local CPU; Outside affinity !!!
> >> *
> >> * We need to allow local dispatch outside affinity iff:
> >> *
> >> * p->is_blocked && cpu == task_cpu(p)
> >> *
> >> * Since enqueue_task_scx() hold's the task's rq_lock, the
> >> * is_blocked indicator should be stable during a dispatch.
> >> */
> >> ops.dispatch()
> >> do_pick_task_scx()
> >> set_next_task_scx()
> >> ops.running(donor)
> >> find_proxy_task()
> >> next = owner
> >> /*
> >> * !!! Owner stats running without any notification. !!!
> >> *
> >> * If owner blocks, dequeue_task_scx() is executed first and
> >> * the sched-ext scheduler sees:
> >> *
> >> * ops.stopping(owner)
> >> *
> >> * which leads to some asymmetry.
> >> *
> >> * XXX: Below is how I imagine the flow should continue.
> >> */
> >> ops.quiescent(owner) /* Core is taking back control of owner's running */
> >> /* Runs owner */
> >> ops.runnable(owner) /* Core is giving back control to ext layer */
> >> ops.stopping(donor); /* Accounting symmetry for donor */
> >
> > I think the order of operations should be the following:
> >
> > ops.runnable(donor)
> > -> ops.enqueue(donor)
> > -> donor becomes curr
> > -> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
> > -> donor executes
> > -> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
> > -> __schedule()
> > -> pick_next -> proxy-exec selects owner as next
> > -> put_prev_task_scx(donor)
> > -> ops.stopping(donor)
> > -> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
> > -> set_next_task_scx(owner)
> > -> ops.running(owner)
>
> So ext will just switch the context back to owner? But how does this
> happen with the changes in your series?
>
> Based on my understanding, this happens:
>
> -> pick_next -> sced-ext returns donor as next
> /* prev's context is put back */
> -> set_next_task_scx(donor)
> -> ops.running(donor)
>
> /* In core.c */
>
> /* next = donor */
> if (next->blocked_on) /* true since we have blocked donor */
> next = find_proxy_task(); /* Returns owner */
>
> /* next = owner; */
> /* Starts running owner */
>
> How does ext core swap back the owner context here? Am I missing
> something? find_proxy_task() doesn't call put_prev_set_next_task() so
> I'm at a loss how we get to set_next_task_scx(owner).
The sequence should be the following:
- pick_next_task(rq, rq->donor, &rf) returns donor (because we parked it on the local DSQ)
- in __schedule() (still holding rq->lock), proxy sees next->blocked_on and does:
- next = find_proxy_task(rq, next, &rf); -> returns owner (or triggers migration / retries)
- Only after that, __schedule() reaches the point where it performs the switch
(put_prev_set_next_task(rq, prev, next) via the pick path). At that point,
next is already the owner, so:
- put_prev_task_scx(prev=donor) (or whatever prev is)
- set_next_task_scx(next=owner)
And parking the blocked donor on rq->scx.local_dsq makes it the obvious
candidate for pick_next_task_scx() on that CPU.
So the donor isn't "moved" by find_proxy_task() in the DSQ sense, rather:
- SCX picks the donor token
- proxy-exec replaces the picked task with the lock owner (or triggers
migration/return paths)
>
> > -> donor runs as rq->donor, owner runs as rq->curr /* execution / accounting split */
> >
> > Later, when the owner is switched away (another schedule)
> >
> > ... owner running ...
> > -> __schedule() / switch away from owner
> > -> put_prev_task_scx(owner)
> > -> ops.stopping(owner) /* if QUEUED && IS_RUNNING */
> > -> set_next_task_scx() /* whoever is next */
> >
> > Later, mutex is released - donor can run as itself again
> >
> > -> mutex released / donor unblocked (!task_is_blocked(donor))
> > -> donor selected as next /* becomes rq->curr as donor; not superseded by proxy */
> > -> ops.running(donor) /* set_next_task_scx(donor); QUEUED && !task_is_blocked(donor) */
> > -> donor executes as rq->curr
> >
> >> I think dequeue_task_scx() should see task_current_donor() before
> >> calling ops.stopping() else we get some asymmetry. The donor will
> >> anyways be placed back via put_prev_task_scx() and since it hasn't run,
> >> it cannot block itself and there should be no dependency on
> >> dequeue_task_scx() for donors.
> >
> > The ops.running/stopping() pair should be always enforced by
> > SCX_TASK_IS_RUNNING, so we either see a pair of them or none. So in theory,
> > there shouldn't be any asymmetry.
> >
> >>
> >> With the quiescent() + runnable() scheme, the sched-ext schedulers need
> >> to be made aware that task can go quiescent() and then back to
> >> runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
> >> full:
> >>
> >> dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
> >>
> >> Also since the mutex owner can block, the sched-ext scheduler needs to
> >> be aware of the fact that it can get a dequeue() -> quiescent()
> >> without having stopping() in between if we plan to keep
> >> symmetry.
> >
> > We can see ops.dequeue() -> ops.quiescent() without ops.stopping() even without
> > proxy-exec: if a task becomes runnable and then it's moved to a different sched
> > class, the BPF scheduler can see ops.runnable/quiescent() without
> > ops.running/stopping().
>
> Ack!
>
> >
> > As long as ops.runnable/quiescent() and ops.running/stopping() are symmetric I
> > think we're fine.
>
> I think it is mostly symmetric other than for that one scenario I'm
> confused about above.
Hope it's clearer now, assuming I didn't miss anything or make it even more
confusing. :)
I'm still not fully convinced about the migration-disabled task scenarios, but
so far I can't find any holes.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-07 10:13 ` Andrea Righi
@ 2026-05-07 15:47 ` K Prateek Nayak
2026-05-08 7:40 ` Andrea Righi
0 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-07 15:47 UTC (permalink / raw)
To: Andrea Righi
Cc: John Stultz, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Joel Fernandes, sched-ext,
linux-kernel
Hello Andrea,
On 5/7/2026 3:43 PM, Andrea Righi wrote:
>>>> scx flow should look something like (please correct me if I'm
>>>> wrong):
>>>>
>>>> CPU0: donor CPU1: owner
>>>> =========== ===========
>>>>
>>>> /* Donor is retained on rq*/
>>>> put_prev_task_scx()
>>>> ops.stopping()
>>>> ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
>>>> do_pick_task_scx()
>>>> next = donor;
>>>> find_proxy_task()
>>>> proxy_migrate_task()
>>>> ops.dequeue()
>>>> ======================> /*
>>
>> At this point I mean ^
>>
>>>> * Moves to owner CPU (May be outside of affinity list)
>>>> * ops.enqueue() still happens on CPU0 but I've shown it
>>>> * here to depict the context has moved to owner's CPU.
>>>> */
>>>> ops.enqueue()
>>>> scx_bpf_dsq_insert()
>>>> /*
>>>> * !!! Cannot dispatch to local CPU; Outside affinity !!!
>>>> *
>>>> * We need to allow local dispatch outside affinity iff:
>>>> *
>>>> * p->is_blocked && cpu == task_cpu(p)
>>>> *
>>>> * Since enqueue_task_scx() hold's the task's rq_lock, the
>>>> * is_blocked indicator should be stable during a dispatch.
>>>> */
>>>> ops.dispatch()
>>>> do_pick_task_scx()
>>>> set_next_task_scx()
>>>> ops.running(donor)
>>>> find_proxy_task()
>>>> next = owner
>>>> /*
>>>> * !!! Owner stats running without any notification. !!!
>>>> *
>>>> * If owner blocks, dequeue_task_scx() is executed first and
>>>> * the sched-ext scheduler sees:
>>>> *
>>>> * ops.stopping(owner)
>>>> *
>>>> * which leads to some asymmetry.
>>>> *
>>>> * XXX: Below is how I imagine the flow should continue.
>>>> */
>>>> ops.quiescent(owner) /* Core is taking back control of owner's running */
>>>> /* Runs owner */
>>>> ops.runnable(owner) /* Core is giving back control to ext layer */
>>>> ops.stopping(donor); /* Accounting symmetry for donor */
>>>
>>> I think the order of operations should be the following:
>>>
>>> ops.runnable(donor)
>>> -> ops.enqueue(donor)
>>> -> donor becomes curr
>>> -> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
>>> -> donor executes
>>> -> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
>>> -> __schedule()
>>> -> pick_next -> proxy-exec selects owner as next
>>> -> put_prev_task_scx(donor)
>>> -> ops.stopping(donor)
>>> -> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
>>> -> set_next_task_scx(owner)
>>> -> ops.running(owner)
>>
>> So ext will just switch the context back to owner? But how does this
>> happen with the changes in your series?
>>
>> Based on my understanding, this happens:
>>
>> -> pick_next -> sced-ext returns donor as next
>> /* prev's context is put back */
>> -> set_next_task_scx(donor)
>> -> ops.running(donor)
>>
>> /* In core.c */
>>
>> /* next = donor */
>> if (next->blocked_on) /* true since we have blocked donor */
>> next = find_proxy_task(); /* Returns owner */
>>
>> /* next = owner; */
>> /* Starts running owner */
>>
>> How does ext core swap back the owner context here? Am I missing
>> something? find_proxy_task() doesn't call put_prev_set_next_task() so
>> I'm at a loss how we get to set_next_task_scx(owner).
>
> The sequence should be the following:
Still a bit confused! Hope you can bear with me for just a little
bit longer :-)
>
> - pick_next_task(rq, rq->donor, &rf) returns donor (because we parked it on the local DSQ)
So put_prev_set_next_task() happens as a part of pick_next_task().
When we pick the donor, we have already called set_next_task(donor)
on it before returning it from pick_next_task().
"owner" is still not known at this point ...
> - in __schedule() (still holding rq->lock), proxy sees next->blocked_on and does:
> - next = find_proxy_task(rq, next, &rf); -> returns owner (or triggers migration / retries)
> - Only after that, __schedule() reaches the point where it performs the switch
> (put_prev_set_next_task(rq, prev, next) via the pick path). At that point,
... and we don't do put_prev_set_next_task(donor, owner) after
(or within) find_proxy_task() as far as I'm aware. The "donor"
remains as the task on which we last called put_prev_task().
If you are referring to the bits in your Patch2, the calls to
put_prev_task() and set_next_task() is done on the same "donor"
task. It is purely for the sake of adding a balance callback if
we had skipped migrating away the prev task due to proxy.
AFAIC, nothing does a set_next_task(owner) after
pick_next_task() in __schedule() unless I'm grossly mistaken.
Sorry in case I didn't see it until now but could you please
point out where this happens? I'm not seeing any scx specific
calls either in the context switch path other than
sched_ext_dead() called for a dead task.
> next is already the owner, so:
> - put_prev_task_scx(prev=donor) (or whatever prev is)
> - set_next_task_scx(next=owner)
>
> And parking the blocked donor on rq->scx.local_dsq makes it the obvious
> candidate for pick_next_task_scx() on that CPU.
>
> So the donor isn't "moved" by find_proxy_task() in the DSQ sense, rather:
> - SCX picks the donor token
> - proxy-exec replaces the picked task with the lock owner (or triggers
> migration/return paths)
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
2026-05-07 15:47 ` K Prateek Nayak
@ 2026-05-08 7:40 ` Andrea Righi
0 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-08 7:40 UTC (permalink / raw)
To: K Prateek Nayak
Cc: John Stultz, Tejun Heo, David Vernet, Changwoo Min, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Christian Loehle, Koba Ko, Joel Fernandes, sched-ext,
linux-kernel
Hi Prateek,
On Thu, May 07, 2026 at 09:17:34PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 5/7/2026 3:43 PM, Andrea Righi wrote:
> >>>> scx flow should look something like (please correct me if I'm
> >>>> wrong):
> >>>>
> >>>> CPU0: donor CPU1: owner
> >>>> =========== ===========
> >>>>
> >>>> /* Donor is retained on rq*/
> >>>> put_prev_task_scx()
> >>>> ops.stopping()
> >>>> ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
> >>>> do_pick_task_scx()
> >>>> next = donor;
> >>>> find_proxy_task()
> >>>> proxy_migrate_task()
> >>>> ops.dequeue()
> >>>> ======================> /*
> >>
> >> At this point I mean ^
> >>
> >>>> * Moves to owner CPU (May be outside of affinity list)
> >>>> * ops.enqueue() still happens on CPU0 but I've shown it
> >>>> * here to depict the context has moved to owner's CPU.
> >>>> */
> >>>> ops.enqueue()
> >>>> scx_bpf_dsq_insert()
> >>>> /*
> >>>> * !!! Cannot dispatch to local CPU; Outside affinity !!!
> >>>> *
> >>>> * We need to allow local dispatch outside affinity iff:
> >>>> *
> >>>> * p->is_blocked && cpu == task_cpu(p)
> >>>> *
> >>>> * Since enqueue_task_scx() hold's the task's rq_lock, the
> >>>> * is_blocked indicator should be stable during a dispatch.
> >>>> */
> >>>> ops.dispatch()
> >>>> do_pick_task_scx()
> >>>> set_next_task_scx()
> >>>> ops.running(donor)
> >>>> find_proxy_task()
> >>>> next = owner
> >>>> /*
> >>>> * !!! Owner stats running without any notification. !!!
> >>>> *
> >>>> * If owner blocks, dequeue_task_scx() is executed first and
> >>>> * the sched-ext scheduler sees:
> >>>> *
> >>>> * ops.stopping(owner)
> >>>> *
> >>>> * which leads to some asymmetry.
> >>>> *
> >>>> * XXX: Below is how I imagine the flow should continue.
> >>>> */
> >>>> ops.quiescent(owner) /* Core is taking back control of owner's running */
> >>>> /* Runs owner */
> >>>> ops.runnable(owner) /* Core is giving back control to ext layer */
> >>>> ops.stopping(donor); /* Accounting symmetry for donor */
> >>>
> >>> I think the order of operations should be the following:
> >>>
> >>> ops.runnable(donor)
> >>> -> ops.enqueue(donor)
> >>> -> donor becomes curr
> >>> -> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
> >>> -> donor executes
> >>> -> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
> >>> -> __schedule()
> >>> -> pick_next -> proxy-exec selects owner as next
> >>> -> put_prev_task_scx(donor)
> >>> -> ops.stopping(donor)
> >>> -> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
> >>> -> set_next_task_scx(owner)
> >>> -> ops.running(owner)
> >>
> >> So ext will just switch the context back to owner? But how does this
> >> happen with the changes in your series?
> >>
> >> Based on my understanding, this happens:
> >>
> >> -> pick_next -> sced-ext returns donor as next
> >> /* prev's context is put back */
> >> -> set_next_task_scx(donor)
> >> -> ops.running(donor)
> >>
> >> /* In core.c */
> >>
> >> /* next = donor */
> >> if (next->blocked_on) /* true since we have blocked donor */
> >> next = find_proxy_task(); /* Returns owner */
> >>
> >> /* next = owner; */
> >> /* Starts running owner */
> >>
> >> How does ext core swap back the owner context here? Am I missing
> >> something? find_proxy_task() doesn't call put_prev_set_next_task() so
> >> I'm at a loss how we get to set_next_task_scx(owner).
> >
> > The sequence should be the following:
>
> Still a bit confused! Hope you can bear with me for just a little
> bit longer :-)
No, thank you! This is super useful for me! I want to make sure I'm not
missing/misinterpreting anything obvious. :)
>
> >
> > - pick_next_task(rq, rq->donor, &rf) returns donor (because we parked it on the local DSQ)
>
> So put_prev_set_next_task() happens as a part of pick_next_task().
>
> When we pick the donor, we have already called set_next_task(donor)
> on it before returning it from pick_next_task().
>
> "owner" is still not known at this point ...
That seems correct.
>
> > - in __schedule() (still holding rq->lock), proxy sees next->blocked_on and does:
> > - next = find_proxy_task(rq, next, &rf); -> returns owner (or triggers migration / retries)
> > - Only after that, __schedule() reaches the point where it performs the switch
> > (put_prev_set_next_task(rq, prev, next) via the pick path). Ao that point,
>
> ... and we don't do put_prev_set_next_task(donor, owner) after
> (or within) find_proxy_task() as far as I'm aware. The "donor"
> remains as the task on which we last called put_prev_task().
Also correct.
>
> If you are referring to the bits in your Patch2, the calls to
> put_prev_task() and set_next_task() is done on the same "donor"
> task. It is purely for the sake of adding a balance callback if
> we had skipped migrating away the prev task due to proxy.
>
> AFAIC, nothing does a set_next_task(owner) after
> pick_next_task() in __schedule() unless I'm grossly mistaken.
I think you're right.
Let me try to recap what happens in two different scenarios:
# donor and owner running on the same CPU
Owner runs on CPU0, it expires its p->scx.slice, so it's de-scheduled and added
to a DSQ; donor is next, it runs and blocks on a mutex on CPU0, we park the
donor on CPU0's local DSQ, pick_next_task(rq, rq->donor, &rf) on CPU0 returns
next == donor, we see next->blocked_on == true, so we trigger find_proxy_task(),
inside find_proxy_task() we see owner_cpu == task_cpu, find_proxy_task() returns
owner, replacing next, set_next_task_scx(owner) triggers ops_dequeue() +
dispatch_dequeue(), removing the owner from the DSQ, then
set_next_task_scx(owner) will trigger ops.running(onwer), then
ops.stopping(owner). And in this case we don't trigger ops.stopping(donor) +
ops.running(donor) during the proxy switch.
# donor and owner running on different CPUs
Owner runs on CPU0, it expires its p->scx.slice, so it's de-scheduled and added
to a DSQ; donor runs on CPU1, it blocks on a mutex on CPU1, we park donor on
CPU1's local DSQ, pick_next_task(rq, rq->donor, &rf) on CPU1 returns donor as
next, we see next->blocked_on == true, we trigger
find_proxy_task() on CPU1, find_proxy_task() sees owner_cpu != this_cpu, so it
triggers proxy_migrate_task() to migrate donor to CPU0, which triggers
deactivate_task(donor), unlinking it from CPU1's local DSQ, then
proxy_set_task_cpu(donor, CPU0). But at this point we're not adding donor to
CPU0's local DSQ. I think this is the part that is missing, if we add donor to
CPU0's local DSQ at this point we would effectively fall back to the "same CPU"
scenario and (in theory) everything should work.
Something like the following (not tested yet - about to).
Thanks,
-Andrea
kernel/sched/ext.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index af9b10cd82c4a..6125c4cbd6d64 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1915,6 +1915,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
+ /*
+ * Under proxy execution, mutex-blocked donors can be migrated to a
+ * different rq (e.g., towards the mutex owner's CPU). For sched_ext, rq
+ * association alone isn't sufficient for the donor to be picked again
+ * and drive find_proxy_task(); make it immediately visible on the
+ * destination rq by parking it on the built-in local DSQ.
+ *
+ * This task is a scheduling context token and isn't supposed to run as
+ * itself while blocked.
+ */
+ if (unlikely(task_is_blocked(p))) {
+ clear_direct_dispatch(p);
+ dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
+ return;
+ }
+
/* internal movements - rq migration / RESTORE */
if (sticky_cpu == cpu_of(rq))
goto local_norefill;
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
` (8 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
In __schedule(), the proxy-exec donor-stabilization block calls
put_prev_task() and set_next_task() when rq->donor == prev_donor and
prev != next.
For sched_ext tasks, re-entering set_next_task_scx() for a donor that
has already been seen by BPF ops.running via the normal pick path causes
issues. It fires SCX_CALL_OP_TASK(sch, running, rq, donor) a second
time, and sch->ops dispatch can land on a vtable slot in a state that
yields a NULL function pointer or corrupts the stack.
Fix this by skipping the put_prev_task/set_next_task re-entry when the
donor is in the ext_sched_class, since sched_ext tracks curr/donor
itself.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/core.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 75541e5bb66d1..1c161dd9d7440 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7147,9 +7147,14 @@ static void __sched notrace __schedule(int sched_mode)
* anything, since B == B. However, A might have
* missed a RT/DL balance opportunity due to being
* on_cpu.
+ *
+ * sched_ext tracks curr/donor itself; re-entering set_next_task_scx
+ * here dispatches through a stale/NULL BPF ops vtable.
*/
- donor->sched_class->put_prev_task(rq, donor, donor);
- donor->sched_class->set_next_task(rq, donor, true);
+ if (donor->sched_class != &ext_sched_class) {
+ donor->sched_class->put_prev_task(rq, donor, donor);
+ donor->sched_class->set_next_task(rq, donor, true);
+ }
}
} else {
rq_set_donor(rq, next);
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 03/10] sched/ext: Split curr|donor references properly
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
` (7 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
From: John Stultz <jstultz@google.com>
With proxy-exec, we want to do the accounting against the donor most of
the time. Without proxy-exec, there should be no difference as the
rq->donor and rq->curr are the same.
So rework the logic to reference the rq->donor where appropriate.
Also add donor info to scx_dump_state().
Since CONFIG_SCHED_PROXY_EXEC currently depends on
!CONFIG_SCHED_CLASS_EXT, this should have no effect (other than the
extra donor output in scx_dump_state), but this is one step needed to
eventually remove that constraint for proxy-exec.
Signed-off-by: John Stultz <jstultz@google.com>
---
kernel/sched/ext.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 7ac7d10a41bef..c410afd28fb6d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1370,17 +1370,17 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
static void update_curr_scx(struct rq *rq)
{
- struct task_struct *curr = rq->curr;
+ struct task_struct *donor = rq->donor;
s64 delta_exec;
delta_exec = update_curr_common(rq);
if (unlikely(delta_exec <= 0))
return;
- if (curr->scx.slice != SCX_SLICE_INF) {
- curr->scx.slice -= min_t(u64, curr->scx.slice, delta_exec);
- if (!curr->scx.slice)
- touch_core_sched(rq, curr);
+ if (donor->scx.slice != SCX_SLICE_INF) {
+ donor->scx.slice -= min_t(u64, donor->scx.slice, delta_exec);
+ if (!donor->scx.slice)
+ touch_core_sched(rq, donor);
}
dl_server_update(&rq->ext_server, delta_exec);
@@ -1504,13 +1504,14 @@ static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq
if (rq->scx.flags & SCX_RQ_IN_BALANCE)
return;
- if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
- rq->curr->sched_class == &ext_sched_class) {
- rq->curr->scx.slice = 0;
+ if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->donor &&
+ rq->donor->sched_class == &ext_sched_class) {
+ rq->donor->scx.slice = 0;
preempt = true;
}
- if (preempt || sched_class_above(&ext_sched_class, rq->curr->sched_class))
+ if (preempt || sched_class_above(&ext_sched_class,
+ rq->donor->sched_class))
resched_curr(rq);
}
@@ -2634,7 +2635,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
}
/* if the destination CPU is idle, wake it up */
- if (sched_class_above(p->sched_class, dst_rq->curr->sched_class))
+ if (sched_class_above(p->sched_class, dst_rq->donor->sched_class))
resched_curr(dst_rq);
}
@@ -3150,7 +3151,7 @@ static struct task_struct *first_local_task(struct rq *rq)
static struct task_struct *
do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
{
- struct task_struct *prev = rq->curr;
+ struct task_struct *prev = rq->donor;
bool keep_prev;
struct task_struct *p;
@@ -4323,7 +4324,7 @@ static void run_deferred(struct rq *rq)
#ifdef CONFIG_NO_HZ_FULL
bool scx_can_stop_tick(struct rq *rq)
{
- struct task_struct *p = rq->curr;
+ struct task_struct *p = rq->donor;
struct scx_sched *sch = scx_task_sched(p);
if (p->sched_class != &ext_sched_class)
@@ -6355,6 +6356,9 @@ static void scx_dump_cpu(struct scx_sched *sch, struct seq_buf *s,
dump_line(&ns, " curr=%s[%d] class=%ps",
rq->curr->comm, rq->curr->pid,
rq->curr->sched_class);
+ dump_line(&ns, " donor=%s[%d] class=%ps",
+ rq->donor->comm, rq->donor->pid,
+ rq->donor->sched_class);
if (!cpumask_empty(rq->scx.cpus_to_kick))
dump_line(&ns, " cpus_to_kick : %*pb",
cpumask_pr_args(rq->scx.cpus_to_kick));
@@ -7974,7 +7978,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
unsigned long flags;
raw_spin_rq_lock_irqsave(rq, flags);
- cur_class = rq->curr->sched_class;
+ cur_class = rq->donor->sched_class;
/*
* During CPU hotplug, a CPU may depend on kicking itself to make
@@ -7986,7 +7990,7 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
!sched_class_above(cur_class, &ext_sched_class)) {
if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
if (cur_class == &ext_sched_class)
- rq->curr->scx.slice = 0;
+ rq->donor->scx.slice = 0;
cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (2 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
` (6 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
From: John Stultz <jstultz@google.com>
With proxy execution enabled, mutex blocked tasks stay on the runqueue.
Later with donor migration they will be migrated when necessary by the
core scheduler to boost lock owners.
Don't try to migrate mutex blocked tasks, the proxy logic will handle
that.
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
kernel/sched/ext.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c410afd28fb6d..d64b1283fa851 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2320,6 +2320,14 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
WARN_ON_ONCE(task_cpu(p) == cpu);
+ /* Make sure tasks aren't on a cpu */
+ if (task_on_cpu(task_rq(p), p))
+ return false;
+
+ /* Don't migrate blocked tasks, proxy-exec will handle this */
+ if (task_is_blocked(p))
+ return false;
+
/*
* If @p has migration disabled, @p->cpus_ptr is updated to contain only
* the pinned CPU in migrate_disable_switch() while @p is being switched
@@ -3063,6 +3071,23 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
if (p->scx.flags & SCX_TASK_QUEUED) {
set_task_runnable(rq, p);
+ /*
+ * Mutex-blocked donors stay queued on the runqueue under proxy
+ * execution, but the donor never runs as itself, proxy-exec
+ * walks the blocked_on chain on the next __schedule() and runs
+ * the lock owner in its place.
+ *
+ * Put the donor on the local DSQ directly, so pick_next_task()
+ * can still see it, find_proxy_task() will be invoked on
+ * next->blocked_on and either run the chain owner here, or call
+ * proxy_force_return() and let BPF make a new dispatch decision
+ * once the task is no longer blocked.
+ */
+ if (task_is_blocked(p)) {
+ dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
+ goto switch_class;
+ }
+
/*
* If @p has slice left and is being put, @p is getting
* preempted by a higher priority scheduler class or core-sched
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task()
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (3 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
` (5 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
When pulling a task from a non-local DSQ, consume_dispatch_q() checks if
the task can run on the destination rq via task_can_run_on_remote_rq().
However, it then drops the destination rq lock and locks the source rq
in consume_remote_task() -> unlink_dsq_and_lock_src_rq(). During this
window, the task might have become migration disabled, making it invalid
to migrate it to the destination rq.
Fix this by re-evaluating task_can_run_on_remote_rq() in
consume_remote_task() after the source rq is locked. If the task can no
longer be migrated, we clear its DSQ association, reset the holding CPU,
and enqueue it to the source rq's local DSQ instead.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d64b1283fa851..a70f8693b906f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2418,13 +2418,24 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
!WARN_ON_ONCE(src_rq != task_rq(p));
}
-static bool consume_remote_task(struct rq *this_rq,
+static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
struct task_struct *p, u64 enq_flags,
struct scx_dispatch_q *dsq, struct rq *src_rq)
{
raw_spin_rq_unlock(this_rq);
if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
+ if (unlikely(!task_can_run_on_remote_rq(sch, p, this_rq, true))) {
+ p->scx.dsq = NULL;
+ p->scx.holding_cpu = -1;
+ dispatch_enqueue(sch, src_rq, &src_rq->scx.local_dsq, p,
+ enq_flags | SCX_ENQ_CLEAR_OPSS);
+ if (sched_class_above(p->sched_class, src_rq->donor->sched_class))
+ resched_curr(src_rq);
+ raw_spin_rq_unlock(src_rq);
+ raw_spin_rq_lock(this_rq);
+ return false;
+ }
move_remote_task_to_local_dsq(p, enq_flags, src_rq, this_rq);
return true;
} else {
@@ -2541,7 +2552,7 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
}
if (task_can_run_on_remote_rq(sch, p, rq, false)) {
- if (likely(consume_remote_task(rq, p, enq_flags, dsq, task_rq)))
+ if (likely(consume_remote_task(sch, rq, p, enq_flags, dsq, task_rq)))
return true;
goto retry;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (4 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
` (4 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
With proxy-exec, pick_next_task() can return a task with blocked_on set
(a proxy donor); put_prev_set_next_task() then calls set_next_task_scx()
on this "ghost" task, which fires ops.running(). However, the task never
actually runs.
If we simply short-circuit set_next_task_scx() for blocked tasks, we
break DSQ bookkeeping. If we only skip ops.running(), we create an
ops.enqueue() -> ops.stopping() pair without running, because
ops.stopping() is still called in put_prev_task_scx().
Fix this by introducing a new flag SCX_TASK_IS_RUNNING to track whether
ops.running() was actually called. Skip ops.running() for blocked tasks,
and only call ops.stopping() if SCX_TASK_IS_RUNNING is set. This ensures
that running and stopping callbacks are perfectly paired even when a
blocked task is picked as a proxy donor.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
include/linux/sched/ext.h | 2 ++
kernel/sched/ext.c | 14 +++++++++++---
2 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d05efcac794d6..5096c05d7a978 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -102,6 +102,8 @@ enum scx_ent_flags {
SCX_TASK_SUB_INIT = 1 << 4, /* task being initialized for a sub sched */
SCX_TASK_IMMED = 1 << 5, /* task is on local DSQ with %SCX_ENQ_IMMED */
+ SCX_TASK_IS_RUNNING = 1 << 6, /* ops.running() has been called */
+
/*
* Bits 8 and 9 are used to carry task state:
*
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a70f8693b906f..b6d29087ec0e8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2164,9 +2164,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_
* information meaningful to the BPF scheduler and can be suppressed by
* skipping the callbacks if the task is !QUEUED.
*/
- if (SCX_HAS_OP(sch, stopping) && task_current(rq, p)) {
+ if (SCX_HAS_OP(sch, stopping) && task_current(rq, p) &&
+ (p->scx.flags & SCX_TASK_IS_RUNNING)) {
update_curr_scx(rq);
SCX_CALL_OP_TASK(sch, stopping, rq, p, false);
+ p->scx.flags &= ~SCX_TASK_IS_RUNNING;
}
if (SCX_HAS_OP(sch, quiescent) && !task_on_rq_migrating(p))
@@ -2986,8 +2988,11 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
p->se.exec_start = rq_clock_task(rq);
/* see dequeue_task_scx() on why we skip when !QUEUED */
- if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED))
+ if (SCX_HAS_OP(sch, running) && (p->scx.flags & SCX_TASK_QUEUED) &&
+ !task_is_blocked(p)) {
SCX_CALL_OP_TASK(sch, running, rq, p);
+ p->scx.flags |= SCX_TASK_IS_RUNNING;
+ }
clr_task_runnable(p, true);
@@ -3076,8 +3081,11 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
update_curr_scx(rq);
/* see dequeue_task_scx() on why we skip when !QUEUED */
- if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED))
+ if (SCX_HAS_OP(sch, stopping) && (p->scx.flags & SCX_TASK_QUEUED) &&
+ (p->scx.flags & SCX_TASK_IS_RUNNING)) {
SCX_CALL_OP_TASK(sch, stopping, rq, p, true);
+ p->scx.flags &= ~SCX_TASK_IS_RUNNING;
+ }
if (p->scx.flags & SCX_TASK_QUEUED) {
set_task_runnable(rq, p);
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (5 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
` (3 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
SCX_CALL_OP_TASK*() stored the subject task in current->scx.kf_tasks[]
and assumed ops would not nest. A BPF ops.running() callback can call
kfuncs (e.g. scx_bpf_dsq_insert) that enqueue work and trigger
enqueue_task_scx() -> ops.runnable(), which used SCX_CALL_OP_TASK again
and overwrote kf_tasks[0] then cleared it, leaving the running context
wrong and leading to NULL function dispatches from BPF helpers.
Save and restore kf_tasks[] (both slots for the two-task variant) around
each invocation so nested task-based ops preserve the outer context.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 43 +++++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b6d29087ec0e8..1ac885eadfa8e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -567,37 +567,50 @@ static s32 scx_cpu_ret(struct scx_sched *sch, s32 cpu_or_cid)
* pi_lock held by try_to_wake_up() with rq tracking via scx_rq.in_select_cpu.
* So if kf_tasks[] is set, @p's scheduler-protected fields are stable.
*
- * kf_tasks[] can not stack, so task-based SCX ops must not nest. The
- * WARN_ON_ONCE() in each macro catches a re-entry of any of the three variants
- * while a previous one is still in progress.
+ * Task-based SCX ops may nest (e.g. ops.running() calling a kfunc that ends up
+ * in enqueue_task_scx() -> ops.runnable()). Save and restore kf_tasks[] around
+ * each invocation so the outer op's context is restored for kfuncs and for
+ * further nested calls. Single-task ops save/restore both slots and clear
+ * kf_tasks[1] while active so a nested call under SCX_CALL_OP_2TASKS_RET does
+ * not leave the outer pair's second task authenticated for kfuncs.
*/
#define SCX_CALL_OP_TASK(sch, op, locked_rq, task, args...) \
do { \
- WARN_ON_ONCE(current->scx.kf_tasks[0]); \
+ struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
+ struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
+ \
current->scx.kf_tasks[0] = task; \
+ current->scx.kf_tasks[1] = NULL; \
SCX_CALL_OP((sch), op, locked_rq, task, ##args); \
- current->scx.kf_tasks[0] = NULL; \
+ current->scx.kf_tasks[0] = __scx_kf0_sv; \
+ current->scx.kf_tasks[1] = __scx_kf1_sv; \
} while (0)
#define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...) \
({ \
__typeof__((sch)->ops.op(task, ##args)) __ret; \
- WARN_ON_ONCE(current->scx.kf_tasks[0]); \
+ struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
+ struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
+ \
current->scx.kf_tasks[0] = task; \
+ current->scx.kf_tasks[1] = NULL; \
__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args); \
- current->scx.kf_tasks[0] = NULL; \
+ current->scx.kf_tasks[0] = __scx_kf0_sv; \
+ current->scx.kf_tasks[1] = __scx_kf1_sv; \
__ret; \
})
#define SCX_CALL_OP_2TASKS_RET(sch, op, locked_rq, task0, task1, args...) \
({ \
__typeof__((sch)->ops.op(task0, task1, ##args)) __ret; \
- WARN_ON_ONCE(current->scx.kf_tasks[0]); \
+ struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
+ struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
+ \
current->scx.kf_tasks[0] = task0; \
current->scx.kf_tasks[1] = task1; \
__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args); \
- current->scx.kf_tasks[0] = NULL; \
- current->scx.kf_tasks[1] = NULL; \
+ current->scx.kf_tasks[0] = __scx_kf0_sv; \
+ current->scx.kf_tasks[1] = __scx_kf1_sv; \
__ret; \
})
@@ -616,8 +629,12 @@ static inline void scx_call_op_set_cpumask(struct scx_sched *sch, struct rq *rq,
struct task_struct *task,
const struct cpumask *cpumask)
{
- WARN_ON_ONCE(current->scx.kf_tasks[0]);
+ struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0];
+ struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1];
+
+ current->scx.kf_nest++;
current->scx.kf_tasks[0] = task;
+ current->scx.kf_tasks[1] = NULL;
if (rq)
update_locked_rq(rq);
@@ -633,7 +650,9 @@ static inline void scx_call_op_set_cpumask(struct scx_sched *sch, struct rq *rq,
if (rq)
update_locked_rq(NULL);
- current->scx.kf_tasks[0] = NULL;
+ current->scx.kf_tasks[0] = __scx_kf0_sv;
+ current->scx.kf_tasks[1] = __scx_kf1_sv;
+ current->scx.kf_nest--;
}
/* see SCX_CALL_OP_TASK() */
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (6 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
` (2 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
ops.running() can pull in enqueue_task_scx() -> ops.runnable() on the
same current task while kf_tasks[] save/restore is still insufficient
for every BPF/kfunc combination, leading to NULL dispatches and stack
corruption.
Track SCX_CALL_OP_TASK nesting in current->scx.kf_nest (incremented by
all SCX_CALL_OP_TASK* macros) and omit the ops.runnable() callback when
non-zero. The full enqueue path including ops.enqueue() still runs, only
the runnable hook is skipped in this case.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
include/linux/sched/ext.h | 7 +++++++
kernel/sched/ext.c | 9 ++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 5096c05d7a978..8c04edf1bc91a 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -197,6 +197,13 @@ struct sched_ext_entity {
s32 holding_cpu;
s32 selected_cpu;
struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */
+ /*
+ * Nesting depth of SCX_CALL_OP_TASK() on this task as %current (e.g.
+ * during schedule() %current is still the previous task). Used to skip
+ * ops.runnable() when invoked from inside another task op such as
+ * ops.running() to avoid breaking BPF re-entrance guarantees.
+ */
+ u32 kf_nest;
struct list_head runnable_node; /* rq->scx.runnable_list */
unsigned long runnable_at;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 1ac885eadfa8e..af9b10cd82c4a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -579,11 +579,13 @@ do { \
struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
\
+ current->scx.kf_nest++; \
current->scx.kf_tasks[0] = task; \
current->scx.kf_tasks[1] = NULL; \
SCX_CALL_OP((sch), op, locked_rq, task, ##args); \
current->scx.kf_tasks[0] = __scx_kf0_sv; \
current->scx.kf_tasks[1] = __scx_kf1_sv; \
+ current->scx.kf_nest--; \
} while (0)
#define SCX_CALL_OP_TASK_RET(sch, op, locked_rq, task, args...) \
@@ -592,11 +594,13 @@ do { \
struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
\
+ current->scx.kf_nest++; \
current->scx.kf_tasks[0] = task; \
current->scx.kf_tasks[1] = NULL; \
__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task, ##args); \
current->scx.kf_tasks[0] = __scx_kf0_sv; \
current->scx.kf_tasks[1] = __scx_kf1_sv; \
+ current->scx.kf_nest--; \
__ret; \
})
@@ -606,11 +610,13 @@ do { \
struct task_struct *__scx_kf0_sv = current->scx.kf_tasks[0]; \
struct task_struct *__scx_kf1_sv = current->scx.kf_tasks[1]; \
\
+ current->scx.kf_nest++; \
current->scx.kf_tasks[0] = task0; \
current->scx.kf_tasks[1] = task1; \
__ret = SCX_CALL_OP_RET((sch), op, locked_rq, task0, task1, ##args); \
current->scx.kf_tasks[0] = __scx_kf0_sv; \
current->scx.kf_tasks[1] = __scx_kf1_sv; \
+ current->scx.kf_nest--; \
__ret; \
})
@@ -2067,7 +2073,8 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_
rq->scx.nr_running++;
add_nr_running(rq, 1);
- if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p))
+ if (SCX_HAS_OP(sch, runnable) && !task_on_rq_migrating(p) &&
+ !READ_ONCE(current->scx.kf_nest))
SCX_CALL_OP_TASK(sch, runnable, rq, p, enq_flags);
if (enq_flags & SCX_ENQ_WAKEUP)
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (7 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
2026-05-09 1:00 ` [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible " Tejun Heo
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
Proxy execution switches a donor's execution context to the mutex owner,
so the owner can make progress while the donor remains on the runqueue.
This logic might be incompatible with some sched_ext schedulers: the BPF
scheduler picks tasks through its own dispatch interface, and a
proxy-exec switch may end up running a task the BPF scheduler never
dispatched. This mismatch can break BPF context: sched_ext callbacks
fire against a task that isn't the one the BPF scheduler tracks as
running, so any kfunc they invoke operates on an inconsistent view of
the current task.
Therefore, when sched_ext is enabled, disable proxy-exec context
donation by default:
- Force try_to_block_task() to actually block a mutex-blocked prev
instead of keeping it on the rq as a donor.
- Skip find_proxy_task() in the pick path. Clear any leftover
PROXY_WAKING marker set by the mutex handoff, since
find_proxy_task() is no longer there to do it; otherwise the task
trips the blocked_on mismatch WARN in __set_task_blocked_on() when
it resumes the mutex_lock() retry loop.
However, some schedulers may not consider proxy execution as a real
"task switch" and more like a "function call": the donor effectively
executes the lock owner's critical section, so the switch does not
represent a true change in scheduling ownership.
To handle both semantics, add a boot-time knob to enable proxy execution
under sched_ext when explicitly desired:
sched_proxy_exec_scx=0|1
The default is 0, keeping proxy-exec disabled for the reasons described
above. Setting it to 1 allows donor->owner context switch even with
sched_ext enabled.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
.../admin-guide/kernel-parameters.txt | 6 +++
kernel/sched/core.c | 47 ++++++++++++++++++-
2 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4510b4b3c4165..f73c12e9645de 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6821,6 +6821,12 @@ Kernel parameters
solution to mutex-based priority inversion.
Format: <bool>
+ sched_proxy_exec_scx= [KNL]
+ Enables or disables proxy execution when sched_ext is
+ enabled. The default is disabled, meaning proxy-exec
+ context donation is suppressed while sched_ext is active.
+ Format: <bool>
+
sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages.
schedstats= [KNL,X86] Enable or disable scheduled statistics.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1c161dd9d7440..0f714c6613771 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -151,14 +151,52 @@ static int __init setup_proxy_exec(char *str)
}
return 1;
}
+
+DEFINE_STATIC_KEY_FALSE(__sched_proxy_exec_scx);
+static __always_inline bool sched_proxy_exec_scx(void)
+{
+ return static_branch_unlikely(&__sched_proxy_exec_scx);
+}
+
+static int __init setup_proxy_exec_scx(char *str)
+{
+ bool proxy_scx_enable = false;
+
+ if (*str && kstrtobool(str + 1, &proxy_scx_enable)) {
+ pr_warn("Unable to parse sched_proxy_exec_scx=\n");
+ return 0;
+ }
+
+ if (proxy_scx_enable) {
+ pr_info("sched_proxy_exec_scx enabled via boot arg\n");
+ static_branch_enable(&__sched_proxy_exec_scx);
+ } else {
+ pr_info("sched_proxy_exec_scx disabled via boot arg\n");
+ static_branch_disable(&__sched_proxy_exec_scx);
+ }
+
+ return 1;
+}
#else
static int __init setup_proxy_exec(char *str)
{
pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so it cannot be enabled or disabled at boot time\n");
return 0;
}
+
+static __always_inline bool sched_proxy_exec_scx(void)
+{
+ return false;
+}
+
+static int __init setup_proxy_exec_scx(char *str)
+{
+ pr_warn("CONFIG_SCHED_PROXY_EXEC=n, so sched_proxy_exec_scx= is ignored\n");
+ return 0;
+}
#endif
__setup("sched_proxy_exec", setup_proxy_exec);
+__setup("sched_proxy_exec_scx", setup_proxy_exec_scx);
/*
* Debugging: various feature bits
@@ -7111,7 +7149,8 @@ static void __sched notrace __schedule(int sched_mode)
* task_is_blocked() will always be false).
*/
try_to_block_task(rq, prev, &prev_state,
- !task_is_blocked(prev));
+ !task_is_blocked(prev) ||
+ (scx_enabled() && !sched_proxy_exec_scx()));
switch_count = &prev->nvcsw;
}
@@ -7123,6 +7162,12 @@ static void __sched notrace __schedule(int sched_mode)
struct task_struct *prev_donor = rq->donor;
rq_set_donor(rq, next);
+ if (scx_enabled() && !sched_proxy_exec_scx()) {
+ if (unlikely(next->blocked_on))
+ clear_task_blocked_on(next, PROXY_WAKING);
+ goto picked;
+ }
+
if (unlikely(next->blocked_on)) {
next = find_proxy_task(rq, next, &rf);
if (!next) {
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (8 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
@ 2026-05-06 17:45 ` Andrea Righi
2026-05-09 1:00 ` [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible " Tejun Heo
10 siblings, 0 replies; 21+ messages in thread
From: Andrea Righi @ 2026-05-06 17:45 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min, John Stultz
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
Joel Fernandes, sched-ext, linux-kernel
Now that sched_ext supports proxy execution, allow enabling both options
together.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
init/Kconfig | 2 --
1 file changed, 2 deletions(-)
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308aec..6b18ba7263f0b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -934,8 +934,6 @@ config SCHED_PROXY_EXEC
bool "Proxy Execution"
# Avoid some build failures w/ PREEMPT_RT until it can be fixed
depends on !PREEMPT_RT
- # Need to investigate how to inform sched_ext of split contexts
- depends on !SCHED_CLASS_EXT
# Not particularly useful until we get to multi-rq proxying
depends on EXPERT
help
--
2.54.0
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
` (9 preceding siblings ...)
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
@ 2026-05-09 1:00 ` Tejun Heo
2026-05-10 15:06 ` Andrea Righi
10 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2026-05-09 1:00 UTC (permalink / raw)
To: arighi, void, changwoo, jstultz
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, vschneid, kprateek.nayak,
christian.loehle, kobak, joelagnelf, emil, sched-ext,
linux-kernel
Hello,
I'm a bit worried this is more invasive than what it buys. Even with
the full series, the cross-CPU gap Prateek raised stays open -
find_proxy_task() doesn't go through put_prev_set_next_task(), so owner
runs without ops.running(owner). Closing that seems to need yet another
protocol on top, either synthetic running/stopping events or scx core
taking over dispatch_dequeue for substitutions. The BPF scheduler ends
up dispatching tasks it didn't pick and observing callbacks for tasks
it didn't enqueue, which feels too magical and error-prone.
Maybe worth considering an alternative where, when scx is loaded, we
just turn proxy-exec off entirely and expose blocked_on to the BPF
scheduler. Schedulers that want PI can implement it themselves on top
of the relationship; ones that don't pay nothing.
scx_enable could flip the proxy_exec static branch off, after which the
existing gates in __schedule keep blocked tasks off the runqueue and
skip find_proxy_task on their own. The remaining concern is in-flight
donors at the moment of the flip - the existing scx_bypass walk already
visits every rq's runnable list during enable, and could force-block
any task it sees with blocked_on set. Mutex unlock would re-wake them
through wake_q normally after that. blocked_on itself is set and
cleared in mutex.c regardless of proxy_exec, so the signal we'd want
to surface is already there.
For the BPF side, the natural shape seems to be tagging the existing
ops.quiescent and ops.runnable callbacks with a bit indicating "this
sleep/wake was a mutex transition," plus a small kfunc that returns
the owner of the mutex p is blocked on. A scheduler that wants PI then
records the owner in its own task storage on the quiescent side, boosts
it via the existing vtime / slice / dsq_move / kick primitives, and
drops the boost when the runnable side fires. No new dispatch protocol,
the BPF scheduler stays in charge of who runs.
Does that direction seem reasonable, or am I missing something that
makes it not work?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext
2026-05-09 1:00 ` [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible " Tejun Heo
@ 2026-05-10 15:06 ` Andrea Righi
2026-05-10 19:41 ` Tejun Heo
0 siblings, 1 reply; 21+ messages in thread
From: Andrea Righi @ 2026-05-10 15:06 UTC (permalink / raw)
To: Tejun Heo
Cc: void, changwoo, jstultz, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, kprateek.nayak, christian.loehle, kobak, joelagnelf,
emil, sched-ext, linux-kernel
Hi Tejun,
On Fri, May 08, 2026 at 03:00:59PM -1000, Tejun Heo wrote:
> Hello,
>
> I'm a bit worried this is more invasive than what it buys. Even with
> the full series, the cross-CPU gap Prateek raised stays open -
> find_proxy_task() doesn't go through put_prev_set_next_task(), so owner
> runs without ops.running(owner). Closing that seems to need yet another
> protocol on top, either synthetic running/stopping events or scx core
> taking over dispatch_dequeue for substitutions. The BPF scheduler ends
> up dispatching tasks it didn't pick and observing callbacks for tasks
> it didn't enqueue, which feels too magical and error-prone.
>
> Maybe worth considering an alternative where, when scx is loaded, we
> just turn proxy-exec off entirely and expose blocked_on to the BPF
> scheduler. Schedulers that want PI can implement it themselves on top
> of the relationship; ones that don't pay nothing.
>
> scx_enable could flip the proxy_exec static branch off, after which the
> existing gates in __schedule keep blocked tasks off the runqueue and
> skip find_proxy_task on their own. The remaining concern is in-flight
> donors at the moment of the flip - the existing scx_bypass walk already
> visits every rq's runnable list during enable, and could force-block
> any task it sees with blocked_on set. Mutex unlock would re-wake them
> through wake_q normally after that. blocked_on itself is set and
> cleared in mutex.c regardless of proxy_exec, so the signal we'd want
> to surface is already there.
>
> For the BPF side, the natural shape seems to be tagging the existing
> ops.quiescent and ops.runnable callbacks with a bit indicating "this
> sleep/wake was a mutex transition," plus a small kfunc that returns
> the owner of the mutex p is blocked on. A scheduler that wants PI then
> records the owner in its own task storage on the quiescent side, boosts
> it via the existing vtime / slice / dsq_move / kick primitives, and
> drops the boost when the runnable side fires. No new dispatch protocol,
> the BPF scheduler stays in charge of who runs.
>
> Does that direction seem reasonable, or am I missing something that
> makes it not work?
Thanks for looking at this and laying this out. Let me try to elaborate more
about your concerns and the alternative approach you're proposing.
On the cross-CPU gap Prateek raised: you're right that find_proxy_task()
substitutes the owner without going through put_prev_set_next_task(), so neither
ops.stopping(donor) nor ops.running(owner) fires for that substitution. But I'd
argue this is less critical than it looks:
1) For the ops.running(owner) side specifically, I don't think skipping it is
actually a correctness problem. With proxy-exec, the owner is not really
"the task that is running" in any scheduling sense, what runs is the donor,
the donor's slice is what gets consumed, and the donor is what BPF
dispatched. The owner just happens to be the execution context the kernel
uses to make the critical section progress, more like a function call inside
the donor's quantum than a real task switch. If we frame it that way,
ops.running(donor) + ops.stopping(donor) is the pairing the BPF scheduler
should observe.
2) The cases where the owner is on a different CPU don't go through the
substitution path at all, find_proxy_task() either migrates the donor over
(proxy_migrate_task()) or proxy_force_returns() it. In both cases the
receiving CPU's __schedule() does pick again, so ops.running() fires
normally on that CPU for whatever gets picked next. The "ghost owner runs
without ops.running()" only happens when the chain resolves locally, i.e.,
when the owner was already on the same rq's runnable list. That should
narrow the surface considerably.
About dispatching tasks BPF didn't pick / observing callbacks for tasks BPF
didn't enqueue: point 1 above is essentially an answer to that. If we treat the
donor as the running task and the owner substitution as an internal kernel
detail (a "function call" in the donor's context), then BPF only ever sees
callbacks for tasks it actually dispatched.
That said, your alternative proposal is also appealing in that it gets sched_ext
out of the proxy-exec dispatch protocol entirely, which is essentially the part
that genuinely is invasive. But I think there are some gaps before the "BPF
rolls its own proxy-exec" model is workable.
Let's say we expose blocked_on (and a kfunc returning the mutex owner) via
tagged ops.quiescent/runnable(). The BPF scheduler now wants to boost the owner.
What's the actual way to do so? Some mechanisms that we have right now:
- slice extension: scx_bpf_task_set_slice() works in place, but it affects
only a running owner,
- dsq_vtime: scx_bpf_task_set_dsq_vtime() updates the value, but for a task
already enqueued in a PRIQ DSQ the position in the rbtree doesn't move, so
this doesn't actually boost an already-queued owner.
- DSQ move: scx_bpf_dsq_move() requires an iterator and the task to have been
queued before iteration started. We don't have a kfunc today that takes a
task pointer and atomically yanks it from wherever it is to a higher-priority
DSQ. We also have no API exposing which DSQ a task is currently sitting in.
- scx_bpf_dsq_insert(SCX_DSQ_LOCAL) + SCX_ENQ_HEAD|SCX_ENQ_PREEMPT: it probably
works to run the owner immediately on its CPU, if we have a way to
re-enqueue it.
So, to make the BPF-side proxy-exec model real, I think we'd need at least:
1) A kfunc that returns the DSQ id a task is currently enqueued on (or
NULL/SCX_DSQ_INVALID if running), so the BPF scheduler can locate the owner.
2) A kfunc that removes a task by pointer from its current DSQ and triggers a
re-enqueue (or inserts the task into another DSQ).
Without these kfuncs a BPF scheduler that wants to support proxy-exec has no
concrete way to actually boost the owner.
If we add those primitives, the alternative seems reasonable: scx disables
proxy-exec, the bypass-style walk you described handles in-flight donors at flip
time, and proxy-exec with sched_ext becomes a BPF-side policy. I'm willing to
experiment in that direction if we think the primitives above are acceptable to
add.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext
2026-05-10 15:06 ` Andrea Righi
@ 2026-05-10 19:41 ` Tejun Heo
0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2026-05-10 19:41 UTC (permalink / raw)
To: Andrea Righi
Cc: void, changwoo, jstultz, mingo, peterz, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, kprateek.nayak, christian.loehle, kobak, joelagnelf,
emil, sched-ext, linux-kernel
Hello,
I'll think more on enabling proxy execution as-is for sched_ext. Reponse on
scx_bpf_dsq_move():
On Sun, May 10, 2026 at 05:06:41PM +0200, Andrea Righi wrote:
...
> Let's say we expose blocked_on (and a kfunc returning the mutex owner) via
> tagged ops.quiescent/runnable(). The BPF scheduler now wants to boost the owner.
> What's the actual way to do so? Some mechanisms that we have right now:
> - slice extension: scx_bpf_task_set_slice() works in place, but it affects
> only a running owner,
> - dsq_vtime: scx_bpf_task_set_dsq_vtime() updates the value, but for a task
> already enqueued in a PRIQ DSQ the position in the rbtree doesn't move, so
> this doesn't actually boost an already-queued owner.
> - DSQ move: scx_bpf_dsq_move() requires an iterator and the task to have been
> queued before iteration started. We don't have a kfunc today that takes a
> task pointer and atomically yanks it from wherever it is to a higher-priority
> DSQ. We also have no API exposing which DSQ a task is currently sitting in.
Assuming ->blocked_on() is triggered without rq lock held (if not, we just
need to tell scx_bpf_dsq_move() that it can do lock dancing in this context
too), we should already be able to move the task directly:
p->scx.dsq->id should already be accessible through BPF_CORE_READ(). Maybe
we can make it a bit nicer.
scx_bpf_dsq_move() doesn't actually need the task to come from iteration.
It's a bit odd but we're overloading the iterator for two purposes -
iteration and transaction scope definition. If a task is dequeued and
reenqueued after iteration is opened, scx_bpf_dsq_move() ignores the move as
the visit is considered stale. scx_bpf_dsq_move() only depends on this part.
The following is an excerpt from the function comment:
* For the transfer to be successful, @p must still be on the DSQ and have been
* queued before the DSQ iteration started. This function doesn't care whether
* @p was obtained from the DSQ iteration. @p just has to be on the DSQ and have
* been queued before the iteration started.
So, for an example, ->blocked_on() can do:
void my_sched_blocked_on(struct task_struct *p, struct task_struct *blocker)
{
u64 dsq_id = BPF_CORE_READ(p, scx.dsq, id);
struct bpf_iter_scx_dsq it;
if (!bpf_iter_scx_dsq_new(&it, dsq_id, 0)) {
scx_bpf_dsq_move(&it, p, SCX_DSQ_LOCAL_ON | WHATEVER_CPU_WE_PICK,
SCX_ENQ_PREEMPT);
bpf_iter_scx_dsq_destroy(&it);
}
}
This is not the prettiest but the above should do all that's needed and
nothing else. It just looks up the dsq and remembers the insert sequence. No
actual iteration happens. Again, it'd be trivial to add BPF helpers or extra
kfuncs to make this nicer.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 21+ messages in thread