Sched_ext development
 help / color / mirror / Atom feed
From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Andrea Righi <arighi@nvidia.com>
Cc: John Stultz <jstultz@google.com>, Tejun Heo <tj@kernel.org>,
	David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Koba Ko <kobak@nvidia.com>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	<sched-ext@lists.linux.dev>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
Date: Thu, 7 May 2026 13:15:22 +0530	[thread overview]
Message-ID: <bfb4c8ca-e155-41b8-ab00-1607d9823051@amd.com> (raw)
In-Reply-To: <afwxsJiNJcNQrcHN@gpd4>



On 5/7/2026 12:01 PM, Andrea Righi wrote:
> Hi John, Prateek,
> 
> On Thu, May 07, 2026 at 09:04:57AM +0530, K Prateek Nayak wrote:
>> Hello John, Andrea,
>>
>> (Full disclaimer: I haven't looked at the entire series)
>>
>> On 5/7/2026 2:39 AM, John Stultz wrote:
>>>> +                       /*
>>>> +                        * Tasks pinned to a single CPU (per-CPU kthreads via
>>>> +                        * kthread_bind(), tasks under migrate_disable()) cannot
>>>> +                        * be moved to @owner_cpu. proxy_migrate_task() uses
>>>> +                        * __set_task_cpu() which would silently violate the
>>>> +                        * pinning and leave the task to run on a CPU outside
>>>> +                        * its cpus_ptr once it is unblocked. Stay on this CPU
>>>> +                        * via force_return; the owner running elsewhere will
>>>> +                        * wake @p back up when the mutex becomes available.
>>>> +                        */
>>>> +                       if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
>>>> +                               goto force_return;
>>>>                         goto migrate_task;
>>>
>>> Hey Andrea!
>>>   I'm excited to see this series! Thanks for your efforts here!
>>>
>>> Though I'm a bit confused on this patch.  I see the patch changes it
>>> so we don't proxy-migrate pinned/migration-disabled patches, but I'm
>>> not sure I understand why.
>>>
>>> We only proxy-migrate blocked_on tasks, which don't run on the cpu
>>> they are migrated to (they are only migrated to be used as a donor).
>>> That's why we have the proxy_force_return() function to return-migrate
>>> them back when they do become runnable.
>>
>> I agree this shouldn't be a problem from core perspective but there
>> are some interesting sched-ext interactions possible. More on that
>> below:
> 
> So, I included this patch, because in a previous version of this series it was
> preventing a "SCX_DSQ_LOCAL[_ON] cannot move migration disabled task" error.
> 
> However, I tried again this series without this and everything seems to work. I
> guess this was fixed by "sched/ext: Avoid migrating blocked tasks with proxy
> execution", that was not present in my previous early implementation. So, let's
> ignore this for now...
> 
>>
>>>
>>> Could you provide some more details about what motivated this change
>>> (ie: how you tripped a problem that it resolved?).
>>
>> I think ops.enqueue() always assumes that the task being enqueued is
>> runnable on the task_cpu() and when the the sched-ext layer tries to
>> dispatch this task to local DSQ, the ext core complains and marks
>> the sched-ext scheduler as buggy.
> 
> Correct that ops.enqueue() assumes that the task being enqueued is runnable on
> task_cpu(), but this should still be true even when the donor is migrated:
> proxy-exec should only migrate the donor to the owner's CPU when the placement
> is allowed.

Not really - it'll migrate the task to donor's CPU even if it is outside
the task's affinity with the reasoning that the donor will never run
there - it only exists on the runqueue to donate it's time to the lock
owner.

But if you mean runnable in the sense it hasn't blocked then yes it is
SCX_TASK_QUEUED + set_task_runnable().

> 
>>
>> With sched-ext, even the lock owner's CPU is slightly complicated
>> since the owner might be associated with a CPU but it is in fact on a
>> custom DSQ and after moving the donor to owner's CPU, we will need
>> sched-ext scheduler to guarantee that the owner runs there else
>> there is no point in doing a proxy.
> 
> But a donor is always a running task (by definition), so it can't be on a custom
> DSQ. Custom DSQs only hold tasks that are in the BPF scheduler's custody,
> waiting to be dispatched.

I was thinking more from a proxy migration standpoint - when the donor
is on a different CPU and the owner is on another one, and the core.c
bits move the donor to the owner's CPU.

> 
> The core keeps the donor logically runnable / on_rq and the ext core always
> parks blocked donors on the built-in local DSQ:
> 
> put_prev_task_scx():
> 	...
>         if (p->scx.flags & SCX_TASK_QUEUED) {
> 		set_task_runnable(rq, p);
> 
> 		if (task_is_blocked(p)) {
> 			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
> 			goto switch_class;
> 		}
> 	...

Ah! This is what I was missing but then, this task gets picked and
is moved by find_proxy_task() in core.c right?

> 
>>
>> scx flow should look something like (please correct me if I'm
>> wrong):
>>
>>      CPU0: donor                    CPU1: owner
>>      ===========                    ===========
>>
>>   /* Donor is retained on rq*/
>>   put_prev_task_scx()
>>     ops.stopping()
>>   ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
>>   do_pick_task_scx()
>>     next = donor;
>>   find_proxy_task()
>>     proxy_migrate_task()
>>       ops.dequeue()
>>         ======================> /*

At this point I mean ^

>>                                  * Moves to owner CPU (May be outside of affinity list)
>>                                  * ops.enqueue() still happens on CPU0 but I've shown it
>>                                  * here to depict the context has moved to owner's CPU.
>>                                  */
>>                                 ops.enqueue()
>>                                   scx_bpf_dsq_insert()
>>                                     /*
>>                                      * !!! Cannot dispatch to local CPU; Outside affinity !!!
>>                                      *
>>                                      * We need to allow local dispatch outside affinity iff:
>>                                      *
>>                                      *   p->is_blocked && cpu == task_cpu(p)
>>                                      *
>>                                      * Since enqueue_task_scx() hold's the task's rq_lock, the
>>                                      * is_blocked indicator should be stable during a dispatch.
>>                                      */
>>                                 ops.dispatch()
>>                                 do_pick_task_scx()
>>                                   set_next_task_scx()
>>                                     ops.running(donor)
>>                                 find_proxy_task()
>>                                   next = owner
>>                                 /*
>>                                  * !!! Owner stats running without any notification. !!!
>>                                  * 
>>                                  * If owner blocks, dequeue_task_scx() is executed first and
>>                                  * the sched-ext scheduler sees:
>>                                  *
>>                                  *   ops.stopping(owner)
>>                                  *
>>                                  * which leads to some asymmetry.
>>                                  *
>>                                  * XXX: Below is how I imagine the flow should continue.
>>                                  */
>>                                 ops.quiescent(owner) /* Core is taking back control of owner's running */
>>                                 /* Runs owner */
>>                                 ops.runnable(owner) /* Core is giving back control to ext layer */
>>                                 ops.stopping(donor); /* Accounting symmetry for donor */
> 
> I think the order of operations should be the following:
> 
> ops.runnable(donor)
>    -> ops.enqueue(donor)
>    -> donor becomes curr
>    -> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
>    -> donor executes
>    -> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
>    -> __schedule()
>    -> pick_next -> proxy-exec selects owner as next
>    -> put_prev_task_scx(donor)
>         -> ops.stopping(donor)
>         -> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
>    -> set_next_task_scx(owner)
>         -> ops.running(owner)

So ext will just switch the context back to owner? But how does this
happen with the changes in your series?

Based on my understanding, this happens:

    -> pick_next -> sced-ext returns donor as next
    /* prev's context is put back */
    -> set_next_task_scx(donor)
      -> ops.running(donor)

    /* In core.c */

    /* next = donor */
    if (next->blocked_on) /* true since we have blocked donor */
       next = find_proxy_task(); /* Returns owner */

    /* next = owner; */
    /* Starts running owner */

How does ext core swap back the owner context here? Am I missing
something? find_proxy_task() doesn't call put_prev_set_next_task() so
I'm at a loss how we get to set_next_task_scx(owner).

>    -> donor runs as rq->donor, owner runs as rq->curr /* execution / accounting split */
> 
>   Later, when the owner is switched away (another schedule)
> 
>   ... owner running ...
>    -> __schedule() / switch away from owner
>    -> put_prev_task_scx(owner)
>         -> ops.stopping(owner) /* if QUEUED && IS_RUNNING */
>    -> set_next_task_scx() /* whoever is next */
> 
>    Later, mutex is released - donor can run as itself again
> 
>    -> mutex released / donor unblocked (!task_is_blocked(donor))
>    -> donor selected as next /* becomes rq->curr as donor; not superseded by proxy */
>    -> ops.running(donor) /* set_next_task_scx(donor); QUEUED && !task_is_blocked(donor) */
>    -> donor executes as rq->curr
> 
>> I think dequeue_task_scx() should see task_current_donor() before
>> calling ops.stopping() else we get some asymmetry. The donor will
>> anyways be placed back via put_prev_task_scx() and since it hasn't run,
>> it cannot block itself and there should be no dependency on
>> dequeue_task_scx() for donors.
> 
> The ops.running/stopping() pair should be always enforced by
> SCX_TASK_IS_RUNNING, so we either see a pair of them or none. So in theory,
> there shouldn't be any asymmetry.
> 
>>
>> With the quiescent() + runnable() scheme, the sched-ext schedulers need
>> to be made aware that task can go quiescent() and then back to
>> runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
>> full:
>>
>>   dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
>>
>> Also since the mutex owner can block, the sched-ext scheduler needs to
>> be aware of the fact that it can get a dequeue() -> quiescent()
>> without having stopping() in between if we plan to keep
>> symmetry.
> 
> We can see ops.dequeue() -> ops.quiescent() without ops.stopping() even without
> proxy-exec: if a task becomes runnable and then it's moved to a different sched
> class, the BPF scheduler can see ops.runnable/quiescent() without
> ops.running/stopping().

Ack!

> 
> As long as ops.runnable/quiescent() and ops.running/stopping() are symmetric I
> think we're fine.

I think it is mostly symmetric other than for that one scenario I'm
confused about above.

-- 
Thanks and Regards,
Prateek


  reply	other threads:[~2026-05-07  7:45 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-05-06 21:09   ` John Stultz
2026-05-07  3:34     ` K Prateek Nayak
2026-05-07  6:31       ` Andrea Righi
2026-05-07  7:45         ` K Prateek Nayak [this message]
2026-05-07 10:13           ` Andrea Righi
2026-05-07 15:47             ` K Prateek Nayak
2026-05-08  7:40               ` Andrea Righi
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
2026-05-09  1:00 ` [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible " Tejun Heo
2026-05-10 15:06   ` Andrea Righi
2026-05-10 19:41     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bfb4c8ca-e155-41b8-ab00-1607d9823051@amd.com \
    --to=kprateek.nayak@amd.com \
    --cc=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kobak@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox