Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: John Stultz <jstultz@google.com>, Tejun Heo <tj@kernel.org>,
	David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Koba Ko <kobak@nvidia.com>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
Date: Thu, 7 May 2026 12:13:19 +0200	[thread overview]
Message-ID: <afxlvypRoa0qCHhs@gpd4> (raw)
In-Reply-To: <bfb4c8ca-e155-41b8-ab00-1607d9823051@amd.com>

Hi Prateek,

On Thu, May 07, 2026 at 01:15:22PM +0530, K Prateek Nayak wrote:
> On 5/7/2026 12:01 PM, Andrea Righi wrote:
> > Hi John, Prateek,
> > 
> > On Thu, May 07, 2026 at 09:04:57AM +0530, K Prateek Nayak wrote:
> >> Hello John, Andrea,
> >>
> >> (Full disclaimer: I haven't looked at the entire series)
> >>
> >> On 5/7/2026 2:39 AM, John Stultz wrote:
> >>>> +                       /*
> >>>> +                        * Tasks pinned to a single CPU (per-CPU kthreads via
> >>>> +                        * kthread_bind(), tasks under migrate_disable()) cannot
> >>>> +                        * be moved to @owner_cpu. proxy_migrate_task() uses
> >>>> +                        * __set_task_cpu() which would silently violate the
> >>>> +                        * pinning and leave the task to run on a CPU outside
> >>>> +                        * its cpus_ptr once it is unblocked. Stay on this CPU
> >>>> +                        * via force_return; the owner running elsewhere will
> >>>> +                        * wake @p back up when the mutex becomes available.
> >>>> +                        */
> >>>> +                       if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
> >>>> +                               goto force_return;
> >>>>                         goto migrate_task;
> >>>
> >>> Hey Andrea!
> >>>   I'm excited to see this series! Thanks for your efforts here!
> >>>
> >>> Though I'm a bit confused on this patch.  I see the patch changes it
> >>> so we don't proxy-migrate pinned/migration-disabled patches, but I'm
> >>> not sure I understand why.
> >>>
> >>> We only proxy-migrate blocked_on tasks, which don't run on the cpu
> >>> they are migrated to (they are only migrated to be used as a donor).
> >>> That's why we have the proxy_force_return() function to return-migrate
> >>> them back when they do become runnable.
> >>
> >> I agree this shouldn't be a problem from core perspective but there
> >> are some interesting sched-ext interactions possible. More on that
> >> below:
> > 
> > So, I included this patch, because in a previous version of this series it was
> > preventing a "SCX_DSQ_LOCAL[_ON] cannot move migration disabled task" error.
> > 
> > However, I tried again this series without this and everything seems to work. I
> > guess this was fixed by "sched/ext: Avoid migrating blocked tasks with proxy
> > execution", that was not present in my previous early implementation. So, let's
> > ignore this for now...
> > 
> >>
> >>>
> >>> Could you provide some more details about what motivated this change
> >>> (ie: how you tripped a problem that it resolved?).
> >>
> >> I think ops.enqueue() always assumes that the task being enqueued is
> >> runnable on the task_cpu() and when the the sched-ext layer tries to
> >> dispatch this task to local DSQ, the ext core complains and marks
> >> the sched-ext scheduler as buggy.
> > 
> > Correct that ops.enqueue() assumes that the task being enqueued is runnable on
> > task_cpu(), but this should still be true even when the donor is migrated:
> > proxy-exec should only migrate the donor to the owner's CPU when the placement
> > is allowed.
> 
> Not really - it'll migrate the task to donor's CPU even if it is outside
> the task's affinity with the reasoning that the donor will never run
> there - it only exists on the runqueue to donate it's time to the lock
> owner.
> 
> But if you mean runnable in the sense it hasn't blocked then yes it is
> SCX_TASK_QUEUED + set_task_runnable().

proxy-exec can migrate the donor to the owner's rq/DSQ, but it doesn't actually
execute it, it's just there to bump the owner on the same CPU.

Then when the donor returns back home, via proxy_force_return(), we trigger
deactivate_task() -> dequeue_task_scx() and that also unlinks the donor from the
local DSQ. So we shouldn't break affinity.

> >> With sched-ext, even the lock owner's CPU is slightly complicated
> >> since the owner might be associated with a CPU but it is in fact on a
> >> custom DSQ and after moving the donor to owner's CPU, we will need
> >> sched-ext scheduler to guarantee that the owner runs there else
> >> there is no point in doing a proxy.
> > 
> > But a donor is always a running task (by definition), so it can't be on a custom
> > DSQ. Custom DSQs only hold tasks that are in the BPF scheduler's custody,
> > waiting to be dispatched.
> 
> I was thinking more from a proxy migration standpoint - when the donor
> is on a different CPU and the owner is on another one, and the core.c
> bits move the donor to the owner's CPU.

Ah I see what you mean. So you're saying for example: if owner is on CPU1's rq,
but it's also sitting on a global DSQ (that can be consumed by any CPU), we'd
move the donor to CPU1's rq, park it in CPU1's local DSQ, but then the owner can
be consumed by any CPU, because it's in a global DSQ.

However, if I'm not missing anything, I think in this case the core scheduler
should select the donor via pick_next_task(), then proxy-exec can replace it
with the owner, removing it from the global DSQ and run it.

From set_next_task_scx():

 if (p->scx.flags & SCX_TASK_QUEUED) {
      ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC);
      dispatch_dequeue(rq, p);
  }

That dispatch_dequeue() removes the owner from whatever DSQ it was on
(global/custom/local), then the owner becomes rq->curr and runs.

> 
> > 
> > The core keeps the donor logically runnable / on_rq and the ext core always
> > parks blocked donors on the built-in local DSQ:
> > 
> > put_prev_task_scx():
> > 	...
> >         if (p->scx.flags & SCX_TASK_QUEUED) {
> > 		set_task_runnable(rq, p);
> > 
> > 		if (task_is_blocked(p)) {
> > 			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
> > 			goto switch_class;
> > 		}
> > 	...
> 
> Ah! This is what I was missing but then, this task gets picked and
> is moved by find_proxy_task() in core.c right?

Yes, that's the intent.

> 
> > 
> >>
> >> scx flow should look something like (please correct me if I'm
> >> wrong):
> >>
> >>      CPU0: donor                    CPU1: owner
> >>      ===========                    ===========
> >>
> >>   /* Donor is retained on rq*/
> >>   put_prev_task_scx()
> >>     ops.stopping()
> >>   ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
> >>   do_pick_task_scx()
> >>     next = donor;
> >>   find_proxy_task()
> >>     proxy_migrate_task()
> >>       ops.dequeue()
> >>         ======================> /*
> 
> At this point I mean ^
> 
> >>                                  * Moves to owner CPU (May be outside of affinity list)
> >>                                  * ops.enqueue() still happens on CPU0 but I've shown it
> >>                                  * here to depict the context has moved to owner's CPU.
> >>                                  */
> >>                                 ops.enqueue()
> >>                                   scx_bpf_dsq_insert()
> >>                                     /*
> >>                                      * !!! Cannot dispatch to local CPU; Outside affinity !!!
> >>                                      *
> >>                                      * We need to allow local dispatch outside affinity iff:
> >>                                      *
> >>                                      *   p->is_blocked && cpu == task_cpu(p)
> >>                                      *
> >>                                      * Since enqueue_task_scx() hold's the task's rq_lock, the
> >>                                      * is_blocked indicator should be stable during a dispatch.
> >>                                      */
> >>                                 ops.dispatch()
> >>                                 do_pick_task_scx()
> >>                                   set_next_task_scx()
> >>                                     ops.running(donor)
> >>                                 find_proxy_task()
> >>                                   next = owner
> >>                                 /*
> >>                                  * !!! Owner stats running without any notification. !!!
> >>                                  * 
> >>                                  * If owner blocks, dequeue_task_scx() is executed first and
> >>                                  * the sched-ext scheduler sees:
> >>                                  *
> >>                                  *   ops.stopping(owner)
> >>                                  *
> >>                                  * which leads to some asymmetry.
> >>                                  *
> >>                                  * XXX: Below is how I imagine the flow should continue.
> >>                                  */
> >>                                 ops.quiescent(owner) /* Core is taking back control of owner's running */
> >>                                 /* Runs owner */
> >>                                 ops.runnable(owner) /* Core is giving back control to ext layer */
> >>                                 ops.stopping(donor); /* Accounting symmetry for donor */
> > 
> > I think the order of operations should be the following:
> > 
> > ops.runnable(donor)
> >    -> ops.enqueue(donor)
> >    -> donor becomes curr
> >    -> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
> >    -> donor executes
> >    -> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
> >    -> __schedule()
> >    -> pick_next -> proxy-exec selects owner as next
> >    -> put_prev_task_scx(donor)
> >         -> ops.stopping(donor)
> >         -> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
> >    -> set_next_task_scx(owner)
> >         -> ops.running(owner)
> 
> So ext will just switch the context back to owner? But how does this
> happen with the changes in your series?
> 
> Based on my understanding, this happens:
> 
>     -> pick_next -> sced-ext returns donor as next
>     /* prev's context is put back */
>     -> set_next_task_scx(donor)
>       -> ops.running(donor)
> 
>     /* In core.c */
> 
>     /* next = donor */
>     if (next->blocked_on) /* true since we have blocked donor */
>        next = find_proxy_task(); /* Returns owner */
> 
>     /* next = owner; */
>     /* Starts running owner */
> 
> How does ext core swap back the owner context here? Am I missing
> something? find_proxy_task() doesn't call put_prev_set_next_task() so
> I'm at a loss how we get to set_next_task_scx(owner).

The sequence should be the following:

 - pick_next_task(rq, rq->donor, &rf) returns donor (because we parked it on the local DSQ)
 - in __schedule() (still holding rq->lock), proxy sees next->blocked_on and does:
    - next = find_proxy_task(rq, next, &rf);  -> returns owner (or triggers migration / retries)
 - Only after that, __schedule() reaches the point where it performs the switch
   (put_prev_set_next_task(rq, prev, next) via the pick path). At that point,
   next is already the owner, so:
    - put_prev_task_scx(prev=donor) (or whatever prev is)
    - set_next_task_scx(next=owner)

And parking the blocked donor on rq->scx.local_dsq makes it the obvious
candidate for pick_next_task_scx() on that CPU.

So the donor isn't "moved" by find_proxy_task() in the DSQ sense, rather:
  - SCX picks the donor token
  - proxy-exec replaces the picked task with the lock owner (or triggers
    migration/return paths)

> 
> >    -> donor runs as rq->donor, owner runs as rq->curr /* execution / accounting split */
> > 
> >   Later, when the owner is switched away (another schedule)
> > 
> >   ... owner running ...
> >    -> __schedule() / switch away from owner
> >    -> put_prev_task_scx(owner)
> >         -> ops.stopping(owner) /* if QUEUED && IS_RUNNING */
> >    -> set_next_task_scx() /* whoever is next */
> > 
> >    Later, mutex is released - donor can run as itself again
> > 
> >    -> mutex released / donor unblocked (!task_is_blocked(donor))
> >    -> donor selected as next /* becomes rq->curr as donor; not superseded by proxy */
> >    -> ops.running(donor) /* set_next_task_scx(donor); QUEUED && !task_is_blocked(donor) */
> >    -> donor executes as rq->curr
> > 
> >> I think dequeue_task_scx() should see task_current_donor() before
> >> calling ops.stopping() else we get some asymmetry. The donor will
> >> anyways be placed back via put_prev_task_scx() and since it hasn't run,
> >> it cannot block itself and there should be no dependency on
> >> dequeue_task_scx() for donors.
> > 
> > The ops.running/stopping() pair should be always enforced by
> > SCX_TASK_IS_RUNNING, so we either see a pair of them or none. So in theory,
> > there shouldn't be any asymmetry.
> > 
> >>
> >> With the quiescent() + runnable() scheme, the sched-ext schedulers need
> >> to be made aware that task can go quiescent() and then back to
> >> runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
> >> full:
> >>
> >>   dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
> >>
> >> Also since the mutex owner can block, the sched-ext scheduler needs to
> >> be aware of the fact that it can get a dequeue() -> quiescent()
> >> without having stopping() in between if we plan to keep
> >> symmetry.
> > 
> > We can see ops.dequeue() -> ops.quiescent() without ops.stopping() even without
> > proxy-exec: if a task becomes runnable and then it's moved to a different sched
> > class, the BPF scheduler can see ops.runnable/quiescent() without
> > ops.running/stopping().
> 
> Ack!
> 
> > 
> > As long as ops.runnable/quiescent() and ops.running/stopping() are symmetric I
> > think we're fine.
> 
> I think it is mostly symmetric other than for that one scenario I'm
> confused about above.

Hope it's clearer now, assuming I didn't miss anything or make it even more
confusing. :)

I'm still not fully convinced about the migration-disabled task scenarios, but
so far I can't find any holes.

Thanks,
-Andrea

next prev parent reply	other threads:[~2026-05-07 10:13 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-05-06 21:09   ` John Stultz
2026-05-07  3:34     ` K Prateek Nayak
2026-05-07  6:31       ` Andrea Righi
2026-05-07  7:45         ` K Prateek Nayak
2026-05-07 10:13           ` Andrea Righi [this message]
2026-05-07 15:47             ` K Prateek Nayak
2026-05-08  7:40               ` Andrea Righi
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
2026-05-09  1:00 ` [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible " Tejun Heo
2026-05-10 15:06   ` Andrea Righi
2026-05-10 19:41     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afxlvypRoa0qCHhs@gpd4 \
    --to=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kobak@nvidia.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox