All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: John Stultz <jstultz@google.com>, Tejun Heo <tj@kernel.org>,
	David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Koba Ko <kobak@nvidia.com>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
Date: Thu, 7 May 2026 08:31:12 +0200	[thread overview]
Message-ID: <afwxsJiNJcNQrcHN@gpd4> (raw)
In-Reply-To: <427e64df-2d3c-47a5-925f-ef9a751f1ca3@amd.com>

Hi John, Prateek,

On Thu, May 07, 2026 at 09:04:57AM +0530, K Prateek Nayak wrote:
> Hello John, Andrea,
> 
> (Full disclaimer: I haven't looked at the entire series)
> 
> On 5/7/2026 2:39 AM, John Stultz wrote:
> >> +                       /*
> >> +                        * Tasks pinned to a single CPU (per-CPU kthreads via
> >> +                        * kthread_bind(), tasks under migrate_disable()) cannot
> >> +                        * be moved to @owner_cpu. proxy_migrate_task() uses
> >> +                        * __set_task_cpu() which would silently violate the
> >> +                        * pinning and leave the task to run on a CPU outside
> >> +                        * its cpus_ptr once it is unblocked. Stay on this CPU
> >> +                        * via force_return; the owner running elsewhere will
> >> +                        * wake @p back up when the mutex becomes available.
> >> +                        */
> >> +                       if (p->nr_cpus_allowed == 1 || is_migration_disabled(p))
> >> +                               goto force_return;
> >>                         goto migrate_task;
> > 
> > Hey Andrea!
> >   I'm excited to see this series! Thanks for your efforts here!
> > 
> > Though I'm a bit confused on this patch.  I see the patch changes it
> > so we don't proxy-migrate pinned/migration-disabled patches, but I'm
> > not sure I understand why.
> > 
> > We only proxy-migrate blocked_on tasks, which don't run on the cpu
> > they are migrated to (they are only migrated to be used as a donor).
> > That's why we have the proxy_force_return() function to return-migrate
> > them back when they do become runnable.
> 
> I agree this shouldn't be a problem from core perspective but there
> are some interesting sched-ext interactions possible. More on that
> below:

So, I included this patch, because in a previous version of this series it was
preventing a "SCX_DSQ_LOCAL[_ON] cannot move migration disabled task" error.

However, I tried again this series without this and everything seems to work. I
guess this was fixed by "sched/ext: Avoid migrating blocked tasks with proxy
execution", that was not present in my previous early implementation. So, let's
ignore this for now...

> 
> > 
> > Could you provide some more details about what motivated this change
> > (ie: how you tripped a problem that it resolved?).
> 
> I think ops.enqueue() always assumes that the task being enqueued is
> runnable on the task_cpu() and when the the sched-ext layer tries to
> dispatch this task to local DSQ, the ext core complains and marks
> the sched-ext scheduler as buggy.

Correct that ops.enqueue() assumes that the task being enqueued is runnable on
task_cpu(), but this should still be true even when the donor is migrated:
proxy-exec should only migrate the donor to the owner's CPU when the placement
is allowed.

> 
> With sched-ext, even the lock owner's CPU is slightly complicated
> since the owner might be associated with a CPU but it is in fact on a
> custom DSQ and after moving the donor to owner's CPU, we will need
> sched-ext scheduler to guarantee that the owner runs there else
> there is no point in doing a proxy.

But a donor is always a running task (by definition), so it can't be on a custom
DSQ. Custom DSQs only hold tasks that are in the BPF scheduler's custody,
waiting to be dispatched.

The core keeps the donor logically runnable / on_rq and the ext core always
parks blocked donors on the built-in local DSQ:

put_prev_task_scx():
	...
        if (p->scx.flags & SCX_TASK_QUEUED) {
		set_task_runnable(rq, p);

		if (task_is_blocked(p)) {
			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
			goto switch_class;
		}
	...

> 
> scx flow should look something like (please correct me if I'm
> wrong):
> 
>      CPU0: donor                    CPU1: owner
>      ===========                    ===========
> 
>   /* Donor is retained on rq*/
>   put_prev_task_scx()
>     ops.stopping()
>   ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
>   do_pick_task_scx()
>     next = donor;
>   find_proxy_task()
>     proxy_migrate_task()
>       ops.dequeue()
>         ======================> /*
>                                  * Moves to owner CPU (May be outside of affinity list)
>                                  * ops.enqueue() still happens on CPU0 but I've shown it
>                                  * here to depict the context has moved to owner's CPU.
>                                  */
>                                 ops.enqueue()
>                                   scx_bpf_dsq_insert()
>                                     /*
>                                      * !!! Cannot dispatch to local CPU; Outside affinity !!!
>                                      *
>                                      * We need to allow local dispatch outside affinity iff:
>                                      *
>                                      *   p->is_blocked && cpu == task_cpu(p)
>                                      *
>                                      * Since enqueue_task_scx() hold's the task's rq_lock, the
>                                      * is_blocked indicator should be stable during a dispatch.
>                                      */
>                                 ops.dispatch()
>                                 do_pick_task_scx()
>                                   set_next_task_scx()
>                                     ops.running(donor)
>                                 find_proxy_task()
>                                   next = owner
>                                 /*
>                                  * !!! Owner stats running without any notification. !!!
>                                  * 
>                                  * If owner blocks, dequeue_task_scx() is executed first and
>                                  * the sched-ext scheduler sees:
>                                  *
>                                  *   ops.stopping(owner)
>                                  *
>                                  * which leads to some asymmetry.
>                                  *
>                                  * XXX: Below is how I imagine the flow should continue.
>                                  */
>                                 ops.quiescent(owner) /* Core is taking back control of owner's running */
>                                 /* Runs owner */
>                                 ops.runnable(owner) /* Core is giving back control to ext layer */
>                                 ops.stopping(donor); /* Accounting symmetry for donor */

I think the order of operations should be the following:

ops.runnable(donor)
   -> ops.enqueue(donor)
   -> donor becomes curr
   -> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
   -> donor executes
   -> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
   -> __schedule()
   -> pick_next -> proxy-exec selects owner as next
   -> put_prev_task_scx(donor)
        -> ops.stopping(donor)
        -> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
   -> set_next_task_scx(owner)
        -> ops.running(owner)
   -> donor runs as rq->donor, owner runs as rq->curr /* execution / accounting split */

  Later, when the owner is switched away (another schedule)

  ... owner running ...
   -> __schedule() / switch away from owner
   -> put_prev_task_scx(owner)
        -> ops.stopping(owner) /* if QUEUED && IS_RUNNING */
   -> set_next_task_scx() /* whoever is next */

   Later, mutex is released - donor can run as itself again

   -> mutex released / donor unblocked (!task_is_blocked(donor))
   -> donor selected as next /* becomes rq->curr as donor; not superseded by proxy */
   -> ops.running(donor) /* set_next_task_scx(donor); QUEUED && !task_is_blocked(donor) */
   -> donor executes as rq->curr

> I think dequeue_task_scx() should see task_current_donor() before
> calling ops.stopping() else we get some asymmetry. The donor will
> anyways be placed back via put_prev_task_scx() and since it hasn't run,
> it cannot block itself and there should be no dependency on
> dequeue_task_scx() for donors.

The ops.running/stopping() pair should be always enforced by
SCX_TASK_IS_RUNNING, so we either see a pair of them or none. So in theory,
there shouldn't be any asymmetry.

> 
> With the quiescent() + runnable() scheme, the sched-ext schedulers need
> to be made aware that task can go quiescent() and then back to
> runnable() while being SCX_TASK_QUEUED or the ext core has to spoof a
> full:
> 
>   dequeue(SLEEP) -> quiescent() -> /* Run owner */ -> runnable() -> select_cpu() -> enqueue()
> 
> Also since the mutex owner can block, the sched-ext scheduler needs to
> be aware of the fact that it can get a dequeue() -> quiescent()
> without having stopping() in between if we plan to keep
> symmetry.

We can see ops.dequeue() -> ops.quiescent() without ops.stopping() even without
proxy-exec: if a task becomes runnable and then it's moved to a different sched
class, the BPF scheduler can see ops.runnable/quiescent() without
ops.running/stopping().

As long as ops.runnable/quiescent() and ops.running/stopping() are symmetric I
think we're fine.

> 
> There might be more issues there that I'm missing.
> 

Right, I'm still trying to figure out if there's any scenario that can break
some BPF assumptions (kfunc permissions or similar), but considering that the
BPF context is usually associated to task_struct I can't see any potential
violation/breakage at the moment.

Thanks,
-Andrea

  reply	other threads:[~2026-05-07  6:31 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-05-06 21:09   ` John Stultz
2026-05-07  3:34     ` K Prateek Nayak
2026-05-07  6:31       ` Andrea Righi [this message]
2026-05-07  7:45         ` K Prateek Nayak
2026-05-07 10:13           ` Andrea Righi
2026-05-07 15:47             ` K Prateek Nayak
2026-05-08  7:40               ` Andrea Righi
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
2026-05-09  1:00 ` [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible " Tejun Heo
2026-05-10 15:06   ` Andrea Righi
2026-05-10 19:41     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afwxsJiNJcNQrcHN@gpd4 \
    --to=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kobak@nvidia.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.