The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: John Stultz <jstultz@google.com>, Tejun Heo <tj@kernel.org>,
	David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Koba Ko <kobak@nvidia.com>,
	Joel Fernandes <joelagnelf@nvidia.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution
Date: Fri, 8 May 2026 09:40:04 +0200	[thread overview]
Message-ID: <af2TVI7ZAw41ZO3C@gpd4> (raw)
In-Reply-To: <24ffc508-a806-4be0-9b33-fbe8c02d1742@amd.com>

Hi Prateek,

On Thu, May 07, 2026 at 09:17:34PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 5/7/2026 3:43 PM, Andrea Righi wrote:
> >>>> scx flow should look something like (please correct me if I'm
> >>>> wrong):
> >>>>
> >>>>      CPU0: donor                    CPU1: owner
> >>>>      ===========                    ===========
> >>>>
> >>>>   /* Donor is retained on rq*/
> >>>>   put_prev_task_scx()
> >>>>     ops.stopping()
> >>>>   ops.dispatch() /* May be skipped if SCX_OPS_ENQ_LAST is not set */
> >>>>   do_pick_task_scx()
> >>>>     next = donor;
> >>>>   find_proxy_task()
> >>>>     proxy_migrate_task()
> >>>>       ops.dequeue()
> >>>>         ======================> /*
> >>
> >> At this point I mean ^
> >>
> >>>>                                  * Moves to owner CPU (May be outside of affinity list)
> >>>>                                  * ops.enqueue() still happens on CPU0 but I've shown it
> >>>>                                  * here to depict the context has moved to owner's CPU.
> >>>>                                  */
> >>>>                                 ops.enqueue()
> >>>>                                   scx_bpf_dsq_insert()
> >>>>                                     /*
> >>>>                                      * !!! Cannot dispatch to local CPU; Outside affinity !!!
> >>>>                                      *
> >>>>                                      * We need to allow local dispatch outside affinity iff:
> >>>>                                      *
> >>>>                                      *   p->is_blocked && cpu == task_cpu(p)
> >>>>                                      *
> >>>>                                      * Since enqueue_task_scx() hold's the task's rq_lock, the
> >>>>                                      * is_blocked indicator should be stable during a dispatch.
> >>>>                                      */
> >>>>                                 ops.dispatch()
> >>>>                                 do_pick_task_scx()
> >>>>                                   set_next_task_scx()
> >>>>                                     ops.running(donor)
> >>>>                                 find_proxy_task()
> >>>>                                   next = owner
> >>>>                                 /*
> >>>>                                  * !!! Owner stats running without any notification. !!!
> >>>>                                  * 
> >>>>                                  * If owner blocks, dequeue_task_scx() is executed first and
> >>>>                                  * the sched-ext scheduler sees:
> >>>>                                  *
> >>>>                                  *   ops.stopping(owner)
> >>>>                                  *
> >>>>                                  * which leads to some asymmetry.
> >>>>                                  *
> >>>>                                  * XXX: Below is how I imagine the flow should continue.
> >>>>                                  */
> >>>>                                 ops.quiescent(owner) /* Core is taking back control of owner's running */
> >>>>                                 /* Runs owner */
> >>>>                                 ops.runnable(owner) /* Core is giving back control to ext layer */
> >>>>                                 ops.stopping(donor); /* Accounting symmetry for donor */
> >>>
> >>> I think the order of operations should be the following:
> >>>
> >>> ops.runnable(donor)
> >>>    -> ops.enqueue(donor)
> >>>    -> donor becomes curr
> >>>    -> ops.running(donor) /* set_next_task_scx(donor); !task_is_blocked(donor) */
> >>>    -> donor executes
> >>>    -> donor blocks on mutex (proxy: stays on_rq; task_is_blocked(donor) true)
> >>>    -> __schedule()
> >>>    -> pick_next -> proxy-exec selects owner as next
> >>>    -> put_prev_task_scx(donor)
> >>>         -> ops.stopping(donor)
> >>>         -> dispatch_enqueue(local_dsq) /* blocked donor: ext core parks on local DSQ */
> >>>    -> set_next_task_scx(owner)
> >>>         -> ops.running(owner)
> >>
> >> So ext will just switch the context back to owner? But how does this
> >> happen with the changes in your series?
> >>
> >> Based on my understanding, this happens:
> >>
> >>     -> pick_next -> sced-ext returns donor as next
> >>     /* prev's context is put back */
> >>     -> set_next_task_scx(donor)
> >>       -> ops.running(donor)
> >>
> >>     /* In core.c */
> >>
> >>     /* next = donor */
> >>     if (next->blocked_on) /* true since we have blocked donor */
> >>        next = find_proxy_task(); /* Returns owner */
> >>
> >>     /* next = owner; */
> >>     /* Starts running owner */
> >>
> >> How does ext core swap back the owner context here? Am I missing
> >> something? find_proxy_task() doesn't call put_prev_set_next_task() so
> >> I'm at a loss how we get to set_next_task_scx(owner).
> > 
> > The sequence should be the following:
> 
> Still a bit confused! Hope you can bear with me for just a little
> bit longer :-)

No, thank you! This is super useful for me! I want to make sure I'm not
missing/misinterpreting anything obvious. :)

> 
> > 
> >  - pick_next_task(rq, rq->donor, &rf) returns donor (because we parked it on the local DSQ)
> 
> So put_prev_set_next_task() happens as a part of pick_next_task().
> 
> When we pick the donor, we have already called set_next_task(donor)
> on it before returning it from pick_next_task().
> 
> "owner" is still not known at this point ...

That seems correct.

> 
> >  - in __schedule() (still holding rq->lock), proxy sees next->blocked_on and does:
> >     - next = find_proxy_task(rq, next, &rf);  -> returns owner (or triggers migration / retries)
> >  - Only after that, __schedule() reaches the point where it performs the switch
> >    (put_prev_set_next_task(rq, prev, next) via the pick path). Ao  that point,
> 
> ... and we don't do put_prev_set_next_task(donor, owner) after
> (or within) find_proxy_task() as far as I'm aware. The "donor"
> remains as the task on which we last called put_prev_task().

Also correct.

> 
> If you are referring to the bits in your Patch2, the calls to
> put_prev_task() and set_next_task() is done on the same "donor"
> task. It is purely for the sake of adding a balance callback if
> we had skipped migrating away the prev task due to proxy.
> 
> AFAIC, nothing does a set_next_task(owner) after
> pick_next_task() in __schedule() unless I'm grossly mistaken.

I think you're right.

Let me try to recap what happens in two different scenarios:

# donor and owner running on the same CPU

Owner runs on CPU0, it expires its p->scx.slice, so it's de-scheduled and added
to a DSQ; donor is next, it runs and blocks on a mutex on CPU0, we park the
donor on CPU0's local DSQ, pick_next_task(rq, rq->donor, &rf) on CPU0 returns
next == donor, we see next->blocked_on == true, so we trigger find_proxy_task(),
inside find_proxy_task() we see owner_cpu == task_cpu, find_proxy_task() returns
owner, replacing next, set_next_task_scx(owner) triggers ops_dequeue() +
dispatch_dequeue(), removing the owner from the DSQ, then
set_next_task_scx(owner) will trigger ops.running(onwer), then
ops.stopping(owner). And in this case we don't trigger ops.stopping(donor) +
ops.running(donor) during the proxy switch.

# donor and owner running on different CPUs

Owner runs on CPU0, it expires its p->scx.slice, so it's de-scheduled and added
to a DSQ; donor runs on CPU1, it blocks on a mutex on CPU1, we park donor on
CPU1's local DSQ, pick_next_task(rq, rq->donor, &rf) on CPU1 returns donor as
next, we see next->blocked_on == true, we trigger
find_proxy_task() on CPU1, find_proxy_task() sees owner_cpu != this_cpu, so it
triggers proxy_migrate_task() to migrate donor to CPU0, which triggers
deactivate_task(donor), unlinking it from CPU1's local DSQ, then
proxy_set_task_cpu(donor, CPU0). But at this point we're not adding donor to
CPU0's local DSQ. I think this is the part that is missing, if we add donor to
CPU0's local DSQ at this point we would effectively fall back to the "same CPU"
scenario and (in theory) everything should work.

Something like the following (not tested yet - about to).

Thanks,
-Andrea

 kernel/sched/ext.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index af9b10cd82c4a..6125c4cbd6d64 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1915,6 +1915,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 
 	WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
 
+	/*
+	 * Under proxy execution, mutex-blocked donors can be migrated to a
+	 * different rq (e.g., towards the mutex owner's CPU). For sched_ext, rq
+	 * association alone isn't sufficient for the donor to be picked again
+	 * and drive find_proxy_task(); make it immediately visible on the
+	 * destination rq by parking it on the built-in local DSQ.
+	 *
+	 * This task is a scheduling context token and isn't supposed to run as
+	 * itself while blocked.
+	 */
+	if (unlikely(task_is_blocked(p))) {
+		clear_direct_dispatch(p);
+		dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, 0);
+		return;
+	}
+
 	/* internal movements - rq migration / RESTORE */
 	if (sticky_cpu == cpu_of(rq))
 		goto local_norefill;

  reply	other threads:[~2026-05-08  7:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-05-06 21:09   ` John Stultz
2026-05-07  3:34     ` K Prateek Nayak
2026-05-07  6:31       ` Andrea Righi
2026-05-07  7:45         ` K Prateek Nayak
2026-05-07 10:13           ` Andrea Righi
2026-05-07 15:47             ` K Prateek Nayak
2026-05-08  7:40               ` Andrea Righi [this message]
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
2026-05-09  1:00 ` [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible " Tejun Heo
2026-05-10 15:06   ` Andrea Righi
2026-05-10 19:41     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=af2TVI7ZAw41ZO3C@gpd4 \
    --to=arighi@nvidia.com \
    --cc=bsegall@google.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jstultz@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kobak@nvidia.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=void@manifault.com \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox