From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: void@manifault.com, changwoo@igalia.com, jstultz@google.com,
mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kprateek.nayak@amd.com,
christian.loehle@arm.com, kobak@nvidia.com,
joelagnelf@nvidia.com, emil@etsalapatis.com,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext
Date: Sun, 10 May 2026 17:06:41 +0200 [thread overview]
Message-ID: <agCfAY9zqrQiKAJ4@gpd4> (raw)
In-Reply-To: <20260509010059.345908-1-tj@kernel.org>
Hi Tejun,
On Fri, May 08, 2026 at 03:00:59PM -1000, Tejun Heo wrote:
> Hello,
>
> I'm a bit worried this is more invasive than what it buys. Even with
> the full series, the cross-CPU gap Prateek raised stays open -
> find_proxy_task() doesn't go through put_prev_set_next_task(), so owner
> runs without ops.running(owner). Closing that seems to need yet another
> protocol on top, either synthetic running/stopping events or scx core
> taking over dispatch_dequeue for substitutions. The BPF scheduler ends
> up dispatching tasks it didn't pick and observing callbacks for tasks
> it didn't enqueue, which feels too magical and error-prone.
>
> Maybe worth considering an alternative where, when scx is loaded, we
> just turn proxy-exec off entirely and expose blocked_on to the BPF
> scheduler. Schedulers that want PI can implement it themselves on top
> of the relationship; ones that don't pay nothing.
>
> scx_enable could flip the proxy_exec static branch off, after which the
> existing gates in __schedule keep blocked tasks off the runqueue and
> skip find_proxy_task on their own. The remaining concern is in-flight
> donors at the moment of the flip - the existing scx_bypass walk already
> visits every rq's runnable list during enable, and could force-block
> any task it sees with blocked_on set. Mutex unlock would re-wake them
> through wake_q normally after that. blocked_on itself is set and
> cleared in mutex.c regardless of proxy_exec, so the signal we'd want
> to surface is already there.
>
> For the BPF side, the natural shape seems to be tagging the existing
> ops.quiescent and ops.runnable callbacks with a bit indicating "this
> sleep/wake was a mutex transition," plus a small kfunc that returns
> the owner of the mutex p is blocked on. A scheduler that wants PI then
> records the owner in its own task storage on the quiescent side, boosts
> it via the existing vtime / slice / dsq_move / kick primitives, and
> drops the boost when the runnable side fires. No new dispatch protocol,
> the BPF scheduler stays in charge of who runs.
>
> Does that direction seem reasonable, or am I missing something that
> makes it not work?
Thanks for looking at this and laying this out. Let me try to elaborate more
about your concerns and the alternative approach you're proposing.
On the cross-CPU gap Prateek raised: you're right that find_proxy_task()
substitutes the owner without going through put_prev_set_next_task(), so neither
ops.stopping(donor) nor ops.running(owner) fires for that substitution. But I'd
argue this is less critical than it looks:
1) For the ops.running(owner) side specifically, I don't think skipping it is
actually a correctness problem. With proxy-exec, the owner is not really
"the task that is running" in any scheduling sense, what runs is the donor,
the donor's slice is what gets consumed, and the donor is what BPF
dispatched. The owner just happens to be the execution context the kernel
uses to make the critical section progress, more like a function call inside
the donor's quantum than a real task switch. If we frame it that way,
ops.running(donor) + ops.stopping(donor) is the pairing the BPF scheduler
should observe.
2) The cases where the owner is on a different CPU don't go through the
substitution path at all, find_proxy_task() either migrates the donor over
(proxy_migrate_task()) or proxy_force_returns() it. In both cases the
receiving CPU's __schedule() does pick again, so ops.running() fires
normally on that CPU for whatever gets picked next. The "ghost owner runs
without ops.running()" only happens when the chain resolves locally, i.e.,
when the owner was already on the same rq's runnable list. That should
narrow the surface considerably.
About dispatching tasks BPF didn't pick / observing callbacks for tasks BPF
didn't enqueue: point 1 above is essentially an answer to that. If we treat the
donor as the running task and the owner substitution as an internal kernel
detail (a "function call" in the donor's context), then BPF only ever sees
callbacks for tasks it actually dispatched.
That said, your alternative proposal is also appealing in that it gets sched_ext
out of the proxy-exec dispatch protocol entirely, which is essentially the part
that genuinely is invasive. But I think there are some gaps before the "BPF
rolls its own proxy-exec" model is workable.
Let's say we expose blocked_on (and a kfunc returning the mutex owner) via
tagged ops.quiescent/runnable(). The BPF scheduler now wants to boost the owner.
What's the actual way to do so? Some mechanisms that we have right now:
- slice extension: scx_bpf_task_set_slice() works in place, but it affects
only a running owner,
- dsq_vtime: scx_bpf_task_set_dsq_vtime() updates the value, but for a task
already enqueued in a PRIQ DSQ the position in the rbtree doesn't move, so
this doesn't actually boost an already-queued owner.
- DSQ move: scx_bpf_dsq_move() requires an iterator and the task to have been
queued before iteration started. We don't have a kfunc today that takes a
task pointer and atomically yanks it from wherever it is to a higher-priority
DSQ. We also have no API exposing which DSQ a task is currently sitting in.
- scx_bpf_dsq_insert(SCX_DSQ_LOCAL) + SCX_ENQ_HEAD|SCX_ENQ_PREEMPT: it probably
works to run the owner immediately on its CPU, if we have a way to
re-enqueue it.
So, to make the BPF-side proxy-exec model real, I think we'd need at least:
1) A kfunc that returns the DSQ id a task is currently enqueued on (or
NULL/SCX_DSQ_INVALID if running), so the BPF scheduler can locate the owner.
2) A kfunc that removes a task by pointer from its current DSQ and triggers a
re-enqueue (or inserts the task into another DSQ).
Without these kfuncs a BPF scheduler that wants to support proxy-exec has no
concrete way to actually boost the owner.
If we add those primitives, the alternative seems reasonable: scx disables
proxy-exec, the bypass-style walk you described handles in-flight donors at flip
time, and proxy-exec with sched_ext becomes a BPF-side policy. I'm willing to
experiment in that direction if we think the primitives above are acceptable to
add.
Thanks,
-Andrea
next prev parent reply other threads:[~2026-05-10 15:06 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-06 17:45 [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible with sched_ext Andrea Righi
2026-05-06 17:45 ` [PATCH 01/10] sched/core: Skip migration disabled tasks in proxy execution Andrea Righi
2026-05-06 21:09 ` John Stultz
2026-05-07 3:34 ` K Prateek Nayak
2026-05-07 6:31 ` Andrea Righi
2026-05-07 7:45 ` K Prateek Nayak
2026-05-07 10:13 ` Andrea Righi
2026-05-07 15:47 ` K Prateek Nayak
2026-05-08 7:40 ` Andrea Righi
2026-05-06 17:45 ` [PATCH 02/10] sched/core: Skip put_prev_task/set_next_task re-entry for sched_ext donors Andrea Righi
2026-05-06 17:45 ` [PATCH 03/10] sched/ext: Split curr|donor references properly Andrea Righi
2026-05-06 17:45 ` [PATCH 04/10] sched/ext: Avoid migrating blocked tasks with proxy execution Andrea Righi
2026-05-06 17:45 ` [PATCH 05/10] sched_ext: Fix TOCTOU race in consume_remote_task() Andrea Righi
2026-05-06 17:45 ` [PATCH 06/10] sched_ext: Fix ops.running/stopping() pairing for proxy-exec donors Andrea Righi
2026-05-06 17:45 ` [PATCH 07/10] sched_ext: Save/restore kf_tasks[] when task ops nest Andrea Righi
2026-05-06 17:45 ` [PATCH 08/10] sched_ext: Skip ops.runnable() when nested in SCX_CALL_OP_TASK Andrea Righi
2026-05-06 17:45 ` [PATCH 09/10] sched/core: Disable proxy-exec context switch under sched_ext by default Andrea Righi
2026-05-06 17:45 ` [PATCH 10/10] sched: Allow enabling proxy exec with sched_ext Andrea Righi
2026-05-09 1:00 ` [RFC PATCH sched_ext/for-7.2 0/10] sched: Make proxy execution compatible " Tejun Heo
2026-05-10 15:06 ` Andrea Righi [this message]
2026-05-10 19:41 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agCfAY9zqrQiKAJ4@gpd4 \
--to=arighi@nvidia.com \
--cc=bsegall@google.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=emil@etsalapatis.com \
--cc=joelagnelf@nvidia.com \
--cc=jstultz@google.com \
--cc=juri.lelli@redhat.com \
--cc=kobak@nvidia.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=void@manifault.com \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox