Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Cheng-Yang Chou <yphbchou0911@gmail.com>
To: Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, Andrea Righi <arighi@nvidia.com>,
	 David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	 Emil Tsalapatis <emil@etsalapatis.com>,
	Christian Loehle <christian.loehle@arm.com>,
	 Daniel Hodges <hodgesd@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org,
	 Ching-Chun Huang <jserv@ccns.ncku.edu.tw>,
	Chia-Ping Tsai <chia7712@gmail.com>
Subject: Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes
Date: Sat, 2 May 2026 00:19:52 +0800	[thread overview]
Message-ID: <20260502000039.Ga94c@cchengyang.duckdns.org> (raw)
In-Reply-To: <DI3TFV6PNXZ7.3OR8GY5SBIEZ7@google.com>

Hi Kuba,

On Mon, Apr 27, 2026 at 09:06:21AM +0000, Kuba Piecuch wrote:
> > On Thu, Apr 23, 2026 at 01:32:20PM +0000, Kuba Piecuch wrote:
> >> > On Mon, Mar 23, 2026 at 01:13:20PM -1000, Tejun Heo wrote:
> >> >> > The simple way to do this is to do scx_bpf_dsq_insert() at the very beginning,
> >> >> > once we know which task we would like to dispatch, and cancel the pending
> >> >> > dispatch via scx_bpf_dispatch_cancel() if any of the pre-dispatch checks fail
> >> >> > on the BPF side. This way, the "critical section" includes BPF-side checks, and
> >> >> > SCX will ignore the dispatch if there was a dequeue/enqueue racing with the
> >> >> > critical section.
> >> >> > 
> >> >> > With this solution, we can throw an error if task_can_run_on_remote_rq() is
> >> >> > false, because we know that there was no racing cpumask change (if there was,
> >> >> > it would have been caught earlier, in finish_dispatch()).
> >> >> 
> >> >> Yeah, I think this makes more sense. qseq is already there to provide
> >> >> protection against these events. It's just that the capturing of qseq is too
> >> >> late. If insert/cancel is too ugly, we can introduce another kfunc to
> >> >> capture the qseq - scx_bpf_dsq_insert_begin() or something like that - and
> >> >> stash it in a per-cpu variable. That way, qseq would be cover the "current"
> >> >> queued instance and the existing qseq mechanism would be able to reliably
> >> >> ignore the ones that lost race to dequeue.
> >> >
> >> > Since this has been stale for a while, I prepared a patch to implement
> >> > scx_bpf_dsq_insert_begin() as suggested.
> >> 
> >> Thanks for creating the patch. A couple of thoughts:
> >> 
> >> 1. Do we have a use case that requires dsq_insert_begin() that isn't
> >>    satisfied using the "insert and then cancel if needed" approach?
> >
> > IIUC, yes. scx_bpf_dispatch_cancel() is only registered in 
> > scx_kfunc_ids_dispatch, so it is only callable from ops.dispatch().
> > dsq_insert_begin(), on the other hand, is available from both
> > ops.enqueue() and ops.dispatch() (SCX_KF_ENQUEUE | SCX_KF_DISPATCH).
> > Since there is nothing to cancel in ops.enqueue(), the insert-and-cancel
> > approach simply doesn't work there.
> 
> Wouldn't the natural thing then be to extend scx_bpf_dispatch_cancel() to
> work for direct dispatch? Instead of introducing a whole new mechanism, let's
> extend the one we have by functionality that it (arguably) should have had
> from the beginning.

I see. You're right that dispatch_cancel() could be extended to work in
the enqueue context.

I'm happy to go either direction, your approach or Tejun's suggestion.
Tejun, Andrea, sched-ext folks, thoughts?

> 
> >
> >> 
> >> 2. Do we want to restrict ourselves through the one qseq slot provided by
> >>    dsq_insert_begin()? The most flexible approach IMO would be to simply
> >>    allow BPF to read the qseq directly via a kfunc and then supply it to
> >>    dsq_insert() later. With this, we can have multiple qseqs saved at the
> >>    same time, and we can even pass them between CPUs, e.g. if one CPU
> >>    dequeues a task for a sibling CPU, but we want the checks to be made inside
> >>    the sibling's ops.dispatch() (I just made this use case it up, it may not
> >>    be practical.)
> >>    That said, exposing an internal thing like qseq to BPF may be a step too far.
> >
> > In Tejun's reply back in [1], he suggested dsq_insert_begin() precisely
> > to avoid promoting qseq into the BPF ABI — which matches your own concern.
> > The single per-CPU slot is sufficient for the one-task-per-iteration
> > dispatch loops used by existing schedulers (e.g., scx_central).
> > If a concrete cross-CPU use case materializes later, we can always extend
> > dsq_insert() to accept an explicit qseq without breaking the current,
> > simpler path.
> >
> > [1]: https://lore.kernel.org/all/acHJED4iAeytdC2l@slm.duckdns.org/
> >
> 
> Well, Tejun doesn't explicitly say there that he's against exposing qseq, but
> I won't be surprised if he is.
> 
> FWIW, ghOSt (our Google-internal BPF scheduling solution) uses exactly this
> approach to guard the dispatch path against racing dequeues/enqueues.
> Every task has a seqnum that gets incremented on each "event" pertaining to
> the task. In the dispatch path, the BPF scheduler reads the task seqnum,
> does whatever checks it needs to do, and passes the seqnum to ghOSt at the end.
> 
> Admittedly, what works downstream doesn't have to work upstream, but I still
> wanted to provide this data point :-)

The ghOSt data point is appreciated. If a concrete use case emerges where
the single-slot approach falls short, extending dsq_insert() to accept an
explicit qseq seems like a natural next step.

Tejun, Andrea, sched-ext folks, any preferences?

-- 
Cheers,
Cheng-Yang

next prev parent reply	other threads:[~2026-05-01 16:20 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19  8:35 [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes Andrea Righi
2026-03-19 10:31 ` Kuba Piecuch
2026-03-19 13:54   ` Kuba Piecuch
2026-03-19 21:09   ` Andrea Righi
2026-03-20  9:18     ` Kuba Piecuch
2026-03-23 23:13       ` Tejun Heo
2026-04-22  6:33         ` Cheng-Yang Chou
2026-04-22 11:02           ` Andrea Righi
2026-04-23 13:32           ` Kuba Piecuch
2026-04-26  1:47             ` Cheng-Yang Chou
2026-04-27  9:06               ` Kuba Piecuch
2026-05-01 16:19                 ` Cheng-Yang Chou [this message]
2026-05-04  8:00                   ` Kuba Piecuch
2026-05-04 21:24                     ` Tejun Heo
2026-05-04 21:58                       ` Andrea Righi
2026-05-05  8:35                         ` Cheng-Yang Chou
2026-05-05  8:01                       ` Kuba Piecuch
2026-05-05  8:31                         ` Tejun Heo
2026-05-05  9:13                           ` Kuba Piecuch
2026-05-05 15:14                             ` Tejun Heo
2026-05-05 15:58                           ` Cheng-Yang Chou
2026-03-19 15:18 ` Kuba Piecuch
2026-03-19 19:01   ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260502000039.Ga94c@cchengyang.duckdns.org \
    --to=yphbchou0911@gmail.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=chia7712@gmail.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=jpiecuch@google.com \
    --cc=jserv@ccns.ncku.edu.tw \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox