All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Kuba Piecuch <jpiecuch@google.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Daniel Hodges <hodgesd@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics
Date: Mon, 9 Feb 2026 16:43:20 +0100	[thread overview]
Message-ID: <aYoAmES8tKNZJkdx@gpd4> (raw)
In-Reply-To: <DGAIRREHTBTV.E8XTJPHQUAN8@etsalapatis.com>

On Mon, Feb 09, 2026 at 10:00:40AM -0500, Emil Tsalapatis wrote:
> On Mon Feb 9, 2026 at 5:20 AM EST, Andrea Righi wrote:
> > On Sun, Feb 08, 2026 at 09:08:38PM +0100, Andrea Righi wrote:
> >> On Sun, Feb 08, 2026 at 12:59:36PM -0500, Emil Tsalapatis wrote:
> >> > On Sun Feb 8, 2026 at 8:55 AM EST, Andrea Righi wrote:
> >> > > On Sun, Feb 08, 2026 at 11:26:13AM +0100, Andrea Righi wrote:
> >> > >> On Sun, Feb 08, 2026 at 10:02:41AM +0100, Andrea Righi wrote:
> >> > >> ...
> >> > >> > > >> >  - From ops.select_cpu():
> >> > >> > > >> >      - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
> >> > >> > > >> >        the BPF scheduler entirely; they never enter BPF custody, so
> >> > >> > > >> >        ops.dequeue() is not called,
> >> > >> > > >> >      - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
> >> > >> > > >> >        bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
> >> > >> > > >> >        not called,
> >> > >> > > >> >      - scenario 2 (user DSQ): tasks enter BPF scheduler custody with full
> >> > >> > > >> >        enqueue/dequeue lifecycle tracking and state machine validation
> >> > >> > > >> >        (expects 1:1 enqueue/dequeue pairing).
> >> > >> > > >> 
> >> > >> > > >> Could you add a note here about why there's no equivalent to scenario 6?
> >> > >> > > >> The differentiating factor between that and scenario 2 (nonterminal queue) is 
> >> > >> > > >> that scx_dsq_insert_commit() is called regardless of whether the queue is terminal.
> >> > >> > > >> And this makes sense since for non-DSQ queues the BPF scheduler can do its
> >> > >> > > >> own tracking of enqueue/dequeue (plus it does not make too much sense to
> >> > >> > > >> do BPF-internal enqueueing in select_cpu).
> >> > >> > > >> 
> >> > >> > > >> What do you think? If the above makes sense, maybe we should spell it out 
> >> > >> > > >> in the documentation too. Maybe also add it makes no sense to enqueue
> >> > >> > > >> in an internal BPF structure from select_cpu - the task is not yet
> >> > >> > > >> enqueued, and would have to go through enqueue anyway.
> >> > >> > > >
> >> > >> > > > Oh, I just didn't think about it, we can definitely add to ops.select_cpu()
> >> > >> > > > a scenario equivalent to scenario 6 (push task to the BPF queue).
> >> > >> > > >
> >> > >> > > > From a practical standpoint the benefits are questionable, but in the scope
> >> > >> > > > of the kselftest I think it makes sense to better validate the entire state
> >> > >> > > > machine in all cases. I'll add this scenario as well.
> >> > >> > > >
> >> > >> > > 
> >> > >> > > That makes sense! Let's add it for completeness. Even if it doesn't make
> >> > >> > > sense right now that may change in the future. For example, if we end
> >> > >> > > up finding a good reason to add the task into an internal structure from
> >> > >> > > .select_cpu(), we may allow the task to be explicitly marked as being in
> >> > >> > > the BPF scheduler's custody from a kfunc. Right now we can't do that
> >> > >> > > from select_cpu() unless we direct dispatch IIUC.
> >> > >> > 
> >> > >> > Ok, I'll send a new patch later with the new scenario included. It should
> >> > >> > work already (if done properly in the test case), I think we don't need to
> >> > >> > change anything in the kernel.
> >> > >> 
> >> > >> Actually I take that back. The internal BPF queue from ops.select_cpu()
> >> > >> scenario is a bit tricky, because when we return from ops.select_cpu()
> >> > >> without p->scx.ddsp_dsq_id being set, we don't know if the scheduler added
> >> > >> the task to an internal BPF queue or simply did nothing.
> >> > >> 
> >> > >> We need to add some special logic here, preferably without introducing
> >> > >> overhead just to handle this particular (really uncommon) case. I'll take a
> >> > >> look.
> >> > >
> >> > > The more I think about this, the more it feels wrong to consider a task as
> >> > > being "in BPF scheduler custody" if it is stored in a BPF internal data
> >> > > structure from ops.select_cpu().
> >> > >
> >> > > At the point where ops.select_cpu() runs, the task has not yet entered the
> >> > > BPF scheduler's queues. While it is technically possible to stash the task
> >> > > in some BPF-managed structure from there, doing so should not imply full
> >> > > scheduler custody.
> >> > >
> >> > > In particular, we should not trigger ops.dequeue(), because the task has
> >> > > not reached the "enqueue" stage of its lifecycle. ops.select_cpu() is
> >> > > effectively a pre-enqueue hook, primarily intended as a fast path to bypass
> >> > > the scheduler altogether. As such, triggering ops.dequeue() in this case
> >> > > would not make sense IMHO.
> >> > >
> >> > > I think it would make more sense to document this behavior explicitly and
> >> > > leave the kselftest as is.
> >> > >
> >> > > Thoughts?
> >> > 
> >> > I am going back and forth on this but I think the problem is that the enqueue() 
> >> > and dequeue() BPF callbacks we have are not actually symmetrical? 
> >> > 
> >> > 1) ops.enqueue() is "sched-ext specific work for the scheduler core's enqueue
> >> > method". This is independent on whether the task ends up in BPF custody or not.
> >> > It could be in a terminal DSQ, a non-terminal DSQ, or a BPF data structure.
> >> > 
> >> > 2) ops.dequeue() is "remove task from BPF custody". E.g., it is used by the
> >> > BPF scheduler to signal whether it should keep a task within its
> >> > internal tracking structures.
> >> > 
> >> > So the edge case of ops.select_cpu() placing the task in BPF custody is
> >> > currently valid. The way I see it, we have two choices in terms of
> >> > semantics:
> >> > 
> >> > 1) ops.dequeue() must be the equivalent of ops.enqueue(). If the BPF
> >> > scheduler writer decides to place a task into BPF custody during the
> >> > ops.select_cpu() that's on them. ops.select_cpu() is supposed to be a
> >> > pure function providing a hint, anyway. Using it to place a task into
> >> > BPF is a bit of an abuse even if allowed.
> >> > 
> >> > 2) We interpret ops.dequeue() to mean "dequeue from the BPF scheduler".
> >> > In that case we allow the edge case and interpret ops.dequeue() as "the
> >> > function that must be called to clear the NEEDS_DEQ/IN_BPF flag", not as
> >> > the complement of ops.enqueue(). In most cases both will be true, and in
> >> > the cases where not then it's up to the scheduler writer to understand
> >> > the nuance.
> >> > 
> >> > I think while 2) is cleaner, it is more involved and honestly kinda
> >> > speculative. However, I think it's fair game since once we settle on
> >> > the semantics it will be more difficult to change them. Which one do you 
> >> > think makes more sense?
> >> 
> >> Yeah, I'm also going back and forth on this.
> >> 
> >> Honestly from a pure theoretical perspective, option (1) feels cleaner to
> >> me: when ops.select_cpu() runs, the task has not entered the BPF scheduler
> >> yet. If we trigger ops.dequeue() in this case, we end up with tasks that
> >> are "leaving" the scheduler without ever having entered it, which feels
> >> like a violation of the lifecycle model.
> >> 
> >> However, from a practical perspective, it's probably more convenient to
> >> trigger ops.dequeue() also for tasks that are stored in BPF data structures
> >> or user DSQs from ops.select_cpu() as well. If we don't allow that, we
> >> can't just silently ignore the behavior and it's also pretty hard to
> >> reliably detect and trigger an error for this kind of "abuse" at runtime.
> >> That means it could easily turn into a source of subtle bugs in the future,
> >> and I don't think documentation alone would be sufficient to prevent that
> >> (the "don't do that" rules are always fragile).
> >> 
> >> Therefore, at the moment I'm more inclined to go with option (2), as it
> >> provides better robustness and gives schedulers more flexibility.
> >
> > I'm running into a number of headaches and corner cases if we go with
> > option (2)... One of them is the following.
> >
> > Assume we push tasks into a BPF queue from ops.select_cpu() and pop them
> > from ops.dispatch(). The following scenario can happen:
> >
> >   CPU0                                         CPU1
> >   ----                                         ----
> >   ops.select_cpu()
> >     bpf_map_push_elem(&queue, &pid, 0)
> >                                                ops.dispatch()
> > 					         bpf_map_pop_elem(&queue, &pid)
> > 						 scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | dst_cpu)
> > 						   ==> ops.dequeue() is not triggered!
> >     p->scx.flags |= SCX_TASK_IN_BPF
> >
> > To fix this, we would need to always set SCX_TASK_IN_BPF before calling
> > ops.select_cpu(), and then clear it again if the task is directly
> > dispatched to a terminal DSQ from ops.select_cpu().
> >
> > However, doing so introduces further problems. In particular, we may end up
> > triggering spurious ops.dequeue() callbacks, which means we would then need
> > to distinguish whether a task entered BPF custody via ops.select_cpu() or
> > via ops.enqueue(), and handle the two cases differently. Which is also racy
> > and leads to additional locking and complexity.
> >
> > At that point, it starts to feel like we're over-complicating the design to
> > support a scenario that is both uncommon and of questionable practical
> > value.
> >
> > Given that, I'd suggest proceeding incrementally: for now, we go with
> > option (1), which looks doable without major changes and it probably fixes
> > the ops.dequeue() semantics for the majority of use cases (which is already
> > a significant improvement over the current state). Once that is in place,
> > we can revisit the "store tasks in internal BPF data structures from
> > ops.select_cpu()" scenario and see if it's worth supporting it in a cleaner
> > way. WDYT?
> >
> 
> I agree with going with option 1. 
> 
> For the select_cpu() edge case, how about introducing an explicit 
> kfunc scx_place_in_bpf_custody() later? Placing a task in BPF custody 
> during select_cpu() is already pretty niche, so we can assume the 
> scheduler writer knows what they're doing. In that case, let's let 
> _them_ decide when in select_cpu() the task is considered "in BPF". 
> They can also do their own locking to avoid races with locking on 
> the task context. This keeps the state machine clean for the average
> scheduler while still handling the edge case. DYT that would work?

Yeah, I was also considering introducing dedicated kfuncs so that the BPF
scheduler can explicitly manage the "in BPF custody" state, decoupling the
notion of BPF custody from ops.enqueue(). With such interface, a scheduler
could do something like:

ops.select_cpu()
{
        s32 pid = p->pid;

        scx_bpf_enter_custody(p);
        if (!bpf_map_push_elem(&bpf_queue, &pid, 0)) {
                set_task_state(TASK_ENQUEUED);
        } else {
                scx_bpf_exit_custody(p);
                set_task_state(TASK_NONE);
        }

        return prev_cpu;
}

On the implementation side, entering / leaving BPF custody is essentially
setting / clearing SCX_TASK_IN_BPF, with the scheduler taking full
responsibility for ensuring the flag is managed consistently: you set the
flag => ops.dequeue() is called when the task leaves custody, you clear the
flag => fallback to the default custody behavior.

But I think this is something to explore in the future, for now I'd go with
the easier way first. :)

Thanks,
-Andrea

  reply	other threads:[~2026-02-09 15:43 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-06 13:54 [PATCHSET v7] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-06 13:54 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-06 20:10   ` Emil Tsalapatis
2026-02-07  9:16     ` Andrea Righi
2026-02-08  5:11       ` Emil Tsalapatis
2026-02-08  9:02         ` Andrea Righi
2026-02-08 10:26           ` Andrea Righi
2026-02-08 13:55             ` Andrea Righi
2026-02-08 17:59               ` Emil Tsalapatis
2026-02-08 20:08                 ` Andrea Righi
2026-02-09 10:20                   ` Andrea Righi
2026-02-09 15:00                     ` Emil Tsalapatis
2026-02-09 15:43                       ` Andrea Righi [this message]
2026-02-09 17:23                         ` Tejun Heo
2026-02-09 19:17                           ` Andrea Righi
2026-02-09 20:10                             ` Tejun Heo
2026-02-09 22:22                               ` Andrea Righi
2026-02-10  0:42                                 ` Tejun Heo
2026-02-10  7:29                                   ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-12 17:15   ` Christian Loehle
2026-02-12 18:25     ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] sched_ext: Fix " Andrea Righi
2026-02-05 15:32 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-04 16:05 [PATCHSET v5] sched_ext: Fix " Andrea Righi
2026-02-04 16:05 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] sched_ext: Fix " Andrea Righi
2026-02-01  9:08 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] sched_ext: Fix " Andrea Righi
2026-01-26  8:41 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-01-27 16:53   ` Emil Tsalapatis
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] sched_ext: Fix " Andrea Righi
2026-01-21 12:25 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYoAmES8tKNZJkdx@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=jpiecuch@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.