From: Andrea Righi <arighi@nvidia.com>
To: Kuba Piecuch <jpiecuch@google.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Changwoo Min <changwoo@igalia.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
Christian Loehle <christian.loehle@arm.com>,
Daniel Hodges <hodgesd@meta.com>,
sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Wed, 4 Feb 2026 11:11:49 +0100 [thread overview]
Message-ID: <aYMbZcRNR5AUNiUt@gpd4> (raw)
In-Reply-To: <DG4GH47JCNH5.36CX78QFOYVS6@google.com>
On Mon, Feb 02, 2026 at 11:56:43AM +0000, Kuba Piecuch wrote:
> Hi Andrea,
>
> Looks good overall, but we need to settle on the global DSQ semantics, plus
> some edge cases that need clearing up.
On this one I think we settled on the assumption that SCX_DSQ_GLOBAL can be
considered a "terminal DSQ", so we won't trigger ops.dequeue().
>
> On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote:
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..6d9e82e6ca9d4 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
> >
> > * Queue the task on the BPF side.
> >
> > + **Task State Tracking and ops.dequeue() Semantics**
> > +
> > + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> > + enter the "BPF scheduler's custody" depending on where it's dispatched:
> > +
> > + * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
> > + ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
> > + entirely and goes straight to the CPU's local run queue. The task
> > + never enters BPF custody, and ``ops.dequeue()`` will not be called.
> > +
> > + * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
> > + the task enters the BPF scheduler's custody. When the task later
> > + leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
> > + or dequeued for sleep/property changes), ``ops.dequeue()`` will be
> > + called exactly once.
> > +
> > + * **Queued on BPF side**: The task is in BPF data structures and in BPF
> > + custody, ``ops.dequeue()`` will be called when it leaves.
> > +
> > + The key principle: **ops.dequeue() is called when a task leaves the BPF
> > + scheduler's custody**. A task is in BPF custody if it's on a non-local
> > + DSQ or in BPF data structures. Once dispatched to a local DSQ or after
> > + ops.dequeue() is called, the task is out of BPF custody and the BPF
> > + scheduler no longer needs to track it.
> > +
> > + This works correctly with the ``ops.select_cpu()`` direct dispatch
> > + optimization: even though it skips ``ops.enqueue()`` invocation, if the
> > + task is dispatched to a non-local DSQ, it enters BPF custody and will
> > + get ``ops.dequeue()`` when it leaves. This provides the performance
> > + benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
> > + correct state tracking.
> > +
> > + The dequeue can happen for different reasons, distinguished by flags:
> > +
> > + 1. **Regular dispatch workflow**: when the task is dispatched from a
> > + non-local DSQ to a local DSQ (leaving BPF custody for execution),
> > + ``ops.dequeue()`` is triggered without any special flags.
>
> Maybe add a note that this can happen asynchronously, without the BPF
> scheduler explicitly dispatching the task to a local DSQ, when the task
> is on a global DSQ? Or maybe make that case into a separate dequeue reason
> with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ?
And I guess we don't need this if we consider SCX_DSQ_GLOBAL as a terminal
DSQ, because we won't trigger ops.dequeue().
>
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..0d003d2845393 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> > /* scx_entity.flags */
> > enum scx_ent_flags {
> > SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
> > + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* under ext scheduler's custody */
>
> Nit: I think "in BPF scheduler's custody" would be a bit clearer, as
> "ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT
> as a whole.
Ack. Will change that.
>
> > @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >
> > switch (opss & SCX_OPSS_STATE_MASK) {
> > case SCX_OPSS_NONE:
> > + /*
> > + * Task is not in BPF data structures (either dispatched to
> > + * a DSQ or running). Only call ops.dequeue() if the task
> > + * is still in BPF scheduler's custody
> > + * (%SCX_TASK_OPS_ENQUEUED is set).
> > + *
> > + * If the task has already been dispatched to a local DSQ
> > + * (left BPF custody), the flag will be clear and we skip
> > + * ops.dequeue()
> > + *
> > + * If this is a property change (not sleep/core-sched) and
> > + * the task is still in BPF custody, set the
> > + * %SCX_DEQ_SCHED_CHANGE flag.
> > + */
> > + if (SCX_HAS_OP(sch, dequeue) &&
> > + p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> > + u64 flags = deq_flags;
> > +
> > + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > + flags |= SCX_DEQ_SCHED_CHANGE;
>
> I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for
> tasks being picked from a global DSQ being migrated from a remote rq to the
> local rq, which, while technically correct since the task is migrating rqs,
> may be confusing, since it fits two cases in the documentation:
>
> * Since the task is leaving BPF custody for execution, ops.dequeue() should be
> called without any special flags.
> * Since the task is being migrated between rqs, ops.dequeue() should be called
> with SCX_DEQ_SCHED_CHANGE.
This also should be fixed with the new logic, because a task disptched to a
global DSQ is considered outside of the BPF scheduler's custody, so
ops.dequeue() is not invoked at all.
I'll post a new patch set later today, so we can better discuss if all
these assumptions have been addressed properly. :)
>
> > +
> > + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > + }
> > break;
> > case SCX_OPSS_QUEUEING:
> > /*
>
> Thanks,
> Kuba
Thanks,
-Andrea
next prev parent reply other threads:[~2026-02-04 10:12 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-01 9:08 [PATCHSET v4 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-01 9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47 ` Christian Loehle
2026-02-02 7:45 ` Andrea Righi
2026-02-02 9:26 ` Andrea Righi
2026-02-02 10:02 ` Christian Loehle
2026-02-02 15:32 ` Andrea Righi
2026-02-02 10:09 ` Christian Loehle
2026-02-02 13:59 ` Kuba Piecuch
2026-02-04 9:36 ` Andrea Righi
2026-02-04 9:51 ` Kuba Piecuch
2026-02-02 11:56 ` Kuba Piecuch
2026-02-04 10:11 ` Andrea Righi [this message]
2026-02-04 10:33 ` Kuba Piecuch
2026-02-01 9:08 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
-- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20 ` Tejun Heo
2026-02-11 16:06 ` Andrea Righi
2026-02-11 19:47 ` Tejun Heo
2026-02-11 22:34 ` Andrea Righi
2026-02-11 22:37 ` Tejun Heo
2026-02-11 22:48 ` Andrea Righi
2026-02-12 10:16 ` Andrea Righi
2026-02-12 14:32 ` Christian Loehle
2026-02-12 15:45 ` Andrea Righi
2026-02-12 17:07 ` Tejun Heo
2026-02-12 18:14 ` Andrea Righi
2026-02-12 18:35 ` Tejun Heo
2026-02-12 22:30 ` Andrea Righi
2026-02-14 10:16 ` Andrea Righi
2026-02-14 17:56 ` Tejun Heo
2026-02-14 19:32 ` Andrea Righi
2026-02-10 23:54 ` Tejun Heo
2026-02-11 16:07 ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35 ` Emil Tsalapatis
2026-02-07 9:26 ` Andrea Righi
2026-02-09 17:28 ` Tejun Heo
2026-02-09 19:06 ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29 ` Kuba Piecuch
2026-02-05 21:32 ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14 ` Tejun Heo
2026-02-05 9:26 ` Andrea Righi
2026-01-26 8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26 8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38 ` Emil Tsalapatis
2026-01-27 16:41 ` Kuba Piecuch
2026-01-30 7:34 ` Andrea Righi
2026-01-30 13:14 ` Kuba Piecuch
2026-01-31 6:54 ` Andrea Righi
2026-01-31 16:45 ` Kuba Piecuch
2026-01-31 17:24 ` Andrea Righi
2026-01-28 21:21 ` Tejun Heo
2026-01-30 11:54 ` Kuba Piecuch
2026-01-31 9:02 ` Andrea Righi
2026-01-31 17:53 ` Kuba Piecuch
2026-01-31 20:26 ` Andrea Righi
2026-02-02 15:19 ` Tejun Heo
2026-02-02 15:30 ` Andrea Righi
2026-02-01 17:43 ` Tejun Heo
2026-02-02 15:52 ` Andrea Righi
2026-02-02 16:23 ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54 ` Christian Loehle
2026-01-21 12:57 ` Andrea Righi
2026-01-22 9:28 ` Kuba Piecuch
2026-01-23 13:32 ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28 3:20 ` Emil Tsalapatis
2025-12-29 16:36 ` Andrea Righi
2025-12-29 18:35 ` Emil Tsalapatis
2025-12-28 17:19 ` Tejun Heo
2025-12-28 23:28 ` Tejun Heo
2025-12-28 23:38 ` Tejun Heo
2025-12-29 17:07 ` Andrea Righi
2025-12-29 18:55 ` Emil Tsalapatis
2025-12-28 23:42 ` Tejun Heo
2025-12-29 17:17 ` Andrea Righi
2025-12-29 0:06 ` Tejun Heo
2025-12-29 18:56 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aYMbZcRNR5AUNiUt@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=christian.loehle@arm.com \
--cc=emil@etsalapatis.com \
--cc=hodgesd@meta.com \
--cc=jpiecuch@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.