public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Kuba Piecuch <jpiecuch@google.com>
To: Andrea Righi <arighi@nvidia.com>, Tejun Heo <tj@kernel.org>,
	 David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>
Cc: Kuba Piecuch <jpiecuch@google.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	 Christian Loehle <christian.loehle@arm.com>,
	Daniel Hodges <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
	 <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Mon, 02 Feb 2026 11:56:43 +0000	[thread overview]
Message-ID: <DG4GH47JCNH5.36CX78QFOYVS6@google.com> (raw)
In-Reply-To: <20260201091318.178710-2-arighi@nvidia.com>

Hi Andrea,

Looks good overall, but we need to settle on the global DSQ semantics, plus
some edge cases that need clearing up.

On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote:
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..6d9e82e6ca9d4 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   **Task State Tracking and ops.dequeue() Semantics**
> +
> +   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> +   enter the "BPF scheduler's custody" depending on where it's dispatched:
> +
> +   * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
> +     ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
> +     entirely and goes straight to the CPU's local run queue. The task
> +     never enters BPF custody, and ``ops.dequeue()`` will not be called.
> +
> +   * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
> +     the task enters the BPF scheduler's custody. When the task later
> +     leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
> +     or dequeued for sleep/property changes), ``ops.dequeue()`` will be
> +     called exactly once.
> +
> +   * **Queued on BPF side**: The task is in BPF data structures and in BPF
> +     custody, ``ops.dequeue()`` will be called when it leaves.
> +
> +   The key principle: **ops.dequeue() is called when a task leaves the BPF
> +   scheduler's custody**. A task is in BPF custody if it's on a non-local
> +   DSQ or in BPF data structures. Once dispatched to a local DSQ or after
> +   ops.dequeue() is called, the task is out of BPF custody and the BPF
> +   scheduler no longer needs to track it.
> +
> +   This works correctly with the ``ops.select_cpu()`` direct dispatch
> +   optimization: even though it skips ``ops.enqueue()`` invocation, if the
> +   task is dispatched to a non-local DSQ, it enters BPF custody and will
> +   get ``ops.dequeue()`` when it leaves. This provides the performance
> +   benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
> +   correct state tracking.
> +
> +   The dequeue can happen for different reasons, distinguished by flags:
> +
> +   1. **Regular dispatch workflow**: when the task is dispatched from a
> +      non-local DSQ to a local DSQ (leaving BPF custody for execution),
> +      ``ops.dequeue()`` is triggered without any special flags.

Maybe add a note that this can happen asynchronously, without the BPF
scheduler explicitly dispatching the task to a local DSQ, when the task
is on a global DSQ? Or maybe make that case into a separate dequeue reason
with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ?

> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..0d003d2845393 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* under ext scheduler's custody */

Nit: I think "in BPF scheduler's custody" would be a bit clearer, as
"ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT
as a whole.

> @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * Task is not in BPF data structures (either dispatched to
> +		 * a DSQ or running). Only call ops.dequeue() if the task
> +		 * is still in BPF scheduler's custody
> +		 * (%SCX_TASK_OPS_ENQUEUED is set).
> +		 *
> +		 * If the task has already been dispatched to a local DSQ
> +		 * (left BPF custody), the flag will be clear and we skip
> +		 * ops.dequeue()
> +		 *
> +		 * If this is a property change (not sleep/core-sched) and
> +		 * the task is still in BPF custody, set the
> +		 * %SCX_DEQ_SCHED_CHANGE flag.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> +			u64 flags = deq_flags;
> +
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;

I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for
tasks being picked from a global DSQ being migrated from a remote rq to the
local rq, which, while technically correct since the task is migrating rqs,
may be confusing, since it fits two cases in the documentation:

* Since the task is leaving BPF custody for execution, ops.dequeue() should be
  called without any special flags.
* Since the task is being migrated between rqs, ops.dequeue() should be called
  with SCX_DEQ_SCHED_CHANGE.

> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*

Thanks,
Kuba


  parent reply	other threads:[~2026-02-02 11:56 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47   ` Christian Loehle
2026-02-02  7:45     ` Andrea Righi
2026-02-02  9:26       ` Andrea Righi
2026-02-02 10:02         ` Christian Loehle
2026-02-02 15:32           ` Andrea Righi
2026-02-02 10:09       ` Christian Loehle
2026-02-02 13:59       ` Kuba Piecuch
2026-02-04  9:36         ` Andrea Righi
2026-02-04  9:51           ` Kuba Piecuch
2026-02-02 11:56   ` Kuba Piecuch [this message]
2026-02-04 10:11     ` Andrea Righi
2026-02-04 10:33       ` Kuba Piecuch
2026-02-01  9:08 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20   ` Tejun Heo
2026-02-11 16:06     ` Andrea Righi
2026-02-11 19:47       ` Tejun Heo
2026-02-11 22:34         ` Andrea Righi
2026-02-11 22:37           ` Tejun Heo
2026-02-11 22:48             ` Andrea Righi
2026-02-12 10:16             ` Andrea Righi
2026-02-12 14:32               ` Christian Loehle
2026-02-12 15:45                 ` Andrea Righi
2026-02-12 17:07                   ` Tejun Heo
2026-02-12 18:14                     ` Andrea Righi
2026-02-12 18:35                       ` Tejun Heo
2026-02-12 22:30                         ` Andrea Righi
2026-02-14 10:16                           ` Andrea Righi
2026-02-14 17:56                             ` Tejun Heo
2026-02-14 19:32                               ` Andrea Righi
2026-02-10 23:54   ` Tejun Heo
2026-02-11 16:07     ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29   ` Kuba Piecuch
2026-02-05 21:32     ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14   ` Tejun Heo
2026-02-05  9:26     ` Andrea Righi
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38   ` Emil Tsalapatis
2026-01-27 16:41   ` Kuba Piecuch
2026-01-30  7:34     ` Andrea Righi
2026-01-30 13:14       ` Kuba Piecuch
2026-01-31  6:54         ` Andrea Righi
2026-01-31 16:45           ` Kuba Piecuch
2026-01-31 17:24             ` Andrea Righi
2026-01-28 21:21   ` Tejun Heo
2026-01-30 11:54     ` Kuba Piecuch
2026-01-31  9:02       ` Andrea Righi
2026-01-31 17:53         ` Kuba Piecuch
2026-01-31 20:26           ` Andrea Righi
2026-02-02 15:19             ` Tejun Heo
2026-02-02 15:30               ` Andrea Righi
2026-02-01 17:43       ` Tejun Heo
2026-02-02 15:52         ` Andrea Righi
2026-02-02 16:23           ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54   ` Christian Loehle
2026-01-21 12:57     ` Andrea Righi
2026-01-22  9:28   ` Kuba Piecuch
2026-01-23 13:32     ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28  3:20   ` Emil Tsalapatis
2025-12-29 16:36     ` Andrea Righi
2025-12-29 18:35       ` Emil Tsalapatis
2025-12-28 17:19   ` Tejun Heo
2025-12-28 23:28     ` Tejun Heo
2025-12-28 23:38       ` Tejun Heo
2025-12-29 17:07         ` Andrea Righi
2025-12-29 18:55           ` Emil Tsalapatis
2025-12-28 23:42   ` Tejun Heo
2025-12-29 17:17     ` Andrea Righi
2025-12-29  0:06   ` Tejun Heo
2025-12-29 18:56     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DG4GH47JCNH5.36CX78QFOYVS6@google.com \
    --to=jpiecuch@google.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox