All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kuba Piecuch <jpiecuch@google.com>
To: Andrea Righi <arighi@nvidia.com>, Tejun Heo <tj@kernel.org>,
	 David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>
Cc: Kuba Piecuch <jpiecuch@google.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	 Christian Loehle <christian.loehle@arm.com>,
	Daniel Hodges <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
	 <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Mon, 02 Feb 2026 11:56:43 +0000	[thread overview]
Message-ID: <DG4GH47JCNH5.36CX78QFOYVS6@google.com> (raw)
In-Reply-To: <20260201091318.178710-2-arighi@nvidia.com>

Hi Andrea,

Looks good overall, but we need to settle on the global DSQ semantics, plus
some edge cases that need clearing up.

On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote:
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..6d9e82e6ca9d4 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   **Task State Tracking and ops.dequeue() Semantics**
> +
> +   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> +   enter the "BPF scheduler's custody" depending on where it's dispatched:
> +
> +   * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
> +     ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
> +     entirely and goes straight to the CPU's local run queue. The task
> +     never enters BPF custody, and ``ops.dequeue()`` will not be called.
> +
> +   * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
> +     the task enters the BPF scheduler's custody. When the task later
> +     leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
> +     or dequeued for sleep/property changes), ``ops.dequeue()`` will be
> +     called exactly once.
> +
> +   * **Queued on BPF side**: The task is in BPF data structures and in BPF
> +     custody, ``ops.dequeue()`` will be called when it leaves.
> +
> +   The key principle: **ops.dequeue() is called when a task leaves the BPF
> +   scheduler's custody**. A task is in BPF custody if it's on a non-local
> +   DSQ or in BPF data structures. Once dispatched to a local DSQ or after
> +   ops.dequeue() is called, the task is out of BPF custody and the BPF
> +   scheduler no longer needs to track it.
> +
> +   This works correctly with the ``ops.select_cpu()`` direct dispatch
> +   optimization: even though it skips ``ops.enqueue()`` invocation, if the
> +   task is dispatched to a non-local DSQ, it enters BPF custody and will
> +   get ``ops.dequeue()`` when it leaves. This provides the performance
> +   benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
> +   correct state tracking.
> +
> +   The dequeue can happen for different reasons, distinguished by flags:
> +
> +   1. **Regular dispatch workflow**: when the task is dispatched from a
> +      non-local DSQ to a local DSQ (leaving BPF custody for execution),
> +      ``ops.dequeue()`` is triggered without any special flags.

Maybe add a note that this can happen asynchronously, without the BPF
scheduler explicitly dispatching the task to a local DSQ, when the task
is on a global DSQ? Or maybe make that case into a separate dequeue reason
with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ?

> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..0d003d2845393 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* under ext scheduler's custody */

Nit: I think "in BPF scheduler's custody" would be a bit clearer, as
"ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT
as a whole.

> @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * Task is not in BPF data structures (either dispatched to
> +		 * a DSQ or running). Only call ops.dequeue() if the task
> +		 * is still in BPF scheduler's custody
> +		 * (%SCX_TASK_OPS_ENQUEUED is set).
> +		 *
> +		 * If the task has already been dispatched to a local DSQ
> +		 * (left BPF custody), the flag will be clear and we skip
> +		 * ops.dequeue()
> +		 *
> +		 * If this is a property change (not sleep/core-sched) and
> +		 * the task is still in BPF custody, set the
> +		 * %SCX_DEQ_SCHED_CHANGE flag.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> +			u64 flags = deq_flags;
> +
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;

I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for
tasks being picked from a global DSQ being migrated from a remote rq to the
local rq, which, while technically correct since the task is migrating rqs,
may be confusing, since it fits two cases in the documentation:

* Since the task is leaving BPF custody for execution, ops.dequeue() should be
  called without any special flags.
* Since the task is being migrated between rqs, ops.dequeue() should be called
  with SCX_DEQ_SCHED_CHANGE.

> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*

Thanks,
Kuba


  parent reply	other threads:[~2026-02-02 11:56 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47   ` Christian Loehle
2026-02-02  7:45     ` Andrea Righi
2026-02-02  9:26       ` Andrea Righi
2026-02-02 10:02         ` Christian Loehle
2026-02-02 15:32           ` Andrea Righi
2026-02-02 10:09       ` Christian Loehle
2026-02-02 13:59       ` Kuba Piecuch
2026-02-04  9:36         ` Andrea Righi
2026-02-04  9:51           ` Kuba Piecuch
2026-02-02 11:56   ` Kuba Piecuch [this message]
2026-02-04 10:11     ` Andrea Righi
2026-02-04 10:33       ` Kuba Piecuch
2026-02-01  9:08 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20   ` Tejun Heo
2026-02-11 16:06     ` Andrea Righi
2026-02-11 19:47       ` Tejun Heo
2026-02-11 22:34         ` Andrea Righi
2026-02-11 22:37           ` Tejun Heo
2026-02-11 22:48             ` Andrea Righi
2026-02-12 10:16             ` Andrea Righi
2026-02-12 14:32               ` Christian Loehle
2026-02-12 15:45                 ` Andrea Righi
2026-02-12 17:07                   ` Tejun Heo
2026-02-12 18:14                     ` Andrea Righi
2026-02-12 18:35                       ` Tejun Heo
2026-02-12 22:30                         ` Andrea Righi
2026-02-14 10:16                           ` Andrea Righi
2026-02-14 17:56                             ` Tejun Heo
2026-02-14 19:32                               ` Andrea Righi
2026-02-10 23:54   ` Tejun Heo
2026-02-11 16:07     ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29   ` Kuba Piecuch
2026-02-05 21:32     ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14   ` Tejun Heo
2026-02-05  9:26     ` Andrea Righi
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38   ` Emil Tsalapatis
2026-01-27 16:41   ` Kuba Piecuch
2026-01-30  7:34     ` Andrea Righi
2026-01-30 13:14       ` Kuba Piecuch
2026-01-31  6:54         ` Andrea Righi
2026-01-31 16:45           ` Kuba Piecuch
2026-01-31 17:24             ` Andrea Righi
2026-01-28 21:21   ` Tejun Heo
2026-01-30 11:54     ` Kuba Piecuch
2026-01-31  9:02       ` Andrea Righi
2026-01-31 17:53         ` Kuba Piecuch
2026-01-31 20:26           ` Andrea Righi
2026-02-02 15:19             ` Tejun Heo
2026-02-02 15:30               ` Andrea Righi
2026-02-01 17:43       ` Tejun Heo
2026-02-02 15:52         ` Andrea Righi
2026-02-02 16:23           ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54   ` Christian Loehle
2026-01-21 12:57     ` Andrea Righi
2026-01-22  9:28   ` Kuba Piecuch
2026-01-23 13:32     ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28  3:20   ` Emil Tsalapatis
2025-12-29 16:36     ` Andrea Righi
2025-12-29 18:35       ` Emil Tsalapatis
2025-12-28 17:19   ` Tejun Heo
2025-12-28 23:28     ` Tejun Heo
2025-12-28 23:38       ` Tejun Heo
2025-12-29 17:07         ` Andrea Righi
2025-12-29 18:55           ` Emil Tsalapatis
2025-12-28 23:42   ` Tejun Heo
2025-12-29 17:17     ` Andrea Righi
2025-12-29  0:06   ` Tejun Heo
2025-12-29 18:56     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DG4GH47JCNH5.36CX78QFOYVS6@google.com \
    --to=jpiecuch@google.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.