All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Emil Tsalapatis" <emil@etsalapatis.com>
To: "Andrea Righi" <arighi@nvidia.com>, "Tejun Heo" <tj@kernel.org>,
	"David Vernet" <void@manifault.com>,
	"Changwoo Min" <changwoo@igalia.com>
Cc: "Daniel Hodges" <hodgesd@meta.com>, <sched-ext@lists.linux.dev>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Sat, 27 Dec 2025 22:20:06 -0500	[thread overview]
Message-ID: <DF9IXXXOIXP7.LILBQBLDYVM7@etsalapatis.com> (raw)
In-Reply-To: <20251219224450.2537941-2-arighi@nvidia.com>

On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote:
> Properly implement ops.dequeue() to ensure every ops.enqueue() is
> balanced by a corresponding ops.dequeue() call, regardless of whether
> the task is on a BPF data structure or already dispatched to a DSQ.
>
> A task is considered enqueued when it is owned by the BPF scheduler.
> This ownership persists until the task is either dispatched (moved to a
> local DSQ for execution) or removed from the BPF scheduler, such as when
> it blocks waiting for an event or when its properties (for example, CPU
> affinity or priority) are updated.
>
> When the task enters the BPF scheduler ops.enqueue() is invoked, when it
> leaves BPF scheduler ownership, ops.dequeue() is invoked.
>
> This allows BPF schedulers to reliably track task ownership and maintain
> accurate accounting.
>
> Cc: Emil Tsalapatis <emil@etsalapatis.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---


Hi Andrea,

	This change looks reasonable to me. Some comments inline:

>  Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
>  include/linux/sched/ext.h             |  1 +
>  kernel/sched/ext.c                    | 27 ++++++++++++++++++++++++++-
>  3 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..3ed4be53f97da 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> +   is owned by the BPF scheduler. Ownership is retained until the task is
> +   either dispatched (moved to a local DSQ for execution) or dequeued
> +   (removed from the scheduler due to a blocking event, or to modify a
> +   property, like CPU affinity, priority, etc.). When the task leaves the
> +   BPF scheduler ``ops.dequeue()`` is invoked.
> +

Can we say "leaves the scx class" instead? On direct dispatch we
technically never insert the task into the BPF scheduler.

> +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> +   regardless of whether the task is still on a BPF data structure, or it
> +   is already dispatched to a DSQ (global, local, or user DSQ)
> +
> +   This guarantees that every ``ops.enqueue()`` will eventually be followed
> +   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
> +   track task ownership and maintain accurate accounting, such as per-DSQ
> +   queued runtime sums.
> +
> +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> +   don't need to track these transitions. The sched_ext core will safely
> +   handle all dequeue operations regardless.
> +
>  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>     empty, it then looks at the global DSQ. If there still isn't a task to
>     run, ``ops.dispatch()`` is invoked which can use the following two
> @@ -319,6 +339,8 @@ by a sched_ext scheduler:
>                  /* Any usable CPU becomes available */
>  
>                  ops.dispatch(); /* Task is moved to a local DSQ */
> +
> +                ops.dequeue(); /* Exiting BPF scheduler */
>              }
>              ops.running();      /* Task starts running on its assigned CPU */
>              while (task->scx.slice > 0 && task is runnable)
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..334c3692a9c62 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* ops.enqueue() was called */

Can we rename this flag? For direct dispatch we never got enqueued.
Something like "DEQ_ON_DISPATCH" would show the purpose of the
flag more clearly.

>  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
>  
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 94164f2dec6dc..985d75d374385 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
>  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
>  
> +	/* Mark that ops.enqueue() is being called for this task */
> +	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +

Can we avoid setting this flag when we have no .dequeue() method?
Otherwise it stays set forever AFAICT, even after the task has been
sent to the runqueues.

>  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
>  	WARN_ON_ONCE(*ddsp_taskp);
>  	*ddsp_taskp = p;
> @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * Task is not currently being enqueued or queued on the BPF
> +		 * scheduler. Check if ops.enqueue() was called for this task.
> +		 */
> +		if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
> +		    SCX_HAS_OP(sch, dequeue)) {
> +			/*
> +			 * ops.enqueue() was called and the task was dispatched.
> +			 * Call ops.dequeue() to notify the BPF scheduler that
> +			 * the task is leaving.
> +			 */
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> +					 p, deq_flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> -		if (SCX_HAS_OP(sch, dequeue))
> +		/*
> +		 * Task is owned by the BPF scheduler. Call ops.dequeue()
> +		 * to notify the BPF scheduler that the task is being
> +		 * removed.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue)) {

Edge case, but if we have a .dequeue() method but not an .enqueue() we
still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an 
extra condition to be consistent with the SCX_OPSS_NONE case above?

>  			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
>  					 p, deq_flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  
>  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>  					    SCX_OPSS_NONE))


  reply	other threads:[~2025-12-28  3:20 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper ops.dequeue() semantics Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28  3:20   ` Emil Tsalapatis [this message]
2025-12-29 16:36     ` Andrea Righi
2025-12-29 18:35       ` Emil Tsalapatis
2025-12-28 17:19   ` Tejun Heo
2025-12-28 23:28     ` Tejun Heo
2025-12-28 23:38       ` Tejun Heo
2025-12-29 17:07         ` Andrea Righi
2025-12-29 18:55           ` Emil Tsalapatis
2025-12-28 23:42   ` Tejun Heo
2025-12-29 17:17     ` Andrea Righi
2025-12-29  0:06   ` Tejun Heo
2025-12-29 18:56     ` Andrea Righi
2025-12-19 22:43 ` [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() Andrea Righi
2025-12-28  3:28   ` Emil Tsalapatis
2025-12-29 16:11     ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54   ` Christian Loehle
2026-01-21 12:57     ` Andrea Righi
2026-01-22  9:28   ` Kuba Piecuch
2026-01-23 13:32     ` Andrea Righi
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38   ` Emil Tsalapatis
2026-01-27 16:41   ` Kuba Piecuch
2026-01-30  7:34     ` Andrea Righi
2026-01-30 13:14       ` Kuba Piecuch
2026-01-31  6:54         ` Andrea Righi
2026-01-31 16:45           ` Kuba Piecuch
2026-01-31 17:24             ` Andrea Righi
2026-01-28 21:21   ` Tejun Heo
2026-01-30 11:54     ` Kuba Piecuch
2026-01-31  9:02       ` Andrea Righi
2026-01-31 17:53         ` Kuba Piecuch
2026-01-31 20:26           ` Andrea Righi
2026-02-02 15:19             ` Tejun Heo
2026-02-02 15:30               ` Andrea Righi
2026-02-01 17:43       ` Tejun Heo
2026-02-02 15:52         ` Andrea Righi
2026-02-02 16:23           ` Kuba Piecuch
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47   ` Christian Loehle
2026-02-02  7:45     ` Andrea Righi
2026-02-02  9:26       ` Andrea Righi
2026-02-02 10:02         ` Christian Loehle
2026-02-02 15:32           ` Andrea Righi
2026-02-02 10:09       ` Christian Loehle
2026-02-02 13:59       ` Kuba Piecuch
2026-02-04  9:36         ` Andrea Righi
2026-02-04  9:51           ` Kuba Piecuch
2026-02-02 11:56   ` Kuba Piecuch
2026-02-04 10:11     ` Andrea Righi
2026-02-04 10:33       ` Kuba Piecuch
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14   ` Tejun Heo
2026-02-05  9:26     ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29   ` Kuba Piecuch
2026-02-05 21:32     ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-10 21:26 [PATCHSET v8] " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20   ` Tejun Heo
2026-02-11 16:06     ` Andrea Righi
2026-02-11 19:47       ` Tejun Heo
2026-02-11 22:34         ` Andrea Righi
2026-02-11 22:37           ` Tejun Heo
2026-02-11 22:48             ` Andrea Righi
2026-02-12 10:16             ` Andrea Righi
2026-02-12 14:32               ` Christian Loehle
2026-02-12 15:45                 ` Andrea Righi
2026-02-12 17:07                   ` Tejun Heo
2026-02-12 18:14                     ` Andrea Righi
2026-02-12 18:35                       ` Tejun Heo
2026-02-12 22:30                         ` Andrea Righi
2026-02-14 10:16                           ` Andrea Righi
2026-02-14 17:56                             ` Tejun Heo
2026-02-14 19:32                               ` Andrea Righi
2026-02-10 23:54   ` Tejun Heo
2026-02-11 16:07     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DF9IXXXOIXP7.LILBQBLDYVM7@etsalapatis.com \
    --to=emil@etsalapatis.com \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=hodgesd@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.