[PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
@ 2025-12-19 22:43 ` Andrea Righi
  2025-12-28  3:20   ` Emil Tsalapatis
                     ` (3 more replies)
  0 siblings, 4 replies; 83+ messages in thread
From: Andrea Righi @ 2025-12-19 22:43 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

Properly implement ops.dequeue() to ensure every ops.enqueue() is
balanced by a corresponding ops.dequeue() call, regardless of whether
the task is on a BPF data structure or already dispatched to a DSQ.

A task is considered enqueued when it is owned by the BPF scheduler.
This ownership persists until the task is either dispatched (moved to a
local DSQ for execution) or removed from the BPF scheduler, such as when
it blocks waiting for an event or when its properties (for example, CPU
affinity or priority) are updated.

When the task enters the BPF scheduler ops.enqueue() is invoked, when it
leaves BPF scheduler ownership, ops.dequeue() is invoked.

This allows BPF schedulers to reliably track task ownership and maintain
accurate accounting.

Cc: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
 include/linux/sched/ext.h             |  1 +
 kernel/sched/ext.c                    | 27 ++++++++++++++++++++++++++-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..3ed4be53f97da 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
+   is owned by the BPF scheduler. Ownership is retained until the task is
+   either dispatched (moved to a local DSQ for execution) or dequeued
+   (removed from the scheduler due to a blocking event, or to modify a
+   property, like CPU affinity, priority, etc.). When the task leaves the
+   BPF scheduler ``ops.dequeue()`` is invoked.
+
+   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
+   regardless of whether the task is still on a BPF data structure, or it
+   is already dispatched to a DSQ (global, local, or user DSQ)
+
+   This guarantees that every ``ops.enqueue()`` will eventually be followed
+   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
+   track task ownership and maintain accurate accounting, such as per-DSQ
+   queued runtime sums.
+
+   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+   don't need to track these transitions. The sched_ext core will safely
+   handle all dequeue operations regardless.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +339,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..334c3692a9c62 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,6 +84,7 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* ops.enqueue() was called */
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 94164f2dec6dc..985d75d374385 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
 	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
 
+	/* Mark that ops.enqueue() is being called for this task */
+	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+
 	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
 	WARN_ON_ONCE(*ddsp_taskp);
 	*ddsp_taskp = p;
@@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		/*
+		 * Task is not currently being enqueued or queued on the BPF
+		 * scheduler. Check if ops.enqueue() was called for this task.
+		 */
+		if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
+		    SCX_HAS_OP(sch, dequeue)) {
+			/*
+			 * ops.enqueue() was called and the task was dispatched.
+			 * Call ops.dequeue() to notify the BPF scheduler that
+			 * the task is leaving.
+			 */
+			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
+					 p, deq_flags);
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
-		if (SCX_HAS_OP(sch, dequeue))
+		/*
+		 * Task is owned by the BPF scheduler. Call ops.dequeue()
+		 * to notify the BPF scheduler that the task is being
+		 * removed.
+		 */
+		if (SCX_HAS_OP(sch, dequeue)) {
 			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
 					 p, deq_flags);
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
@ 2025-12-28  3:20   ` Emil Tsalapatis
  2025-12-29 16:36     ` Andrea Righi
  2025-12-28 17:19   ` Tejun Heo
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 83+ messages in thread
From: Emil Tsalapatis @ 2025-12-28  3:20 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Daniel Hodges, sched-ext, linux-kernel

On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote:
> Properly implement ops.dequeue() to ensure every ops.enqueue() is
> balanced by a corresponding ops.dequeue() call, regardless of whether
> the task is on a BPF data structure or already dispatched to a DSQ.
>
> A task is considered enqueued when it is owned by the BPF scheduler.
> This ownership persists until the task is either dispatched (moved to a
> local DSQ for execution) or removed from the BPF scheduler, such as when
> it blocks waiting for an event or when its properties (for example, CPU
> affinity or priority) are updated.
>
> When the task enters the BPF scheduler ops.enqueue() is invoked, when it
> leaves BPF scheduler ownership, ops.dequeue() is invoked.
>
> This allows BPF schedulers to reliably track task ownership and maintain
> accurate accounting.
>
> Cc: Emil Tsalapatis <emil@etsalapatis.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---


Hi Andrea,

	This change looks reasonable to me. Some comments inline:

>  Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
>  include/linux/sched/ext.h             |  1 +
>  kernel/sched/ext.c                    | 27 ++++++++++++++++++++++++++-
>  3 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..3ed4be53f97da 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> +   is owned by the BPF scheduler. Ownership is retained until the task is
> +   either dispatched (moved to a local DSQ for execution) or dequeued
> +   (removed from the scheduler due to a blocking event, or to modify a
> +   property, like CPU affinity, priority, etc.). When the task leaves the
> +   BPF scheduler ``ops.dequeue()`` is invoked.
> +

Can we say "leaves the scx class" instead? On direct dispatch we
technically never insert the task into the BPF scheduler.

> +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> +   regardless of whether the task is still on a BPF data structure, or it
> +   is already dispatched to a DSQ (global, local, or user DSQ)
> +
> +   This guarantees that every ``ops.enqueue()`` will eventually be followed
> +   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
> +   track task ownership and maintain accurate accounting, such as per-DSQ
> +   queued runtime sums.
> +
> +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> +   don't need to track these transitions. The sched_ext core will safely
> +   handle all dequeue operations regardless.
> +
>  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>     empty, it then looks at the global DSQ. If there still isn't a task to
>     run, ``ops.dispatch()`` is invoked which can use the following two
> @@ -319,6 +339,8 @@ by a sched_ext scheduler:
>                  /* Any usable CPU becomes available */
>  
>                  ops.dispatch(); /* Task is moved to a local DSQ */
> +
> +                ops.dequeue(); /* Exiting BPF scheduler */
>              }
>              ops.running();      /* Task starts running on its assigned CPU */
>              while (task->scx.slice > 0 && task is runnable)
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..334c3692a9c62 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* ops.enqueue() was called */

Can we rename this flag? For direct dispatch we never got enqueued.
Something like "DEQ_ON_DISPATCH" would show the purpose of the
flag more clearly.

>  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
>  
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 94164f2dec6dc..985d75d374385 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
>  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
>  
> +	/* Mark that ops.enqueue() is being called for this task */
> +	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +

Can we avoid setting this flag when we have no .dequeue() method?
Otherwise it stays set forever AFAICT, even after the task has been
sent to the runqueues.

>  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
>  	WARN_ON_ONCE(*ddsp_taskp);
>  	*ddsp_taskp = p;
> @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * Task is not currently being enqueued or queued on the BPF
> +		 * scheduler. Check if ops.enqueue() was called for this task.
> +		 */
> +		if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
> +		    SCX_HAS_OP(sch, dequeue)) {
> +			/*
> +			 * ops.enqueue() was called and the task was dispatched.
> +			 * Call ops.dequeue() to notify the BPF scheduler that
> +			 * the task is leaving.
> +			 */
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> +					 p, deq_flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> -		if (SCX_HAS_OP(sch, dequeue))
> +		/*
> +		 * Task is owned by the BPF scheduler. Call ops.dequeue()
> +		 * to notify the BPF scheduler that the task is being
> +		 * removed.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue)) {

Edge case, but if we have a .dequeue() method but not an .enqueue() we
still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an 
extra condition to be consistent with the SCX_OPSS_NONE case above?

>  			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
>  					 p, deq_flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  
>  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>  					    SCX_OPSS_NONE))


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
  2025-12-28  3:20   ` Emil Tsalapatis
@ 2025-12-28 17:19   ` Tejun Heo
  2025-12-28 23:28     ` Tejun Heo
  2025-12-28 23:42   ` Tejun Heo
  2025-12-29  0:06   ` Tejun Heo
  3 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-12-28 17:19 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges,
	sched-ext, linux-kernel

Hello, Andrea.

On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote:
...
> +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> +   is owned by the BPF scheduler. Ownership is retained until the task is
> +   either dispatched (moved to a local DSQ for execution) or dequeued
> +   (removed from the scheduler due to a blocking event, or to modify a
> +   property, like CPU affinity, priority, etc.). When the task leaves the
> +   BPF scheduler ``ops.dequeue()`` is invoked.
> +
> +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> +   regardless of whether the task is still on a BPF data structure, or it
> +   is already dispatched to a DSQ (global, local, or user DSQ)
> +
> +   This guarantees that every ``ops.enqueue()`` will eventually be followed
> +   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
> +   track task ownership and maintain accurate accounting, such as per-DSQ
> +   queued runtime sums.

While this works, from the BPF sched's POV, there's no way to tell whether
an ops.dequeue() call is from the task being actually dequeued or the
follow-up to the dispatch operation it just did. This won't make much
difference if ops.dequeue() is just used for accounting purposes, but, a
scheduler which uses an arena data structure for queueing would likely need
to perform extra tests to tell whether the task needs to be dequeued from
the arena side. I *think* hot path (ops.dequeue() following the task's
dispatch) can be a simple lockless test, so this may be okay, but from API
POV, it can probably be better.

The counter interlocking point is scx_bpf_dsq_insert(). If we can
synchronize scx_bpf_dsq_insert() and dequeue so that ops.dequeue() is not
called for a successfully inserted task, I think the semantics would be
neater - an enqueued task is either dispatched or dequeued. Due to the async
dispatch operation, this likely is difficult to do without adding extra sync
operations in scx_bpf_dsq_insert(). However, I *think* we may be able to get
rid of dspc and async inserting if we call ops.dispatch() w/ rq lock
dropped. That may make the whole dispatch path simpler and the behavior
neater too. What do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-28 17:19   ` Tejun Heo
@ 2025-12-28 23:28     ` Tejun Heo
  2025-12-28 23:38       ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-12-28 23:28 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges,
	sched-ext, linux-kernel

Hello,

On Sun, Dec 28, 2025 at 07:19:46AM -1000, Tejun Heo wrote:
> While this works, from the BPF sched's POV, there's no way to tell whether
> an ops.dequeue() call is from the task being actually dequeued or the
> follow-up to the dispatch operation it just did. This won't make much
> difference if ops.dequeue() is just used for accounting purposes, but, a
> scheduler which uses an arena data structure for queueing would likely need
> to perform extra tests to tell whether the task needs to be dequeued from
> the arena side. I *think* hot path (ops.dequeue() following the task's
> dispatch) can be a simple lockless test, so this may be okay, but from API
> POV, it can probably be better.
> 
> The counter interlocking point is scx_bpf_dsq_insert(). If we can
> synchronize scx_bpf_dsq_insert() and dequeue so that ops.dequeue() is not
> called for a successfully inserted task, I think the semantics would be
> neater - an enqueued task is either dispatched or dequeued. Due to the async
> dispatch operation, this likely is difficult to do without adding extra sync
> operations in scx_bpf_dsq_insert(). However, I *think* we may be able to get
> rid of dspc and async inserting if we call ops.dispatch() w/ rq lock
> dropped. That may make the whole dispatch path simpler and the behavior
> neater too. What do you think?

I sat down and went through the code to see whether I was actually making
sense, and I wasn't:

The async dispatch buffering is necessary to avoid lock inversion between rq
lock and whatever locks the BPF scheduler might be using internally. This is
necessary because enqueue path runs with rq lock held. Thus, any lock that
BPF sched uses in tne enqueue path has to nest inside rq lock.

In dispatch, scx_bpf_dsq_insert() is likely to be called with the same BPF
sched side lock held. If we try to do rq lock dancing synchronously, we can
end up trying to grab rq lock while holding BPF side lock leading to
deadlock.

Kernel side has no control over BPF side locking, so the asynchronous
operation is there to side-step the issue. I don't see a good way to make
this synchronous.

So, please ignore that part. That's non-sense. I still wonder whether we can
create some interlocking between scx_bpf_dsq_insert() and ops.dequeue()
without making hot path slower. I'll think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-28 23:28     ` Tejun Heo
@ 2025-12-28 23:38       ` Tejun Heo
  2025-12-29 17:07         ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-12-28 23:38 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges,
	sched-ext, linux-kernel

Hello again, again.

On Sun, Dec 28, 2025 at 01:28:04PM -1000, Tejun Heo wrote:
...
> So, please ignore that part. That's non-sense. I still wonder whether we can
> create some interlocking between scx_bpf_dsq_insert() and ops.dequeue()
> without making hot path slower. I'll think more about it.

And we can't create an interlocking between scx_bpf_dsq_insert() and
ops.dequeue() without adding extra atomic operations in hot paths. The only
thing shared is task rq lock and dispatch path can't do that synchronously.
So, yeah, it looks like the best we can do is always letting the BPF sched
know and let it figure out locking and whether the task needs to be
dequeued from BPF side.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
  2025-12-28  3:20   ` Emil Tsalapatis
  2025-12-28 17:19   ` Tejun Heo
@ 2025-12-28 23:42   ` Tejun Heo
  2025-12-29 17:17     ` Andrea Righi
  2025-12-29  0:06   ` Tejun Heo
  3 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-12-28 23:42 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges,
	sched-ext, linux-kernel

Hello,

On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote:
> +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> +   is owned by the BPF scheduler. Ownership is retained until the task is

Can we avoid using "ownership" for this? From user's POV, this is fine but
kernel side internally uses the word for different purposes - e.g. we say
the BPF side owns the task if the task's SCX_OPSS_QUEUED is set (ie. it's on
BPF data structure, not on a DSQ). Here, the ownership encompasses both
kernel-side and BPF-side queueing, so the term becomes rather confusing.
Maybe we can stick with "queued" or "enqueued"?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
                     ` (2 preceding siblings ...)
  2025-12-28 23:42   ` Tejun Heo
@ 2025-12-29  0:06   ` Tejun Heo
  2025-12-29 18:56     ` Andrea Righi
  3 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-12-29  0:06 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges,
	sched-ext, linux-kernel

Sorry about the million replies. Pretty squirrel brained right now.

On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote:
> @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
>  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
>  
> +	/* Mark that ops.enqueue() is being called for this task */
> +	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;

Is this guaranteed to be cleared after dispatch? ops_dequeue() is called
from dequeue_task_scx() and set_next_task_scx(). It looks like the call from
set_next_task_scx() may end up calling ops.dequeue() when the task starts
running, this seems mostly accidental.

- The BPF sched probably expects ops.dequeue() call immediately after
  dispatch rather than on the running transition. e.g. imagine a scenario
  where a BPF sched dispatches multiple tasks to a local DSQ. Wouldn't the
  expectation be that ops.dequeue() is called as soon as a task is
  dispatched into a local DSQ?

- If this depends on the ops_dequeue() call from set_next_task_scx(), it'd
  also be using the wrong DEQ flag - SCX_DEQ_CORE_SCHED_EXEC - for regular
  ops.dequeue() following a dispatch. That call there is that way only
  because ops_dequeue() didn't do anything when OPSS_NONE.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-28  3:20   ` Emil Tsalapatis
@ 2025-12-29 16:36     ` Andrea Righi
  2025-12-29 18:35       ` Emil Tsalapatis
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2025-12-29 16:36 UTC (permalink / raw)
  To: Emil Tsalapatis
  Cc: Tejun Heo, David Vernet, Changwoo Min, Daniel Hodges, sched-ext,
	linux-kernel

Hi Emil,

On Sat, Dec 27, 2025 at 10:20:06PM -0500, Emil Tsalapatis wrote:
> On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote:
> > Properly implement ops.dequeue() to ensure every ops.enqueue() is
> > balanced by a corresponding ops.dequeue() call, regardless of whether
> > the task is on a BPF data structure or already dispatched to a DSQ.
> >
> > A task is considered enqueued when it is owned by the BPF scheduler.
> > This ownership persists until the task is either dispatched (moved to a
> > local DSQ for execution) or removed from the BPF scheduler, such as when
> > it blocks waiting for an event or when its properties (for example, CPU
> > affinity or priority) are updated.
> >
> > When the task enters the BPF scheduler ops.enqueue() is invoked, when it
> > leaves BPF scheduler ownership, ops.dequeue() is invoked.
> >
> > This allows BPF schedulers to reliably track task ownership and maintain
> > accurate accounting.
> >
> > Cc: Emil Tsalapatis <emil@etsalapatis.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> 
> 
> Hi Andrea,
> 
> 	This change looks reasonable to me. Some comments inline:
> 
> >  Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
> >  include/linux/sched/ext.h             |  1 +
> >  kernel/sched/ext.c                    | 27 ++++++++++++++++++++++++++-
> >  3 files changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..3ed4be53f97da 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
> >  
> >     * Queue the task on the BPF side.
> >  
> > +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> > +   is owned by the BPF scheduler. Ownership is retained until the task is
> > +   either dispatched (moved to a local DSQ for execution) or dequeued
> > +   (removed from the scheduler due to a blocking event, or to modify a
> > +   property, like CPU affinity, priority, etc.). When the task leaves the
> > +   BPF scheduler ``ops.dequeue()`` is invoked.
> > +
> 
> Can we say "leaves the scx class" instead? On direct dispatch we
> technically never insert the task into the BPF scheduler.

Hm.. I agree that'd be more accurate, but it might also be slightly
misleading, as it could be interpreted as the task being moved to a
different scheduling class. How about saying "leaves the enqueued state"
instead, where enqueued means ops.enqueue() being called... I can't find a
better name for this state, like "ops_enqueued", but that's be even more
confusing. :)

> 
> > +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> > +   regardless of whether the task is still on a BPF data structure, or it
> > +   is already dispatched to a DSQ (global, local, or user DSQ)
> > +
> > +   This guarantees that every ``ops.enqueue()`` will eventually be followed
> > +   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
> > +   track task ownership and maintain accurate accounting, such as per-DSQ
> > +   queued runtime sums.
> > +
> > +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> > +   don't need to track these transitions. The sched_ext core will safely
> > +   handle all dequeue operations regardless.
> > +
> >  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> >     empty, it then looks at the global DSQ. If there still isn't a task to
> >     run, ``ops.dispatch()`` is invoked which can use the following two
> > @@ -319,6 +339,8 @@ by a sched_ext scheduler:
> >                  /* Any usable CPU becomes available */
> >  
> >                  ops.dispatch(); /* Task is moved to a local DSQ */
> > +
> > +                ops.dequeue(); /* Exiting BPF scheduler */
> >              }
> >              ops.running();      /* Task starts running on its assigned CPU */
> >              while (task->scx.slice > 0 && task is runnable)
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..334c3692a9c62 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* ops.enqueue() was called */
> 
> Can we rename this flag? For direct dispatch we never got enqueued.
> Something like "DEQ_ON_DISPATCH" would show the purpose of the
> flag more clearly.

Good point. However, ops.dequeue() isn't only called on dispatch, it can
also be triggered when a task property is changed.

So the flag should represent the "enqueued state" in the sense that
ops.enqueue() has been called and a corresponding ops.dequeue() is
expected. This is a lifecycle state, not an indication that the task is in
any queue.

Would a more descriptive comment clarify this? Something like:

  SCX_TASK_OPS_ENQUEUED = 1 << 1, /* Task in enqueued state: ops.enqueue()
                                     called, ops.dequeue() will be called
                                     when task leaves this state. */

> 
> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> >  
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 94164f2dec6dc..985d75d374385 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> >  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> >  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
> >  
> > +	/* Mark that ops.enqueue() is being called for this task */
> > +	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +
> 
> Can we avoid setting this flag when we have no .dequeue() method?
> Otherwise it stays set forever AFAICT, even after the task has been
> sent to the runqueues.

Good catch! Definitely we don't need to set this for schedulers that don't
implement ops.dequeue().

> 
> >  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
> >  	WARN_ON_ONCE(*ddsp_taskp);
> >  	*ddsp_taskp = p;
> > @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * Task is not currently being enqueued or queued on the BPF
> > +		 * scheduler. Check if ops.enqueue() was called for this task.
> > +		 */
> > +		if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
> > +		    SCX_HAS_OP(sch, dequeue)) {
> > +			/*
> > +			 * ops.enqueue() was called and the task was dispatched.
> > +			 * Call ops.dequeue() to notify the BPF scheduler that
> > +			 * the task is leaving.
> > +			 */
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > +					 p, deq_flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> > @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  		 */
> >  		BUG();
> >  	case SCX_OPSS_QUEUED:
> > -		if (SCX_HAS_OP(sch, dequeue))
> > +		/*
> > +		 * Task is owned by the BPF scheduler. Call ops.dequeue()
> > +		 * to notify the BPF scheduler that the task is being
> > +		 * removed.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue)) {
> 
> Edge case, but if we have a .dequeue() method but not an .enqueue() we
> still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an 
> extra condition to be consistent with the SCX_OPSS_NONE case above?

Also good catch. Will add that.

> 
> >  			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> >  					 p, deq_flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  
> >  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
> >  					    SCX_OPSS_NONE))
> 

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-28 23:38       ` Tejun Heo
@ 2025-12-29 17:07         ` Andrea Righi
  2025-12-29 18:55           ` Emil Tsalapatis
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2025-12-29 17:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges,
	sched-ext, linux-kernel

Hi Tejun,

On Sun, Dec 28, 2025 at 01:38:01PM -1000, Tejun Heo wrote:
> Hello again, again.
> 
> On Sun, Dec 28, 2025 at 01:28:04PM -1000, Tejun Heo wrote:
> ...
> > So, please ignore that part. That's non-sense. I still wonder whether we can
> > create some interlocking between scx_bpf_dsq_insert() and ops.dequeue()
> > without making hot path slower. I'll think more about it.
> 
> And we can't create an interlocking between scx_bpf_dsq_insert() and
> ops.dequeue() without adding extra atomic operations in hot paths. The only
> thing shared is task rq lock and dispatch path can't do that synchronously.
> So, yeah, it looks like the best we can do is always letting the BPF sched
> know and let it figure out locking and whether the task needs to be
> dequeued from BPF side.

How about setting a flag in deq_flags to distinguish between a "dispatch"
dequeue vs a real dequeue (due to property changes or other reasons)?

We should be able to pass this information in a reliable way without any
additional synchronization in the hot paths. This would let schedulers that
use arena data structures check the flag instead of doing their own
internal lookups.

And it would also allow us to provide both semantics:
1) Catch real dequeues that need special BPF-side actions (check the flag)
2) Track all ops.enqueue()/ops.dequeue() pairs for accounting purposes
   (ignore the flag)

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-28 23:42   ` Tejun Heo
@ 2025-12-29 17:17     ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2025-12-29 17:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges,
	sched-ext, linux-kernel

Hi,

On Sun, Dec 28, 2025 at 01:42:28PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote:
> > +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
> > +   is owned by the BPF scheduler. Ownership is retained until the task is
> 
> Can we avoid using "ownership" for this? From user's POV, this is fine but
> kernel side internally uses the word for different purposes - e.g. we say
> the BPF side owns the task if the task's SCX_OPSS_QUEUED is set (ie. it's on
> BPF data structure, not on a DSQ). Here, the ownership encompasses both
> kernel-side and BPF-side queueing, so the term becomes rather confusing.
> Maybe we can stick with "queued" or "enqueued"?

Agreed. I can't find a better term to describe this phase of the lifecycle,
where ops.enqueue() has been called and the task remains in that state
until the corresponding ops.dequeue() occurs (either due to a "dispatch"
dequeue or "real" dequeue).

So maybe we should stick with "enqueued" and clarify exactly what this
state means.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-29 16:36     ` Andrea Righi
@ 2025-12-29 18:35       ` Emil Tsalapatis
  0 siblings, 0 replies; 83+ messages in thread
From: Emil Tsalapatis @ 2025-12-29 18:35 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Daniel Hodges, sched-ext,
	linux-kernel

On Mon Dec 29, 2025 at 11:36 AM EST, Andrea Righi wrote:
> Hi Emil,
>
> On Sat, Dec 27, 2025 at 10:20:06PM -0500, Emil Tsalapatis wrote:
>> On Fri Dec 19, 2025 at 5:43 PM EST, Andrea Righi wrote:
>> > Properly implement ops.dequeue() to ensure every ops.enqueue() is
>> > balanced by a corresponding ops.dequeue() call, regardless of whether
>> > the task is on a BPF data structure or already dispatched to a DSQ.
>> >
>> > A task is considered enqueued when it is owned by the BPF scheduler.
>> > This ownership persists until the task is either dispatched (moved to a
>> > local DSQ for execution) or removed from the BPF scheduler, such as when
>> > it blocks waiting for an event or when its properties (for example, CPU
>> > affinity or priority) are updated.
>> >
>> > When the task enters the BPF scheduler ops.enqueue() is invoked, when it
>> > leaves BPF scheduler ownership, ops.dequeue() is invoked.
>> >
>> > This allows BPF schedulers to reliably track task ownership and maintain
>> > accurate accounting.
>> >
>> > Cc: Emil Tsalapatis <emil@etsalapatis.com>
>> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
>> > ---
>> 
>> 
>> Hi Andrea,
>> 
>> 	This change looks reasonable to me. Some comments inline:
>> 
>> >  Documentation/scheduler/sched-ext.rst | 22 ++++++++++++++++++++++
>> >  include/linux/sched/ext.h             |  1 +
>> >  kernel/sched/ext.c                    | 27 ++++++++++++++++++++++++++-
>> >  3 files changed, 49 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
>> > index 404fe6126a769..3ed4be53f97da 100644
>> > --- a/Documentation/scheduler/sched-ext.rst
>> > +++ b/Documentation/scheduler/sched-ext.rst
>> > @@ -252,6 +252,26 @@ The following briefly shows how a waking task is scheduled and executed.
>> >  
>> >     * Queue the task on the BPF side.
>> >  
>> > +   Once ``ops.enqueue()`` is called, the task is considered "enqueued" and
>> > +   is owned by the BPF scheduler. Ownership is retained until the task is
>> > +   either dispatched (moved to a local DSQ for execution) or dequeued
>> > +   (removed from the scheduler due to a blocking event, or to modify a
>> > +   property, like CPU affinity, priority, etc.). When the task leaves the
>> > +   BPF scheduler ``ops.dequeue()`` is invoked.
>> > +
>> 
>> Can we say "leaves the scx class" instead? On direct dispatch we
>> technically never insert the task into the BPF scheduler.
>
> Hm.. I agree that'd be more accurate, but it might also be slightly
> misleading, as it could be interpreted as the task being moved to a
> different scheduling class. How about saying "leaves the enqueued state"
> instead, where enqueued means ops.enqueue() being called... I can't find a
> better name for this state, like "ops_enqueued", but that's be even more
> confusing. :)
>

I like "leaves the enqueued state", it implies that the task has no
state in the scx scheduler.

>> 
>> > +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
>> > +   regardless of whether the task is still on a BPF data structure, or it
>> > +   is already dispatched to a DSQ (global, local, or user DSQ)
>> > +
>> > +   This guarantees that every ``ops.enqueue()`` will eventually be followed
>> > +   by a ``ops.dequeue()``. This makes it reliable for BPF schedulers to
>> > +   track task ownership and maintain accurate accounting, such as per-DSQ
>> > +   queued runtime sums.
>> > +
>> > +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
>> > +   don't need to track these transitions. The sched_ext core will safely
>> > +   handle all dequeue operations regardless.
>> > +
>> >  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>> >     empty, it then looks at the global DSQ. If there still isn't a task to
>> >     run, ``ops.dispatch()`` is invoked which can use the following two
>> > @@ -319,6 +339,8 @@ by a sched_ext scheduler:
>> >                  /* Any usable CPU becomes available */
>> >  
>> >                  ops.dispatch(); /* Task is moved to a local DSQ */
>> > +
>> > +                ops.dequeue(); /* Exiting BPF scheduler */
>> >              }
>> >              ops.running();      /* Task starts running on its assigned CPU */
>> >              while (task->scx.slice > 0 && task is runnable)
>> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
>> > index bcb962d5ee7d8..334c3692a9c62 100644
>> > --- a/include/linux/sched/ext.h
>> > +++ b/include/linux/sched/ext.h
>> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>> >  /* scx_entity.flags */
>> >  enum scx_ent_flags {
>> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
>> > +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* ops.enqueue() was called */
>> 
>> Can we rename this flag? For direct dispatch we never got enqueued.
>> Something like "DEQ_ON_DISPATCH" would show the purpose of the
>> flag more clearly.
>
> Good point. However, ops.dequeue() isn't only called on dispatch, it can
> also be triggered when a task property is changed.
>
> So the flag should represent the "enqueued state" in the sense that
> ops.enqueue() has been called and a corresponding ops.dequeue() is
> expected. This is a lifecycle state, not an indication that the task is in
> any queue.
>
> Would a more descriptive comment clarify this? Something like:
>
>   SCX_TASK_OPS_ENQUEUED = 1 << 1, /* Task in enqueued state: ops.enqueue()
>                                      called, ops.dequeue() will be called
>                                      when task leaves this state. */
>

That makes sense, my reasoning was that we actually use the flag for
is not whether the task is enqueued, but rather whether whether we 
need to call the dequeue callback when dequeueing from the SCX_OPSS_NONE 
state. Can the comment maybe more concretely explain this?

As an aside, I think this change makes it so we can remove the _OPSS_ state 
machine with some more refactoring. 

>> 
>> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
>> >  
>> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> > index 94164f2dec6dc..985d75d374385 100644
>> > --- a/kernel/sched/ext.c
>> > +++ b/kernel/sched/ext.c
>> > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>> >  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
>> >  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
>> >  
>> > +	/* Mark that ops.enqueue() is being called for this task */
>> > +	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
>> > +
>> 
>> Can we avoid setting this flag when we have no .dequeue() method?
>> Otherwise it stays set forever AFAICT, even after the task has been
>> sent to the runqueues.
>
> Good catch! Definitely we don't need to set this for schedulers that don't
> implement ops.dequeue().
>
>> 
>> >  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
>> >  	WARN_ON_ONCE(*ddsp_taskp);
>> >  	*ddsp_taskp = p;
>> > @@ -1522,6 +1525,21 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>> >  
>> >  	switch (opss & SCX_OPSS_STATE_MASK) {
>> >  	case SCX_OPSS_NONE:
>> > +		/*
>> > +		 * Task is not currently being enqueued or queued on the BPF
>> > +		 * scheduler. Check if ops.enqueue() was called for this task.
>> > +		 */
>> > +		if ((p->scx.flags & SCX_TASK_OPS_ENQUEUED) &&
>> > +		    SCX_HAS_OP(sch, dequeue)) {
>> > +			/*
>> > +			 * ops.enqueue() was called and the task was dispatched.
>> > +			 * Call ops.dequeue() to notify the BPF scheduler that
>> > +			 * the task is leaving.
>> > +			 */
>> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
>> > +					 p, deq_flags);
>> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
>> > +		}
>> >  		break;
>> >  	case SCX_OPSS_QUEUEING:
>> >  		/*
>> > @@ -1530,9 +1548,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>> >  		 */
>> >  		BUG();
>> >  	case SCX_OPSS_QUEUED:
>> > -		if (SCX_HAS_OP(sch, dequeue))
>> > +		/*
>> > +		 * Task is owned by the BPF scheduler. Call ops.dequeue()
>> > +		 * to notify the BPF scheduler that the task is being
>> > +		 * removed.
>> > +		 */
>> > +		if (SCX_HAS_OP(sch, dequeue)) {
>> 
>> Edge case, but if we have a .dequeue() method but not an .enqueue() we
>> still make this call. Can we add flags & SCX_TASK_OPS_ENQUEUED as an 
>> extra condition to be consistent with the SCX_OPSS_NONE case above?
>
> Also good catch. Will add that.
>
>> 
>> >  			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
>> >  					 p, deq_flags);
>> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
>> > +		}
>> >  
>> >  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>> >  					    SCX_OPSS_NONE))
>> 
>
> Thanks,
> -Andrea


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-29 17:07         ` Andrea Righi
@ 2025-12-29 18:55           ` Emil Tsalapatis
  0 siblings, 0 replies; 83+ messages in thread
From: Emil Tsalapatis @ 2025-12-29 18:55 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo
  Cc: David Vernet, Changwoo Min, Daniel Hodges, sched-ext,
	linux-kernel

On Mon Dec 29, 2025 at 12:07 PM EST, Andrea Righi wrote:
> Hi Tejun,
>
> On Sun, Dec 28, 2025 at 01:38:01PM -1000, Tejun Heo wrote:
>> Hello again, again.
>> 
>> On Sun, Dec 28, 2025 at 01:28:04PM -1000, Tejun Heo wrote:
>> ...
>> > So, please ignore that part. That's non-sense. I still wonder whether we can
>> > create some interlocking between scx_bpf_dsq_insert() and ops.dequeue()
>> > without making hot path slower. I'll think more about it.
>> 
>> And we can't create an interlocking between scx_bpf_dsq_insert() and
>> ops.dequeue() without adding extra atomic operations in hot paths. The only
>> thing shared is task rq lock and dispatch path can't do that synchronously.
>> So, yeah, it looks like the best we can do is always letting the BPF sched
>> know and let it figure out locking and whether the task needs to be
>> dequeued from BPF side.
>
> How about setting a flag in deq_flags to distinguish between a "dispatch"
> dequeue vs a real dequeue (due to property changes or other reasons)?
>
> We should be able to pass this information in a reliable way without any
> additional synchronization in the hot paths. This would let schedulers that
> use arena data structures check the flag instead of doing their own
> internal lookups.
>
> And it would also allow us to provide both semantics:
> 1) Catch real dequeues that need special BPF-side actions (check the flag)
> 2) Track all ops.enqueue()/ops.dequeue() pairs for accounting purposes
>    (ignore the flag)
>

IMO the extra flag suffices for arena-based queueing, the arena data
structures already have to track the state of the task already:

Even without the flag it should be possible to infer the task is in
in from inside the BPF code. For example, calling .dequeue() while 
the task is in an arena queue means the task got dequeued _after_
being dispatched, while calling .dequeue() on a queued task means we are
removing it because of a true dequeue event (e.g. sched_setaffinity()
was called). The only edge case in the logic is if a true dequeue event 
happens between .dispatch() and .dequeue(), but a new flag would take 
care of that.


> Thanks,
> -Andrea


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2025-12-29  0:06   ` Tejun Heo
@ 2025-12-29 18:56     ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2025-12-29 18:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Emil Tsalapatis, Daniel Hodges,
	sched-ext, linux-kernel

On Sun, Dec 28, 2025 at 02:06:19PM -1000, Tejun Heo wrote:
> Sorry about the million replies. Pretty squirrel brained right now.
> 
> On Fri, Dec 19, 2025 at 11:43:14PM +0100, Andrea Righi wrote:
> > @@ -1390,6 +1390,9 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> >  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> >  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
> >  
> > +	/* Mark that ops.enqueue() is being called for this task */
> > +	p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> 
> Is this guaranteed to be cleared after dispatch? ops_dequeue() is called
> from dequeue_task_scx() and set_next_task_scx(). It looks like the call from
> set_next_task_scx() may end up calling ops.dequeue() when the task starts
> running, this seems mostly accidental.
> 
> - The BPF sched probably expects ops.dequeue() call immediately after
>   dispatch rather than on the running transition. e.g. imagine a scenario
>   where a BPF sched dispatches multiple tasks to a local DSQ. Wouldn't the
>   expectation be that ops.dequeue() is called as soon as a task is
>   dispatched into a local DSQ?
> 
> - If this depends on the ops_dequeue() call from set_next_task_scx(), it'd
>   also be using the wrong DEQ flag - SCX_DEQ_CORE_SCHED_EXEC - for regular
>   ops.dequeue() following a dispatch. That call there is that way only
>   because ops_dequeue() didn't do anything when OPSS_NONE.

You're right, the flag should be cleared and ops.dequeue() should be called
immediately when the async dispatch completes and the task is inserted into
the DSQ. I'll add an explicit ops.dequeue() call in the dispatch completion
path.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
@ 2026-01-21 12:25 ` Andrea Righi
  2026-01-21 12:54   ` Christian Loehle
  2026-01-22  9:28   ` Kuba Piecuch
  0 siblings, 2 replies; 83+ messages in thread
From: Andrea Righi @ 2026-01-21 12:25 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change scenarios. As a result, BPF schedulers
cannot reliably track task state.

In addition, some ops.dequeue() callbacks can be skipped (e.g., during
direct dispatch), so ops.enqueue() calls are not always paired with a
corresponding ops.dequeue(), potentially breaking accounting logic.

Fix this by guaranteeing that every ops.enqueue() is matched with a
corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
distinguish dequeues triggered by scheduling property changes from those
occurring in the normal dispatch workflow.

New semantics:
1. ops.enqueue() is called when a task enters the BPF scheduler
2. ops.dequeue() is called when the task leaves the BPF scheduler,
   because it is dispatched to a DSQ (regular workflow)
3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
   scheduler, because a task property is changed (sched_change)

The SCX_DEQ_ASYNC flag allows BPF schedulers to distinguish between a
regular dispatch workflow and a task property changes (e.g.,
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, CPU migrations, etc.).

This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of enqueue/dequeue pairs,
- update internal state when tasks change properties.

Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 Documentation/scheduler/sched-ext.rst         | 33 ++++++++++
 include/linux/sched/ext.h                     | 11 ++++
 kernel/sched/ext.c                            | 63 ++++++++++++++++++-
 kernel/sched/ext_internal.h                   |  6 ++
 .../sched_ext/include/scx/enum_defs.autogen.h |  2 +
 .../sched_ext/include/scx/enums.autogen.bpf.h |  2 +
 tools/sched_ext/include/scx/enums.autogen.h   |  1 +
 7 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..960125c1439ab 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
+   The task remains in this state until ``ops.dequeue()`` is called, which
+   happens in two cases:
+
+   1. **Regular dispatch workflow**: when the task is successfully
+      dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
+      is triggered immediately to notify the BPF scheduler.
+
+   2. **Scheduling property change**: when a task property changes (via
+      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+      priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
+      with the ``SCX_DEQ_ASYNC`` flag set in ``deq_flags``.
+
+   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
+   regardless of whether the task is still on a BPF data structure, or it
+   has already been dispatched to a DSQ. This guarantees that every
+   ``ops.enqueue()`` will eventually be followed by a corresponding
+   ``ops.dequeue()``.
+
+   The ``SCX_DEQ_ASYNC`` flag allows BPF schedulers to distinguish between:
+   - normal dispatch workflow (task successfully dispatched to a DSQ),
+   - asynchronous dequeues (``SCX_DEQ_ASYNC``): task property changes that
+     require the scheduler to update its internal state.
+
+   This makes it reliable for BPF schedulers to track the enqueued state
+   and maintain accurate accounting.
+
+   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+   don't need to track these transitions. The sched_ext core will safely
+   handle all dequeue operations regardless.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +350,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..f3094b4a72a56 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,8 +84,19 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	/*
+	 * Set when ops.enqueue() is called; used to determine if ops.dequeue()
+	 * should be invoked when transitioning out of SCX_OPSS_NONE state.
+	 */
+	SCX_TASK_OPS_ENQUEUED	= 1 << 1,
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
+	/*
+	 * Set when ops.dequeue() is called after successful dispatch; used to
+	 * distinguish dispatch dequeues from async dequeues (property changes)
+	 * and to prevent duplicate dequeue calls.
+	 */
+	SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
 
 	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
 	SCX_TASK_STATE_BITS	= 2,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 809f774183202..ac13115c463d2 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1289,6 +1289,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 
 	p->scx.ddsp_enq_flags |= enq_flags;
 
+	/*
+	 * The task is about to be dispatched. If ops.enqueue() was called,
+	 * notify the BPF scheduler by calling ops.dequeue().
+	 *
+	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+	 * changes can trigger ops.dequeue() with %SCX_DEQ_ASYNC. Mark that
+	 * the dispatch dequeue has been called to distinguish from
+	 * property change dequeues.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	/*
 	 * We are in the enqueue path with @rq locked and pinned, and thus can't
 	 * double lock a remote rq and enqueue to its local DSQ. For
@@ -1393,6 +1407,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
 	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
 
+	/*
+	 * Mark that ops.enqueue() is being called for this task.
+	 * Clear the dispatch dequeue flag for the new enqueue cycle.
+	 * Only track these flags if ops.dequeue() is implemented.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
 	WARN_ON_ONCE(*ddsp_taskp);
 	*ddsp_taskp = p;
@@ -1529,6 +1553,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		if (SCX_HAS_OP(sch, dequeue) &&
+		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
+			bool is_async_dequeue =
+				!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC));
+
+			if (is_async_dequeue)
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
+						 p, deq_flags | SCX_DEQ_ASYNC);
+			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+					  SCX_TASK_DISPATCH_DEQUEUED);
+		}
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1537,9 +1572,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
-		if (SCX_HAS_OP(sch, dequeue))
+		/*
+		 * Task is in the enqueued state. This is a property change
+		 * dequeue before dispatch completes. Notify the BPF scheduler
+		 * with SCX_DEQ_ASYNC flag.
+		 */
+		if (SCX_HAS_OP(sch, dequeue)) {
 			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
-					 p, deq_flags);
+					 p, deq_flags | SCX_DEQ_ASYNC);
+			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+					  SCX_TASK_DISPATCH_DEQUEUED);
+		}
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -2113,6 +2156,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 
 	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
 
+	/*
+	 * The task is about to be dispatched. If ops.enqueue() was called,
+	 * notify the BPF scheduler by calling ops.dequeue().
+	 *
+	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+	 * changes can trigger ops.dequeue() with %SCX_DEQ_ASYNC. Mark that
+	 * the dispatch dequeue has been called to distinguish from
+	 * property change dequeues.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+		struct rq *task_rq = task_rq(p);
+
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0);
+		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
 
 	if (dsq->id == SCX_DSQ_LOCAL)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..068c7c2892a16 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,12 @@ enum scx_deq_flags {
 	 * it hasn't been dispatched yet. Dequeue from the BPF side.
 	 */
 	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
+
+	/*
+	 * The task is being dequeued due to an asynchronous event (e.g.,
+	 * property change via sched_setaffinity(), priority change, etc.).
+	 */
+	SCX_DEQ_ASYNC		= 1LLU << 33,
 };
 
 enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..17d8f4324b856 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
 #define HAVE_SCX_DEQ_SLEEP
 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_ASYNC
 #define HAVE_SCX_DSQ_FLAG_BUILTIN
 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
 #define HAVE_SCX_DSQ_INVALID
@@ -48,6 +49,7 @@
 #define HAVE_SCX_TASK_QUEUED
 #define HAVE_SCX_TASK_RESET_RUNNABLE_AT
 #define HAVE_SCX_TASK_DEQD_FOR_SLEEP
+#define HAVE_SCX_TASK_DISPATCH_DEQUEUED
 #define HAVE_SCX_TASK_STATE_SHIFT
 #define HAVE_SCX_TASK_STATE_BITS
 #define HAVE_SCX_TASK_STATE_MASK
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..b3ecd6783d1e5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
 
+const volatile u64 __SCX_DEQ_ASYNC __weak;
+#define SCX_DEQ_ASYNC __SCX_DEQ_ASYNC
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..89359ab65cd3c 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_ASYNC); \
 } while (0)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
@ 2026-01-21 12:54   ` Christian Loehle
  2026-01-21 12:57     ` Andrea Righi
  2026-01-22  9:28   ` Kuba Piecuch
  1 sibling, 1 reply; 83+ messages in thread
From: Christian Loehle @ 2026-01-21 12:54 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On 1/21/26 12:25, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change scenarios. As a result, BPF schedulers
> cannot reliably track task state.
> 
> In addition, some ops.dequeue() callbacks can be skipped (e.g., during
> direct dispatch), so ops.enqueue() calls are not always paired with a
> corresponding ops.dequeue(), potentially breaking accounting logic.
> 
> Fix this by guaranteeing that every ops.enqueue() is matched with a
> corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
> distinguish dequeues triggered by scheduling property changes from those
> occurring in the normal dispatch workflow.
> 
> New semantics:
> 1. ops.enqueue() is called when a task enters the BPF scheduler
> 2. ops.dequeue() is called when the task leaves the BPF scheduler,
>    because it is dispatched to a DSQ (regular workflow)
> 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
>    scheduler, because a task property is changed (sched_change)
> 
> The SCX_DEQ_ASYNC flag allows BPF schedulers to distinguish between a
> regular dispatch workflow and a task property changes (e.g.,
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, CPU migrations, etc.).
> 
> This allows BPF schedulers to:
> - reliably track task ownership and lifecycle,
> - maintain accurate accounting of enqueue/dequeue pairs,
> - update internal state when tasks change properties.
> [snip]

Cool, so with this patch I should be able to fix my scx_storm BPF
scheduler doing local inserts, as long as I track all the task's status
that are not in a DSQ?
https://github.com/cloehle/scx/commit/25ea91d8f7fea1f31cf426561b432180fb9cf76a
mentioned in
https://github.com/sched-ext/scx/issues/2825

Let me give that a go and report back!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-21 12:54   ` Christian Loehle
@ 2026-01-21 12:57     ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-01-21 12:57 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis,
	Daniel Hodges, sched-ext, linux-kernel

On Wed, Jan 21, 2026 at 12:54:42PM +0000, Christian Loehle wrote:
...
> > This allows BPF schedulers to:
> > - reliably track task ownership and lifecycle,
> > - maintain accurate accounting of enqueue/dequeue pairs,
> > - update internal state when tasks change properties.
> > [snip]
> 
> Cool, so with this patch I should be able to fix my scx_storm BPF
> scheduler doing local inserts, as long as I track all the task's status
> that are not in a DSQ?
> https://github.com/cloehle/scx/commit/25ea91d8f7fea1f31cf426561b432180fb9cf76a
> mentioned in
> https://github.com/sched-ext/scx/issues/2825

In theory, yes...

> 
> Let me give that a go and report back!

Let me know how it goes.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
  2026-01-21 12:54   ` Christian Loehle
@ 2026-01-22  9:28   ` Kuba Piecuch
  2026-01-23 13:32     ` Andrea Righi
  1 sibling, 1 reply; 83+ messages in thread
From: Kuba Piecuch @ 2026-01-22  9:28 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

[Resending with reply-all, messed up the first time, apologies.]

Hi Andrea,

On Wed Jan 21, 2026 at 12:25 PM UTC, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change scenarios. As a result, BPF schedulers
> cannot reliably track task state.
>
> In addition, some ops.dequeue() callbacks can be skipped (e.g., during
> direct dispatch), so ops.enqueue() calls are not always paired with a
> corresponding ops.dequeue(), potentially breaking accounting logic.
>
> Fix this by guaranteeing that every ops.enqueue() is matched with a
> corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
> distinguish dequeues triggered by scheduling property changes from those
> occurring in the normal dispatch workflow.
>
> New semantics:
> 1. ops.enqueue() is called when a task enters the BPF scheduler
> 2. ops.dequeue() is called when the task leaves the BPF scheduler,
>    because it is dispatched to a DSQ (regular workflow)
> 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
>    scheduler, because a task property is changed (sched_change)

What about the case where ops.dequeue() is called due to core-sched picking the
task through sched_core_find()? If I understand core-sched correctly, it can
happen without prior dispatch, so it doesn't fit case 2, and we're not changing
task properties, so it doesn't fit case 3 either.

> +     /*
> +      * Set when ops.dequeue() is called after successful dispatch; used to
> +      * distinguish dispatch dequeues from async dequeues (property changes)
> +      * and to prevent duplicate dequeue calls.
> +      */
> +     SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,

I see this flag being set and cleared in several places, but I don't see it
actually being read, is that intentional?

> @@ -1529,6 +1553,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> 
>       switch (opss & SCX_OPSS_STATE_MASK) {
>       case SCX_OPSS_NONE:
> +             if (SCX_HAS_OP(sch, dequeue) &&
> +                 p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> +                     bool is_async_dequeue =
> +                             !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC));
> +
> +                     if (is_async_dequeue)
> +                             SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> +                                              p, deq_flags | SCX_DEQ_ASYNC);
> +                     p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> +                                       SCX_TASK_DISPATCH_DEQUEUED);
> +             }
>               break;
>       case SCX_OPSS_QUEUEING:
>               /*
> @@ -1537,9 +1572,17 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>                */
>               BUG();
>       case SCX_OPSS_QUEUED:
> -             if (SCX_HAS_OP(sch, dequeue))
> +             /*
> +              * Task is in the enqueued state. This is a property change
> +              * dequeue before dispatch completes. Notify the BPF scheduler
> +              * with SCX_DEQ_ASYNC flag.
> +              */
> +             if (SCX_HAS_OP(sch, dequeue)) {
>                       SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -                                      p, deq_flags);
> +                                      p, deq_flags | SCX_DEQ_ASYNC);
> +                     p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> +                                       SCX_TASK_DISPATCH_DEQUEUED);
> +             }
> 

A core-sched pick of a task queued on the BPF scheduler will result in entering
the SCX_OPSS_QUEUED case, which in turn will call ops.dequeue(SCX_DEQ_ASYNC).
This seems to conflict with the is_async_dequeue check above, which treats
SCX_DEQ_CORE_SCHED_EXEC as a synchronous dequeue.

Thanks,
Kuba


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-22  9:28   ` Kuba Piecuch
@ 2026-01-23 13:32     ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-01-23 13:32 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis,
	Daniel Hodges, sched-ext, linux-kernel

On Thu, Jan 22, 2026 at 09:28:39AM +0000, Kuba Piecuch wrote:
> [Resending with reply-all, messed up the first time, apologies.]

Re-sendind my reply as well, just for the records. :)

> 
> Hi Andrea,
> 
> On Wed Jan 21, 2026 at 12:25 PM UTC, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change scenarios. As a result, BPF schedulers
> > cannot reliably track task state.
> >
> > In addition, some ops.dequeue() callbacks can be skipped (e.g., during
> > direct dispatch), so ops.enqueue() calls are not always paired with a
> > corresponding ops.dequeue(), potentially breaking accounting logic.
> >
> > Fix this by guaranteeing that every ops.enqueue() is matched with a
> > corresponding ops.dequeue(), and introduce the SCX_DEQ_ASYNC flag to
> > distinguish dequeues triggered by scheduling property changes from those
> > occurring in the normal dispatch workflow.
> >
> > New semantics:
> > 1. ops.enqueue() is called when a task enters the BPF scheduler
> > 2. ops.dequeue() is called when the task leaves the BPF scheduler,
> >    because it is dispatched to a DSQ (regular workflow)
> > 3. ops.dequeue(SCX_DEQ_ASYNC) is called when the task leaves the BPF
> >    scheduler, because a task property is changed (sched_change)
> 
> What about the case where ops.dequeue() is called due to core-sched picking the
> task through sched_core_find()? If I understand core-sched correctly, it can
> happen without prior dispatch, so it doesn't fit case 2, and we're not changing
> task properties, so it doesn't fit case 3 either.

You're absolutely right, core-sched picks are inconsistently handled.
They're treated as property change dequeues in the SCX_OPSS_QUEUED case and
as dispatch dequeues in SCX_OPSS_NONE.

Core-sched picks should be treated consistently as regular dequeues since
they're not property changes. I'll fix this in the next version (adding
SCX_DEQ_CORE_SCHED_EXEC check in the SCX_OPSS_QUEUED should make the
core-sched case consistent).

> 
> > +     /*
> > +      * Set when ops.dequeue() is called after successful dispatch; used to
> > +      * distinguish dispatch dequeues from async dequeues (property changes)
> > +      * and to prevent duplicate dequeue calls.
> > +      */
> > +     SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
> 
> I see this flag being set and cleared in several places, but I don't see it
> actually being read, is that intentional?

And you're right here as well. At some point this was used to distinguish
dispatch dequeues vs async dequeues, but isn't actually used anymore. I'll
clean this up in the next version.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
@ 2026-01-26  8:41 ` Andrea Righi
  2026-01-27 16:38   ` Emil Tsalapatis
                     ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Andrea Righi @ 2026-01-26  8:41 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext,
	linux-kernel, Emil Tsalapatis

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change scenarios. As a result, BPF schedulers
cannot reliably track task state.

In addition, some ops.dequeue() callbacks can be skipped (e.g., during
direct dispatch), so ops.enqueue() calls are not always paired with a
corresponding ops.dequeue(), potentially breaking accounting logic.

Fix this by guaranteeing that every ops.enqueue() is matched with a
corresponding ops.dequeue(), and introduce the %SCX_DEQ_SCHED_CHANGE
flag to distinguish dequeues triggered by scheduling property changes
from those occurring in the normal dispatch/execution workflow.

New semantics:
1. ops.enqueue() is called when a task enters the BPF scheduler
2. ops.dequeue() is called when the task leaves the BPF scheduler in
   the following cases:
   a) regular dispatch workflow: task dispatched to a DSQ,
   b) core scheduling pick: core-sched picks task before dispatch,
   c) property change: task properties modified.

A new %SCX_DEQ_SCHED_CHANGE flag is also introduced, allowing BPF
schedulers to distinguish between:
- normal dispatch/execution workflow (dispatch, core-sched pick),
- property changes that require state updates (e.g.,
  sched_setaffinity(), sched_setscheduler(), set_user_nice(),
  NUMA balancing, CPU migrations, etc.).

With this, BPF schedulers can:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of enqueue/dequeue pairs,
- distinguish between execution events and property changes,
- update internal state appropriately for each dequeue type.

Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 Documentation/scheduler/sched-ext.rst         | 33 +++++++
 include/linux/sched/ext.h                     | 11 +++
 kernel/sched/ext.c                            | 89 ++++++++++++++++++-
 kernel/sched/ext_internal.h                   |  7 ++
 .../sched_ext/include/scx/enum_defs.autogen.h |  2 +
 .../sched_ext/include/scx/enums.autogen.bpf.h |  2 +
 tools/sched_ext/include/scx/enums.autogen.h   |  1 +
 7 files changed, 142 insertions(+), 3 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..ed6bf7d9e6e8c 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
+   The task remains in this state until ``ops.dequeue()`` is called, which
+   happens in the following cases:
+
+   1. **Regular dispatch workflow**: when the task is successfully
+      dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
+      is triggered immediately to notify the BPF scheduler.
+
+   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+      core scheduling picks a task for execution before it has been
+      dispatched, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+   3. **Scheduling property change**: when a task property changes (via
+      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+      priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
+      with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
+   regardless of whether the task is still on a BPF data structure, or it
+   has already been dispatched to a DSQ. This guarantees that every
+   ``ops.enqueue()`` will eventually be followed by a corresponding
+   ``ops.dequeue()``.
+
+   This makes it reliable for BPF schedulers to track the enqueued state
+   and maintain accurate accounting.
+
+   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+   don't need to track these transitions. The sched_ext core will safely
+   handle all dequeue operations regardless.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +350,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..59446cd0373fa 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,8 +84,19 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	/*
+	 * Set when ops.enqueue() is called; used to determine if ops.dequeue()
+	 * should be invoked when transitioning out of SCX_OPSS_NONE state.
+	 */
+	SCX_TASK_OPS_ENQUEUED	= 1 << 1,
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
+	/*
+	 * Set when ops.dequeue() is called after successful dispatch; used to
+	 * distinguish dispatch dequeues from property change dequeues and
+	 * prevent duplicate dequeue calls.
+	 */
+	SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
 
 	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
 	SCX_TASK_STATE_BITS	= 2,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index afe28c04d5aa7..18bca2b83f5c5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 
 	p->scx.ddsp_enq_flags |= enq_flags;
 
+	/*
+	 * The task is about to be dispatched. If ops.enqueue() was called,
+	 * notify the BPF scheduler by calling ops.dequeue().
+	 *
+	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
+	 * Mark that the dispatch dequeue has been called to distinguish
+	 * from property change dequeues.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	/*
 	 * We are in the enqueue path with @rq locked and pinned, and thus can't
 	 * double lock a remote rq and enqueue to its local DSQ. For
@@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
 	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
 
+	/*
+	 * Mark that ops.enqueue() is being called for this task.
+	 * Clear the dispatch dequeue flag for the new enqueue cycle.
+	 * Only track these flags if ops.dequeue() is implemented.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
 	WARN_ON_ONCE(*ddsp_taskp);
 	*ddsp_taskp = p;
@@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		if (SCX_HAS_OP(sch, dequeue) &&
+		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
+			/*
+			 * Task was already dispatched. Only call ops.dequeue()
+			 * if it hasn't been called yet (check DISPATCH_DEQUEUED).
+			 * This can happen when:
+			 * 1. Core-sched picks a task that was dispatched
+			 * 2. Property changes occur after dispatch
+			 */
+			if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) {
+				/*
+				 * ops.dequeue() wasn't called during dispatch.
+				 * This shouldn't normally happen, but call it now.
+				 */
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
+						 p, deq_flags);
+			} else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) {
+				/*
+				 * This is a property change after
+				 * dispatch. Call ops.dequeue() again with
+				 * %SCX_DEQ_SCHED_CHANGE.
+				 */
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
+						 p, deq_flags | SCX_DEQ_SCHED_CHANGE);
+			}
+			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+					  SCX_TASK_DISPATCH_DEQUEUED);
+		}
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1531,9 +1583,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
-		if (SCX_HAS_OP(sch, dequeue))
-			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
-					 p, deq_flags);
+		/*
+		 * Task is still on the BPF scheduler (not dispatched yet).
+		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
+		 * only for property changes, not for core-sched picks.
+		 */
+		if (SCX_HAS_OP(sch, dequeue)) {
+			u64 flags = deq_flags;
+			/*
+			 * Add %SCX_DEQ_SCHED_CHANGE for property changes,
+			 * but not for core-sched picks or sleep.
+			 */
+			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+				flags |= SCX_DEQ_SCHED_CHANGE;
+
+			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
+					  SCX_TASK_DISPATCH_DEQUEUED);
+		}
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -2107,6 +2174,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 
 	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
 
+	/*
+	 * The task is about to be dispatched. If ops.enqueue() was called,
+	 * notify the BPF scheduler by calling ops.dequeue().
+	 *
+	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
+	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
+	 * Mark that the dispatch dequeue has been called to distinguish
+	 * from property change dequeues.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+		struct rq *task_rq = task_rq(p);
+
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0);
+		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
+	}
+
 	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
 
 	if (dsq->id == SCX_DSQ_LOCAL)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
 	 * it hasn't been dispatched yet. Dequeue from the BPF side.
 	 */
 	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
+
+	/*
+	 * The task is being dequeued due to a property change (e.g.,
+	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+	 * etc.).
+	 */
+	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
 };
 
 enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..8284f717ff05e 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
 #define HAVE_SCX_DEQ_SLEEP
 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
 #define HAVE_SCX_DSQ_FLAG_BUILTIN
 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
 #define HAVE_SCX_DSQ_INVALID
@@ -48,6 +49,7 @@
 #define HAVE_SCX_TASK_QUEUED
 #define HAVE_SCX_TASK_RESET_RUNNABLE_AT
 #define HAVE_SCX_TASK_DEQD_FOR_SLEEP
+#define HAVE_SCX_TASK_DISPATCH_DEQUEUED
 #define HAVE_SCX_TASK_STATE_SHIFT
 #define HAVE_SCX_TASK_STATE_BITS
 #define HAVE_SCX_TASK_STATE_MASK
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
 
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
 } while (0)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
@ 2026-01-27 16:38   ` Emil Tsalapatis
  2026-01-27 16:41   ` Kuba Piecuch
  2026-01-28 21:21   ` Tejun Heo
  2 siblings, 0 replies; 83+ messages in thread
From: Emil Tsalapatis @ 2026-01-27 16:38 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext,
	linux-kernel

On Mon Jan 26, 2026 at 3:41 AM EST, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change scenarios. As a result, BPF schedulers
> cannot reliably track task state.
>
> In addition, some ops.dequeue() callbacks can be skipped (e.g., during
> direct dispatch), so ops.enqueue() calls are not always paired with a
> corresponding ops.dequeue(), potentially breaking accounting logic.
>
> Fix this by guaranteeing that every ops.enqueue() is matched with a
> corresponding ops.dequeue(), and introduce the %SCX_DEQ_SCHED_CHANGE
> flag to distinguish dequeues triggered by scheduling property changes
> from those occurring in the normal dispatch/execution workflow.
>
> New semantics:
> 1. ops.enqueue() is called when a task enters the BPF scheduler
> 2. ops.dequeue() is called when the task leaves the BPF scheduler in
>    the following cases:
>    a) regular dispatch workflow: task dispatched to a DSQ,
>    b) core scheduling pick: core-sched picks task before dispatch,
>    c) property change: task properties modified.
>
> A new %SCX_DEQ_SCHED_CHANGE flag is also introduced, allowing BPF
> schedulers to distinguish between:
> - normal dispatch/execution workflow (dispatch, core-sched pick),
> - property changes that require state updates (e.g.,
>   sched_setaffinity(), sched_setscheduler(), set_user_nice(),
>   NUMA balancing, CPU migrations, etc.).
>
> With this, BPF schedulers can:
> - reliably track task ownership and lifecycle,
> - maintain accurate accounting of enqueue/dequeue pairs,
> - distinguish between execution events and property changes,
> - update internal state appropriately for each dequeue type.
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Emil Tsalapatis <emil@etsalapatis.com>
> Cc: Kuba Piecuch <jpiecuch@google.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>

Looks great overall. Following up on our off-list chat about whether 
SCX_TASK_DISPATCH_DEQUEUED is necessary: We need it for the new DEQ_STATE
change flag so no need to consider removing it imo.

> ---
>  Documentation/scheduler/sched-ext.rst         | 33 +++++++
>  include/linux/sched/ext.h                     | 11 +++
>  kernel/sched/ext.c                            | 89 ++++++++++++++++++-
>  kernel/sched/ext_internal.h                   |  7 ++
>  .../sched_ext/include/scx/enum_defs.autogen.h |  2 +
>  .../sched_ext/include/scx/enums.autogen.bpf.h |  2 +
>  tools/sched_ext/include/scx/enums.autogen.h   |  1 +
>  7 files changed, 142 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..ed6bf7d9e6e8c 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
> +   The task remains in this state until ``ops.dequeue()`` is called, which
> +   happens in the following cases:
> +
> +   1. **Regular dispatch workflow**: when the task is successfully
> +      dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
> +      is triggered immediately to notify the BPF scheduler.
> +
> +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> +      core scheduling picks a task for execution before it has been
> +      dispatched, ``ops.dequeue()`` is called with the
> +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> +
> +   3. **Scheduling property change**: when a task property changes (via
> +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> +      priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
> +      with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> +
> +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> +   regardless of whether the task is still on a BPF data structure, or it
> +   has already been dispatched to a DSQ. This guarantees that every
> +   ``ops.enqueue()`` will eventually be followed by a corresponding
> +   ``ops.dequeue()``.
> +
> +   This makes it reliable for BPF schedulers to track the enqueued state
> +   and maintain accurate accounting.
> +
> +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> +   don't need to track these transitions. The sched_ext core will safely
> +   handle all dequeue operations regardless.
> +
>  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>     empty, it then looks at the global DSQ. If there still isn't a task to
>     run, ``ops.dispatch()`` is invoked which can use the following two
> @@ -319,6 +350,8 @@ by a sched_ext scheduler:
>                  /* Any usable CPU becomes available */
>  
>                  ops.dispatch(); /* Task is moved to a local DSQ */
> +
> +                ops.dequeue(); /* Exiting BPF scheduler */
>              }
>              ops.running();      /* Task starts running on its assigned CPU */
>              while (task->scx.slice > 0 && task is runnable)
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..59446cd0373fa 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,8 +84,19 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	/*
> +	 * Set when ops.enqueue() is called; used to determine if ops.dequeue()
> +	 * should be invoked when transitioning out of SCX_OPSS_NONE state.
> +	 */
> +	SCX_TASK_OPS_ENQUEUED	= 1 << 1,
>  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> +	/*
> +	 * Set when ops.dequeue() is called after successful dispatch; used to
> +	 * distinguish dispatch dequeues from property change dequeues and
> +	 * prevent duplicate dequeue calls.
> +	 */
> +	SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
>  
>  	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
>  	SCX_TASK_STATE_BITS	= 2,
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index afe28c04d5aa7..18bca2b83f5c5 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
>  
>  	p->scx.ddsp_enq_flags |= enq_flags;
>  
> +	/*
> +	 * The task is about to be dispatched. If ops.enqueue() was called,
> +	 * notify the BPF scheduler by calling ops.dequeue().
> +	 *
> +	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
> +	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
> +	 * Mark that the dispatch dequeue has been called to distinguish
> +	 * from property change dequeues.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> +		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
> +	}
> +
>  	/*
>  	 * We are in the enqueue path with @rq locked and pinned, and thus can't
>  	 * double lock a remote rq and enqueue to its local DSQ. For
> @@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
>  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
>  
> +	/*
> +	 * Mark that ops.enqueue() is being called for this task.
> +	 * Clear the dispatch dequeue flag for the new enqueue cycle.
> +	 * Only track these flags if ops.dequeue() is implemented.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +		p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
> +	}
> +
>  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
>  	WARN_ON_ONCE(*ddsp_taskp);
>  	*ddsp_taskp = p;
> @@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> +			/*
> +			 * Task was already dispatched. Only call ops.dequeue()
> +			 * if it hasn't been called yet (check DISPATCH_DEQUEUED).
> +			 * This can happen when:
> +			 * 1. Core-sched picks a task that was dispatched
> +			 * 2. Property changes occur after dispatch
> +			 */
> +			if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) {
> +				/*
> +				 * ops.dequeue() wasn't called during dispatch.
> +				 * This shouldn't normally happen, but call it now.
> +				 */
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> +						 p, deq_flags);
> +			} else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) {
> +				/*
> +				 * This is a property change after
> +				 * dispatch. Call ops.dequeue() again with
> +				 * %SCX_DEQ_SCHED_CHANGE.
> +				 */
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> +						 p, deq_flags | SCX_DEQ_SCHED_CHANGE);
> +			}
> +			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> +					  SCX_TASK_DISPATCH_DEQUEUED);
> +		}
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1531,9 +1583,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> -		if (SCX_HAS_OP(sch, dequeue))
> -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -					 p, deq_flags);
> +		/*
> +		 * Task is still on the BPF scheduler (not dispatched yet).
> +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> +		 * only for property changes, not for core-sched picks.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue)) {
> +			u64 flags = deq_flags;
> +			/*
> +			 * Add %SCX_DEQ_SCHED_CHANGE for property changes,
> +			 * but not for core-sched picks or sleep.
> +			 */
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> +					  SCX_TASK_DISPATCH_DEQUEUED);
> +		}
>  
>  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>  					    SCX_OPSS_NONE))
> @@ -2107,6 +2174,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  
>  	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
>  
> +	/*
> +	 * The task is about to be dispatched. If ops.enqueue() was called,
> +	 * notify the BPF scheduler by calling ops.dequeue().
> +	 *
> +	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
> +	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
> +	 * Mark that the dispatch dequeue has been called to distinguish
> +	 * from property change dequeues.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> +		struct rq *task_rq = task_rq(p);
> +
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0);
> +		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
> +	}
> +
>  	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
>  
>  	if (dsq->id == SCX_DSQ_LOCAL)
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 386c677e4c9a0..befa9a5d6e53f 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -982,6 +982,13 @@ enum scx_deq_flags {
>  	 * it hasn't been dispatched yet. Dequeue from the BPF side.
>  	 */
>  	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
> +
> +	/*
> +	 * The task is being dequeued due to a property change (e.g.,
> +	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
> +	 * etc.).
> +	 */
> +	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
>  };
>  
>  enum scx_pick_idle_cpu_flags {
> diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
> index c2c33df9292c2..8284f717ff05e 100644
> --- a/tools/sched_ext/include/scx/enum_defs.autogen.h
> +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
> @@ -21,6 +21,7 @@
>  #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
>  #define HAVE_SCX_DEQ_SLEEP
>  #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
> +#define HAVE_SCX_DEQ_SCHED_CHANGE
>  #define HAVE_SCX_DSQ_FLAG_BUILTIN
>  #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
>  #define HAVE_SCX_DSQ_INVALID
> @@ -48,6 +49,7 @@
>  #define HAVE_SCX_TASK_QUEUED
>  #define HAVE_SCX_TASK_RESET_RUNNABLE_AT
>  #define HAVE_SCX_TASK_DEQD_FOR_SLEEP
> +#define HAVE_SCX_TASK_DISPATCH_DEQUEUED
>  #define HAVE_SCX_TASK_STATE_SHIFT
>  #define HAVE_SCX_TASK_STATE_BITS
>  #define HAVE_SCX_TASK_STATE_MASK
> diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
> index 2f8002bcc19ad..5da50f9376844 100644
> --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
> +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
> @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
>  const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
>  #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
>  
> +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
> +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
> diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
> index fedec938584be..fc9a7a4d9dea5 100644
> --- a/tools/sched_ext/include/scx/enums.autogen.h
> +++ b/tools/sched_ext/include/scx/enums.autogen.h
> @@ -46,4 +46,5 @@
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
> +	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
>  } while (0)


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
  2026-01-27 16:38   ` Emil Tsalapatis
@ 2026-01-27 16:41   ` Kuba Piecuch
  2026-01-30  7:34     ` Andrea Righi
  2026-01-28 21:21   ` Tejun Heo
  2 siblings, 1 reply; 83+ messages in thread
From: Kuba Piecuch @ 2026-01-27 16:41 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext,
	linux-kernel, Emil Tsalapatis

Hi Andrea,

On Mon Jan 26, 2026 at 8:41 AM UTC, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change scenarios. As a result, BPF schedulers
> cannot reliably track task state.
>
> In addition, some ops.dequeue() callbacks can be skipped (e.g., during
> direct dispatch), so ops.enqueue() calls are not always paired with a
> corresponding ops.dequeue(), potentially breaking accounting logic.
>
> Fix this by guaranteeing that every ops.enqueue() is matched with a
> corresponding ops.dequeue(), and introduce the %SCX_DEQ_SCHED_CHANGE
> flag to distinguish dequeues triggered by scheduling property changes
> from those occurring in the normal dispatch/execution workflow.
>
> New semantics:
> 1. ops.enqueue() is called when a task enters the BPF scheduler
> 2. ops.dequeue() is called when the task leaves the BPF scheduler in
>    the following cases:
>    a) regular dispatch workflow: task dispatched to a DSQ,
>    b) core scheduling pick: core-sched picks task before dispatch,
>    c) property change: task properties modified.
>
> A new %SCX_DEQ_SCHED_CHANGE flag is also introduced, allowing BPF
> schedulers to distinguish between:
> - normal dispatch/execution workflow (dispatch, core-sched pick),
> - property changes that require state updates (e.g.,
>   sched_setaffinity(), sched_setscheduler(), set_user_nice(),
>   NUMA balancing, CPU migrations, etc.).
>
> With this, BPF schedulers can:
> - reliably track task ownership and lifecycle,
> - maintain accurate accounting of enqueue/dequeue pairs,
> - distinguish between execution events and property changes,
> - update internal state appropriately for each dequeue type.
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Emil Tsalapatis <emil@etsalapatis.com>
> Cc: Kuba Piecuch <jpiecuch@google.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  Documentation/scheduler/sched-ext.rst         | 33 +++++++
>  include/linux/sched/ext.h                     | 11 +++
>  kernel/sched/ext.c                            | 89 ++++++++++++++++++-
>  kernel/sched/ext_internal.h                   |  7 ++
>  .../sched_ext/include/scx/enum_defs.autogen.h |  2 +
>  .../sched_ext/include/scx/enums.autogen.bpf.h |  2 +
>  tools/sched_ext/include/scx/enums.autogen.h   |  1 +
>  7 files changed, 142 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..ed6bf7d9e6e8c 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
> +   The task remains in this state until ``ops.dequeue()`` is called, which
> +   happens in the following cases:
> +
> +   1. **Regular dispatch workflow**: when the task is successfully
> +      dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
> +      is triggered immediately to notify the BPF scheduler.
> +
> +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> +      core scheduling picks a task for execution before it has been
> +      dispatched, ``ops.dequeue()`` is called with the
> +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> +
> +   3. **Scheduling property change**: when a task property changes (via
> +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> +      priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
> +      with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> +
> +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> +   regardless of whether the task is still on a BPF data structure, or it
> +   has already been dispatched to a DSQ. This guarantees that every
> +   ``ops.enqueue()`` will eventually be followed by a corresponding
> +   ``ops.dequeue()``.

Not sure I follow this paragraph, specifically the first sentence
(starting with ``ops.dequeue()`` is called ...).
It seems to imply that a task that has already been dispatched to a DSQ still
counts as enqueued, but the preceding text contradicts that by saying that
a task is in an "enqueued state" from the time ops.enqueue() is called until
(among other things) it's successfully dispatched to a DSQ.

This would make sense if this paragraph used "enqueued" in the SCX_TASK_QUEUED
sense, while the first paragraph used the SCX_OPSS_QUEUED sense, but if that's
the case, it's quite confusing and should be clarified IMO.

> +
> +   This makes it reliable for BPF schedulers to track the enqueued state
> +   and maintain accurate accounting.
> +
> +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> +   don't need to track these transitions. The sched_ext core will safely
> +   handle all dequeue operations regardless.
> +
>  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>     empty, it then looks at the global DSQ. If there still isn't a task to
>     run, ``ops.dispatch()`` is invoked which can use the following two
> @@ -319,6 +350,8 @@ by a sched_ext scheduler:
>                  /* Any usable CPU becomes available */
>  
>                  ops.dispatch(); /* Task is moved to a local DSQ */
> +
> +                ops.dequeue(); /* Exiting BPF scheduler */
>              }
>              ops.running();      /* Task starts running on its assigned CPU */
>              while (task->scx.slice > 0 && task is runnable)
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..59446cd0373fa 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,8 +84,19 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	/*
> +	 * Set when ops.enqueue() is called; used to determine if ops.dequeue()
> +	 * should be invoked when transitioning out of SCX_OPSS_NONE state.
> +	 */
> +	SCX_TASK_OPS_ENQUEUED	= 1 << 1,
>  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> +	/*
> +	 * Set when ops.dequeue() is called after successful dispatch; used to
> +	 * distinguish dispatch dequeues from property change dequeues and
> +	 * prevent duplicate dequeue calls.
> +	 */

What counts as a duplicate dequeue call? Looking at the code, we can clearly
have ops.dequeue(SCHED_CHANGE) called after ops.dequeue(0) without an
intervening call to ops.enqueue().

> +	SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
>  
>  	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
>  	SCX_TASK_STATE_BITS	= 2,
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index afe28c04d5aa7..18bca2b83f5c5 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
>  
>  	p->scx.ddsp_enq_flags |= enq_flags;
>  
> +	/*
> +	 * The task is about to be dispatched. If ops.enqueue() was called,
> +	 * notify the BPF scheduler by calling ops.dequeue().
> +	 *
> +	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
> +	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
> +	 * Mark that the dispatch dequeue has been called to distinguish
> +	 * from property change dequeues.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> +		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
> +	}
> +
>  	/*
>  	 * We are in the enqueue path with @rq locked and pinned, and thus can't
>  	 * double lock a remote rq and enqueue to its local DSQ. For
> @@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
>  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
>  
> +	/*
> +	 * Mark that ops.enqueue() is being called for this task.
> +	 * Clear the dispatch dequeue flag for the new enqueue cycle.
> +	 * Only track these flags if ops.dequeue() is implemented.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +		p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
> +	}
> +
>  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
>  	WARN_ON_ONCE(*ddsp_taskp);
>  	*ddsp_taskp = p;
> @@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> +			/*
> +			 * Task was already dispatched. Only call ops.dequeue()
> +			 * if it hasn't been called yet (check DISPATCH_DEQUEUED).
> +			 * This can happen when:
> +			 * 1. Core-sched picks a task that was dispatched
> +			 * 2. Property changes occur after dispatch
> +			 */
> +			if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) {
> +				/*
> +				 * ops.dequeue() wasn't called during dispatch.
> +				 * This shouldn't normally happen, but call it now.
> +				 */

Should we add a warning here?

> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> +						 p, deq_flags);
> +			} else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) {
> +				/*
> +				 * This is a property change after
> +				 * dispatch. Call ops.dequeue() again with
> +				 * %SCX_DEQ_SCHED_CHANGE.
> +				 */
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> +						 p, deq_flags | SCX_DEQ_SCHED_CHANGE);
> +			}
> +			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> +					  SCX_TASK_DISPATCH_DEQUEUED);
> +		}

If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
for a task at most once between it being dispatched and taken off the CPU,
even if its properties are changed multiple times while it's on CPU.
Is that intentional? I don't see it documented.

To illustrate, assume we have a task p that has been enqueued, dispatched, and
is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.

When a property of p is changed while it runs on the CPU,
the sequence of calls is:
  dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
  (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
  set_next_task_scx(p).

dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
ops.dequeue(p, ... | SCHED_CHANGE) and clears
SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.

put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
dequeue_task_scx().

enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.

set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
this is not a core-sched pick, but it won't do much because the ops_state is
SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
dispatch_dequeue(p) which the removes the task from the local DSQ it was just
inserted into.


So, we end up in a state where any subsequent property change while the task is
still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
unset in p->scx.flags.

I really hope I didn't mess anything up when tracing the code, but of course
I'm happy to be corrected.

>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1531,9 +1583,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> -		if (SCX_HAS_OP(sch, dequeue))
> -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -					 p, deq_flags);
> +		/*
> +		 * Task is still on the BPF scheduler (not dispatched yet).
> +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> +		 * only for property changes, not for core-sched picks.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue)) {
> +			u64 flags = deq_flags;
> +			/*
> +			 * Add %SCX_DEQ_SCHED_CHANGE for property changes,
> +			 * but not for core-sched picks or sleep.
> +			 */
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> +					  SCX_TASK_DISPATCH_DEQUEUED);
> +		}
>  
>  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>  					    SCX_OPSS_NONE))
> @@ -2107,6 +2174,22 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  
>  	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
>  
> +	/*
> +	 * The task is about to be dispatched. If ops.enqueue() was called,
> +	 * notify the BPF scheduler by calling ops.dequeue().
> +	 *
> +	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
> +	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
> +	 * Mark that the dispatch dequeue has been called to distinguish
> +	 * from property change dequeues.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> +		struct rq *task_rq = task_rq(p);
> +
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq, p, 0);
> +		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
> +	}
> +
>  	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
>  
>  	if (dsq->id == SCX_DSQ_LOCAL)
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 386c677e4c9a0..befa9a5d6e53f 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -982,6 +982,13 @@ enum scx_deq_flags {
>  	 * it hasn't been dispatched yet. Dequeue from the BPF side.
>  	 */
>  	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
> +
> +	/*
> +	 * The task is being dequeued due to a property change (e.g.,
> +	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
> +	 * etc.).
> +	 */
> +	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
>  };
>  
>  enum scx_pick_idle_cpu_flags {
> diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
> index c2c33df9292c2..8284f717ff05e 100644
> --- a/tools/sched_ext/include/scx/enum_defs.autogen.h
> +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
> @@ -21,6 +21,7 @@
>  #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
>  #define HAVE_SCX_DEQ_SLEEP
>  #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
> +#define HAVE_SCX_DEQ_SCHED_CHANGE
>  #define HAVE_SCX_DSQ_FLAG_BUILTIN
>  #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
>  #define HAVE_SCX_DSQ_INVALID
> @@ -48,6 +49,7 @@
>  #define HAVE_SCX_TASK_QUEUED
>  #define HAVE_SCX_TASK_RESET_RUNNABLE_AT
>  #define HAVE_SCX_TASK_DEQD_FOR_SLEEP
> +#define HAVE_SCX_TASK_DISPATCH_DEQUEUED
>  #define HAVE_SCX_TASK_STATE_SHIFT
>  #define HAVE_SCX_TASK_STATE_BITS
>  #define HAVE_SCX_TASK_STATE_MASK
> diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
> index 2f8002bcc19ad..5da50f9376844 100644
> --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
> +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
> @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
>  const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
>  #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
>  
> +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
> +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
> diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
> index fedec938584be..fc9a7a4d9dea5 100644
> --- a/tools/sched_ext/include/scx/enums.autogen.h
> +++ b/tools/sched_ext/include/scx/enums.autogen.h
> @@ -46,4 +46,5 @@
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
> +	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
>  } while (0)

Thanks,
Kuba


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
  2026-01-27 16:38   ` Emil Tsalapatis
  2026-01-27 16:41   ` Kuba Piecuch
@ 2026-01-28 21:21   ` Tejun Heo
  2026-01-30 11:54     ` Kuba Piecuch
  2 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-01-28 21:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hello,

On Mon, Jan 26, 2026 at 09:41:49AM +0100, Andrea Righi wrote:
> @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
>  
>  	p->scx.ddsp_enq_flags |= enq_flags;
>  
> +	/*
> +	 * The task is about to be dispatched. If ops.enqueue() was called,
> +	 * notify the BPF scheduler by calling ops.dequeue().
> +	 *
> +	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
> +	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
> +	 * Mark that the dispatch dequeue has been called to distinguish
> +	 * from property change dequeues.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> +		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
> +	}

1. When to call ops.dequeue()?

I'm not sure whether deciding whether to call ops.dequeue() solely onwhether
ops.enqueue() was called. Direct dispatch has been expanded to include other
DSQs but was originally added as a way to shortcut the dispatch path and
"dispatch directly" for execution from ops.select_cpu/enqueue() paths. ie.
When a task is dispatched directly to a local DSQ, the BPF scheduler is done
with that task - the task is now in the same state with tasks that get
dispatched to a local DSQ from ops.dispatch().

ie. What effectively decides whether a task left the BPF scheduler is
whether the task reached a local DSQ or not, and direct dispatching into a
local DSQ shouldn't trigger ops.dequeue() - the task never really "queues"
on the BPF scheduler.

This creates another discrepancy - From ops.enqueue(), direct dispatching
into a non-local DSQ clearly makes the task enter the BPF scheduler and thus
its departure should trigger ops.dequeue(). What about a task which is
direct dispatched to a non-local DSQ from ops.select_cpu()? Superficially,
the right thing to do seems to skip ops.dequeue(). After all, the task has
never been ops.enqueue()'d. However, I think this is another case where
what's obvious doesn't agree with what's happening underneath.

ops.select_cpu() cannot actually queue anything. It's too early. Direct
dispatch from ops.select_cpu() is a shortcut to schedule direct dispatch
once the enqueue path is invoked so that the BPF scheudler can avoid
invocation of ops.enqueue() when the decision has already been made. While
this shortcut was added for convenience (so that e.g. the BPF scheduler
doesn't have to pass a note from ops.select_cpu() to ops.enqueue()), it has
real performance implications as it does save a roundtrip through
ops.enqueue() and we know that such overheads do matter for some use cases
(e.g. maximizing FPS on certain games).

So, while more subtle on the surface, I think the right thing to do is
basing the decision to call ops.dequeue() on the task's actual state -
ops.dequeue() should be called if the task is "on" the BPF scheduler - ie.
if the task ran ops.select_cpu/enqueue() paths and ended up in a non-local
DSQ or on the BPF side.

The subtlety would need clear documentation and we probably want to allow
ops.dequeue() to distinguish different cases. If you boil it down to the
actual task state, I don't think it's that subtle - if a task is in the
custody of the BPF scheduler, ops.dequeue() will be called. Otherwise, not.
Note that, this way, whether ops.dequeue() needs to be called agrees with
whether the task needs to be dispatched to run.

2. Why keep %SCX_TASK_OPS_ENQUEUED for %SCX_DEQ_SCHED_CHANGE?

Wouldn't that lead to calling ops.dequeue() more than once for the same
enqueue event? If the BPF scheduler is told that the task has left it
already, why does it matter whether the task gets dequeued for sched change
afterwards? e.g. from the BPF sched's POV, it shouldn't matter whether the
task is still on the local DSQ or already running, in which case the sched
class's dequeue() wouldn't be called in the first place, no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-27 16:41   ` Kuba Piecuch
@ 2026-01-30  7:34     ` Andrea Righi
  2026-01-30 13:14       ` Kuba Piecuch
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-01-30  7:34 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hi Kuba,

On Tue, Jan 27, 2026 at 04:41:43PM +0000, Kuba Piecuch wrote:
...
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..ed6bf7d9e6e8c 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,37 @@ The following briefly shows how a waking task is scheduled and executed.
> >  
> >     * Queue the task on the BPF side.
> >  
> > +   Once ``ops.enqueue()`` is called, the task enters the "enqueued state".
> > +   The task remains in this state until ``ops.dequeue()`` is called, which
> > +   happens in the following cases:
> > +
> > +   1. **Regular dispatch workflow**: when the task is successfully
> > +      dispatched to a DSQ (local, global, or user DSQ), ``ops.dequeue()``
> > +      is triggered immediately to notify the BPF scheduler.
> > +
> > +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> > +      core scheduling picks a task for execution before it has been
> > +      dispatched, ``ops.dequeue()`` is called with the
> > +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> > +
> > +   3. **Scheduling property change**: when a task property changes (via
> > +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> > +      priority changes, CPU migrations, etc.), ``ops.dequeue()`` is called
> > +      with the ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> > +
> > +   **Important**: ``ops.dequeue()`` is called for *any* enqueued task,
> > +   regardless of whether the task is still on a BPF data structure, or it
> > +   has already been dispatched to a DSQ. This guarantees that every
> > +   ``ops.enqueue()`` will eventually be followed by a corresponding
> > +   ``ops.dequeue()``.
> 
> Not sure I follow this paragraph, specifically the first sentence
> (starting with ``ops.dequeue()`` is called ...).
> It seems to imply that a task that has already been dispatched to a DSQ still
> counts as enqueued, but the preceding text contradicts that by saying that
> a task is in an "enqueued state" from the time ops.enqueue() is called until
> (among other things) it's successfully dispatched to a DSQ.
> 
> This would make sense if this paragraph used "enqueued" in the SCX_TASK_QUEUED
> sense, while the first paragraph used the SCX_OPSS_QUEUED sense, but if that's
> the case, it's quite confusing and should be clarified IMO.

Good point, the confusion is on my side, the documentation overloads the
term "enqueued" and doesn't clearly distinguish the different contexts.

In that paragraph, "enqueued" refers to the ops lifecycle (i.e., a task for
which ops.enqueue() has been called and whose scheduler-visible state is
being tracked), not to the task being queued on a DSQ or having
SCX_TASK_QUEUED set.

The intent is to treat ops.enqueue() and ops.dequeue() as the boundaries of
a scheduler-visible lifecycle, regardless of whether the task is eventually
queued on a DSQ or dispatched directly.

And as noted by Tejun in his last email, skipping ops.dequeue() for direct
dispatches also makes sense, since in that case no new ops lifecycle is
established (direct dispatch in ops.select_cpu() or ops.enqueue() can be
seen as a shortcut to bypass the scheduler).

I'll update the patch and documentation accordingly to make this
distinction more explicit.

> 
> > +
> > +   This makes it reliable for BPF schedulers to track the enqueued state
> > +   and maintain accurate accounting.
> > +
> > +   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
> > +   don't need to track these transitions. The sched_ext core will safely
> > +   handle all dequeue operations regardless.
> > +
> >  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> >     empty, it then looks at the global DSQ. If there still isn't a task to
> >     run, ``ops.dispatch()`` is invoked which can use the following two
> > @@ -319,6 +350,8 @@ by a sched_ext scheduler:
> >                  /* Any usable CPU becomes available */
> >  
> >                  ops.dispatch(); /* Task is moved to a local DSQ */
> > +
> > +                ops.dequeue(); /* Exiting BPF scheduler */
> >              }
> >              ops.running();      /* Task starts running on its assigned CPU */
> >              while (task->scx.slice > 0 && task is runnable)
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..59446cd0373fa 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,8 +84,19 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	/*
> > +	 * Set when ops.enqueue() is called; used to determine if ops.dequeue()
> > +	 * should be invoked when transitioning out of SCX_OPSS_NONE state.
> > +	 */
> > +	SCX_TASK_OPS_ENQUEUED	= 1 << 1,
> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> > +	/*
> > +	 * Set when ops.dequeue() is called after successful dispatch; used to
> > +	 * distinguish dispatch dequeues from property change dequeues and
> > +	 * prevent duplicate dequeue calls.
> > +	 */
> 
> What counts as a duplicate dequeue call? Looking at the code, we can clearly
> have ops.dequeue(SCHED_CHANGE) called after ops.dequeue(0) without an
> intervening call to ops.enqueue().

Yeah SCHED_CHANGE dequeues are the exception, and it's acceptable to have
ops.dequeue(0) + ops.dequeue(SCHED_CHANGE). The idea is to catch potential
duplicate dispatch dequeues. I'll clarify this.

> 
> > +	SCX_TASK_DISPATCH_DEQUEUED = 1 << 4,
> >  
> >  	SCX_TASK_STATE_SHIFT	= 8,	  /* bit 8 and 9 are used to carry scx_task_state */
> >  	SCX_TASK_STATE_BITS	= 2,
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index afe28c04d5aa7..18bca2b83f5c5 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -1287,6 +1287,20 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> >  
> >  	p->scx.ddsp_enq_flags |= enq_flags;
> >  
> > +	/*
> > +	 * The task is about to be dispatched. If ops.enqueue() was called,
> > +	 * notify the BPF scheduler by calling ops.dequeue().
> > +	 *
> > +	 * Keep %SCX_TASK_OPS_ENQUEUED set so that subsequent property
> > +	 * changes can trigger ops.dequeue() with %SCX_DEQ_SCHED_CHANGE.
> > +	 * Mark that the dispatch dequeue has been called to distinguish
> > +	 * from property change dequeues.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> > +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +		p->scx.flags |= SCX_TASK_DISPATCH_DEQUEUED;
> > +	}
> > +
> >  	/*
> >  	 * We are in the enqueue path with @rq locked and pinned, and thus can't
> >  	 * double lock a remote rq and enqueue to its local DSQ. For
> > @@ -1391,6 +1405,16 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> >  	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> >  	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
> >  
> > +	/*
> > +	 * Mark that ops.enqueue() is being called for this task.
> > +	 * Clear the dispatch dequeue flag for the new enqueue cycle.
> > +	 * Only track these flags if ops.dequeue() is implemented.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		p->scx.flags &= ~SCX_TASK_DISPATCH_DEQUEUED;
> > +	}
> > +
> >  	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
> >  	WARN_ON_ONCE(*ddsp_taskp);
> >  	*ddsp_taskp = p;
> > @@ -1523,6 +1547,34 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> > +			/*
> > +			 * Task was already dispatched. Only call ops.dequeue()
> > +			 * if it hasn't been called yet (check DISPATCH_DEQUEUED).
> > +			 * This can happen when:
> > +			 * 1. Core-sched picks a task that was dispatched
> > +			 * 2. Property changes occur after dispatch
> > +			 */
> > +			if (!(p->scx.flags & SCX_TASK_DISPATCH_DEQUEUED)) {
> > +				/*
> > +				 * ops.dequeue() wasn't called during dispatch.
> > +				 * This shouldn't normally happen, but call it now.
> > +				 */
> 
> Should we add a warning here?

Good idea, I'll add a WARN_ON_ONCE().

> 
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > +						 p, deq_flags);
> > +			} else if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC))) {
> > +				/*
> > +				 * This is a property change after
> > +				 * dispatch. Call ops.dequeue() again with
> > +				 * %SCX_DEQ_SCHED_CHANGE.
> > +				 */
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > +						 p, deq_flags | SCX_DEQ_SCHED_CHANGE);
> > +			}
> > +			p->scx.flags &= ~(SCX_TASK_OPS_ENQUEUED |
> > +					  SCX_TASK_DISPATCH_DEQUEUED);
> > +		}
> 
> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
> for a task at most once between it being dispatched and taken off the CPU,
> even if its properties are changed multiple times while it's on CPU.
> Is that intentional? I don't see it documented.
> 
> To illustrate, assume we have a task p that has been enqueued, dispatched, and
> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
> 
> When a property of p is changed while it runs on the CPU,
> the sequence of calls is:
>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
>   set_next_task_scx(p).
> 
> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
> ops.dequeue(p, ... | SCHED_CHANGE) and clears
> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
> 
> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
> dequeue_task_scx().
> 
> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
> 
> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
> this is not a core-sched pick, but it won't do much because the ops_state is
> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
> inserted into.
> 
> 
> So, we end up in a state where any subsequent property change while the task is
> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
> unset in p->scx.flags.
> 
> I really hope I didn't mess anything up when tracing the code, but of course
> I'm happy to be corrected.

Correct. And the enqueue/dequeue balancing is preserved here. In the
scenario you describe, subsequent property changes while the task remains
running go through ENQUEUE_RESTORE, which intentionally skips
ops.enqueue(). Since no new enqueue cycle is started, there is no
corresponding ops.dequeue() to deliver either.

In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
scheduler state established by the last ops.enqueue(), not with every
individual property change. Multiple property changes while the task stays
on CPU are coalesced and the enqueue/dequeue pairing remains balanced.

I agree this distinction isn't obvious from the current documentation, I'll
clarify that SCX_DEQ_SCHED_CHANGE is edge-triggered per enqueue/run cycle,
not per property change.

Do you see any practical use case where it'd be beneficial to tie
individual ops.dequeue() calls to every property change, as opposed to the
current coalesced behavior??

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-28 21:21   ` Tejun Heo
@ 2026-01-30 11:54     ` Kuba Piecuch
  2026-01-31  9:02       ` Andrea Righi
  2026-02-01 17:43       ` Tejun Heo
  0 siblings, 2 replies; 83+ messages in thread
From: Kuba Piecuch @ 2026-01-30 11:54 UTC (permalink / raw)
  To: Tejun Heo, Andrea Righi
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hi Tejun,

On Wed Jan 28, 2026 at 9:21 PM UTC, Tejun Heo wrote:
...
> 1. When to call ops.dequeue()?
>
> I'm not sure whether deciding whether to call ops.dequeue() solely onwhether
> ops.enqueue() was called. Direct dispatch has been expanded to include other
> DSQs but was originally added as a way to shortcut the dispatch path and
> "dispatch directly" for execution from ops.select_cpu/enqueue() paths. ie.
> When a task is dispatched directly to a local DSQ, the BPF scheduler is done
> with that task - the task is now in the same state with tasks that get
> dispatched to a local DSQ from ops.dispatch().
>
> ie. What effectively decides whether a task left the BPF scheduler is
> whether the task reached a local DSQ or not, and direct dispatching into a
> local DSQ shouldn't trigger ops.dequeue() - the task never really "queues"
> on the BPF scheduler.

Is "local" short for "local or global", i.e. not user-created?
Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(),
since dispatch isn't necessary for the task to run. This follows from the last
paragraph:

  Note that, this way, whether ops.dequeue() needs to be called agrees with
  whether the task needs to be dispatched to run.

I agree with your points, just wanted to clarify this one thing.

>
> This creates another discrepancy - From ops.enqueue(), direct dispatching
> into a non-local DSQ clearly makes the task enter the BPF scheduler and thus
> its departure should trigger ops.dequeue(). What about a task which is
> direct dispatched to a non-local DSQ from ops.select_cpu()? Superficially,
> the right thing to do seems to skip ops.dequeue(). After all, the task has
> never been ops.enqueue()'d. However, I think this is another case where
> what's obvious doesn't agree with what's happening underneath.
>
> ops.select_cpu() cannot actually queue anything. It's too early. Direct
> dispatch from ops.select_cpu() is a shortcut to schedule direct dispatch
> once the enqueue path is invoked so that the BPF scheudler can avoid
> invocation of ops.enqueue() when the decision has already been made. While
> this shortcut was added for convenience (so that e.g. the BPF scheduler
> doesn't have to pass a note from ops.select_cpu() to ops.enqueue()), it has
> real performance implications as it does save a roundtrip through
> ops.enqueue() and we know that such overheads do matter for some use cases
> (e.g. maximizing FPS on certain games).
>
> So, while more subtle on the surface, I think the right thing to do is
> basing the decision to call ops.dequeue() on the task's actual state -
> ops.dequeue() should be called if the task is "on" the BPF scheduler - ie.
> if the task ran ops.select_cpu/enqueue() paths and ended up in a non-local
> DSQ or on the BPF side.
>
> The subtlety would need clear documentation and we probably want to allow
> ops.dequeue() to distinguish different cases. If you boil it down to the
> actual task state, I don't think it's that subtle - if a task is in the
> custody of the BPF scheduler, ops.dequeue() will be called. Otherwise, not.
> Note that, this way, whether ops.dequeue() needs to be called agrees with
> whether the task needs to be dispatched to run.

Here's my attempt at documenting this behavior:

After ops.enqueue() is called on a task, the task is owned by the BPF
scheduler, provided the task wasn't direct-dispatched to a local/global DSQ.
When a task is owned by the BPF scheduler, the scheduler needs to dispatch the
task to a local/global DSQ in order for it to run.
When the BPF scheduler loses ownership of the task, either due to dispatching it
to a local/global DSQ or due to external events (core-sched pick, CPU
migration, scheduling property changes), the BPF scheduler is notified through
ops.dequeue() with appropriate flags (TBD).

Thanks,
Kuba

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-30  7:34     ` Andrea Righi
@ 2026-01-30 13:14       ` Kuba Piecuch
  2026-01-31  6:54         ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Kuba Piecuch @ 2026-01-30 13:14 UTC (permalink / raw)
  To: Andrea Righi, Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hi Andrea,

On Fri Jan 30, 2026 at 7:34 AM UTC, Andrea Righi wrote:
...
> Good point, the confusion is on my side, the documentation overloads the
> term "enqueued" and doesn't clearly distinguish the different contexts.
>
> In that paragraph, "enqueued" refers to the ops lifecycle (i.e., a task for
> which ops.enqueue() has been called and whose scheduler-visible state is
> being tracked), not to the task being queued on a DSQ or having
> SCX_TASK_QUEUED set.
>
> The intent is to treat ops.enqueue() and ops.dequeue() as the boundaries of
> a scheduler-visible lifecycle, regardless of whether the task is eventually
> queued on a DSQ or dispatched directly.
>
> And as noted by Tejun in his last email, skipping ops.dequeue() for direct
> dispatches also makes sense, since in that case no new ops lifecycle is
> established (direct dispatch in ops.select_cpu() or ops.enqueue() can be
> seen as a shortcut to bypass the scheduler).

Right, skipping ops.dequeue() for direct dispatches makes sense, provided
the task is being dispatched to a local/global DSQ. Or at least that's my
takeaway after reading Tejun's email.

...
>> 
>> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
>> for a task at most once between it being dispatched and taken off the CPU,
>> even if its properties are changed multiple times while it's on CPU.
>> Is that intentional? I don't see it documented.
>> 
>> To illustrate, assume we have a task p that has been enqueued, dispatched, and
>> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
>> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
>> 
>> When a property of p is changed while it runs on the CPU,
>> the sequence of calls is:
>>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
>>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
>>   set_next_task_scx(p).
>> 
>> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
>> ops.dequeue(p, ... | SCHED_CHANGE) and clears
>> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
>> 
>> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
>> dequeue_task_scx().
>> 
>> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
>> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
>> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
>> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
>> 
>> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
>> this is not a core-sched pick, but it won't do much because the ops_state is
>> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
>> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
>> inserted into.
>> 
>> 
>> So, we end up in a state where any subsequent property change while the task is
>> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
>> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
>> unset in p->scx.flags.
>> 
>> I really hope I didn't mess anything up when tracing the code, but of course
>> I'm happy to be corrected.
>
> Correct. And the enqueue/dequeue balancing is preserved here. In the
> scenario you describe, subsequent property changes while the task remains
> running go through ENQUEUE_RESTORE, which intentionally skips
> ops.enqueue(). Since no new enqueue cycle is started, there is no
> corresponding ops.dequeue() to deliver either.
>
> In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
> scheduler state established by the last ops.enqueue(), not with every
> individual property change. Multiple property changes while the task stays
> on CPU are coalesced and the enqueue/dequeue pairing remains balanced.

Ok, I think I understand the logic behind this, here's how I understand it:

The BPF scheduler is naturally going to have some internal per-task state.
That state may be expensive to compute from scratch, so we don't want to
completely discard it when the BPF scheduler loses ownership of the task.

ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
"Hey, some scheduling properties of the task are about to change, so you
probably should invalidate whatever state you have for that task which depends
on these properties."

That way, the BPF scheduler will know to recompute the invalidated state on
the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
BPF scheduler knows that none of the task's fundamental scheduling properties
(priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
the state. Of course, the potential for savings depends on the particular
scheduler's policy.

This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
a task is running: for subsequent calls, the BPF scheduler had already been
notified to invalidate its state, so there's no use in notifying it again.

However, I feel like there's a hidden assumption here that the BPF scheduler
doesn't recompute its state for the task before the next ops.enqueue().
What if the scheduler wanted to immediately react to the priority of a task
being decreased by preempting it? You might say "hook into
ops.set_weight()", but then doesn't that obviate the need for
ops.dequeue(SCHED_CHANGE)?

I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property
changes that happen under ``scoped_guard (sched_change, ...)`` which don't have
a dedicated ops callback, but I wasn't able to find any such properties which
would be relevant to SCX.

Another thought on the design: currently, the exact meaning of
ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF
scheduler:

* When it's owned, it combines two notifications: BPF scheduler losing
  ownership AND that it should invalidate task state.
* When it's not owned, it only serves as an "invalidate" notification,
  the ownership status doesn't change.

Wouldn't it be more elegant to have another callback, say
ops.property_change(), which would only serve as the "invalidate" notification,
and leave ops.dequeue() only for tracking ownership?
That would mean calling ops.dequeue() followed by ops.property_change() when
changing properties of a task owned by the BPF scheduler, as opposed to a
single call to ops.dequeue(SCHED_CHANGE).

But honestly, when I put it like this, it gets harder to justify having this
callback over just using ops.set_weight() etc.

>
> I agree this distinction isn't obvious from the current documentation, I'll
> clarify that SCX_DEQ_SCHED_CHANGE is edge-triggered per enqueue/run cycle,
> not per property change.
>
> Do you see any practical use case where it'd be beneficial to tie
> individual ops.dequeue() calls to every property change, as opposed to the
> current coalesced behavior??

I don't know how practical it is, but in my comment above I mention a BPF
scheduler wanting to immediately preempt a running task on priority decrease,
but in that case we need to hook into ops.set_weight() anyway to find out
whether the priority was decreased.

Thanks,
Kuba

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-30 13:14       ` Kuba Piecuch
@ 2026-01-31  6:54         ` Andrea Righi
  2026-01-31 16:45           ` Kuba Piecuch
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-01-31  6:54 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hi Kuba,

On Fri, Jan 30, 2026 at 01:14:23PM +0000, Kuba Piecuch wrote:
...
> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
> >> for a task at most once between it being dispatched and taken off the CPU,
> >> even if its properties are changed multiple times while it's on CPU.
> >> Is that intentional? I don't see it documented.
> >> 
> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and
> >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
> >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
> >> 
> >> When a property of p is changed while it runs on the CPU,
> >> the sequence of calls is:
> >>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
> >>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
> >>   set_next_task_scx(p).
> >> 
> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
> >> ops.dequeue(p, ... | SCHED_CHANGE) and clears
> >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
> >> 
> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
> >> dequeue_task_scx().
> >> 
> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
> >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
> >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
> >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
> >> 
> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
> >> this is not a core-sched pick, but it won't do much because the ops_state is
> >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
> >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
> >> inserted into.
> >> 
> >> 
> >> So, we end up in a state where any subsequent property change while the task is
> >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
> >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
> >> unset in p->scx.flags.
> >> 
> >> I really hope I didn't mess anything up when tracing the code, but of course
> >> I'm happy to be corrected.
> >
> > Correct. And the enqueue/dequeue balancing is preserved here. In the
> > scenario you describe, subsequent property changes while the task remains
> > running go through ENQUEUE_RESTORE, which intentionally skips
> > ops.enqueue(). Since no new enqueue cycle is started, there is no
> > corresponding ops.dequeue() to deliver either.
> >
> > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
> > scheduler state established by the last ops.enqueue(), not with every
> > individual property change. Multiple property changes while the task stays
> > on CPU are coalesced and the enqueue/dequeue pairing remains balanced.
> 
> Ok, I think I understand the logic behind this, here's how I understand it:
> 
> The BPF scheduler is naturally going to have some internal per-task state.
> That state may be expensive to compute from scratch, so we don't want to
> completely discard it when the BPF scheduler loses ownership of the task.
> 
> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
> "Hey, some scheduling properties of the task are about to change, so you
> probably should invalidate whatever state you have for that task which depends
> on these properties."

Correct. And it's also a way to notify that the task has left the BPF
scheduler, so if the task is stored in any internal queue it can/should be
removed.

> 
> That way, the BPF scheduler will know to recompute the invalidated state on
> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
> BPF scheduler knows that none of the task's fundamental scheduling properties
> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
> the state. Of course, the potential for savings depends on the particular
> scheduler's policy.
> 
> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
> a task is running: for subsequent calls, the BPF scheduler had already been
> notified to invalidate its state, so there's no use in notifying it again.

Actually I think the proper behavior would be to trigger
ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
scheduler. While running, tasks are outside the BPF scheduler ownership, so
ops.dequeue() shouldn't be triggered at all.

> 
> However, I feel like there's a hidden assumption here that the BPF scheduler
> doesn't recompute its state for the task before the next ops.enqueue().

And that should be the proper behavior. BPF scheduler should recompute a
task state only when the task is re-enqueued after a property change.

> What if the scheduler wanted to immediately react to the priority of a task
> being decreased by preempting it? You might say "hook into
> ops.set_weight()", but then doesn't that obviate the need for
> ops.dequeue(SCHED_CHANGE)?

If a scheduler wants to implement preemption on property change, it can do
so in ops.enqueue(): after a property change, the task is re-enqueued,
triggering ops.enqueue(), at which point the BPF scheduler can decide
whether and how to preempt currently running tasks.

If a property change does not result in an ops.enqueue() call, it means the
task is not runnable yet (or does not intend to run), so attempting to
trigger a preemption at that point would be pointless.

> 
> I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property
> changes that happen under ``scoped_guard (sched_change, ...)`` which don't have
> a dedicated ops callback, but I wasn't able to find any such properties which
> would be relevant to SCX.
> 
> Another thought on the design: currently, the exact meaning of
> ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF
> scheduler:
> 
> * When it's owned, it combines two notifications: BPF scheduler losing
>   ownership AND that it should invalidate task state.
> * When it's not owned, it only serves as an "invalidate" notification,
>   the ownership status doesn't change.

When it's not owned I think ops.dequeue() shouldn't be triggered at all.

> 
> Wouldn't it be more elegant to have another callback, say
> ops.property_change(), which would only serve as the "invalidate" notification,
> and leave ops.dequeue() only for tracking ownership?
> That would mean calling ops.dequeue() followed by ops.property_change() when
> changing properties of a task owned by the BPF scheduler, as opposed to a
> single call to ops.dequeue(SCHED_CHANGE).

We could provide an ops.property_change(), but honestly I don't see any
practical usage of this callback.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-30 11:54     ` Kuba Piecuch
@ 2026-01-31  9:02       ` Andrea Righi
  2026-01-31 17:53         ` Kuba Piecuch
  2026-02-01 17:43       ` Tejun Heo
  1 sibling, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-01-31  9:02 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote:
> Hi Tejun,
> 
> On Wed Jan 28, 2026 at 9:21 PM UTC, Tejun Heo wrote:
> ...
> > 1. When to call ops.dequeue()?
> >
> > I'm not sure whether deciding whether to call ops.dequeue() solely onwhether
> > ops.enqueue() was called. Direct dispatch has been expanded to include other
> > DSQs but was originally added as a way to shortcut the dispatch path and
> > "dispatch directly" for execution from ops.select_cpu/enqueue() paths. ie.
> > When a task is dispatched directly to a local DSQ, the BPF scheduler is done
> > with that task - the task is now in the same state with tasks that get
> > dispatched to a local DSQ from ops.dispatch().
> >
> > ie. What effectively decides whether a task left the BPF scheduler is
> > whether the task reached a local DSQ or not, and direct dispatching into a
> > local DSQ shouldn't trigger ops.dequeue() - the task never really "queues"
> > on the BPF scheduler.
> 
> Is "local" short for "local or global", i.e. not user-created?
> Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(),
> since dispatch isn't necessary for the task to run. This follows from the last
> paragraph:
> 
>   Note that, this way, whether ops.dequeue() needs to be called agrees with
>   whether the task needs to be dispatched to run.
> 
> I agree with your points, just wanted to clarify this one thing.

I think this should be interpreted as local DSQs only
(SCX_DSQ_LOCAL / SCX_DSQ_LOCAL_ON), not any built-in DSQ. SCX_DSQ_GLOBAL is
essentially a built-in user DSQ, provided for convenience, it's not really
a "direct dispatch" DSQ.

> 
> >
> > This creates another discrepancy - From ops.enqueue(), direct dispatching
> > into a non-local DSQ clearly makes the task enter the BPF scheduler and thus
> > its departure should trigger ops.dequeue(). What about a task which is
> > direct dispatched to a non-local DSQ from ops.select_cpu()? Superficially,
> > the right thing to do seems to skip ops.dequeue(). After all, the task has
> > never been ops.enqueue()'d. However, I think this is another case where
> > what's obvious doesn't agree with what's happening underneath.
> >
> > ops.select_cpu() cannot actually queue anything. It's too early. Direct
> > dispatch from ops.select_cpu() is a shortcut to schedule direct dispatch
> > once the enqueue path is invoked so that the BPF scheudler can avoid
> > invocation of ops.enqueue() when the decision has already been made. While
> > this shortcut was added for convenience (so that e.g. the BPF scheduler
> > doesn't have to pass a note from ops.select_cpu() to ops.enqueue()), it has
> > real performance implications as it does save a roundtrip through
> > ops.enqueue() and we know that such overheads do matter for some use cases
> > (e.g. maximizing FPS on certain games).
> >
> > So, while more subtle on the surface, I think the right thing to do is
> > basing the decision to call ops.dequeue() on the task's actual state -
> > ops.dequeue() should be called if the task is "on" the BPF scheduler - ie.
> > if the task ran ops.select_cpu/enqueue() paths and ended up in a non-local
> > DSQ or on the BPF side.
> >
> > The subtlety would need clear documentation and we probably want to allow
> > ops.dequeue() to distinguish different cases. If you boil it down to the
> > actual task state, I don't think it's that subtle - if a task is in the
> > custody of the BPF scheduler, ops.dequeue() will be called. Otherwise, not.
> > Note that, this way, whether ops.dequeue() needs to be called agrees with
> > whether the task needs to be dispatched to run.
> 
> Here's my attempt at documenting this behavior:
> 
> After ops.enqueue() is called on a task, the task is owned by the BPF
> scheduler, provided the task wasn't direct-dispatched to a local/global DSQ.
> When a task is owned by the BPF scheduler, the scheduler needs to dispatch the
> task to a local/global DSQ in order for it to run.
> When the BPF scheduler loses ownership of the task, either due to dispatching it
> to a local/global DSQ or due to external events (core-sched pick, CPU
> migration, scheduling property changes), the BPF scheduler is notified through
> ops.dequeue() with appropriate flags (TBD).

This looks good overall, except for the global DSQ part. Also, it might be
better to avoid the term “owned”, internally the kernel already uses the
concept of "task ownership" with a different meaning (see
https://lore.kernel.org/all/aVHAZNbIJLLBHEXY@slm.duckdns.org), and reusing
it here could be misleading.

With that in mind, I'd probably rephrase your documentation along these
lines:

After ops.enqueue() is called, the task is considered *enqueued* by the BPF
scheduler, unless it is directly dispatched to a local DSQ (via
SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON).

While a task is enqueued, the BPF scheduler must explicitly dispatch it to
a DSQ in order for it to run.

When a task leaves the enqueued state (either because it is dispatched to a
non-local DSQ, or due to external events such as a core-sched pick, CPU
migration, or scheduling property changes), ops.dequeue() is invoked to
notify the BPF scheduler, with flags indicating the reason for the dequeue:
regular dispatch dequeues have no flags set, whereas dequeues triggered by
scheduling property changes are reported with SCX_DEQ_SCHED_CHANGE.

What do you think?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-31  6:54         ` Andrea Righi
@ 2026-01-31 16:45           ` Kuba Piecuch
  2026-01-31 17:24             ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Kuba Piecuch @ 2026-01-31 16:45 UTC (permalink / raw)
  To: Andrea Righi, Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hi Andrea,

On Sat Jan 31, 2026 at 6:54 AM UTC, Andrea Righi wrote:
>> >> If I understand the logic correctly, ops.dequeue(SCHED_CHANGE) will be called
>> >> for a task at most once between it being dispatched and taken off the CPU,
>> >> even if its properties are changed multiple times while it's on CPU.
>> >> Is that intentional? I don't see it documented.
>> >> 
>> >> To illustrate, assume we have a task p that has been enqueued, dispatched, and
>> >> is currently running on the CPU, so we have both SCX_TASK_OPS_ENQUEUE and
>> >> SCX_TASK_DISPATCH_DEQUEUED set in p->scx.flags.
>> >> 
>> >> When a property of p is changed while it runs on the CPU,
>> >> the sequence of calls is:
>> >>   dequeue_task_scx(p, DEQUEUE_SAVE) => put_prev_task_scx(p) =>
>> >>   (change property) => enqueue_task_scx(p, ENQUEUE_RESTORE) =>
>> >>   set_next_task_scx(p).
>> >> 
>> >> dequeue_task_scx(p, DEQUEUE_SAVE) calls ops_dequeue() which calls
>> >> ops.dequeue(p, ... | SCHED_CHANGE) and clears
>> >> SCX_TASK_{OPS_ENQUEUED,DISPATCH_DEQUEUED} from p->scx.flags.
>> >> 
>> >> put_prev_task_scx(p) doesn't do much because SCX_TASK_QUEUED was cleared by
>> >> dequeue_task_scx().
>> >> 
>> >> enqueue_task_scx(p, ENQUEUE_RESTORE) sets sticky_cpu because the task is
>> >> currently running and ENQUEUE_RESTORE is set. This causes do_enqueue_task() to
>> >> jump straight to local_norefill, skipping the call to ops.enqueue(), leaving
>> >> SCX_TASK_OPS_ENQUEUED unset, and then enqueueing the task on the local DSQ.
>> >> 
>> >> set_next_task_scx(p) calls ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC) even though
>> >> this is not a core-sched pick, but it won't do much because the ops_state is
>> >> SCX_OPSS_NONE and SCX_TASK_OPS_ENQUEUED is unset. It also calls
>> >> dispatch_dequeue(p) which the removes the task from the local DSQ it was just
>> >> inserted into.
>> >> 
>> >> 
>> >> So, we end up in a state where any subsequent property change while the task is
>> >> still on CPU will not result in ops.dequeue(p, ... | SCHED_CHANGE) being
>> >> called, because both SCX_TASK_OPS_ENQUEUED and SCX_TASK_DISPATCH_DEQUEUED are
>> >> unset in p->scx.flags.
>> >> 
>> >> I really hope I didn't mess anything up when tracing the code, but of course
>> >> I'm happy to be corrected.
>> >
>> > Correct. And the enqueue/dequeue balancing is preserved here. In the
>> > scenario you describe, subsequent property changes while the task remains
>> > running go through ENQUEUE_RESTORE, which intentionally skips
>> > ops.enqueue(). Since no new enqueue cycle is started, there is no
>> > corresponding ops.dequeue() to deliver either.
>> >
>> > In other words, SCX_DEQ_SCHED_CHANGE is associated with invalidating the
>> > scheduler state established by the last ops.enqueue(), not with every
>> > individual property change. Multiple property changes while the task stays
>> > on CPU are coalesced and the enqueue/dequeue pairing remains balanced.
>> 
>> Ok, I think I understand the logic behind this, here's how I understand it:
>> 
>> The BPF scheduler is naturally going to have some internal per-task state.
>> That state may be expensive to compute from scratch, so we don't want to
>> completely discard it when the BPF scheduler loses ownership of the task.
>> 
>> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
>> "Hey, some scheduling properties of the task are about to change, so you
>> probably should invalidate whatever state you have for that task which depends
>> on these properties."
>
> Correct. And it's also a way to notify that the task has left the BPF
> scheduler, so if the task is stored in any internal queue it can/should be
> removed.

Right, unless the task has already been dispatched, in which case it's just
an invalidation notification.

>
>> 
>> That way, the BPF scheduler will know to recompute the invalidated state on
>> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
>> BPF scheduler knows that none of the task's fundamental scheduling properties
>> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
>> the state. Of course, the potential for savings depends on the particular
>> scheduler's policy.
>> 
>> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
>> a task is running: for subsequent calls, the BPF scheduler had already been
>> notified to invalidate its state, so there's no use in notifying it again.
>
> Actually I think the proper behavior would be to trigger
> ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
> scheduler. While running, tasks are outside the BPF scheduler ownership, so
> ops.dequeue() shouldn't be triggered at all.
>

I don't think this is what the current implementation does, right?

>> 
>> However, I feel like there's a hidden assumption here that the BPF scheduler
>> doesn't recompute its state for the task before the next ops.enqueue().
>
> And that should be the proper behavior. BPF scheduler should recompute a
> task state only when the task is re-enqueued after a property change.
>

That would make sense if ops.enqueue() was called immediately after a property
change when a task is running, but I believe that's currently not the case,
see my attempt at tracing the enqueue-dequeue cycle on property change in my
first reply.

>> What if the scheduler wanted to immediately react to the priority of a task
>> being decreased by preempting it? You might say "hook into
>> ops.set_weight()", but then doesn't that obviate the need for
>> ops.dequeue(SCHED_CHANGE)?
>
> If a scheduler wants to implement preemption on property change, it can do
> so in ops.enqueue(): after a property change, the task is re-enqueued,
> triggering ops.enqueue(), at which point the BPF scheduler can decide
> whether and how to preempt currently running tasks.
>
> If a property change does not result in an ops.enqueue() call, it means the
> task is not runnable yet (or does not intend to run), so attempting to
> trigger a preemption at that point would be pointless.
>

IIUC a dequeue-enqueue cycle on a running task during property change doesn't
result in a call to ops.enqueue(), so if the BPF scheduler recomputed its state
only in ops.enqueue(), then it wouldn't be able to react immediately.

>> 
>> I guess it could be argued that ops.dequeue(SCHED_CHANGE) covers property
>> changes that happen under ``scoped_guard (sched_change, ...)`` which don't have
>> a dedicated ops callback, but I wasn't able to find any such properties which
>> would be relevant to SCX.
>> 
>> Another thought on the design: currently, the exact meaning of
>> ops.dequeue(SCHED_CHANGE) depends on whether the task is owned by the BPF
>> scheduler:
>> 
>> * When it's owned, it combines two notifications: BPF scheduler losing
>>   ownership AND that it should invalidate task state.
>> * When it's not owned, it only serves as an "invalidate" notification,
>>   the ownership status doesn't change.
>
> When it's not owned I think ops.dequeue() shouldn't be triggered at all.
>
>> 
>> Wouldn't it be more elegant to have another callback, say
>> ops.property_change(), which would only serve as the "invalidate" notification,
>> and leave ops.dequeue() only for tracking ownership?
>> That would mean calling ops.dequeue() followed by ops.property_change() when
>> changing properties of a task owned by the BPF scheduler, as opposed to a
>> single call to ops.dequeue(SCHED_CHANGE).
>
> We could provide an ops.property_change(), but honestly I don't see any
> practical usage of this callback.
>

Neither do I, I just made it up for the sake of argument :-)

Thanks,
Kuba

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-31 16:45           ` Kuba Piecuch
@ 2026-01-31 17:24             ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-01-31 17:24 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hi Kuba,

On Sat, Jan 31, 2026 at 04:45:59PM +0000, Kuba Piecuch wrote:
...
> >> The BPF scheduler is naturally going to have some internal per-task state.
> >> That state may be expensive to compute from scratch, so we don't want to
> >> completely discard it when the BPF scheduler loses ownership of the task.
> >> 
> >> ops.dequeue(SCHED_CHANGE) serves as a notification to the BPF scheduler:
> >> "Hey, some scheduling properties of the task are about to change, so you
> >> probably should invalidate whatever state you have for that task which depends
> >> on these properties."
> >
> > Correct. And it's also a way to notify that the task has left the BPF
> > scheduler, so if the task is stored in any internal queue it can/should be
> > removed.
> 
> Right, unless the task has already been dispatched, in which case it's just
> an invalidation notification.

Right, but if the task has already been dispatched I don't think we should
trigger ops.dequeue(SCHED_CHANGE), because it's not anymore under the BPF
scheduler's custody (not the way it's implemented right now, I'm just
trying to define the proper semantics based on the latest disussions).

> >> That way, the BPF scheduler will know to recompute the invalidated state on
> >> the next ops.enqueue(). If there was no call to ops.dequeue(SCHED_CHANGE), the
> >> BPF scheduler knows that none of the task's fundamental scheduling properties
> >> (priority, cpu, cpumask, etc.) changed, so it can potentially skip recomputing
> >> the state. Of course, the potential for savings depends on the particular
> >> scheduler's policy.
> >> 
> >> This also explains why we only get one call to ops.dequeue(SCHED_CHANGE) while
> >> a task is running: for subsequent calls, the BPF scheduler had already been
> >> notified to invalidate its state, so there's no use in notifying it again.
> >
> > Actually I think the proper behavior would be to trigger
> > ops.dequeue(SCHED_CHANGE) only when the task is "owned" by the BPF
> > scheduler. While running, tasks are outside the BPF scheduler ownership, so
> > ops.dequeue() shouldn't be triggered at all.
> >
> 
> I don't think this is what the current implementation does, right?

Right, sorry, I wasn't clear. I'm just trying to define the behavior that
makes more sense (see below).

> >> However, I feel like there's a hidden assumption here that the BPF scheduler
> >> doesn't recompute its state for the task before the next ops.enqueue().
> >
> > And that should be the proper behavior. BPF scheduler should recompute a
> > task state only when the task is re-enqueued after a property change.
> >
> 
> That would make sense if ops.enqueue() was called immediately after a property
> change when a task is running, but I believe that's currently not the case,
> see my attempt at tracing the enqueue-dequeue cycle on property change in my
> first reply.

Yeah, that's right.

I have a new patch set where I've implemented the following semantics (that
should match also Tejun's requirements).

With the new semantics:
 - for running tasks: property changes do NOT trigger ops.dequeue(SCHED_CHANGE)
 - once a task leaves BPF custody (dispatched to local DSQ), the BPF
   scheduler no longer manages it
 - property changes on running tasks don't affect the BPF scheduler

Key principle: ops.dequeue() is only called when a task leaves BPF
scheduler's custody. A running task has already left BPF custody, so
property changes don't trigger ops.dequeue().

Therefore, `ops.dequeue(SCHED_CHANGE)` gets called only when:
 - task is in BPF data structures (QUEUED state), or
 - task is on a non-local DSQ (still in BPF custody)

In this case (BPF scheduler custody), if a property change happens,
ops.dequeue(SCHED_CHANGE) is called to notify the BPF scheduler.

Then if you want to react immediately on priority changes for running tasks
we have:
 - ops.set_cpumask(): CPU affinity changes
 - ops.set_weight(): priority/nice changes
 - ops.cgroup_*(): cgroup changes

In conclusion, we don't need ops.dequeue(SCHED_CHANGE) for running tasks,
the dedicated callbacks (ops.set_cpumask(), ops.set_weight(), ...) already
provide comprehensive coverage for property changes on all tasks,
regardless of whether they're running or in BPF custody. And the new
ops.dequeue(SCHED_CHANGE) semantics only notifies for property changes when
tasks are actively managed by the BPF scheduler (in QUEUED state or on
non-local DSQs).

Do you think it's reasonable enough / do you see any flaws?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-31  9:02       ` Andrea Righi
@ 2026-01-31 17:53         ` Kuba Piecuch
  2026-01-31 20:26           ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Kuba Piecuch @ 2026-01-31 17:53 UTC (permalink / raw)
  To: Andrea Righi, Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

On Sat Jan 31, 2026 at 9:02 AM UTC, Andrea Righi wrote:
> On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote:
>> Is "local" short for "local or global", i.e. not user-created?
>> Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(),
>> since dispatch isn't necessary for the task to run. This follows from the last
>> paragraph:
>> 
>>   Note that, this way, whether ops.dequeue() needs to be called agrees with
>>   whether the task needs to be dispatched to run.
>> 
>> I agree with your points, just wanted to clarify this one thing.
>
> I think this should be interpreted as local DSQs only
> (SCX_DSQ_LOCAL / SCX_DSQ_LOCAL_ON), not any built-in DSQ. SCX_DSQ_GLOBAL is
> essentially a built-in user DSQ, provided for convenience, it's not really
> a "direct dispatch" DSQ.

SCX_DSQ_GLOBAL is significantly different from user DSQs, because balance_one()
can pull tasks directly from SCX_DSQ_GLOBAL, while it cannot pull tasks from
user-created DSQs.

If a BPF scheduler puts a task onto SCX_DSQ_GLOBAL, then it _must_ be ok with
balance_one() coming along and pulling that task without the BPF scheduler's
intervention, so in that way I believe SCX_DSQ_GLOBAL is semantically quite
similar to local DSQs.

>> Here's my attempt at documenting this behavior:
>> 
>> After ops.enqueue() is called on a task, the task is owned by the BPF
>> scheduler, provided the task wasn't direct-dispatched to a local/global DSQ.
>> When a task is owned by the BPF scheduler, the scheduler needs to dispatch the
>> task to a local/global DSQ in order for it to run.
>> When the BPF scheduler loses ownership of the task, either due to dispatching it
>> to a local/global DSQ or due to external events (core-sched pick, CPU
>> migration, scheduling property changes), the BPF scheduler is notified through
>> ops.dequeue() with appropriate flags (TBD).
>
> This looks good overall, except for the global DSQ part. Also, it might be
> better to avoid the term “owned”, internally the kernel already uses the
> concept of "task ownership" with a different meaning (see
> https://lore.kernel.org/all/aVHAZNbIJLLBHEXY@slm.duckdns.org), and reusing
> it here could be misleading.
>
> With that in mind, I'd probably rephrase your documentation along these
> lines:
>
> After ops.enqueue() is called, the task is considered *enqueued* by the BPF
> scheduler, unless it is directly dispatched to a local DSQ (via
> SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON).
>
> While a task is enqueued, the BPF scheduler must explicitly dispatch it to
> a DSQ in order for it to run.
>
> When a task leaves the enqueued state (either because it is dispatched to a
> non-local DSQ, or due to external events such as a core-sched pick, CPU

Shouldn't it be "dispatched to a local DSQ"?

> migration, or scheduling property changes), ops.dequeue() is invoked to
> notify the BPF scheduler, with flags indicating the reason for the dequeue:
> regular dispatch dequeues have no flags set, whereas dequeues triggered by
> scheduling property changes are reported with SCX_DEQ_SCHED_CHANGE.

Core-sched dequeues also have a dedicated flag, it should probably be included
here.

>
> What do you think?

I think using the term "enqueued" isn't very good either since it results in
two ways in which a task can be considered enqueued:

1. Between ops.enqueue() and ops.dequeue()
2. Between enqueue_task_scx() and dequeue_task_scx()

The two are not equivalent, since a task that's running is not enqueued
according to 1. but is enqueued according to 2.

I would be ok with it if we change it to something unambiguous, e.g.
"BPF-enqueued", although that poses a risk of people getting lazy and using
"enqueued" anyway.

Some potential alternative terms: "resident"/"BPF-resident",
"managed"/"BPF-managed", "dispatchable", "pending dispatch",
or simply "pending".

Thanks,
Kuba


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-31 17:53         ` Kuba Piecuch
@ 2026-01-31 20:26           ` Andrea Righi
  2026-02-02 15:19             ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-01-31 20:26 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

On Sat, Jan 31, 2026 at 05:53:27PM +0000, Kuba Piecuch wrote:
> On Sat Jan 31, 2026 at 9:02 AM UTC, Andrea Righi wrote:
> > On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote:
> >> Is "local" short for "local or global", i.e. not user-created?
> >> Direct dispatching into the global DSQ also shouldn't trigger ops.dequeue(),
> >> since dispatch isn't necessary for the task to run. This follows from the last
> >> paragraph:
> >> 
> >>   Note that, this way, whether ops.dequeue() needs to be called agrees with
> >>   whether the task needs to be dispatched to run.
> >> 
> >> I agree with your points, just wanted to clarify this one thing.
> >
> > I think this should be interpreted as local DSQs only
> > (SCX_DSQ_LOCAL / SCX_DSQ_LOCAL_ON), not any built-in DSQ. SCX_DSQ_GLOBAL is
> > essentially a built-in user DSQ, provided for convenience, it's not really
> > a "direct dispatch" DSQ.
> 
> SCX_DSQ_GLOBAL is significantly different from user DSQs, because balance_one()
> can pull tasks directly from SCX_DSQ_GLOBAL, while it cannot pull tasks from
> user-created DSQs.
> 
> If a BPF scheduler puts a task onto SCX_DSQ_GLOBAL, then it _must_ be ok with
> balance_one() coming along and pulling that task without the BPF scheduler's
> intervention, so in that way I believe SCX_DSQ_GLOBAL is semantically quite
> similar to local DSQs.

I agree that SCX_DSQ_GLOBAL behaves differently from user-created DSQs at
the implementation level, but I think that difference shouldn't leak into
the logical model.

From a semantic point of view, dispatching a task to SCX_DSQ_GLOBAL does
not mean that the task leaves the "enqueued by BPF" state. The task is
still under the BPF scheduler's custody, not directly dispatched to a
specific CPU, and remains sched_ext-managed. The scheduler has queued the
task and it hasn't relinquished control over it.

That said, I don't have a strong opinion here. If we prefer to treat
SCX_DSQ_GLOBAL as a "direct dispatch" DSQ for the purposes of ops.dequeue()
semantics, then I'm fine with adjusting the logic accordingly (with proper
documentation).

Tejun, thoughts?

> 
> >> Here's my attempt at documenting this behavior:
> >> 
> >> After ops.enqueue() is called on a task, the task is owned by the BPF
> >> scheduler, provided the task wasn't direct-dispatched to a local/global DSQ.
> >> When a task is owned by the BPF scheduler, the scheduler needs to dispatch the
> >> task to a local/global DSQ in order for it to run.
> >> When the BPF scheduler loses ownership of the task, either due to dispatching it
> >> to a local/global DSQ or due to external events (core-sched pick, CPU
> >> migration, scheduling property changes), the BPF scheduler is notified through
> >> ops.dequeue() with appropriate flags (TBD).
> >
> > This looks good overall, except for the global DSQ part. Also, it might be
> > better to avoid the term “owned”, internally the kernel already uses the
> > concept of "task ownership" with a different meaning (see
> > https://lore.kernel.org/all/aVHAZNbIJLLBHEXY@slm.duckdns.org), and reusing
> > it here could be misleading.
> >
> > With that in mind, I'd probably rephrase your documentation along these
> > lines:
> >
> > After ops.enqueue() is called, the task is considered *enqueued* by the BPF
> > scheduler, unless it is directly dispatched to a local DSQ (via
> > SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON).
> >
> > While a task is enqueued, the BPF scheduler must explicitly dispatch it to
> > a DSQ in order for it to run.
> >
> > When a task leaves the enqueued state (either because it is dispatched to a
> > non-local DSQ, or due to external events such as a core-sched pick, CPU
> 
> Shouldn't it be "dispatched to a local DSQ"?

Oh yes, sorry, it should be "dispatched to a local DSQ, ...".

> 
> > migration, or scheduling property changes), ops.dequeue() is invoked to
> > notify the BPF scheduler, with flags indicating the reason for the dequeue:
> > regular dispatch dequeues have no flags set, whereas dequeues triggered by
> > scheduling property changes are reported with SCX_DEQ_SCHED_CHANGE.
> 
> Core-sched dequeues also have a dedicated flag, it should probably be included
> here.

Right, core-sched dequeues should be mentioned as well.

> 
> >
> > What do you think?
> 
> I think using the term "enqueued" isn't very good either since it results in
> two ways in which a task can be considered enqueued:
> 
> 1. Between ops.enqueue() and ops.dequeue()
> 2. Between enqueue_task_scx() and dequeue_task_scx()
> 
> The two are not equivalent, since a task that's running is not enqueued
> according to 1. but is enqueued according to 2.
> 
> I would be ok with it if we change it to something unambiguous, e.g.
> "BPF-enqueued", although that poses a risk of people getting lazy and using
> "enqueued" anyway.
> 
> Some potential alternative terms: "resident"/"BPF-resident",
> "managed"/"BPF-managed", "dispatchable", "pending dispatch",
> or simply "pending".

I agree that enqueued is a very ambiguous term and we probably need
something more BPF-specific. How about a task "under BPF custody"?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
@ 2026-02-01  9:08 ` Andrea Righi
  2026-02-01 22:47   ` Christian Loehle
  2026-02-02 11:56   ` Kuba Piecuch
  0 siblings, 2 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-01  9:08 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.

Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).

BPF scheduler custody concept: a task is considered to be in "BPF
scheduler's custody" when it has been queued in BPF-managed data
structures and the BPF scheduler is responsible for its lifecycle.
Custody ends when the task is dispatched to a local DSQ, selected by
core scheduling, or removed due to a property change.

Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or
%SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its
custody. As a result, ops.dequeue() is not invoked for these tasks.

To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.

New ops.dequeue() semantics:
 - ops.dequeue() is invoked exactly once when the task leaves the BPF
   scheduler's custody, in one of the following cases:
   a) regular dispatch: task was dispatched to a non-local DSQ (global
      or user DSQ), ops.dequeue() called without any special flags set
   b) core scheduling dispatch: core-sched picks task before dispatch,
      dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set
   c) property change: task properties modified before dispatch,
      dequeue called with %SCX_DEQ_SCHED_CHANGE flag set

This allows BPF schedulers to:
 - reliably track task ownership and lifecycle,
 - maintain accurate accounting of managed tasks,
 - update internal state when tasks change properties.

Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 Documentation/scheduler/sched-ext.rst         |  76 ++++++++
 include/linux/sched/ext.h                     |   1 +
 kernel/sched/ext.c                            | 168 +++++++++++++++++-
 kernel/sched/ext_internal.h                   |   7 +
 .../sched_ext/include/scx/enum_defs.autogen.h |   1 +
 .../sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h   |   1 +
 7 files changed, 253 insertions(+), 3 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..6d9e82e6ca9d4 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   **Task State Tracking and ops.dequeue() Semantics**
+
+   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
+   enter the "BPF scheduler's custody" depending on where it's dispatched:
+
+   * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
+     ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
+     entirely and goes straight to the CPU's local run queue. The task
+     never enters BPF custody, and ``ops.dequeue()`` will not be called.
+
+   * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
+     the task enters the BPF scheduler's custody. When the task later
+     leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
+     or dequeued for sleep/property changes), ``ops.dequeue()`` will be
+     called exactly once.
+
+   * **Queued on BPF side**: The task is in BPF data structures and in BPF
+     custody, ``ops.dequeue()`` will be called when it leaves.
+
+   The key principle: **ops.dequeue() is called when a task leaves the BPF
+   scheduler's custody**. A task is in BPF custody if it's on a non-local
+   DSQ or in BPF data structures. Once dispatched to a local DSQ or after
+   ops.dequeue() is called, the task is out of BPF custody and the BPF
+   scheduler no longer needs to track it.
+
+   This works correctly with the ``ops.select_cpu()`` direct dispatch
+   optimization: even though it skips ``ops.enqueue()`` invocation, if the
+   task is dispatched to a non-local DSQ, it enters BPF custody and will
+   get ``ops.dequeue()`` when it leaves. This provides the performance
+   benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
+   correct state tracking.
+
+   The dequeue can happen for different reasons, distinguished by flags:
+
+   1. **Regular dispatch workflow**: when the task is dispatched from a
+      non-local DSQ to a local DSQ (leaving BPF custody for execution),
+      ``ops.dequeue()`` is triggered without any special flags.
+
+   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+      core scheduling picks a task for execution while it's still in BPF
+      custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+   3. **Scheduling property change**: when a task property changes (via
+      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+      priority changes, CPU migrations, etc.) while the task is still in
+      BPF custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+   **Important**: Once a task has left BPF custody (dispatched to local
+   DSQ), property changes will not trigger ``ops.dequeue()``, since the
+   task is no longer being managed by the BPF scheduler.
+
+   **Property Change Notifications for Running Tasks**:
+
+   For tasks that have left BPF custody (running or on local DSQs),
+   property changes can be intercepted through the dedicated callbacks:
+
+   * ``ops.set_cpumask()``: Called when a task's CPU affinity changes
+     (e.g., via ``sched_setaffinity()``). This callback is invoked for
+     all tasks regardless of their state or BPF custody.
+
+   * ``ops.set_weight()``: Called when a task's scheduling weight/priority
+     changes (e.g., via ``sched_setscheduler()`` or ``set_user_nice()``).
+     This callback is also invoked for all tasks.
+
+   These callbacks provide complete coverage for property changes,
+   complementing ``ops.dequeue()`` which only applies to tasks in BPF
+   custody.
+
+   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+   don't need to track these transitions. The sched_ext core will safely
+   handle all dequeue operations regardless.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +393,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..0d003d2845393 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,6 +84,7 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* under ext scheduler's custody */
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index afe28c04d5aa7..6d6f1253039d8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -924,6 +924,19 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
 #endif
 }
 
+/**
+ * is_local_dsq - Check if a DSQ ID represents a local DSQ
+ * @dsq_id: DSQ ID to check
+ *
+ * Returns true if @dsq_id is a local DSQ, false otherwise. Local DSQs are
+ * per-CPU queues where tasks go directly to execution.
+ */
+static inline bool is_local_dsq(u64 dsq_id)
+{
+	return dsq_id == SCX_DSQ_LOCAL ||
+	       (dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON;
+}
+
 /**
  * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
  * @rq: rq to read clock from, must be locked
@@ -1274,6 +1287,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
 
 	p->scx.ddsp_dsq_id = dsq_id;
 	p->scx.ddsp_enq_flags = enq_flags;
+
+	/*
+	 * Mark the task as entering BPF scheduler's custody if it's being
+	 * dispatched to a non-local DSQ. This handles the case where
+	 * ops.select_cpu() directly dispatches to a non-local DSQ - even
+	 * though ops.enqueue() won't be called, the task enters BPF
+	 * custody and should get ops.dequeue() when it leaves.
+	 *
+	 * For local DSQs, clear the flag, since the task bypasses the BPF
+	 * scheduler entirely. This also clears any flag that was set by
+	 * do_enqueue_task() before we knew the dispatch destination.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		if (!is_local_dsq(dsq_id))
+			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		else
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+	}
 }
 
 static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
@@ -1287,6 +1318,40 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 
 	p->scx.ddsp_enq_flags |= enq_flags;
 
+	/*
+	 * The task is about to be dispatched, handle ops.dequeue() based
+	 * on where the task is going.
+	 *
+	 * Key principle: ops.dequeue() is called when a task leaves the
+	 * BPF scheduler's custody. A task is in BPF custody if it's on a
+	 * non-local DSQ or in BPF data structures. Once dispatched to a
+	 * local DSQ, it's out of BPF custody.
+	 *
+	 * Direct dispatch to local DSQs: task never enters BPF scheduler's
+	 * custody, it goes straight to the CPU. Don't call ops.dequeue()
+	 * and clear the flag so future property changes also won't trigger
+	 * it.
+	 *
+	 * Direct dispatch to non-local DSQs: task enters BPF scheduler's
+	 * custody. Mark the task as in BPF custody so that when it's later
+	 * dispatched to a local DSQ or dequeued for property changes,
+	 * ops.dequeue() will be called.
+	 *
+	 * This also handles the ops.select_cpu() direct dispatch to
+	 * non-local DSQs: the shortcut skips ops.enqueue() invocation but
+	 * the task still enters BPF custody if dispatched to a non-local
+	 * DSQ, and thus needs ops.dequeue() when it leaves.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		if (!is_local_dsq(dsq->id)) {
+			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		} else {
+			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
+	}
+
 	/*
 	 * We are in the enqueue path with @rq locked and pinned, and thus can't
 	 * double lock a remote rq and enqueue to its local DSQ. For
@@ -1391,6 +1456,21 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
 	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
 
+	/*
+	 * Mark that ops.enqueue() is being called for this task. This
+	 * indicates the task is entering the BPF scheduler's data
+	 * structures (QUEUED state).
+	 *
+	 * However, if the task was already marked as in BPF custody by
+	 * mark_direct_dispatch() (ops.select_cpu() direct dispatch to
+	 * non-local DSQ), don't clear that - keep the flag set so
+	 * ops.dequeue() will be called when appropriate.
+	 *
+	 * Only track this flag if ops.dequeue() is implemented.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+
 	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
 	WARN_ON_ONCE(*ddsp_taskp);
 	*ddsp_taskp = p;
@@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		/*
+		 * Task is not in BPF data structures (either dispatched to
+		 * a DSQ or running). Only call ops.dequeue() if the task
+		 * is still in BPF scheduler's custody
+		 * (%SCX_TASK_OPS_ENQUEUED is set).
+		 *
+		 * If the task has already been dispatched to a local DSQ
+		 * (left BPF custody), the flag will be clear and we skip
+		 * ops.dequeue()
+		 *
+		 * If this is a property change (not sleep/core-sched) and
+		 * the task is still in BPF custody, set the
+		 * %SCX_DEQ_SCHED_CHANGE flag.
+		 */
+		if (SCX_HAS_OP(sch, dequeue) &&
+		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
+			u64 flags = deq_flags;
+
+			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+				flags |= SCX_DEQ_SCHED_CHANGE;
+
+			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1531,9 +1635,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
-		if (SCX_HAS_OP(sch, dequeue))
-			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
-					 p, deq_flags);
+		/*
+		 * Task is still on the BPF scheduler (not dispatched yet).
+		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
+		 * only for property changes, not for core-sched picks or
+		 * sleep.
+		 *
+		 * Clear the flag after calling ops.dequeue(): the task is
+		 * leaving BPF scheduler's custody.
+		 */
+		if (SCX_HAS_OP(sch, dequeue)) {
+			u64 flags = deq_flags;
+
+			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+				flags |= SCX_DEQ_SCHED_CHANGE;
+
+			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -1630,6 +1749,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 					 struct scx_dispatch_q *src_dsq,
 					 struct rq *dst_rq)
 {
+	struct scx_sched *sch = scx_root;
 	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
 
 	/* @dsq is locked and @p is on @dst_rq */
@@ -1638,6 +1758,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
 
+	/*
+	 * Task is moving from a non-local DSQ to a local DSQ. Call
+	 * ops.dequeue() if the task was in BPF custody.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
+		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+	}
+
 	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
 		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
 	else
@@ -2107,6 +2236,24 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 
 	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
 
+	/*
+	 * Direct dispatch to local DSQs: call ops.dequeue() if task was in
+	 * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
+	 *
+	 * Dispatch to non-local DSQs: task is in BPF scheduler's custody.
+	 * Mark it so ops.dequeue() will be called when it leaves.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		if (!is_local_dsq(dsq_id)) {
+			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		} else {
+			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0);
+
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
+	}
+
 	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
 
 	if (dsq->id == SCX_DSQ_LOCAL)
@@ -2894,6 +3041,14 @@ static void scx_enable_task(struct task_struct *p)
 
 	lockdep_assert_rq_held(rq);
 
+	/*
+	 * Clear enqueue/dequeue tracking flags when enabling the task.
+	 * This ensures a clean state when the task enters SCX. Only needed
+	 * if ops.dequeue() is implemented.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+
 	/*
 	 * Set the weight before calling ops.enable() so that the scheduler
 	 * doesn't see a stale value if they inspect the task struct.
@@ -2925,6 +3080,13 @@ static void scx_disable_task(struct task_struct *p)
 	if (SCX_HAS_OP(sch, disable))
 		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
 	scx_set_task_state(p, SCX_TASK_READY);
+
+	/*
+	 * Clear enqueue/dequeue tracking flags when disabling the task.
+	 * Only needed if ops.dequeue() is implemented.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
 }
 
 static void scx_exit_task(struct task_struct *p)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
 	 * it hasn't been dispatched yet. Dequeue from the BPF side.
 	 */
 	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
+
+	/*
+	 * The task is being dequeued due to a property change (e.g.,
+	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+	 * etc.).
+	 */
+	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
 };
 
 enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..dcc945304760f 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
 #define HAVE_SCX_DEQ_SLEEP
 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
 #define HAVE_SCX_DSQ_FLAG_BUILTIN
 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
 #define HAVE_SCX_DSQ_INVALID
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
 
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
 } while (0)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-30 11:54     ` Kuba Piecuch
  2026-01-31  9:02       ` Andrea Righi
@ 2026-02-01 17:43       ` Tejun Heo
  2026-02-02 15:52         ` Andrea Righi
  1 sibling, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-01 17:43 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Andrea Righi, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hello,

Sorry about tardiness.

On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote:
> Is "local" short for "local or global", i.e. not user-created?

Yes, maybe it'd be useful to come up with a terminology for them. e.g.
terminal - once a task reaches a terminal DSQ, the only way that the BPF
scheduler can affect the task is by triggering re-enqueue (although we don't
yet support reenqueueing global DSQs).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
@ 2026-02-01 22:47   ` Christian Loehle
  2026-02-02  7:45     ` Andrea Righi
  2026-02-02 11:56   ` Kuba Piecuch
  1 sibling, 1 reply; 83+ messages in thread
From: Christian Loehle @ 2026-02-01 22:47 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext,
	linux-kernel

On 2/1/26 09:08, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change events. In addition, ops.dequeue()
> callbacks are completely skipped when tasks are dispatched to non-local
> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> track task state.
> 
> Fix this by guaranteeing that each task entering the BPF scheduler's
> custody triggers exactly one ops.dequeue() call when it leaves that
> custody, whether the exit is due to a dispatch (regular or via a core
> scheduling pick) or to a scheduling property change (e.g.
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, etc.).
> 
> BPF scheduler custody concept: a task is considered to be in "BPF
> scheduler's custody" when it has been queued in BPF-managed data
> structures and the BPF scheduler is responsible for its lifecycle.
> Custody ends when the task is dispatched to a local DSQ, selected by
> core scheduling, or removed due to a property change.
> 
> Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or
> %SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its
> custody. As a result, ops.dequeue() is not invoked for these tasks.
> 
> To identify dequeues triggered by scheduling property changes, introduce
> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> the dequeue was caused by a scheduling property change.
> 
> New ops.dequeue() semantics:
>  - ops.dequeue() is invoked exactly once when the task leaves the BPF
>    scheduler's custody, in one of the following cases:
>    a) regular dispatch: task was dispatched to a non-local DSQ (global
>       or user DSQ), ops.dequeue() called without any special flags set
>    b) core scheduling dispatch: core-sched picks task before dispatch,
>       dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set
>    c) property change: task properties modified before dispatch,
>       dequeue called with %SCX_DEQ_SCHED_CHANGE flag set
> 
> This allows BPF schedulers to:
>  - reliably track task ownership and lifecycle,
>  - maintain accurate accounting of managed tasks,
>  - update internal state when tasks change properties.
> 

So I have finally gotten around updating scx_storm to the new semantics,
see:
https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics

I don't think the new ops.dequeue() are enough to make inserts to local-on
from anywhere safe, because it's still racing with dequeue from another CPU?

Furthermore I can reproduce the following with this patch applied quite easily
with something like

hackbench -l 1000 & timeout 10 ./build/scheds/c/scx_storm

[   44.356878] sched_ext: BPF scheduler "simple" enabled
[   59.315370] sched_ext: BPF scheduler "simple" disabled (unregistered from user space)
[   85.366747] sched_ext: BPF scheduler "storm" enabled
[   85.371324] ------------[ cut here ]------------
[   85.373370] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#5: gmain/1111
[   85.373392] Modules linked in: qrtr
[   85.380088] ------------[ cut here ]------------
[   85.380719] ------------[ cut here ]------------
[   85.380722] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#10: kworker/u48:1/82
[   85.380728] Modules linked in: qrtr 8021q garp mrp stp llc binfmt_misc sm3_ce r8169 cdns3_pci_wrap nf_tables nfnetlink fuse dm_mod ipv6
[   85.380745] CPU: 10 UID: 0 PID: 82 Comm: kworker/u48:1 Tainted: G S                  6.19.0-rc7-cix-build+ #256 PREEMPT 
[   85.380749] Tainted: [S]=CPU_OUT_OF_SPEC
[   85.380750] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.1.0-1 2025-12-25T02:55:53+00:00
[   85.380754] Workqueue:  0x0 (events_unbound)
[   85.380760] Sched_ext: storm (enabled+all), task: runnable_at=+0ms
[   85.380762] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[   85.380764] pc : update_locked_rq+0x64/0x6c
[   85.380767] lr : update_locked_rq+0x60/0x6c
[   85.380769] sp : ffff8000803a3bd0
[   85.380770] x29: ffff8000803a3bd0 x28: fffffdffbf622dc0 x27: ffff0000911e5040
[   85.380773] x26: 0000000000000000 x25: ffffd204426cad80 x24: ffffd20442ba5bb8
[   85.380776] x23: c00000000000000a x22: 0000000000000000 x21: ffffd20442ba4830
[   85.380778] x20: ffff00009af0b000 x19: ffff0001fef2ed80 x18: 0000000000000000
[   85.380781] x17: 0000000000000000 x16: 0000000000000000 x15: 0000aaaadd996940
[   85.380783] x14: 0000000000000000 x13: 00000000000a0000 x12: 0000000000000000
[   85.380786] x11: 0000000000000040 x10: ffffd204402e7ca0 x9 : ffffd2044324b000
[   85.380788] x8 : ffff0000810e0000 x7 : 0000d00202cc2dc0 x6 : 0000000000000050
[   85.380790] x5 : ffffd204426b5648 x4 : fffffdffbf622dc0 x3 : ffff0000810e0000
[   85.380793] x2 : 0000000000000002 x1 : ffff2dfdbc960000 x0 : 0000000000000000
[   85.380795] Call trace:
[   85.380796]  update_locked_rq+0x64/0x6c (P)
[   85.380799]  flush_dispatch_buf+0x2a8/0x2dc
[   85.380801]  pick_task_scx+0x2b0/0x6d4
[   85.380804]  __schedule+0x62c/0x1060
[   85.380811]  schedule+0x48/0x15c
[   85.380813]  worker_thread+0xdc/0x358
[   85.380824]  kthread+0x134/0x1fc
[   85.380831]  ret_from_fork+0x10/0x20
[   85.380839] irq event stamp: 34386
[   85.380840] hardirqs last  enabled at (34385): [<ffffd20441511408>] _raw_spin_unlock_irq+0x30/0x6c
[   85.380850] hardirqs last disabled at (34386): [<ffffd20441507100>] __schedule+0x510/0x1060
[   85.380852] softirqs last  enabled at (34014): [<ffffd204400c7280>] handle_softirqs+0x514/0x52c
[   85.380865] softirqs last disabled at (34007): [<ffffd204400105c4>] __do_softirq+0x14/0x20
[   85.380867] ---[ end trace 0000000000000000 ]---
[   85.380969] ------------[ cut here ]------------
[   85.380970] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#10: kworker/u48:1/82
[   85.380974] Modules linked in: qrtr 8021q garp mrp stp llc binfmt_misc sm3_ce r8169 cdns3_pci_wrap nf_tables nfnetlink fuse dm_mod ipv6
[   85.380984] CPU: 10 UID: 0 PID: 82 Comm: kworker/u48:1 Tainted: G S      W           6.19.0-rc7-cix-build+ #256 PREEMPT 
[   85.380987] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[   85.380988] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.1.0-1 2025-12-25T02:55:53+00:00
[   85.380990] Workqueue:  0x0 (events_unbound)
[   85.380993] Sched_ext: storm (enabled+all), task: runnable_at=+0ms
[   85.380994] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[   85.380996] pc : update_locked_rq+0x64/0x6c
[   85.380997] lr : update_locked_rq+0x60/0x6c
[   85.380999] sp : ffff8000803a3bd0
[   85.381000] x29: ffff8000803a3bd0 x28: fffffdffbf622dc0 x27: ffff00009151b580
[   85.381002] x26: 0000000000000000 x25: ffffd204426cad80 x24: ffffd20442ba5bb8
[   85.381005] x23: c00000000000000a x22: 0000000000000000 x21: ffffd20442ba4830
[   85.381007] x20: ffff00009af0b000 x19: ffff0001fef52d80 x18: 0000000000000000
[   85.381009] x17: 0000000000000000 x16: 0000000000000000 x15: 0000aaaae6917960
[   85.381012] x14: 0000000000000000 x13: 00000000000a0000 x12: 0000000000000000
[   85.381014] x11: 0000000000000040 x10: ffffd204402e7ca0 x9 : ffffd2044324b000
[   85.381016] x8 : ffff0000810e0000 x7 : 0000d00202cc2dc0 x6 : 0000000000000050
[   85.381019] x5 : ffffd204426b5648 x4 : fffffdffbf622dc0 x3 : ffff0000810e0000
[   85.381021] x2 : 0000000000000002 x1 : ffff2dfdbc960000 x0 : 0000000000000000
[   85.381023] Call trace:
[   85.381024]  update_locked_rq+0x64/0x6c (P)
[   85.381026]  flush_dispatch_buf+0x2a8/0x2dc
[   85.381028]  pick_task_scx+0x2b0/0x6d4
[   85.381030]  __schedule+0x62c/0x1060
[   85.381032]  schedule+0x48/0x15c
[   85.381034]  worker_thread+0xdc/0x358
[   85.381036]  kthread+0x134/0x1fc
[   85.381039]  ret_from_fork+0x10/0x20
[   85.381041] irq event stamp: 34394
[   85.381042] hardirqs last  enabled at (34393): [<ffffd20441511408>] _raw_spin_unlock_irq+0x30/0x6c
[   85.381044] hardirqs last disabled at (34394): [<ffffd20441507100>] __schedule+0x510/0x1060
[   85.381046] softirqs last  enabled at (34014): [<ffffd204400c7280>] handle_softirqs+0x514/0x52c
[   85.381049] softirqs last disabled at (34007): [<ffffd204400105c4>] __do_softirq+0x14/0x20
[   85.381050] ---[ end trace 0000000000000000 ]---
[   85.381199] ------------[ cut here ]------------
[   85.381201] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#10: kworker/u48:1/82

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-01 22:47   ` Christian Loehle
@ 2026-02-02  7:45     ` Andrea Righi
  2026-02-02  9:26       ` Andrea Righi
                         ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-02  7:45 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

Hi Christian,

On Sun, Feb 01, 2026 at 10:47:22PM +0000, Christian Loehle wrote:
> On 2/1/26 09:08, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change events. In addition, ops.dequeue()
> > callbacks are completely skipped when tasks are dispatched to non-local
> > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> > track task state.
> > 
> > Fix this by guaranteeing that each task entering the BPF scheduler's
> > custody triggers exactly one ops.dequeue() call when it leaves that
> > custody, whether the exit is due to a dispatch (regular or via a core
> > scheduling pick) or to a scheduling property change (e.g.
> > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> > balancing, etc.).
> > 
> > BPF scheduler custody concept: a task is considered to be in "BPF
> > scheduler's custody" when it has been queued in BPF-managed data
> > structures and the BPF scheduler is responsible for its lifecycle.
> > Custody ends when the task is dispatched to a local DSQ, selected by
> > core scheduling, or removed due to a property change.
> > 
> > Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or
> > %SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its
> > custody. As a result, ops.dequeue() is not invoked for these tasks.
> > 
> > To identify dequeues triggered by scheduling property changes, introduce
> > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> > the dequeue was caused by a scheduling property change.
> > 
> > New ops.dequeue() semantics:
> >  - ops.dequeue() is invoked exactly once when the task leaves the BPF
> >    scheduler's custody, in one of the following cases:
> >    a) regular dispatch: task was dispatched to a non-local DSQ (global
> >       or user DSQ), ops.dequeue() called without any special flags set
> >    b) core scheduling dispatch: core-sched picks task before dispatch,
> >       dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set
> >    c) property change: task properties modified before dispatch,
> >       dequeue called with %SCX_DEQ_SCHED_CHANGE flag set
> > 
> > This allows BPF schedulers to:
> >  - reliably track task ownership and lifecycle,
> >  - maintain accurate accounting of managed tasks,
> >  - update internal state when tasks change properties.
> > 
> 
> So I have finally gotten around updating scx_storm to the new semantics,
> see:
> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics
> 
> I don't think the new ops.dequeue() are enough to make inserts to local-on
> from anywhere safe, because it's still racing with dequeue from another CPU?

Yeah, with this patch set BPF schedulers get proper ops.dequeue()
callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from
ops.dispatch().

When task properties change between scx_bpf_dsq_insert() and the actual
dispatch, task_can_run_on_remote_rq() can still trigger a fatal
scx_error().

The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the
property change, so it can't prevent already-queued dispatches from
failing. The race window is between ops.dispatch() returning and
dispatch_to_local_dsq() executing.

We can address this in a separate patch set. One thing at a time. :)

> 
> Furthermore I can reproduce the following with this patch applied quite easily
> with something like
> 
> hackbench -l 1000 & timeout 10 ./build/scheds/c/scx_storm
> 
> [   44.356878] sched_ext: BPF scheduler "simple" enabled
> [   59.315370] sched_ext: BPF scheduler "simple" disabled (unregistered from user space)
> [   85.366747] sched_ext: BPF scheduler "storm" enabled
> [   85.371324] ------------[ cut here ]------------
> [   85.373370] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#5: gmain/1111

Ah yes! I think I see it, can you try this on top?

Thanks,
-Andrea

 kernel/sched/ext.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6d6f1253039d8..d8fed4a49195d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
 		} else {
 			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
-				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0);
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
 
 			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
 		}

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02  7:45     ` Andrea Righi
@ 2026-02-02  9:26       ` Andrea Righi
  2026-02-02 10:02         ` Christian Loehle
  2026-02-02 10:09       ` Christian Loehle
  2026-02-02 13:59       ` Kuba Piecuch
  2 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-02  9:26 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Mon, Feb 02, 2026 at 08:45:18AM +0100, Andrea Righi wrote:
...
> > So I have finally gotten around updating scx_storm to the new semantics,
> > see:
> > https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics
> > 
> > I don't think the new ops.dequeue() are enough to make inserts to local-on
> > from anywhere safe, because it's still racing with dequeue from another CPU?
> 
> Yeah, with this patch set BPF schedulers get proper ops.dequeue()
> callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from
> ops.dispatch().
> 
> When task properties change between scx_bpf_dsq_insert() and the actual
> dispatch, task_can_run_on_remote_rq() can still trigger a fatal
> scx_error().
> 
> The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the
> property change, so it can't prevent already-queued dispatches from
> failing. The race window is between ops.dispatch() returning and
> dispatch_to_local_dsq() executing.
> 
> We can address this in a separate patch set. One thing at a time. :)

Thinking more on this, the problem is that we're passing enforce=true to
task_can_run_on_remote_rq(), triggering a critical failure - scx_error().
There's a logic in task_can_run_on_remote_rq() to fallback to the global
DSQ, that doesn't happen if we pass enforce=true, due to scx_error().

However, instead of the global DSQ fallback, I was wondering if it'd be
better to simply re-enqueue the task - setting SCX_ENQ_REENQ - if the
target local DSQ isn't valid anymore when the dispatch is finalized.

In this way using SCX_DSQ_LOCAL_ON | cpu from ops.dispatch() would simply
trigger a re-enqueue when "cpu" isn't valid anymore (due to concurrent
affinity / migration disabled changes) and the BPF scheduler can handle
that in another ops.enqueue().

What do you think?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02  9:26       ` Andrea Righi
@ 2026-02-02 10:02         ` Christian Loehle
  2026-02-02 15:32           ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Christian Loehle @ 2026-02-02 10:02 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On 2/2/26 09:26, Andrea Righi wrote:
> On Mon, Feb 02, 2026 at 08:45:18AM +0100, Andrea Righi wrote:
> ...
>>> So I have finally gotten around updating scx_storm to the new semantics,
>>> see:
>>> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics
>>>
>>> I don't think the new ops.dequeue() are enough to make inserts to local-on
>>> from anywhere safe, because it's still racing with dequeue from another CPU?
>>
>> Yeah, with this patch set BPF schedulers get proper ops.dequeue()
>> callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from
>> ops.dispatch().
>>
>> When task properties change between scx_bpf_dsq_insert() and the actual
>> dispatch, task_can_run_on_remote_rq() can still trigger a fatal
>> scx_error().
>>
>> The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the
>> property change, so it can't prevent already-queued dispatches from
>> failing. The race window is between ops.dispatch() returning and
>> dispatch_to_local_dsq() executing.
>>
>> We can address this in a separate patch set. One thing at a time. :)
> 
> Thinking more on this, the problem is that we're passing enforce=true to
> task_can_run_on_remote_rq(), triggering a critical failure - scx_error().
> There's a logic in task_can_run_on_remote_rq() to fallback to the global
> DSQ, that doesn't happen if we pass enforce=true, due to scx_error().
> 
> However, instead of the global DSQ fallback, I was wondering if it'd be
> better to simply re-enqueue the task - setting SCX_ENQ_REENQ - if the
> target local DSQ isn't valid anymore when the dispatch is finalized.
> 
> In this way using SCX_DSQ_LOCAL_ON | cpu from ops.dispatch() would simply
> trigger a re-enqueue when "cpu" isn't valid anymore (due to concurrent
> affinity / migration disabled changes) and the BPF scheduler can handle
> that in another ops.enqueue().
> 
> What do you think?

I think that's a lot more versatile for the BPF scheduler than using the
global DSQ as fallback in that case, so yeah I'm all for it!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02  7:45     ` Andrea Righi
  2026-02-02  9:26       ` Andrea Righi
@ 2026-02-02 10:09       ` Christian Loehle
  2026-02-02 13:59       ` Kuba Piecuch
  2 siblings, 0 replies; 83+ messages in thread
From: Christian Loehle @ 2026-02-02 10:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On 2/2/26 07:45, Andrea Righi wrote:
> Hi Christian,
> 
> On Sun, Feb 01, 2026 at 10:47:22PM +0000, Christian Loehle wrote:
>> On 2/1/26 09:08, Andrea Righi wrote:
>>> Currently, ops.dequeue() is only invoked when the sched_ext core knows
>>> that a task resides in BPF-managed data structures, which causes it to
>>> miss scheduling property change events. In addition, ops.dequeue()
>>> callbacks are completely skipped when tasks are dispatched to non-local
>>> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
>>> track task state.
>>>
>>> Fix this by guaranteeing that each task entering the BPF scheduler's
>>> custody triggers exactly one ops.dequeue() call when it leaves that
>>> custody, whether the exit is due to a dispatch (regular or via a core
>>> scheduling pick) or to a scheduling property change (e.g.
>>> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
>>> balancing, etc.).
>>>
>>> BPF scheduler custody concept: a task is considered to be in "BPF
>>> scheduler's custody" when it has been queued in BPF-managed data
>>> structures and the BPF scheduler is responsible for its lifecycle.
>>> Custody ends when the task is dispatched to a local DSQ, selected by
>>> core scheduling, or removed due to a property change.
>>>
>>> Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or
>>> %SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its
>>> custody. As a result, ops.dequeue() is not invoked for these tasks.
>>>
>>> To identify dequeues triggered by scheduling property changes, introduce
>>> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
>>> the dequeue was caused by a scheduling property change.
>>>
>>> New ops.dequeue() semantics:
>>>  - ops.dequeue() is invoked exactly once when the task leaves the BPF
>>>    scheduler's custody, in one of the following cases:
>>>    a) regular dispatch: task was dispatched to a non-local DSQ (global
>>>       or user DSQ), ops.dequeue() called without any special flags set
>>>    b) core scheduling dispatch: core-sched picks task before dispatch,
>>>       dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set
>>>    c) property change: task properties modified before dispatch,
>>>       dequeue called with %SCX_DEQ_SCHED_CHANGE flag set
>>>
>>> This allows BPF schedulers to:
>>>  - reliably track task ownership and lifecycle,
>>>  - maintain accurate accounting of managed tasks,
>>>  - update internal state when tasks change properties.
>>>
>>
>> So I have finally gotten around updating scx_storm to the new semantics,
>> see:
>> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics
>>
>> I don't think the new ops.dequeue() are enough to make inserts to local-on
>> from anywhere safe, because it's still racing with dequeue from another CPU?
> 
> Yeah, with this patch set BPF schedulers get proper ops.dequeue()
> callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from
> ops.dispatch().
> 
> When task properties change between scx_bpf_dsq_insert() and the actual
> dispatch, task_can_run_on_remote_rq() can still trigger a fatal
> scx_error().
> 
> The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the
> property change, so it can't prevent already-queued dispatches from
> failing. The race window is between ops.dispatch() returning and
> dispatch_to_local_dsq() executing.
> 
> We can address this in a separate patch set. One thing at a time. :)
> 
>>
>> Furthermore I can reproduce the following with this patch applied quite easily
>> with something like
>>
>> hackbench -l 1000 & timeout 10 ./build/scheds/c/scx_storm
>>
>> [   44.356878] sched_ext: BPF scheduler "simple" enabled
>> [   59.315370] sched_ext: BPF scheduler "simple" disabled (unregistered from user space)
>> [   85.366747] sched_ext: BPF scheduler "storm" enabled
>> [   85.371324] ------------[ cut here ]------------
>> [   85.373370] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#5: gmain/1111
> 
> Ah yes! I think I see it, can you try this on top?
> 
> Thanks,
> -Andrea
> 
>  kernel/sched/ext.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 6d6f1253039d8..d8fed4a49195d 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
>  		} else {
>  			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> -				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0);
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
>  
>  			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
>  		}

Yup, that fixes it!

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
  2026-02-01 22:47   ` Christian Loehle
@ 2026-02-02 11:56   ` Kuba Piecuch
  2026-02-04 10:11     ` Andrea Righi
  1 sibling, 1 reply; 83+ messages in thread
From: Kuba Piecuch @ 2026-02-02 11:56 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Hi Andrea,

Looks good overall, but we need to settle on the global DSQ semantics, plus
some edge cases that need clearing up.

On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote:
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..6d9e82e6ca9d4 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   **Task State Tracking and ops.dequeue() Semantics**
> +
> +   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> +   enter the "BPF scheduler's custody" depending on where it's dispatched:
> +
> +   * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
> +     ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
> +     entirely and goes straight to the CPU's local run queue. The task
> +     never enters BPF custody, and ``ops.dequeue()`` will not be called.
> +
> +   * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
> +     the task enters the BPF scheduler's custody. When the task later
> +     leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
> +     or dequeued for sleep/property changes), ``ops.dequeue()`` will be
> +     called exactly once.
> +
> +   * **Queued on BPF side**: The task is in BPF data structures and in BPF
> +     custody, ``ops.dequeue()`` will be called when it leaves.
> +
> +   The key principle: **ops.dequeue() is called when a task leaves the BPF
> +   scheduler's custody**. A task is in BPF custody if it's on a non-local
> +   DSQ or in BPF data structures. Once dispatched to a local DSQ or after
> +   ops.dequeue() is called, the task is out of BPF custody and the BPF
> +   scheduler no longer needs to track it.
> +
> +   This works correctly with the ``ops.select_cpu()`` direct dispatch
> +   optimization: even though it skips ``ops.enqueue()`` invocation, if the
> +   task is dispatched to a non-local DSQ, it enters BPF custody and will
> +   get ``ops.dequeue()`` when it leaves. This provides the performance
> +   benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
> +   correct state tracking.
> +
> +   The dequeue can happen for different reasons, distinguished by flags:
> +
> +   1. **Regular dispatch workflow**: when the task is dispatched from a
> +      non-local DSQ to a local DSQ (leaving BPF custody for execution),
> +      ``ops.dequeue()`` is triggered without any special flags.

Maybe add a note that this can happen asynchronously, without the BPF
scheduler explicitly dispatching the task to a local DSQ, when the task
is on a global DSQ? Or maybe make that case into a separate dequeue reason
with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ?

> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..0d003d2845393 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* under ext scheduler's custody */

Nit: I think "in BPF scheduler's custody" would be a bit clearer, as
"ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT
as a whole.

> @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * Task is not in BPF data structures (either dispatched to
> +		 * a DSQ or running). Only call ops.dequeue() if the task
> +		 * is still in BPF scheduler's custody
> +		 * (%SCX_TASK_OPS_ENQUEUED is set).
> +		 *
> +		 * If the task has already been dispatched to a local DSQ
> +		 * (left BPF custody), the flag will be clear and we skip
> +		 * ops.dequeue()
> +		 *
> +		 * If this is a property change (not sleep/core-sched) and
> +		 * the task is still in BPF custody, set the
> +		 * %SCX_DEQ_SCHED_CHANGE flag.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> +			u64 flags = deq_flags;
> +
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;

I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for
tasks being picked from a global DSQ being migrated from a remote rq to the
local rq, which, while technically correct since the task is migrating rqs,
may be confusing, since it fits two cases in the documentation:

* Since the task is leaving BPF custody for execution, ops.dequeue() should be
  called without any special flags.
* Since the task is being migrated between rqs, ops.dequeue() should be called
  with SCX_DEQ_SCHED_CHANGE.

> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*

Thanks,
Kuba


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02  7:45     ` Andrea Righi
  2026-02-02  9:26       ` Andrea Righi
  2026-02-02 10:09       ` Christian Loehle
@ 2026-02-02 13:59       ` Kuba Piecuch
  2026-02-04  9:36         ` Andrea Righi
  2 siblings, 1 reply; 83+ messages in thread
From: Kuba Piecuch @ 2026-02-02 13:59 UTC (permalink / raw)
  To: Andrea Righi, Christian Loehle
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

Hi Andrea,

On Mon Feb 2, 2026 at 7:45 AM UTC, Andrea Righi wrote:
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 6d6f1253039d8..d8fed4a49195d 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
>  		} else {
>  			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> -				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0);
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
>  
>  			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
>  		}

This looks risky from a locking perspective. Are we relying on
SCX_OPSS_DISPATCHING to protect against racing dequeues? If so, it might
be worth calling out in a comment.

Thanks,
Kuba

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-01-31 20:26           ` Andrea Righi
@ 2026-02-02 15:19             ` Tejun Heo
  2026-02-02 15:30               ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-02 15:19 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Kuba Piecuch, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

Hello,

On Sat, Jan 31, 2026 at 09:26:56PM +0100, Andrea Righi wrote:
> I agree that SCX_DSQ_GLOBAL behaves differently from user-created DSQs at
> the implementation level, but I think that difference shouldn't leak into
> the logical model.
> 
> From a semantic point of view, dispatching a task to SCX_DSQ_GLOBAL does
> not mean that the task leaves the "enqueued by BPF" state. The task is
> still under the BPF scheduler's custody, not directly dispatched to a
> specific CPU, and remains sched_ext-managed. The scheduler has queued the
> task and it hasn't relinquished control over it.
> 
> That said, I don't have a strong opinion here. If we prefer to treat
> SCX_DSQ_GLOBAL as a "direct dispatch" DSQ for the purposes of ops.dequeue()
> semantics, then I'm fine with adjusting the logic accordingly (with proper
> documentation).
> 
> Tejun, thoughts?

I think putting a task into GLOBAL means that the BPF scheduler is done with
it. Another data point in this direction is that when insertion into a local
DSQ can't be done, the task falls back to the global DSQ although all the
current ones also trigger error.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02 15:19             ` Tejun Heo
@ 2026-02-02 15:30               ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-02 15:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kuba Piecuch, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

On Mon, Feb 02, 2026 at 05:19:51AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Sat, Jan 31, 2026 at 09:26:56PM +0100, Andrea Righi wrote:
> > I agree that SCX_DSQ_GLOBAL behaves differently from user-created DSQs at
> > the implementation level, but I think that difference shouldn't leak into
> > the logical model.
> > 
> > From a semantic point of view, dispatching a task to SCX_DSQ_GLOBAL does
> > not mean that the task leaves the "enqueued by BPF" state. The task is
> > still under the BPF scheduler's custody, not directly dispatched to a
> > specific CPU, and remains sched_ext-managed. The scheduler has queued the
> > task and it hasn't relinquished control over it.
> > 
> > That said, I don't have a strong opinion here. If we prefer to treat
> > SCX_DSQ_GLOBAL as a "direct dispatch" DSQ for the purposes of ops.dequeue()
> > semantics, then I'm fine with adjusting the logic accordingly (with proper
> > documentation).
> > 
> > Tejun, thoughts?
> 
> I think putting a task into GLOBAL means that the BPF scheduler is done with
> it. Another data point in this direction is that when insertion into a local
> DSQ can't be done, the task falls back to the global DSQ although all the
> current ones also trigger error.

Alright, it seems that the general consensus, based on your feedback and
Kuba's, is to treat SCX_DSQ_GLOBAL as a "terminal" DSQ for the purpose of
triggering ops.dequeue().

I'll update the logic to do the following:
 - When a task is dispatched to SCX_DSQ_GLOBAL, the BPF scheduler is
   considered done with it (similar to local DSQ dispatches).
 - ops.dequeue() will not be called for SCX_DSQ_GLOBAL dispatches.
 - This aligns with the fallback behavior where tasks that fail local DSQ
   insertion end up in the global DSQ as a terminal destination.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02 10:02         ` Christian Loehle
@ 2026-02-02 15:32           ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-02 15:32 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Mon, Feb 02, 2026 at 10:02:30AM +0000, Christian Loehle wrote:
> On 2/2/26 09:26, Andrea Righi wrote:
> > On Mon, Feb 02, 2026 at 08:45:18AM +0100, Andrea Righi wrote:
> > ...
> >>> So I have finally gotten around updating scx_storm to the new semantics,
> >>> see:
> >>> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics
> >>>
> >>> I don't think the new ops.dequeue() are enough to make inserts to local-on
> >>> from anywhere safe, because it's still racing with dequeue from another CPU?
> >>
> >> Yeah, with this patch set BPF schedulers get proper ops.dequeue()
> >> callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from
> >> ops.dispatch().
> >>
> >> When task properties change between scx_bpf_dsq_insert() and the actual
> >> dispatch, task_can_run_on_remote_rq() can still trigger a fatal
> >> scx_error().
> >>
> >> The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the
> >> property change, so it can't prevent already-queued dispatches from
> >> failing. The race window is between ops.dispatch() returning and
> >> dispatch_to_local_dsq() executing.
> >>
> >> We can address this in a separate patch set. One thing at a time. :)
> > 
> > Thinking more on this, the problem is that we're passing enforce=true to
> > task_can_run_on_remote_rq(), triggering a critical failure - scx_error().
> > There's a logic in task_can_run_on_remote_rq() to fallback to the global
> > DSQ, that doesn't happen if we pass enforce=true, due to scx_error().
> > 
> > However, instead of the global DSQ fallback, I was wondering if it'd be
> > better to simply re-enqueue the task - setting SCX_ENQ_REENQ - if the
> > target local DSQ isn't valid anymore when the dispatch is finalized.
> > 
> > In this way using SCX_DSQ_LOCAL_ON | cpu from ops.dispatch() would simply
> > trigger a re-enqueue when "cpu" isn't valid anymore (due to concurrent
> > affinity / migration disabled changes) and the BPF scheduler can handle
> > that in another ops.enqueue().
> > 
> > What do you think?
> 
> I think that's a lot more versatile for the BPF scheduler than using the
> global DSQ as fallback in that case, so yeah I'm all for it!
> 

Ack, I already have a working patch do to this, I'll post it as a separate
patch set.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-01 17:43       ` Tejun Heo
@ 2026-02-02 15:52         ` Andrea Righi
  2026-02-02 16:23           ` Kuba Piecuch
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-02 15:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kuba Piecuch, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

On Sun, Feb 01, 2026 at 07:43:33AM -1000, Tejun Heo wrote:
> Hello,
> 
> Sorry about tardiness.
> 
> On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote:
> > Is "local" short for "local or global", i.e. not user-created?
> 
> Yes, maybe it'd be useful to come up with a terminology for them. e.g.
> terminal - once a task reaches a terminal DSQ, the only way that the BPF
> scheduler can affect the task is by triggering re-enqueue (although we don't
> yet support reenqueueing global DSQs).

I like "terminal DSQ", if there's no objection I'll update the
documentation using this terminology.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02 15:52         ` Andrea Righi
@ 2026-02-02 16:23           ` Kuba Piecuch
  0 siblings, 0 replies; 83+ messages in thread
From: Kuba Piecuch @ 2026-02-02 16:23 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo
  Cc: Kuba Piecuch, David Vernet, Changwoo Min, Christian Loehle,
	Daniel Hodges, sched-ext, linux-kernel, Emil Tsalapatis

On Mon Feb 2, 2026 at 3:52 PM UTC, Andrea Righi wrote:
> On Sun, Feb 01, 2026 at 07:43:33AM -1000, Tejun Heo wrote:
>> On Fri, Jan 30, 2026 at 11:54:00AM +0000, Kuba Piecuch wrote:
>> > Is "local" short for "local or global", i.e. not user-created?
>> 
>> Yes, maybe it'd be useful to come up with a terminology for them. e.g.
>> terminal - once a task reaches a terminal DSQ, the only way that the BPF
>> scheduler can affect the task is by triggering re-enqueue (although we don't
>> yet support reenqueueing global DSQs).
>
> I like "terminal DSQ", if there's no objection I'll update the
> documentation using this terminology.

"Built-in" would also work and avoids introducing new terminology, but it
doesn't provide any insight into why these DSQs are special, whereas
"terminal" suggests there's some finality to inserting a task there.

I'm slightly leaning towards "terminal".

Thanks,
Kuba

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02 13:59       ` Kuba Piecuch
@ 2026-02-04  9:36         ` Andrea Righi
  2026-02-04  9:51           ` Kuba Piecuch
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-04  9:36 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Christian Loehle, Tejun Heo, David Vernet, Changwoo Min,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

Hi Kuba,

sorry for the late response.

On Mon, Feb 02, 2026 at 01:59:24PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> On Mon Feb 2, 2026 at 7:45 AM UTC, Andrea Righi wrote:
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 6d6f1253039d8..d8fed4a49195d 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
> >  			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> >  		} else {
> >  			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > -				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0);
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> >  
> >  			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> >  		}
> 
> This looks risky from a locking perspective. Are we relying on
> SCX_OPSS_DISPATCHING to protect against racing dequeues? If so, it might
> be worth calling out in a comment.

You're right, we're relying on SCX_OPSS_DISPATCHING to protect against
racing dequeues and this definitely deserves a comment. How about something
like the following?

Thanks,
-Andrea

---
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 292adf10fee1b..b189339e74101 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2260,6 +2260,15 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 		if (!is_terminal_dsq(dsq_id)) {
 			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
 		} else {
+			/*
+			 * Locking: we're holding the @rq lock (the
+			 * dispatch CPU's rq), but not necessarily
+			 * task_rq(p), since @p may be from a remote CPU.
+			 *
+			 * This is safe because SCX_OPSS_DISPATCHING state
+			 * prevents racing dequeues, any concurrent
+			 * ops_dequeue() will wait for this state to clear.
+			 */
 			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
 				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
 

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-04  9:36         ` Andrea Righi
@ 2026-02-04  9:51           ` Kuba Piecuch
  0 siblings, 0 replies; 83+ messages in thread
From: Kuba Piecuch @ 2026-02-04  9:51 UTC (permalink / raw)
  To: Andrea Righi, Kuba Piecuch
  Cc: Christian Loehle, Tejun Heo, David Vernet, Changwoo Min,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Wed Feb 4, 2026 at 9:36 AM UTC, Andrea Righi wrote:
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 292adf10fee1b..b189339e74101 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2260,6 +2260,15 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  		if (!is_terminal_dsq(dsq_id)) {
>  			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
>  		} else {
> +			/*
> +			 * Locking: we're holding the @rq lock (the
> +			 * dispatch CPU's rq), but not necessarily
> +			 * task_rq(p), since @p may be from a remote CPU.
> +			 *
> +			 * This is safe because SCX_OPSS_DISPATCHING state
> +			 * prevents racing dequeues, any concurrent
> +			 * ops_dequeue() will wait for this state to clear.
> +			 */
>  			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
>  				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);

Looks good, thanks :)


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-02 11:56   ` Kuba Piecuch
@ 2026-02-04 10:11     ` Andrea Righi
  2026-02-04 10:33       ` Kuba Piecuch
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-04 10:11 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Mon, Feb 02, 2026 at 11:56:43AM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> Looks good overall, but we need to settle on the global DSQ semantics, plus
> some edge cases that need clearing up.

On this one I think we settled on the assumption that SCX_DSQ_GLOBAL can be
considered a "terminal DSQ", so we won't trigger ops.dequeue().

> 
> On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote:
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..6d9e82e6ca9d4 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
> >  
> >     * Queue the task on the BPF side.
> >  
> > +   **Task State Tracking and ops.dequeue() Semantics**
> > +
> > +   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> > +   enter the "BPF scheduler's custody" depending on where it's dispatched:
> > +
> > +   * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
> > +     ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
> > +     entirely and goes straight to the CPU's local run queue. The task
> > +     never enters BPF custody, and ``ops.dequeue()`` will not be called.
> > +
> > +   * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
> > +     the task enters the BPF scheduler's custody. When the task later
> > +     leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
> > +     or dequeued for sleep/property changes), ``ops.dequeue()`` will be
> > +     called exactly once.
> > +
> > +   * **Queued on BPF side**: The task is in BPF data structures and in BPF
> > +     custody, ``ops.dequeue()`` will be called when it leaves.
> > +
> > +   The key principle: **ops.dequeue() is called when a task leaves the BPF
> > +   scheduler's custody**. A task is in BPF custody if it's on a non-local
> > +   DSQ or in BPF data structures. Once dispatched to a local DSQ or after
> > +   ops.dequeue() is called, the task is out of BPF custody and the BPF
> > +   scheduler no longer needs to track it.
> > +
> > +   This works correctly with the ``ops.select_cpu()`` direct dispatch
> > +   optimization: even though it skips ``ops.enqueue()`` invocation, if the
> > +   task is dispatched to a non-local DSQ, it enters BPF custody and will
> > +   get ``ops.dequeue()`` when it leaves. This provides the performance
> > +   benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
> > +   correct state tracking.
> > +
> > +   The dequeue can happen for different reasons, distinguished by flags:
> > +
> > +   1. **Regular dispatch workflow**: when the task is dispatched from a
> > +      non-local DSQ to a local DSQ (leaving BPF custody for execution),
> > +      ``ops.dequeue()`` is triggered without any special flags.
> 
> Maybe add a note that this can happen asynchronously, without the BPF
> scheduler explicitly dispatching the task to a local DSQ, when the task
> is on a global DSQ? Or maybe make that case into a separate dequeue reason
> with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ?

And I guess we don't need this if we consider SCX_DSQ_GLOBAL as a terminal
DSQ, because we won't trigger ops.dequeue().

> 
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..0d003d2845393 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* under ext scheduler's custody */
> 
> Nit: I think "in BPF scheduler's custody" would be a bit clearer, as
> "ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT
> as a whole.

Ack. Will change that.

> 
> > @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * Task is not in BPF data structures (either dispatched to
> > +		 * a DSQ or running). Only call ops.dequeue() if the task
> > +		 * is still in BPF scheduler's custody
> > +		 * (%SCX_TASK_OPS_ENQUEUED is set).
> > +		 *
> > +		 * If the task has already been dispatched to a local DSQ
> > +		 * (left BPF custody), the flag will be clear and we skip
> > +		 * ops.dequeue()
> > +		 *
> > +		 * If this is a property change (not sleep/core-sched) and
> > +		 * the task is still in BPF custody, set the
> > +		 * %SCX_DEQ_SCHED_CHANGE flag.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> > +			u64 flags = deq_flags;
> > +
> > +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +				flags |= SCX_DEQ_SCHED_CHANGE;
> 
> I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for
> tasks being picked from a global DSQ being migrated from a remote rq to the
> local rq, which, while technically correct since the task is migrating rqs,
> may be confusing, since it fits two cases in the documentation:
> 
> * Since the task is leaving BPF custody for execution, ops.dequeue() should be
>   called without any special flags.
> * Since the task is being migrated between rqs, ops.dequeue() should be called
>   with SCX_DEQ_SCHED_CHANGE.

This also should be fixed with the new logic, because a task disptched to a
global DSQ is considered outside of the BPF scheduler's custody, so
ops.dequeue() is not invoked at all.

I'll post a new patch set later today, so we can better discuss if all
these assumptions have been addressed properly. :)

> 
> > +
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> 
> Thanks,
> Kuba

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-04 10:11     ` Andrea Righi
@ 2026-02-04 10:33       ` Kuba Piecuch
  0 siblings, 0 replies; 83+ messages in thread
From: Kuba Piecuch @ 2026-02-04 10:33 UTC (permalink / raw)
  To: Andrea Righi, Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Wed Feb 4, 2026 at 10:11 AM UTC, Andrea Righi wrote:
> On Mon, Feb 02, 2026 at 11:56:43AM +0000, Kuba Piecuch wrote:
>> Hi Andrea,
>> 
>> Looks good overall, but we need to settle on the global DSQ semantics, plus
>> some edge cases that need clearing up.
>
> On this one I think we settled on the assumption that SCX_DSQ_GLOBAL can be
> considered a "terminal DSQ", so we won't trigger ops.dequeue().

Correct, I made this comment before we settled it.

Thanks,
Kuba


^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
@ 2026-02-04 16:05 ` Andrea Righi
  2026-02-04 22:14   ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-04 16:05 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.

Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).

BPF scheduler custody concept: a task is considered to be in "BPF
scheduler's custody" when it has been queued in user-created DSQs and
the BPF scheduler is responsible for its lifecycle. Custody ends when
the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
selected by core scheduling, or removed due to a property change.

Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are not in its custody. Terminal DSQs include:
 - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
   where tasks go directly to execution.
 - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
   BPF scheduler is considered "done" with the task.

As a result, ops.dequeue() is not invoked for tasks dispatched to
terminal DSQs, as the BPF scheduler no longer retains custody of them.

To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.

New ops.dequeue() semantics:
 - ops.dequeue() is invoked exactly once when the task leaves the BPF
   scheduler's custody, in one of the following cases:
   a) regular dispatch: a task dispatched to a user DSQ is moved to a
      terminal DSQ (ops.dequeue() called without any special flags set),
   b) core scheduling dispatch: core-sched picks task before dispatch,
      ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set,
   c) property change: task properties modified before dispatch,
      ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set.

This allows BPF schedulers to:
 - reliably track task ownership and lifecycle,
 - maintain accurate accounting of managed tasks,
 - update internal state when tasks change properties.

Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 Documentation/scheduler/sched-ext.rst         |  74 +++++++
 include/linux/sched/ext.h                     |   1 +
 kernel/sched/ext.c                            | 186 +++++++++++++++++-
 kernel/sched/ext_internal.h                   |   7 +
 .../sched_ext/include/scx/enum_defs.autogen.h |   1 +
 .../sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h   |   1 +
 7 files changed, 269 insertions(+), 3 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..1457f2aefa93e 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,78 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   **Task State Tracking and ops.dequeue() Semantics**
+
+   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
+   enter the "BPF scheduler's custody" depending on where it's dispatched:
+
+   * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``,
+     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
+     is done with the task - it either goes straight to a CPU's local run
+     queue or to the global DSQ as a fallback. The task never enters (or
+     exits) BPF custody, and ``ops.dequeue()`` will not be called.
+
+   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
+     BPF scheduler's custody. When the task later leaves BPF custody
+     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
+     sleep/property changes), ``ops.dequeue()`` will be called exactly once.
+
+   * **Queued on BPF side**: The task is in BPF data structures and in BPF
+     custody, ``ops.dequeue()`` will be called when it leaves.
+
+   The key principle: **ops.dequeue() is called when a task leaves the BPF
+   scheduler's custody**.
+
+   This works also with the ``ops.select_cpu()`` direct dispatch
+   optimization: even though it skips ``ops.enqueue()`` invocation, if the
+   task is dispatched to a user-created DSQ, it enters BPF custody and will
+   get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ,
+   the BPF scheduler is done with it immediately. This provides the
+   performance benefit of avoiding the ``ops.enqueue()`` roundtrip while
+   maintaining correct state tracking.
+
+   The dequeue can happen for different reasons, distinguished by flags:
+
+   1. **Regular dispatch workflow**: when the task is dispatched from a
+      user-created DSQ to a terminal DSQ (leaving BPF custody for execution),
+      ``ops.dequeue()`` is triggered without any special flags.
+
+   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+      core scheduling picks a task for execution while it's still in BPF
+      custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+   3. **Scheduling property change**: when a task property changes (via
+      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+      priority changes, CPU migrations, etc.) while the task is still in
+      BPF custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+   **Important**: Once a task has left BPF custody (dispatched to a
+   terminal DSQ), property changes will not trigger ``ops.dequeue()``,
+   since the task is no longer being managed by the BPF scheduler.
+
+   **Property Change Notifications for Running Tasks**:
+
+   For tasks that have left BPF custody (running or on terminal DSQs),
+   property changes can be intercepted through the dedicated callbacks:
+
+   * ``ops.set_cpumask()``: Called when a task's CPU affinity changes
+     (e.g., via ``sched_setaffinity()``). This callback is invoked for
+     all tasks regardless of their state or BPF custody.
+
+   * ``ops.set_weight()``: Called when a task's scheduling weight/priority
+     changes (e.g., via ``sched_setscheduler()`` or ``set_user_nice()``).
+     This callback is also invoked for all tasks.
+
+   These callbacks provide complete coverage for property changes,
+   complementing ``ops.dequeue()`` which only applies to tasks in BPF
+   custody.
+
+   BPF schedulers can choose not to implement ``ops.dequeue()`` if they
+   don't need to track these transitions. The sched_ext core will safely
+   handle all dequeue operations regardless.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +391,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..8d7c13e75efec 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,6 +84,7 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	SCX_TASK_OPS_ENQUEUED	= 1 << 1, /* in BPF scheduler's custody */
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index afe28c04d5aa7..34ba6870d2abf 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -924,6 +924,26 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
 #endif
 }
 
+/**
+ * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
+ * @dsq_id: DSQ ID to check
+ *
+ * Returns true if @dsq_id is a terminal DSQ where the BPF scheduler is
+ * considered "done" with the task. Terminal DSQs include:
+ *  - Local DSQs (SCX_DSQ_LOCAL or SCX_DSQ_LOCAL_ON): per-CPU queues where
+ *    tasks go directly to execution
+ *  - Global DSQ (SCX_DSQ_GLOBAL): the built-in fallback queue
+ *
+ * Tasks dispatched to terminal DSQs exit BPF scheduler custody and do not
+ * trigger ops.dequeue() when they are later consumed.
+ */
+static inline bool is_terminal_dsq(u64 dsq_id)
+{
+	return dsq_id == SCX_DSQ_LOCAL ||
+	       (dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON ||
+	       dsq_id == SCX_DSQ_GLOBAL;
+}
+
 /**
  * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
  * @rq: rq to read clock from, must be locked
@@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 	dsq_mod_nr(dsq, 1);
 	p->scx.dsq = dsq;
 
+	/*
+	 * Mark task as in BPF scheduler's custody if being queued to a
+	 * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are
+	 * terminal: tasks on them have left BPF custody.
+	 *
+	 * Don't touch the flag if already set (e.g., by
+	 * mark_direct_dispatch() or direct_dispatch()/finish_dispatch()
+	 * for user DSQs).
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
+		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+
 	/*
 	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
 	 * direct dispatch path, but we clear them here because the direct
@@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
 
 	p->scx.ddsp_dsq_id = dsq_id;
 	p->scx.ddsp_enq_flags = enq_flags;
+
+	/*
+	 * Mark the task as entering BPF scheduler's custody if it's being
+	 * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This
+	 * handles the case where ops.select_cpu() directly dispatches - even
+	 * though ops.enqueue() won't be called, the task enters BPF custody
+	 * if dispatched to a user DSQ and should get ops.dequeue() when it
+	 * leaves.
+	 *
+	 * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag
+	 * is clear since the BPF scheduler is done with the task.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		if (!is_terminal_dsq(dsq_id))
+			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		else
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+	}
 }
 
 static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
@@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 
 	p->scx.ddsp_enq_flags |= enq_flags;
 
+	/*
+	 * The task is about to be dispatched, handle ops.dequeue() based
+	 * on where the task is going.
+	 *
+	 * Key principle: ops.dequeue() is called when a task leaves the
+	 * BPF scheduler's custody. A task is in BPF custody if it's on a
+	 * user-created DSQ or in BPF data structures. Once dispatched to a
+	 * terminal DSQ (local DSQ or SCX_DSQ_GLOBAL), the BPF scheduler is
+	 * done with it.
+	 *
+	 * Direct dispatch to terminal DSQs: task never enters (or exits)
+	 * BPF scheduler's custody. If it was in custody, call ops.dequeue()
+	 * to notify the BPF scheduler. Clear the flag so future property
+	 * changes also won't trigger ops.dequeue().
+	 *
+	 * Direct dispatch to user DSQs: task enters BPF scheduler's custody.
+	 * Mark the task as in BPF custody so that when it's later dispatched
+	 * to a terminal DSQ or dequeued for property changes, ops.dequeue()
+	 * will be called.
+	 *
+	 * This also handles the ops.select_cpu() direct dispatch: the
+	 * shortcut skips ops.enqueue() but the task still enters BPF custody
+	 * if dispatched to a user DSQ, and thus needs ops.dequeue() when it
+	 * leaves.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		if (!is_terminal_dsq(dsq->id)) {
+			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		} else {
+			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
+	}
+
 	/*
 	 * We are in the enqueue path with @rq locked and pinned, and thus can't
 	 * double lock a remote rq and enqueue to its local DSQ. For
@@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		/*
+		 * Task is not in BPF data structures (either dispatched to
+		 * a DSQ or running). Only call ops.dequeue() if the task
+		 * is still in BPF scheduler's custody
+		 * (%SCX_TASK_OPS_ENQUEUED is set).
+		 *
+		 * If the task has already been dispatched to a terminal
+		 * DSQ (local DSQ or SCX_DSQ_GLOBAL), it has left the BPF
+		 * scheduler's custody and the flag will be clear, so we
+		 * skip ops.dequeue().
+		 *
+		 * If this is a property change (not sleep/core-sched) and
+		 * the task is still in BPF custody, set the
+		 * %SCX_DEQ_SCHED_CHANGE flag.
+		 */
+		if (SCX_HAS_OP(sch, dequeue) &&
+		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
+			u64 flags = deq_flags;
+
+			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+				flags |= SCX_DEQ_SCHED_CHANGE;
+
+			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
-		if (SCX_HAS_OP(sch, dequeue))
-			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
-					 p, deq_flags);
+		/*
+		 * Task is still on the BPF scheduler (not dispatched yet).
+		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
+		 * only for property changes, not for core-sched picks or
+		 * sleep.
+		 *
+		 * Clear the flag after calling ops.dequeue(): the task is
+		 * leaving BPF scheduler's custody.
+		 */
+		if (SCX_HAS_OP(sch, dequeue)) {
+			u64 flags = deq_flags;
+
+			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+				flags |= SCX_DEQ_SCHED_CHANGE;
+
+			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 					 struct scx_dispatch_q *src_dsq,
 					 struct rq *dst_rq)
 {
+	struct scx_sched *sch = scx_root;
 	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
 
 	/* @dsq is locked and @p is on @dst_rq */
@@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
 
+	/*
+	 * Task is moving from a non-local DSQ to a local DSQ. Call
+	 * ops.dequeue() if the task was in BPF custody.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
+		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+	}
+
 	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
 		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
 	else
@@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 
 	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
 
+	/*
+	 * Handle ops.dequeue() based on destination DSQ.
+	 *
+	 * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF
+	 * scheduler is done with the task. Call ops.dequeue() if it was in
+	 * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
+	 *
+	 * Dispatch to user DSQs: task is in BPF scheduler's custody.
+	 * Mark it so ops.dequeue() will be called when it leaves.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		if (!is_terminal_dsq(dsq_id)) {
+			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
+		} else {
+			/*
+			 * Locking: we're holding the @rq lock (the
+			 * dispatch CPU's rq), but not necessarily
+			 * task_rq(p), since @p may be from a remote CPU.
+			 *
+			 * This is safe because SCX_OPSS_DISPATCHING state
+			 * prevents racing dequeues, any concurrent
+			 * ops_dequeue() will wait for this state to clear.
+			 */
+			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+
+			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+		}
+	}
+
 	dsq = find_dsq_for_dispatch(sch, this_rq(), dsq_id, p);
 
 	if (dsq->id == SCX_DSQ_LOCAL)
@@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p)
 
 	lockdep_assert_rq_held(rq);
 
+	/*
+	 * Clear enqueue/dequeue tracking flags when enabling the task.
+	 * This ensures a clean state when the task enters SCX. Only needed
+	 * if ops.dequeue() is implemented.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
+
 	/*
 	 * Set the weight before calling ops.enable() so that the scheduler
 	 * doesn't see a stale value if they inspect the task struct.
@@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p)
 	if (SCX_HAS_OP(sch, disable))
 		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
 	scx_set_task_state(p, SCX_TASK_READY);
+
+	/*
+	 * Clear enqueue/dequeue tracking flags when disabling the task.
+	 * Only needed if ops.dequeue() is implemented.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
 }
 
 static void scx_exit_task(struct task_struct *p)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
 	 * it hasn't been dispatched yet. Dequeue from the BPF side.
 	 */
 	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
+
+	/*
+	 * The task is being dequeued due to a property change (e.g.,
+	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+	 * etc.).
+	 */
+	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
 };
 
 enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..dcc945304760f 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
 #define HAVE_SCX_DEQ_SLEEP
 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
 #define HAVE_SCX_DSQ_FLAG_BUILTIN
 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
 #define HAVE_SCX_DSQ_INVALID
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
 
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
 } while (0)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
@ 2026-02-04 22:14   ` Tejun Heo
  2026-02-05  9:26     ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-04 22:14 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hello,

On Wed, Feb 04, 2026 at 05:05:58PM +0100, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change events. In addition, ops.dequeue()
> callbacks are completely skipped when tasks are dispatched to non-local
> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> track task state.
> 
> Fix this by guaranteeing that each task entering the BPF scheduler's
> custody triggers exactly one ops.dequeue() call when it leaves that
> custody, whether the exit is due to a dispatch (regular or via a core
> scheduling pick) or to a scheduling property change (e.g.
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, etc.).
> 
> BPF scheduler custody concept: a task is considered to be in "BPF
> scheduler's custody" when it has been queued in user-created DSQs and
> the BPF scheduler is responsible for its lifecycle. Custody ends when
> the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> selected by core scheduling, or removed due to a property change.
> 
> Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> entirely and are not in its custody. Terminal DSQs include:
>  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
>    where tasks go directly to execution.
>  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
>    BPF scheduler is considered "done" with the task.
> 
> As a result, ops.dequeue() is not invoked for tasks dispatched to
> terminal DSQs, as the BPF scheduler no longer retains custody of them.
> 
> To identify dequeues triggered by scheduling property changes, introduce
> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> the dequeue was caused by a scheduling property change.
> 
...
> +   **Property Change Notifications for Running Tasks**:
> +
> +   For tasks that have left BPF custody (running or on terminal DSQs),
> +   property changes can be intercepted through the dedicated callbacks:

I'm not sure this section is necessary. The way it's phrased makes it sound
like schedulers would use DEQ_SCHED_CHANGE to process property changes but
that's not the case. Relevant property changes will be notified in whatever
ways they're notified and a task being dequeued for SCHED_CHANGE doesn't
necessarily mean there will be an associated property change event either.
e.g. We don't do anything re. on sched_setnuma().

> @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  	dsq_mod_nr(dsq, 1);
>  	p->scx.dsq = dsq;
>  
> +	/*
> +	 * Mark task as in BPF scheduler's custody if being queued to a
> +	 * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are
> +	 * terminal: tasks on them have left BPF custody.
> +	 *
> +	 * Don't touch the flag if already set (e.g., by
> +	 * mark_direct_dispatch() or direct_dispatch()/finish_dispatch()
> +	 * for user DSQs).
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;

given that this is tied to dequeue, maybe a more direct name would be less
confusing? e.g. something like SCX_TASK_NEED_DEQ?

> @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
>  
>  	p->scx.ddsp_dsq_id = dsq_id;
>  	p->scx.ddsp_enq_flags = enq_flags;
> +
> +	/*
> +	 * Mark the task as entering BPF scheduler's custody if it's being
> +	 * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This
> +	 * handles the case where ops.select_cpu() directly dispatches - even
> +	 * though ops.enqueue() won't be called, the task enters BPF custody
> +	 * if dispatched to a user DSQ and should get ops.dequeue() when it
> +	 * leaves.
> +	 *
> +	 * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag
> +	 * is clear since the BPF scheduler is done with the task.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		if (!is_terminal_dsq(dsq_id))
> +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +		else
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +	}

Hmm... I'm a bit confused on why this needs to be in mark_direct_dispatch()
AND dispatch_enqueue(). The flag should be clear when off SCX. The only
places where it could be set is from the enqueue path - when a task is
direct dispatched to a non-terminal DSQ or BPF. Both cases can be reliably
captured in do_enqueue_task(), no?

>  static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
...
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		if (!is_terminal_dsq(dsq->id)) {
> +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +		} else {
> +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
> +	}

And when would direct_dispatch() need to call ops.dequeue()?
direct_dispatch() is only used from do_enqueue_task() and there can only be
one direct dispatch attempt on any given enqueue event. A task being
enqueued shouldn't have the OPS_ENQUEUED set and would get dispatched once
to either a terminal or non-terminal DSQ. If terminal, there's nothing to
do. If non-terminal, the flag would need to be set. Am I missing something?

> @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
...
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {

nit: () around & expression.

> +			u64 flags = deq_flags;
> +
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> -		if (SCX_HAS_OP(sch, dequeue))
> -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -					 p, deq_flags);
> +		/*
> +		 * Task is still on the BPF scheduler (not dispatched yet).
> +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> +		 * only for property changes, not for core-sched picks or
> +		 * sleep.
> +		 *
> +		 * Clear the flag after calling ops.dequeue(): the task is
> +		 * leaving BPF scheduler's custody.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue)) {
> +			u64 flags = deq_flags;
> +
> +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +				flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;

I wonder whether this and the above block can be factored somehow.

> @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  					 struct scx_dispatch_q *src_dsq,
>  					 struct rq *dst_rq)
>  {
> +	struct scx_sched *sch = scx_root;
>  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
>  
>  	/* @dsq is locked and @p is on @dst_rq */
> @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  
>  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
>  
> +	/*
> +	 * Task is moving from a non-local DSQ to a local DSQ. Call
> +	 * ops.dequeue() if the task was in BPF custody.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +	}
> +
>  	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
>  		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
>  	else
> @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  
>  	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
>  
> +	/*
> +	 * Handle ops.dequeue() based on destination DSQ.
> +	 *
> +	 * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF
> +	 * scheduler is done with the task. Call ops.dequeue() if it was in
> +	 * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
> +	 *
> +	 * Dispatch to user DSQs: task is in BPF scheduler's custody.
> +	 * Mark it so ops.dequeue() will be called when it leaves.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		if (!is_terminal_dsq(dsq_id)) {
> +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> +		} else {

Let's do "if (COND) { A } else { B }" instead of "if (!COND) { B } else { A
}". Continuing from earlier, I don't understand why we'd need to set
OPS_ENQUEUED here. Given that a transition to a terminal DSQ is terminal, I
can't think of conditions where we'd need to set OPS_ENQUEUED from
ops.dispatch().

> +			/*
> +			 * Locking: we're holding the @rq lock (the
> +			 * dispatch CPU's rq), but not necessarily
> +			 * task_rq(p), since @p may be from a remote CPU.
> +			 *
> +			 * This is safe because SCX_OPSS_DISPATCHING state
> +			 * prevents racing dequeues, any concurrent
> +			 * ops_dequeue() will wait for this state to clear.
> +			 */
> +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> +
> +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +		}
> +	}

I'm not sure finish_dispatch() is the right place to do this. e.g.
scx_bpf_dsq_move() can also move tasks from a user DSQ to a terminal DSQ and
the above wouldn't cover it. Wouldn't it make more sense to do this in
dispatch_enqueue()?

> @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p)
>  
>  	lockdep_assert_rq_held(rq);
>  
> +	/*
> +	 * Clear enqueue/dequeue tracking flags when enabling the task.
> +	 * This ensures a clean state when the task enters SCX. Only needed
> +	 * if ops.dequeue() is implemented.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))
> +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> +
>  	/*
>  	 * Set the weight before calling ops.enable() so that the scheduler
>  	 * doesn't see a stale value if they inspect the task struct.
> @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p)
>  	if (SCX_HAS_OP(sch, disable))
>  		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
>  	scx_set_task_state(p, SCX_TASK_READY);
> +
> +	/*
> +	 * Clear enqueue/dequeue tracking flags when disabling the task.
> +	 * Only needed if ops.dequeue() is implemented.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))
> +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;

If we make the flag transitions consistent, we shouldn't need these, right?
We can add WARN_ON_ONCE() at the head of enqueue maybe.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-04 22:14   ` Tejun Heo
@ 2026-02-05  9:26     ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-05  9:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hi Tejun,

On Wed, Feb 04, 2026 at 12:14:40PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 04, 2026 at 05:05:58PM +0100, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change events. In addition, ops.dequeue()
> > callbacks are completely skipped when tasks are dispatched to non-local
> > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> > track task state.
> > 
> > Fix this by guaranteeing that each task entering the BPF scheduler's
> > custody triggers exactly one ops.dequeue() call when it leaves that
> > custody, whether the exit is due to a dispatch (regular or via a core
> > scheduling pick) or to a scheduling property change (e.g.
> > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> > balancing, etc.).
> > 
> > BPF scheduler custody concept: a task is considered to be in "BPF
> > scheduler's custody" when it has been queued in user-created DSQs and
> > the BPF scheduler is responsible for its lifecycle. Custody ends when
> > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> > selected by core scheduling, or removed due to a property change.
> > 
> > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> > entirely and are not in its custody. Terminal DSQs include:
> >  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> >    where tasks go directly to execution.
> >  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
> >    BPF scheduler is considered "done" with the task.
> > 
> > As a result, ops.dequeue() is not invoked for tasks dispatched to
> > terminal DSQs, as the BPF scheduler no longer retains custody of them.
> > 
> > To identify dequeues triggered by scheduling property changes, introduce
> > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> > the dequeue was caused by a scheduling property change.
> > 
> ...
> > +   **Property Change Notifications for Running Tasks**:
> > +
> > +   For tasks that have left BPF custody (running or on terminal DSQs),
> > +   property changes can be intercepted through the dedicated callbacks:
> 
> I'm not sure this section is necessary. The way it's phrased makes it sound
> like schedulers would use DEQ_SCHED_CHANGE to process property changes but
> that's not the case. Relevant property changes will be notified in whatever
> ways they're notified and a task being dequeued for SCHED_CHANGE doesn't
> necessarily mean there will be an associated property change event either.
> e.g. We don't do anything re. on sched_setnuma().

Agreed, this section is a bit misleading, DEQ_SCHED_CHANGE is an
informational flag indicating the ops.dequeue() wasn't due to dispatch,
schedulers shouldn't use it to process property changes. I'll remove it.

> 
> > @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> >  	dsq_mod_nr(dsq, 1);
> >  	p->scx.dsq = dsq;
> >  
> > +	/*
> > +	 * Mark task as in BPF scheduler's custody if being queued to a
> > +	 * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are
> > +	 * terminal: tasks on them have left BPF custody.
> > +	 *
> > +	 * Don't touch the flag if already set (e.g., by
> > +	 * mark_direct_dispatch() or direct_dispatch()/finish_dispatch()
> > +	 * for user DSQs).
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> > +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> 
> given that this is tied to dequeue, maybe a more direct name would be less
> confusing? e.g. something like SCX_TASK_NEED_DEQ?

Ack.

> 
> > @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
> >  
> >  	p->scx.ddsp_dsq_id = dsq_id;
> >  	p->scx.ddsp_enq_flags = enq_flags;
> > +
> > +	/*
> > +	 * Mark the task as entering BPF scheduler's custody if it's being
> > +	 * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This
> > +	 * handles the case where ops.select_cpu() directly dispatches - even
> > +	 * though ops.enqueue() won't be called, the task enters BPF custody
> > +	 * if dispatched to a user DSQ and should get ops.dequeue() when it
> > +	 * leaves.
> > +	 *
> > +	 * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag
> > +	 * is clear since the BPF scheduler is done with the task.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq_id))
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		else
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +	}
> 
> Hmm... I'm a bit confused on why this needs to be in mark_direct_dispatch()
> AND dispatch_enqueue(). The flag should be clear when off SCX. The only
> places where it could be set is from the enqueue path - when a task is
> direct dispatched to a non-terminal DSQ or BPF. Both cases can be reliably
> captured in do_enqueue_task(), no?

You're right. I was incorrectly assuming we needed this in
mark_direct_dispatch() to catch direct dispatches to user DSQs from
ops.select_cpu(), but that's not true. All paths go through
do_enqueue_task() which funnels to dispatch_enqueue(), so we can handle it
all in one place.

> 
> >  static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> > @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> ...
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq->id)) {
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		} else {
> > +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> > +	}
> 
> And when would direct_dispatch() need to call ops.dequeue()?
> direct_dispatch() is only used from do_enqueue_task() and there can only be
> one direct dispatch attempt on any given enqueue event. A task being
> enqueued shouldn't have the OPS_ENQUEUED set and would get dispatched once
> to either a terminal or non-terminal DSQ. If terminal, there's nothing to
> do. If non-terminal, the flag would need to be set. Am I missing something?

Nah, you're right, direct_dispatch() doesn't need to call ops.dequeue() or
manage the flag. I'll remove all the flag management from direct_dispatch()
and centralize it in dispatch_enqueue().

> 
> > @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> ...
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> 
> nit: () around & expression.
> 
> > +			u64 flags = deq_flags;
> > +
> > +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +				flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> > @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  		 */
> >  		BUG();
> >  	case SCX_OPSS_QUEUED:
> > -		if (SCX_HAS_OP(sch, dequeue))
> > -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > -					 p, deq_flags);
> > +		/*
> > +		 * Task is still on the BPF scheduler (not dispatched yet).
> > +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> > +		 * only for property changes, not for core-sched picks or
> > +		 * sleep.
> > +		 *
> > +		 * Clear the flag after calling ops.dequeue(): the task is
> > +		 * leaving BPF scheduler's custody.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue)) {
> > +			u64 flags = deq_flags;
> > +
> > +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +				flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> 
> I wonder whether this and the above block can be factored somehow.

Ack, we can add a helper for this.

> 
> > @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  					 struct scx_dispatch_q *src_dsq,
> >  					 struct rq *dst_rq)
> >  {
> > +	struct scx_sched *sch = scx_root;
> >  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
> >  
> >  	/* @dsq is locked and @p is on @dst_rq */
> > @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  
> >  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> >  
> > +	/*
> > +	 * Task is moving from a non-local DSQ to a local DSQ. Call
> > +	 * ops.dequeue() if the task was in BPF custody.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> > +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +	}
> > +
> >  	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
> >  		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
> >  	else
> > @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
> >  
> >  	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
> >  
> > +	/*
> > +	 * Handle ops.dequeue() based on destination DSQ.
> > +	 *
> > +	 * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF
> > +	 * scheduler is done with the task. Call ops.dequeue() if it was in
> > +	 * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
> > +	 *
> > +	 * Dispatch to user DSQs: task is in BPF scheduler's custody.
> > +	 * Mark it so ops.dequeue() will be called when it leaves.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq_id)) {
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		} else {
> 
> Let's do "if (COND) { A } else { B }" instead of "if (!COND) { B } else { A
> }". Continuing from earlier, I don't understand why we'd need to set
> OPS_ENQUEUED here. Given that a transition to a terminal DSQ is terminal, I
> can't think of conditions where we'd need to set OPS_ENQUEUED from
> ops.dispatch().

Right, a task that reaches ops.dispatch() is already in QUEUED state, if
it's in a user DSQ the flag is already set from when it was enqueued, so
there's no need to set the flag in finish_dispatch().

> 
> > +			/*
> > +			 * Locking: we're holding the @rq lock (the
> > +			 * dispatch CPU's rq), but not necessarily
> > +			 * task_rq(p), since @p may be from a remote CPU.
> > +			 *
> > +			 * This is safe because SCX_OPSS_DISPATCHING state
> > +			 * prevents racing dequeues, any concurrent
> > +			 * ops_dequeue() will wait for this state to clear.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> > +	}
> 
> I'm not sure finish_dispatch() is the right place to do this. e.g.
> scx_bpf_dsq_move() can also move tasks from a user DSQ to a terminal DSQ and
> the above wouldn't cover it. Wouldn't it make more sense to do this in
> dispatch_enqueue()?

Agreed.

> 
> > @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p)
> >  
> >  	lockdep_assert_rq_held(rq);
> >  
> > +	/*
> > +	 * Clear enqueue/dequeue tracking flags when enabling the task.
> > +	 * This ensures a clean state when the task enters SCX. Only needed
> > +	 * if ops.dequeue() is implemented.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue))
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +
> >  	/*
> >  	 * Set the weight before calling ops.enable() so that the scheduler
> >  	 * doesn't see a stale value if they inspect the task struct.
> > @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p)
> >  	if (SCX_HAS_OP(sch, disable))
> >  		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
> >  	scx_set_task_state(p, SCX_TASK_READY);
> > +
> > +	/*
> > +	 * Clear enqueue/dequeue tracking flags when disabling the task.
> > +	 * Only needed if ops.dequeue() is implemented.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue))
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> 
> If we make the flag transitions consistent, we shouldn't need these, right?
> We can add WARN_ON_ONCE() at the head of enqueue maybe.

Correct.

Thanks for the review! I'll post a new version.

-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
@ 2026-02-05 15:32 ` Andrea Righi
  2026-02-05 19:29   ` Kuba Piecuch
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-05 15:32 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.

Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).

BPF scheduler custody concept: a task is considered to be in "BPF
scheduler's custody" when it has been queued in user-created DSQs and
the BPF scheduler is responsible for its lifecycle. Custody ends when
the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
selected by core scheduling, or removed due to a property change.

Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are not in its custody. Terminal DSQs include:
 - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
   where tasks go directly to execution.
 - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
   BPF scheduler is considered "done" with the task.

As a result, ops.dequeue() is not invoked for tasks dispatched to
terminal DSQs, as the BPF scheduler no longer retains custody of them.

To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.

New ops.dequeue() semantics:
 - ops.dequeue() is invoked exactly once when the task leaves the BPF
   scheduler's custody, in one of the following cases:
   a) regular dispatch: a task dispatched to a user DSQ is moved to a
      terminal DSQ (ops.dequeue() called without any special flags set),
   b) core scheduling dispatch: core-sched picks task before dispatch,
      ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set,
   c) property change: task properties modified before dispatch,
      ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set.

This allows BPF schedulers to:
 - reliably track task ownership and lifecycle,
 - maintain accurate accounting of managed tasks,
 - update internal state when tasks change properties.

Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 Documentation/scheduler/sched-ext.rst         |  53 +++++++
 include/linux/sched/ext.h                     |   1 +
 kernel/sched/ext.c                            | 130 ++++++++++++++++--
 kernel/sched/ext_internal.h                   |   7 +
 .../sched_ext/include/scx/enum_defs.autogen.h |   1 +
 .../sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h   |   1 +
 7 files changed, 182 insertions(+), 13 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..ccd1fad3b3b92 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   **Task State Tracking and ops.dequeue() Semantics**
+
+   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
+   enter the "BPF scheduler's custody" depending on where it's dispatched:
+
+   * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``,
+     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
+     is done with the task - it either goes straight to a CPU's local run
+     queue or to the global DSQ as a fallback. The task never enters (or
+     exits) BPF custody, and ``ops.dequeue()`` will not be called.
+
+   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
+     BPF scheduler's custody. When the task later leaves BPF custody
+     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
+     sleep/property changes), ``ops.dequeue()`` will be called exactly once.
+
+   * **Queued on BPF side**: The task is in BPF data structures and in BPF
+     custody, ``ops.dequeue()`` will be called when it leaves.
+
+   The key principle: **ops.dequeue() is called when a task leaves the BPF
+   scheduler's custody**.
+
+   This works also with the ``ops.select_cpu()`` direct dispatch
+   optimization: even though it skips ``ops.enqueue()`` invocation, if the
+   task is dispatched to a user-created DSQ, it enters BPF custody and will
+   get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ,
+   the BPF scheduler is done with it immediately. This provides the
+   performance benefit of avoiding the ``ops.enqueue()`` roundtrip while
+   maintaining correct state tracking.
+
+   The dequeue can happen for different reasons, distinguished by flags:
+
+   1. **Regular dispatch workflow**: when the task is dispatched from a
+      user-created DSQ to a terminal DSQ (leaving BPF custody for execution),
+      ``ops.dequeue()`` is triggered without any special flags.
+
+   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+      core scheduling picks a task for execution while it's still in BPF
+      custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+   3. **Scheduling property change**: when a task property changes (via
+      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+      priority changes, CPU migrations, etc.) while the task is still in
+      BPF custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+   **Important**: Once a task has left BPF custody (dispatched to a
+   terminal DSQ), property changes will not trigger ``ops.dequeue()``,
+   since the task is no longer being managed by the BPF scheduler.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +370,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..35a88942810b4 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,6 +84,7 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	SCX_TASK_NEED_DEQ	= 1 << 1, /* task needs ops.dequeue() */
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0bb8fa927e9e9..9ebca357196b4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
 #endif
 }
 
+/**
+ * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
+ * @dsq_id: DSQ ID to check
+ *
+ * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
+ * scheduler is considered "done" with the task.
+ *
+ * Builtin DSQs include:
+ *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
+ *    where tasks go directly to execution,
+ *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
+ *  - Bypass DSQ: used during bypass mode.
+ *
+ * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
+ * trigger ops.dequeue() when they are later consumed.
+ */
+static inline bool is_terminal_dsq(u64 dsq_id)
+{
+	return dsq_id & SCX_DSQ_FLAG_BUILTIN;
+}
+
 /**
  * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
  * @rq: rq to read clock from, must be locked
@@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p
 		resched_curr(rq);
 }
 
-static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
+			     struct scx_dispatch_q *dsq,
 			     struct task_struct *p, u64 enq_flags)
 {
 	bool is_local = dsq->id == SCX_DSQ_LOCAL;
@@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 	dsq_mod_nr(dsq, 1);
 	p->scx.dsq = dsq;
 
+	/*
+	 * Handle ops.dequeue() and custody tracking.
+	 *
+	 * Builtin DSQs (local, global, bypass) are terminal: the BPF
+	 * scheduler is done with the task. If it was in BPF custody, call
+	 * ops.dequeue() and clear the flag.
+	 *
+	 * User DSQs: Task is in BPF scheduler's custody. Set the flag so
+	 * ops.dequeue() will be called when it leaves.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		if (is_terminal_dsq(dsq->id)) {
+			if (p->scx.flags & SCX_TASK_NEED_DEQ)
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
+						 rq, p, 0);
+			p->scx.flags &= ~SCX_TASK_NEED_DEQ;
+		} else {
+			p->scx.flags |= SCX_TASK_NEED_DEQ;
+		}
+	}
+
 	/*
 	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
 	 * direct dispatch path, but we clear them here because the direct
@@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 		return;
 	}
 
-	dispatch_enqueue(sch, dsq, p,
+	dispatch_enqueue(sch, rq, dsq, p,
 			 p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
@@ -1413,7 +1456,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	direct_dispatch(sch, p, enq_flags);
 	return;
 local_norefill:
-	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
+	dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
 	return;
 local:
 	dsq = &rq->scx.local_dsq;
@@ -1433,7 +1476,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	 */
 	touch_core_sched(rq, p);
 	refill_task_slice_dfl(sch, p);
-	dispatch_enqueue(sch, dsq, p, enq_flags);
+	dispatch_enqueue(sch, rq, dsq, p, enq_flags);
 }
 
 static bool task_runnable(const struct task_struct *p)
@@ -1511,6 +1554,18 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 		__scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1);
 }
 
+static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
+			      struct task_struct *p, u64 deq_flags)
+{
+	u64 flags = deq_flags;
+
+	if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+		flags |= SCX_DEQ_SCHED_CHANGE;
+
+	SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+	p->scx.flags &= ~SCX_TASK_NEED_DEQ;
+}
+
 static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 {
 	struct scx_sched *sch = scx_root;
@@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		/*
+		 * Task is not in BPF data structures (either dispatched to
+		 * a DSQ or running). Only call ops.dequeue() if the task
+		 * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
+		 * is set).
+		 *
+		 * If the task has already been dispatched to a terminal
+		 * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
+		 * scheduler's custody and the flag will be clear, so we
+		 * skip ops.dequeue().
+		 *
+		 * If this is a property change (not sleep/core-sched) and
+		 * the task is still in BPF custody, set the
+		 * %SCX_DEQ_SCHED_CHANGE flag.
+		 */
+		if (SCX_HAS_OP(sch, dequeue) &&
+		    (p->scx.flags & SCX_TASK_NEED_DEQ))
+			call_task_dequeue(sch, rq, p, deq_flags);
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
+		/*
+		 * Task is still on the BPF scheduler (not dispatched yet).
+		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
+		 * only for property changes, not for core-sched picks or
+		 * sleep.
+		 */
 		if (SCX_HAS_OP(sch, dequeue))
-			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
-					 p, deq_flags);
+			call_task_dequeue(sch, rq, p, deq_flags);
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -1631,6 +1709,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 					 struct scx_dispatch_q *src_dsq,
 					 struct rq *dst_rq)
 {
+	struct scx_sched *sch = scx_root;
 	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
 
 	/* @dsq is locked and @p is on @dst_rq */
@@ -1639,6 +1718,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
 
+	/*
+	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
+	 * Call ops.dequeue() if the task was in BPF custody.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_NEED_DEQ)) {
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
+		p->scx.flags &= ~SCX_TASK_NEED_DEQ;
+	}
+
 	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
 		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
 	else
@@ -1879,7 +1967,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 		dispatch_dequeue_locked(p, src_dsq);
 		raw_spin_unlock(&src_dsq->lock);
 
-		dispatch_enqueue(sch, dst_dsq, p, enq_flags);
+		dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
 	}
 
 	return dst_rq;
@@ -1969,14 +2057,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 	 * If dispatching to @rq that @p is already on, no lock dancing needed.
 	 */
 	if (rq == src_rq && rq == dst_rq) {
-		dispatch_enqueue(sch, dst_dsq, p,
+		dispatch_enqueue(sch, rq, dst_dsq, p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
 
 	if (src_rq != dst_rq &&
 	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
+		dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
@@ -2014,7 +2102,7 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 		 */
 		if (src_rq == dst_rq) {
 			p->scx.holding_cpu = -1;
-			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
+			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
 					 enq_flags);
 		} else {
 			move_remote_task_to_local_dsq(p, enq_flags,
@@ -2113,7 +2201,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 	if (dsq->id == SCX_DSQ_LOCAL)
 		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
 	else
-		dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+		dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
 static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
@@ -2414,7 +2502,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 * DSQ.
 		 */
 		if (p->scx.slice && !scx_rq_bypassing(rq)) {
-			dispatch_enqueue(sch, &rq->scx.local_dsq, p,
+			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p,
 					 SCX_ENQ_HEAD);
 			goto switch_class;
 		}
@@ -2898,6 +2986,14 @@ static void scx_enable_task(struct task_struct *p)
 
 	lockdep_assert_rq_held(rq);
 
+	/*
+	 * Verify the task is not in BPF scheduler's custody. If flag
+	 * transitions are consistent, the flag should always be clear
+	 * here.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ);
+
 	/*
 	 * Set the weight before calling ops.enable() so that the scheduler
 	 * doesn't see a stale value if they inspect the task struct.
@@ -2929,6 +3025,14 @@ static void scx_disable_task(struct task_struct *p)
 	if (SCX_HAS_OP(sch, disable))
 		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
 	scx_set_task_state(p, SCX_TASK_READY);
+
+	/*
+	 * Verify the task is not in BPF scheduler's custody. If flag
+	 * transitions are consistent, the flag should always be clear
+	 * here.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ);
 }
 
 static void scx_exit_task(struct task_struct *p)
@@ -3919,7 +4023,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq,
 		 * between bypass DSQs.
 		 */
 		dispatch_dequeue_locked(p, donor_dsq);
-		dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);
+		dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED);
 
 		/*
 		 * $donee might have been idle and need to be woken up. No need
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
 	 * it hasn't been dispatched yet. Dequeue from the BPF side.
 	 */
 	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
+
+	/*
+	 * The task is being dequeued due to a property change (e.g.,
+	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+	 * etc.).
+	 */
+	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
 };
 
 enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..dcc945304760f 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
 #define HAVE_SCX_DEQ_SLEEP
 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
 #define HAVE_SCX_DSQ_FLAG_BUILTIN
 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
 #define HAVE_SCX_DSQ_INVALID
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
 
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
 } while (0)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
@ 2026-02-05 19:29   ` Kuba Piecuch
  2026-02-05 21:32     ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Kuba Piecuch @ 2026-02-05 19:29 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Hi Andrea,

On Thu Feb 5, 2026 at 3:32 PM UTC, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change events. In addition, ops.dequeue()
> callbacks are completely skipped when tasks are dispatched to non-local
> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> track task state.
>
> Fix this by guaranteeing that each task entering the BPF scheduler's
> custody triggers exactly one ops.dequeue() call when it leaves that
> custody, whether the exit is due to a dispatch (regular or via a core
> scheduling pick) or to a scheduling property change (e.g.
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, etc.).
>
> BPF scheduler custody concept: a task is considered to be in "BPF
> scheduler's custody" when it has been queued in user-created DSQs and
> the BPF scheduler is responsible for its lifecycle. Custody ends when
> the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> selected by core scheduling, or removed due to a property change.

Strictly speaking, a task in BPF scheduler custody doesn't have to be queued
in a user-created DSQ. It could just reside on some custom data structure.

>
> Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> entirely and are not in its custody. Terminal DSQs include:
>  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
>    where tasks go directly to execution.
>  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
>    BPF scheduler is considered "done" with the task.
>
> As a result, ops.dequeue() is not invoked for tasks dispatched to
> terminal DSQs, as the BPF scheduler no longer retains custody of them.

Shouldn't it be "directly dispatched to terminal DSQs"?

>
> To identify dequeues triggered by scheduling property changes, introduce
> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> the dequeue was caused by a scheduling property change.
>
> New ops.dequeue() semantics:
>  - ops.dequeue() is invoked exactly once when the task leaves the BPF
>    scheduler's custody, in one of the following cases:
>    a) regular dispatch: a task dispatched to a user DSQ is moved to a
>       terminal DSQ (ops.dequeue() called without any special flags set),

I don't think the task has to be on a user DSQ. How about just "a task in BPF
scheduler's custody is dispatched to a terminal DSQ from ops.dispatch()"?

>    b) core scheduling dispatch: core-sched picks task before dispatch,
>       ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set,
>    c) property change: task properties modified before dispatch,
>       ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set.
>
> This allows BPF schedulers to:
>  - reliably track task ownership and lifecycle,
>  - maintain accurate accounting of managed tasks,
>  - update internal state when tasks change properties.
>
...
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..ccd1fad3b3b92 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   **Task State Tracking and ops.dequeue() Semantics**
> +
> +   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> +   enter the "BPF scheduler's custody" depending on where it's dispatched:
> +
> +   * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``,
> +     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
> +     is done with the task - it either goes straight to a CPU's local run
> +     queue or to the global DSQ as a fallback. The task never enters (or
> +     exits) BPF custody, and ``ops.dequeue()`` will not be called.
> +
> +   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
> +     BPF scheduler's custody. When the task later leaves BPF custody
> +     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
> +     sleep/property changes), ``ops.dequeue()`` will be called exactly once.
> +
> +   * **Queued on BPF side**: The task is in BPF data structures and in BPF
> +     custody, ``ops.dequeue()`` will be called when it leaves.
> +
> +   The key principle: **ops.dequeue() is called when a task leaves the BPF
> +   scheduler's custody**.
> +
> +   This works also with the ``ops.select_cpu()`` direct dispatch
> +   optimization: even though it skips ``ops.enqueue()`` invocation, if the
> +   task is dispatched to a user-created DSQ, it enters BPF custody and will
> +   get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ,
> +   the BPF scheduler is done with it immediately. This provides the
> +   performance benefit of avoiding the ``ops.enqueue()`` roundtrip while
> +   maintaining correct state tracking.
> +
> +   The dequeue can happen for different reasons, distinguished by flags:
> +
> +   1. **Regular dispatch workflow**: when the task is dispatched from a
> +      user-created DSQ to a terminal DSQ (leaving BPF custody for execution),
> +      ``ops.dequeue()`` is triggered without any special flags.

There's no requirement for the task do be on a user-created DSQ.

> +
> +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> +      core scheduling picks a task for execution while it's still in BPF
> +      custody, ``ops.dequeue()`` is called with the
> +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> +
> +   3. **Scheduling property change**: when a task property changes (via
> +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> +      priority changes, CPU migrations, etc.) while the task is still in
> +      BPF custody, ``ops.dequeue()`` is called with the
> +      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> +
> +   **Important**: Once a task has left BPF custody (dispatched to a
> +   terminal DSQ), property changes will not trigger ``ops.dequeue()``,
> +   since the task is no longer being managed by the BPF scheduler.
> +
>  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>     empty, it then looks at the global DSQ. If there still isn't a task to
>     run, ``ops.dispatch()`` is invoked which can use the following two
...
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..35a88942810b4 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_NEED_DEQ	= 1 << 1, /* task needs ops.dequeue() */

I think this could use a comment that connects this flag to the concept of
BPF custody, so how about something like "task is in BPF custody, needs
ops.dequeue() when leaving it"?

>  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
>  
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0bb8fa927e9e9..9ebca357196b4 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
...
> @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>       dsq_mod_nr(dsq, 1);
>       p->scx.dsq = dsq;
>
> +     /*
> +      * Handle ops.dequeue() and custody tracking.
> +      *
> +      * Builtin DSQs (local, global, bypass) are terminal: the BPF
> +      * scheduler is done with the task. If it was in BPF custody, call
> +      * ops.dequeue() and clear the flag.
> +      *
> +      * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> +      * ops.dequeue() will be called when it leaves.
> +      */
> +     if (SCX_HAS_OP(sch, dequeue)) {
> +             if (is_terminal_dsq(dsq->id)) {
> +                     if (p->scx.flags & SCX_TASK_NEED_DEQ)
> +                             SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> +                                              rq, p, 0);
> +                     p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +             } else {
> +                     p->scx.flags |= SCX_TASK_NEED_DEQ;
> +             }
> +     }
> +

This is the only place where I see SCX_TASK_NEED_DEQ being set, which means
it won't be set if the enqueued task is queued on the BPF scheduler's internal
data structures rather than dispatched to a user-created DSQ. I don't think
that's the behavior we're aiming for.

> @@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * Task is not in BPF data structures (either dispatched to
> +		 * a DSQ or running). Only call ops.dequeue() if the task
> +		 * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
> +		 * is set).
> +		 *
> +		 * If the task has already been dispatched to a terminal
> +		 * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
> +		 * scheduler's custody and the flag will be clear, so we
> +		 * skip ops.dequeue().
> +		 *
> +		 * If this is a property change (not sleep/core-sched) and
> +		 * the task is still in BPF custody, set the
> +		 * %SCX_DEQ_SCHED_CHANGE flag.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    (p->scx.flags & SCX_TASK_NEED_DEQ))
> +			call_task_dequeue(sch, rq, p, deq_flags);
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> +		/*
> +		 * Task is still on the BPF scheduler (not dispatched yet).
> +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> +		 * only for property changes, not for core-sched picks or
> +		 * sleep.
> +		 */

The part of the comment about SCX_DEQ_SCHED_CHANGE looks like it belongs in
call_task_dequeue(), not here.

>  		if (SCX_HAS_OP(sch, dequeue))
> -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -					 p, deq_flags);
> +			call_task_dequeue(sch, rq, p, deq_flags);

How about adding WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)) here or in
call_task_dequeue()?

Thanks,
Kuba

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-05 19:29   ` Kuba Piecuch
@ 2026-02-05 21:32     ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-05 21:32 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: Tejun Heo, David Vernet, Changwoo Min, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hi Kuba,

On Thu, Feb 05, 2026 at 07:29:42PM +0000, Kuba Piecuch wrote:
> Hi Andrea,
> 
> On Thu Feb 5, 2026 at 3:32 PM UTC, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change events. In addition, ops.dequeue()
> > callbacks are completely skipped when tasks are dispatched to non-local
> > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> > track task state.
> >
> > Fix this by guaranteeing that each task entering the BPF scheduler's
> > custody triggers exactly one ops.dequeue() call when it leaves that
> > custody, whether the exit is due to a dispatch (regular or via a core
> > scheduling pick) or to a scheduling property change (e.g.
> > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> > balancing, etc.).
> >
> > BPF scheduler custody concept: a task is considered to be in "BPF
> > scheduler's custody" when it has been queued in user-created DSQs and
> > the BPF scheduler is responsible for its lifecycle. Custody ends when
> > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> > selected by core scheduling, or removed due to a property change.
> 
> Strictly speaking, a task in BPF scheduler custody doesn't have to be queued
> in a user-created DSQ. It could just reside on some custom data structure.

Yeah... we definitely need to consider internal BPF queues.

> 
> >
> > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> > entirely and are not in its custody. Terminal DSQs include:
> >  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> >    where tasks go directly to execution.
> >  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
> >    BPF scheduler is considered "done" with the task.
> >
> > As a result, ops.dequeue() is not invoked for tasks dispatched to
> > terminal DSQs, as the BPF scheduler no longer retains custody of them.
> 
> Shouldn't it be "directly dispatched to terminal DSQs"?

Ack.

> 
> >
> > To identify dequeues triggered by scheduling property changes, introduce
> > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> > the dequeue was caused by a scheduling property change.
> >
> > New ops.dequeue() semantics:
> >  - ops.dequeue() is invoked exactly once when the task leaves the BPF
> >    scheduler's custody, in one of the following cases:
> >    a) regular dispatch: a task dispatched to a user DSQ is moved to a
> >       terminal DSQ (ops.dequeue() called without any special flags set),
> 
> I don't think the task has to be on a user DSQ. How about just "a task in BPF
> scheduler's custody is dispatched to a terminal DSQ from ops.dispatch()"?

Right.

> 
> >    b) core scheduling dispatch: core-sched picks task before dispatch,
> >       ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set,
> >    c) property change: task properties modified before dispatch,
> >       ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set.
> >
> > This allows BPF schedulers to:
> >  - reliably track task ownership and lifecycle,
> >  - maintain accurate accounting of managed tasks,
> >  - update internal state when tasks change properties.
> >
> ...
> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> > index 404fe6126a769..ccd1fad3b3b92 100644
> > --- a/Documentation/scheduler/sched-ext.rst
> > +++ b/Documentation/scheduler/sched-ext.rst
> > @@ -252,6 +252,57 @@ The following briefly shows how a waking task is scheduled and executed.
> >  
> >     * Queue the task on the BPF side.
> >  
> > +   **Task State Tracking and ops.dequeue() Semantics**
> > +
> > +   Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> > +   enter the "BPF scheduler's custody" depending on where it's dispatched:
> > +
> > +   * **Direct dispatch to terminal DSQs** (``SCX_DSQ_LOCAL``,
> > +     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
> > +     is done with the task - it either goes straight to a CPU's local run
> > +     queue or to the global DSQ as a fallback. The task never enters (or
> > +     exits) BPF custody, and ``ops.dequeue()`` will not be called.
> > +
> > +   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
> > +     BPF scheduler's custody. When the task later leaves BPF custody
> > +     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
> > +     sleep/property changes), ``ops.dequeue()`` will be called exactly once.
> > +
> > +   * **Queued on BPF side**: The task is in BPF data structures and in BPF
> > +     custody, ``ops.dequeue()`` will be called when it leaves.
> > +
> > +   The key principle: **ops.dequeue() is called when a task leaves the BPF
> > +   scheduler's custody**.
> > +
> > +   This works also with the ``ops.select_cpu()`` direct dispatch
> > +   optimization: even though it skips ``ops.enqueue()`` invocation, if the
> > +   task is dispatched to a user-created DSQ, it enters BPF custody and will
> > +   get ``ops.dequeue()`` when it leaves. If dispatched to a terminal DSQ,
> > +   the BPF scheduler is done with it immediately. This provides the
> > +   performance benefit of avoiding the ``ops.enqueue()`` roundtrip while
> > +   maintaining correct state tracking.
> > +
> > +   The dequeue can happen for different reasons, distinguished by flags:
> > +
> > +   1. **Regular dispatch workflow**: when the task is dispatched from a
> > +      user-created DSQ to a terminal DSQ (leaving BPF custody for execution),
> > +      ``ops.dequeue()`` is triggered without any special flags.
> 
> There's no requirement for the task do be on a user-created DSQ.

Ditto.

> 
> > +
> > +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> > +      core scheduling picks a task for execution while it's still in BPF
> > +      custody, ``ops.dequeue()`` is called with the
> > +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> > +
> > +   3. **Scheduling property change**: when a task property changes (via
> > +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> > +      priority changes, CPU migrations, etc.) while the task is still in
> > +      BPF custody, ``ops.dequeue()`` is called with the
> > +      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> > +
> > +   **Important**: Once a task has left BPF custody (dispatched to a
> > +   terminal DSQ), property changes will not trigger ``ops.dequeue()``,
> > +   since the task is no longer being managed by the BPF scheduler.
> > +
> >  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
> >     empty, it then looks at the global DSQ. If there still isn't a task to
> >     run, ``ops.dispatch()`` is invoked which can use the following two
> ...
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..35a88942810b4 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	SCX_TASK_NEED_DEQ	= 1 << 1, /* task needs ops.dequeue() */
> 
> I think this could use a comment that connects this flag to the concept of
> BPF custody, so how about something like "task is in BPF custody, needs
> ops.dequeue() when leaving it"?

Ack.

> 
> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> >  
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 0bb8fa927e9e9..9ebca357196b4 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> ...
> > @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> >       dsq_mod_nr(dsq, 1);
> >       p->scx.dsq = dsq;
> >
> > +     /*
> > +      * Handle ops.dequeue() and custody tracking.
> > +      *
> > +      * Builtin DSQs (local, global, bypass) are terminal: the BPF
> > +      * scheduler is done with the task. If it was in BPF custody, call
> > +      * ops.dequeue() and clear the flag.
> > +      *
> > +      * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> > +      * ops.dequeue() will be called when it leaves.
> > +      */
> > +     if (SCX_HAS_OP(sch, dequeue)) {
> > +             if (is_terminal_dsq(dsq->id)) {
> > +                     if (p->scx.flags & SCX_TASK_NEED_DEQ)
> > +                             SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> > +                                              rq, p, 0);
> > +                     p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> > +             } else {
> > +                     p->scx.flags |= SCX_TASK_NEED_DEQ;
> > +             }
> > +     }
> > +
> 
> This is the only place where I see SCX_TASK_NEED_DEQ being set, which means
> it won't be set if the enqueued task is queued on the BPF scheduler's internal
> data structures rather than dispatched to a user-created DSQ. I don't think
> that's the behavior we're aiming for.

Right, I'll implement the right behavior (calling ops.dequeue()) for tasks
stored in internal BPF queues.

> 
> > @@ -1524,6 +1579,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * Task is not in BPF data structures (either dispatched to
> > +		 * a DSQ or running). Only call ops.dequeue() if the task
> > +		 * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
> > +		 * is set).
> > +		 *
> > +		 * If the task has already been dispatched to a terminal
> > +		 * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
> > +		 * scheduler's custody and the flag will be clear, so we
> > +		 * skip ops.dequeue().
> > +		 *
> > +		 * If this is a property change (not sleep/core-sched) and
> > +		 * the task is still in BPF custody, set the
> > +		 * %SCX_DEQ_SCHED_CHANGE flag.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    (p->scx.flags & SCX_TASK_NEED_DEQ))
> > +			call_task_dequeue(sch, rq, p, deq_flags);
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> > @@ -1532,9 +1605,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  		 */
> >  		BUG();
> >  	case SCX_OPSS_QUEUED:
> > +		/*
> > +		 * Task is still on the BPF scheduler (not dispatched yet).
> > +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> > +		 * only for property changes, not for core-sched picks or
> > +		 * sleep.
> > +		 */
> 
> The part of the comment about SCX_DEQ_SCHED_CHANGE looks like it belongs in
> call_task_dequeue(), not here.

Ack.

> 
> >  		if (SCX_HAS_OP(sch, dequeue))
> > -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > -					 p, deq_flags);
> > +			call_task_dequeue(sch, rq, p, deq_flags);
> 
> How about adding WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ)) here or in
> call_task_dequeue()?

Ack.

Thanks for the review!

-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-06 13:54 [PATCHSET v7] sched_ext: Fix " Andrea Righi
@ 2026-02-06 13:54 ` Andrea Righi
  2026-02-06 20:35   ` Emil Tsalapatis
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-06 13:54 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.

Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).

BPF scheduler custody concept: a task is considered to be in the BPF
scheduler's custody when the scheduler is responsible for managing its
lifecycle. This includes tasks dispatched to user-created DSQs or stored
in the BPF scheduler's internal data structures. Custody ends when the
task is dispatched to a terminal DSQ (such as the local DSQ or
%SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a
property change.

Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are never in its custody. Terminal DSQs include:
 - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
   where tasks go directly to execution.
 - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
   BPF scheduler is considered "done" with the task.

As a result, ops.dequeue() is not invoked for tasks directly dispatched
to terminal DSQs.

To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.

New ops.dequeue() semantics:
 - ops.dequeue() is invoked exactly once when the task leaves the BPF
   scheduler's custody, in one of the following cases:
   a) regular dispatch: a task dispatched to a user DSQ or stored in
      internal BPF data structures is moved to a terminal DSQ
      (ops.dequeue() called without any special flags set),
   b) core scheduling dispatch: core-sched picks task before dispatch
      (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
   c) property change: task properties modified before dispatch,
      (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).

This allows BPF schedulers to:
 - reliably track task ownership and lifecycle,
 - maintain accurate accounting of managed tasks,
 - update internal state when tasks change properties.

Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 Documentation/scheduler/sched-ext.rst         |  58 +++++++
 include/linux/sched/ext.h                     |   1 +
 kernel/sched/ext.c                            | 157 ++++++++++++++++--
 kernel/sched/ext_internal.h                   |   7 +
 .../sched_ext/include/scx/enum_defs.autogen.h |   1 +
 .../sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h   |   1 +
 7 files changed, 213 insertions(+), 14 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..fe8c59b0c1477 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -252,6 +252,62 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   **Task State Tracking and ops.dequeue() Semantics**
+
+   A task is in the "BPF scheduler's custody" when the BPF scheduler is
+   responsible for managing its lifecycle. That includes tasks dispatched
+   to user-created DSQs or stored in the BPF scheduler's internal data
+   structures. Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called,
+   the task may or may not enter custody depending on what the scheduler
+   does:
+
+   * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
+     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
+     is done with the task - it either goes straight to a CPU's local run
+     queue or to the global DSQ as a fallback. The task never enters (or
+     exits) BPF custody, and ``ops.dequeue()`` will not be called.
+
+   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
+     BPF scheduler's custody. When the task later leaves BPF custody
+     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
+     sleep/property changes), ``ops.dequeue()`` will be called exactly once.
+
+   * **Queued on BPF side** (e.g., internal queues, no DSQ): The task is in
+     BPF custody. ``ops.dequeue()`` will be called when it leaves (e.g.
+     when ``ops.dispatch()`` moves it to a terminal DSQ, or on property
+     change / sleep).
+
+   **NOTE**: this concept is valid also with the ``ops.select_cpu()``
+   direct dispatch optimization. Even though it skips ``ops.enqueue()``
+   invocation, if the task is dispatched to a user-created DSQ or internal
+   BPF structure, it enters BPF custody and will get ``ops.dequeue()`` when
+   it leaves. If dispatched to a terminal DSQ, the BPF scheduler is done
+   with it immediately. This provides the performance benefit of avoiding
+   the ``ops.enqueue()`` roundtrip while maintaining correct state
+   tracking.
+
+   The dequeue can happen for different reasons, distinguished by flags:
+
+   1. **Regular dispatch**: when a task in BPF custody is dispatched to a
+      terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
+      execution), ``ops.dequeue()`` is triggered without any special flags.
+
+   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+      core scheduling picks a task for execution while it's still in BPF
+      custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+   3. **Scheduling property change**: when a task property changes (via
+      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+      priority changes, CPU migrations, etc.) while the task is still in
+      BPF custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+   **Important**: Once a task has left BPF custody (e.g. after being
+   dispatched to a terminal DSQ), property changes will not trigger
+   ``ops.dequeue()``, since the task is no longer being managed by the BPF
+   scheduler.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +375,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..c48f818eee9b8 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,6 +84,7 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	SCX_TASK_NEED_DEQ	= 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0bb8fa927e9e9..d17fd9141adf4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
 #endif
 }
 
+/**
+ * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
+ * @dsq_id: DSQ ID to check
+ *
+ * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
+ * scheduler is considered "done" with the task.
+ *
+ * Builtin DSQs include:
+ *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
+ *    where tasks go directly to execution,
+ *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
+ *  - Bypass DSQ: used during bypass mode.
+ *
+ * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
+ * trigger ops.dequeue() when they are later consumed.
+ */
+static inline bool is_terminal_dsq(u64 dsq_id)
+{
+	return dsq_id & SCX_DSQ_FLAG_BUILTIN;
+}
+
 /**
  * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
  * @rq: rq to read clock from, must be locked
@@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p
 		resched_curr(rq);
 }
 
-static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
+			     struct scx_dispatch_q *dsq,
 			     struct task_struct *p, u64 enq_flags)
 {
 	bool is_local = dsq->id == SCX_DSQ_LOCAL;
@@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 	dsq_mod_nr(dsq, 1);
 	p->scx.dsq = dsq;
 
+	/*
+	 * Handle ops.dequeue() and custody tracking.
+	 *
+	 * Builtin DSQs (local, global, bypass) are terminal: the BPF
+	 * scheduler is done with the task. If it was in BPF custody, call
+	 * ops.dequeue() and clear the flag.
+	 *
+	 * User DSQs: Task is in BPF scheduler's custody. Set the flag so
+	 * ops.dequeue() will be called when it leaves.
+	 */
+	if (SCX_HAS_OP(sch, dequeue)) {
+		if (is_terminal_dsq(dsq->id)) {
+			if (p->scx.flags & SCX_TASK_NEED_DEQ)
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
+						 rq, p, 0);
+			p->scx.flags &= ~SCX_TASK_NEED_DEQ;
+		} else {
+			p->scx.flags |= SCX_TASK_NEED_DEQ;
+		}
+	}
+
 	/*
 	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
 	 * direct dispatch path, but we clear them here because the direct
@@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 		return;
 	}
 
-	dispatch_enqueue(sch, dsq, p,
+	dispatch_enqueue(sch, rq, dsq, p,
 			 p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
@@ -1407,13 +1450,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	 * dequeue may be waiting. The store_release matches their load_acquire.
 	 */
 	atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
+
+	/*
+	 * Task is now in BPF scheduler's custody (queued on BPF internal
+	 * structures). Set %SCX_TASK_NEED_DEQ so ops.dequeue() is called
+	 * when it leaves custody (e.g. dispatched to a terminal DSQ or on
+	 * property change).
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		p->scx.flags |= SCX_TASK_NEED_DEQ;
 	return;
 
 direct:
 	direct_dispatch(sch, p, enq_flags);
 	return;
 local_norefill:
-	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
+	dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
 	return;
 local:
 	dsq = &rq->scx.local_dsq;
@@ -1433,7 +1485,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	 */
 	touch_core_sched(rq, p);
 	refill_task_slice_dfl(sch, p);
-	dispatch_enqueue(sch, dsq, p, enq_flags);
+	dispatch_enqueue(sch, rq, dsq, p, enq_flags);
 }
 
 static bool task_runnable(const struct task_struct *p)
@@ -1511,6 +1563,22 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 		__scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1);
 }
 
+/*
+ * Call ops.dequeue() for a task leaving BPF custody. Adds %SCX_DEQ_SCHED_CHANGE
+ * when the dequeue is due to a property change (not sleep or core-sched pick).
+ */
+static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
+			      struct task_struct *p, u64 deq_flags)
+{
+	u64 flags = deq_flags;
+
+	if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+		flags |= SCX_DEQ_SCHED_CHANGE;
+
+	SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
+	p->scx.flags &= ~SCX_TASK_NEED_DEQ;
+}
+
 static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 {
 	struct scx_sched *sch = scx_root;
@@ -1524,6 +1592,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		/*
+		 * Task is not in BPF data structures (either dispatched to
+		 * a DSQ or running). Only call ops.dequeue() if the task
+		 * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
+		 * is set).
+		 *
+		 * If the task has already been dispatched to a terminal
+		 * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
+		 * scheduler's custody and the flag will be clear, so we
+		 * skip ops.dequeue().
+		 *
+		 * If this is a property change (not sleep/core-sched) and
+		 * the task is still in BPF custody, set the
+		 * %SCX_DEQ_SCHED_CHANGE flag.
+		 */
+		if (SCX_HAS_OP(sch, dequeue) &&
+		    (p->scx.flags & SCX_TASK_NEED_DEQ))
+			call_task_dequeue(sch, rq, p, deq_flags);
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1532,9 +1618,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
-		if (SCX_HAS_OP(sch, dequeue))
-			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
-					 p, deq_flags);
+		/*
+		 * Task is still on the BPF scheduler (not dispatched yet).
+		 * Call ops.dequeue() to notify it is leaving BPF custody.
+		 */
+		if (SCX_HAS_OP(sch, dequeue)) {
+			WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ));
+			call_task_dequeue(sch, rq, p, deq_flags);
+		}
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -1631,6 +1722,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 					 struct scx_dispatch_q *src_dsq,
 					 struct rq *dst_rq)
 {
+	struct scx_sched *sch = scx_root;
 	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
 
 	/* @dsq is locked and @p is on @dst_rq */
@@ -1639,6 +1731,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
 
+	/*
+	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
+	 * Call ops.dequeue() if the task was in BPF custody.
+	 */
+	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_NEED_DEQ)) {
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
+		p->scx.flags &= ~SCX_TASK_NEED_DEQ;
+	}
+
 	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
 		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
 	else
@@ -1879,7 +1980,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 		dispatch_dequeue_locked(p, src_dsq);
 		raw_spin_unlock(&src_dsq->lock);
 
-		dispatch_enqueue(sch, dst_dsq, p, enq_flags);
+		dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
 	}
 
 	return dst_rq;
@@ -1969,14 +2070,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 	 * If dispatching to @rq that @p is already on, no lock dancing needed.
 	 */
 	if (rq == src_rq && rq == dst_rq) {
-		dispatch_enqueue(sch, dst_dsq, p,
+		dispatch_enqueue(sch, rq, dst_dsq, p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
 
 	if (src_rq != dst_rq &&
 	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
+		dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
@@ -2014,9 +2115,21 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 		 */
 		if (src_rq == dst_rq) {
 			p->scx.holding_cpu = -1;
-			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
+			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
 					 enq_flags);
 		} else {
+			/*
+			 * Moving to a remote local DSQ. dispatch_enqueue() is
+			 * not used (we go through deactivate/activate), so
+			 * call ops.dequeue() here if the task was in BPF
+			 * custody.
+			 */
+			if (SCX_HAS_OP(sch, dequeue) &&
+			    (p->scx.flags & SCX_TASK_NEED_DEQ)) {
+				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
+						 src_rq, p, 0);
+				p->scx.flags &= ~SCX_TASK_NEED_DEQ;
+			}
 			move_remote_task_to_local_dsq(p, enq_flags,
 						      src_rq, dst_rq);
 			/* task has been moved to dst_rq, which is now locked */
@@ -2113,7 +2226,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 	if (dsq->id == SCX_DSQ_LOCAL)
 		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
 	else
-		dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+		dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
 static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
@@ -2414,7 +2527,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 * DSQ.
 		 */
 		if (p->scx.slice && !scx_rq_bypassing(rq)) {
-			dispatch_enqueue(sch, &rq->scx.local_dsq, p,
+			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p,
 					 SCX_ENQ_HEAD);
 			goto switch_class;
 		}
@@ -2898,6 +3011,14 @@ static void scx_enable_task(struct task_struct *p)
 
 	lockdep_assert_rq_held(rq);
 
+	/*
+	 * Verify the task is not in BPF scheduler's custody. If flag
+	 * transitions are consistent, the flag should always be clear
+	 * here.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ);
+
 	/*
 	 * Set the weight before calling ops.enable() so that the scheduler
 	 * doesn't see a stale value if they inspect the task struct.
@@ -2929,6 +3050,14 @@ static void scx_disable_task(struct task_struct *p)
 	if (SCX_HAS_OP(sch, disable))
 		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
 	scx_set_task_state(p, SCX_TASK_READY);
+
+	/*
+	 * Verify the task is not in BPF scheduler's custody. If flag
+	 * transitions are consistent, the flag should always be clear
+	 * here.
+	 */
+	if (SCX_HAS_OP(sch, dequeue))
+		WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ);
 }
 
 static void scx_exit_task(struct task_struct *p)
@@ -3919,7 +4048,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq,
 		 * between bypass DSQs.
 		 */
 		dispatch_dequeue_locked(p, donor_dsq);
-		dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);
+		dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED);
 
 		/*
 		 * $donee might have been idle and need to be woken up. No need
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
 	 * it hasn't been dispatched yet. Dequeue from the BPF side.
 	 */
 	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
+
+	/*
+	 * The task is being dequeued due to a property change (e.g.,
+	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+	 * etc.).
+	 */
+	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
 };
 
 enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..dcc945304760f 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
 #define HAVE_SCX_DEQ_SLEEP
 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
 #define HAVE_SCX_DSQ_FLAG_BUILTIN
 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
 #define HAVE_SCX_DSQ_INVALID
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
 
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
 } while (0)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
@ 2026-02-06 20:35   ` Emil Tsalapatis
  2026-02-07  9:26     ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Emil Tsalapatis @ 2026-02-06 20:35 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Christian Loehle, Daniel Hodges, sched-ext,
	linux-kernel

On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
> Currently, ops.dequeue() is only invoked when the sched_ext core knows
> that a task resides in BPF-managed data structures, which causes it to
> miss scheduling property change events. In addition, ops.dequeue()
> callbacks are completely skipped when tasks are dispatched to non-local
> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> track task state.
>
> Fix this by guaranteeing that each task entering the BPF scheduler's
> custody triggers exactly one ops.dequeue() call when it leaves that
> custody, whether the exit is due to a dispatch (regular or via a core
> scheduling pick) or to a scheduling property change (e.g.
> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> balancing, etc.).
>
> BPF scheduler custody concept: a task is considered to be in the BPF
> scheduler's custody when the scheduler is responsible for managing its
> lifecycle. This includes tasks dispatched to user-created DSQs or stored
> in the BPF scheduler's internal data structures. Custody ends when the
> task is dispatched to a terminal DSQ (such as the local DSQ or
> %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a
> property change.
>
> Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> entirely and are never in its custody. Terminal DSQs include:
>  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
>    where tasks go directly to execution.
>  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
>    BPF scheduler is considered "done" with the task.
>
> As a result, ops.dequeue() is not invoked for tasks directly dispatched
> to terminal DSQs.
>
> To identify dequeues triggered by scheduling property changes, introduce
> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> the dequeue was caused by a scheduling property change.
>
> New ops.dequeue() semantics:
>  - ops.dequeue() is invoked exactly once when the task leaves the BPF
>    scheduler's custody, in one of the following cases:
>    a) regular dispatch: a task dispatched to a user DSQ or stored in
>       internal BPF data structures is moved to a terminal DSQ
>       (ops.dequeue() called without any special flags set),
>    b) core scheduling dispatch: core-sched picks task before dispatch
>       (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
>    c) property change: task properties modified before dispatch,
>       (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).
>
> This allows BPF schedulers to:
>  - reliably track task ownership and lifecycle,
>  - maintain accurate accounting of managed tasks,
>  - update internal state when tasks change properties.
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Emil Tsalapatis <emil@etsalapatis.com>
> Cc: Kuba Piecuch <jpiecuch@google.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---

Hi Andrea,

>  Documentation/scheduler/sched-ext.rst         |  58 +++++++
>  include/linux/sched/ext.h                     |   1 +
>  kernel/sched/ext.c                            | 157 ++++++++++++++++--
>  kernel/sched/ext_internal.h                   |   7 +
>  .../sched_ext/include/scx/enum_defs.autogen.h |   1 +
>  .../sched_ext/include/scx/enums.autogen.bpf.h |   2 +
>  tools/sched_ext/include/scx/enums.autogen.h   |   1 +
>  7 files changed, 213 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..fe8c59b0c1477 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,62 @@ The following briefly shows how a waking task is scheduled and executed.
>  
>     * Queue the task on the BPF side.
>  
> +   **Task State Tracking and ops.dequeue() Semantics**
> +
> +   A task is in the "BPF scheduler's custody" when the BPF scheduler is
> +   responsible for managing its lifecycle. That includes tasks dispatched
> +   to user-created DSQs or stored in the BPF scheduler's internal data
> +   structures. Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called,
> +   the task may or may not enter custody depending on what the scheduler
> +   does:
> +
> +   * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
> +     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): The BPF scheduler
> +     is done with the task - it either goes straight to a CPU's local run
> +     queue or to the global DSQ as a fallback. The task never enters (or
> +     exits) BPF custody, and ``ops.dequeue()`` will not be called.
> +
> +   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
> +     BPF scheduler's custody. When the task later leaves BPF custody
> +     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
> +     sleep/property changes), ``ops.dequeue()`` will be called exactly once.
> +
> +   * **Queued on BPF side** (e.g., internal queues, no DSQ): The task is in
> +     BPF custody. ``ops.dequeue()`` will be called when it leaves (e.g.
> +     when ``ops.dispatch()`` moves it to a terminal DSQ, or on property
> +     change / sleep).
> +
> +   **NOTE**: this concept is valid also with the ``ops.select_cpu()``
> +   direct dispatch optimization. Even though it skips ``ops.enqueue()``
> +   invocation, if the task is dispatched to a user-created DSQ or internal
> +   BPF structure, it enters BPF custody and will get ``ops.dequeue()`` when
> +   it leaves. If dispatched to a terminal DSQ, the BPF scheduler is done
> +   with it immediately. This provides the performance benefit of avoiding
> +   the ``ops.enqueue()`` roundtrip while maintaining correct state
> +   tracking.
> +
> +   The dequeue can happen for different reasons, distinguished by flags:
> +
> +   1. **Regular dispatch**: when a task in BPF custody is dispatched to a
> +      terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
> +      execution), ``ops.dequeue()`` is triggered without any special flags.
> +
> +   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
> +      core scheduling picks a task for execution while it's still in BPF
> +      custody, ``ops.dequeue()`` is called with the
> +      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
> +
> +   3. **Scheduling property change**: when a task property changes (via
> +      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
> +      priority changes, CPU migrations, etc.) while the task is still in
> +      BPF custody, ``ops.dequeue()`` is called with the
> +      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
> +
> +   **Important**: Once a task has left BPF custody (e.g. after being
> +   dispatched to a terminal DSQ), property changes will not trigger
> +   ``ops.dequeue()``, since the task is no longer being managed by the BPF
> +   scheduler.
> +
>  3. When a CPU is ready to schedule, it first looks at its local DSQ. If
>     empty, it then looks at the global DSQ. If there still isn't a task to
>     run, ``ops.dispatch()`` is invoked which can use the following two
> @@ -319,6 +375,8 @@ by a sched_ext scheduler:
>                  /* Any usable CPU becomes available */
>  
>                  ops.dispatch(); /* Task is moved to a local DSQ */
> +
> +                ops.dequeue(); /* Exiting BPF scheduler */
>              }
>              ops.running();      /* Task starts running on its assigned CPU */
>              while (task->scx.slice > 0 && task is runnable)
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..c48f818eee9b8 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
>  /* scx_entity.flags */
>  enum scx_ent_flags {
>  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_NEED_DEQ	= 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */

Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be
in BPF custody vs the core scx scheduler (terminal DSQs) this is a more
general property that can be useful to check in the future. An example:
We can now assert that a task's BPF state is consistent with its actual 
kernel state when using BPF-based data structures to manage tasks.

>  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
>  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
>  
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 0bb8fa927e9e9..d17fd9141adf4 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
>  #endif
>  }
>  
> +/**
> + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
> + * @dsq_id: DSQ ID to check
> + *
> + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
> + * scheduler is considered "done" with the task.
> + *
> + * Builtin DSQs include:
> + *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> + *    where tasks go directly to execution,
> + *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
> + *  - Bypass DSQ: used during bypass mode.
> + *
> + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
> + * trigger ops.dequeue() when they are later consumed.
> + */
> +static inline bool is_terminal_dsq(u64 dsq_id)
> +{
> +	return dsq_id & SCX_DSQ_FLAG_BUILTIN;
> +}
> +
>  /**
>   * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
>   * @rq: rq to read clock from, must be locked
> @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p
>  		resched_curr(rq);
>  }
>  
> -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> +			     struct scx_dispatch_q *dsq,
>  			     struct task_struct *p, u64 enq_flags)
>  {
>  	bool is_local = dsq->id == SCX_DSQ_LOCAL;
> @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
>  	dsq_mod_nr(dsq, 1);
>  	p->scx.dsq = dsq;
>  
> +	/*
> +	 * Handle ops.dequeue() and custody tracking.
> +	 *
> +	 * Builtin DSQs (local, global, bypass) are terminal: the BPF
> +	 * scheduler is done with the task. If it was in BPF custody, call
> +	 * ops.dequeue() and clear the flag.
> +	 *
> +	 * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> +	 * ops.dequeue() will be called when it leaves.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		if (is_terminal_dsq(dsq->id)) {
> +			if (p->scx.flags & SCX_TASK_NEED_DEQ)
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> +						 rq, p, 0);
> +			p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +		} else {
> +			p->scx.flags |= SCX_TASK_NEED_DEQ;
> +		}
> +	}
> +
>  	/*
>  	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
>  	 * direct dispatch path, but we clear them here because the direct
> @@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
>  		return;
>  	}
>  
> -	dispatch_enqueue(sch, dsq, p,
> +	dispatch_enqueue(sch, rq, dsq, p,
>  			 p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
>  }
>  
> @@ -1407,13 +1450,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	 * dequeue may be waiting. The store_release matches their load_acquire.
>  	 */
>  	atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
> +
> +	/*
> +	 * Task is now in BPF scheduler's custody (queued on BPF internal
> +	 * structures). Set %SCX_TASK_NEED_DEQ so ops.dequeue() is called
> +	 * when it leaves custody (e.g. dispatched to a terminal DSQ or on
> +	 * property change).
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))

Related to the rename: Can we remove the guards and track the flag
regardless of whether ops.dequeue() is present?

There is no reason not to track whether a task is in BPF or the core, 
and it is a property that's independent of whether we implement ops.dequeue(). 
This also simplifies the code since we now just guard the actual ops.dequeue()
call.

> +		p->scx.flags |= SCX_TASK_NEED_DEQ;
>  	return;
>  
>  direct:
>  	direct_dispatch(sch, p, enq_flags);
>  	return;
>  local_norefill:
> -	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
> +	dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
>  	return;
>  local:
>  	dsq = &rq->scx.local_dsq;
> @@ -1433,7 +1485,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	 */
>  	touch_core_sched(rq, p);
>  	refill_task_slice_dfl(sch, p);
> -	dispatch_enqueue(sch, dsq, p, enq_flags);
> +	dispatch_enqueue(sch, rq, dsq, p, enq_flags);
>  }
>  
>  static bool task_runnable(const struct task_struct *p)
> @@ -1511,6 +1563,22 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
>  		__scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1);
>  }
>  
> +/*
> + * Call ops.dequeue() for a task leaving BPF custody. Adds %SCX_DEQ_SCHED_CHANGE
> + * when the dequeue is due to a property change (not sleep or core-sched pick).
> + */
> +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
> +			      struct task_struct *p, u64 deq_flags)
> +{
> +	u64 flags = deq_flags;
> +
> +	if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +		flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +	SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> +	p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +}
> +
>  static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  {
>  	struct scx_sched *sch = scx_root;
> @@ -1524,6 +1592,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * Task is not in BPF data structures (either dispatched to
> +		 * a DSQ or running). Only call ops.dequeue() if the task
> +		 * is still in BPF scheduler's custody (%SCX_TASK_NEED_DEQ
> +		 * is set).
> +		 *
> +		 * If the task has already been dispatched to a terminal
> +		 * DSQ (local DSQ or %SCX_DSQ_GLOBAL), it has left the BPF
> +		 * scheduler's custody and the flag will be clear, so we
> +		 * skip ops.dequeue().
> +		 *
> +		 * If this is a property change (not sleep/core-sched) and
> +		 * the task is still in BPF custody, set the
> +		 * %SCX_DEQ_SCHED_CHANGE flag.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue) &&
> +		    (p->scx.flags & SCX_TASK_NEED_DEQ))
> +			call_task_dequeue(sch, rq, p, deq_flags);
>  		break;
>  	case SCX_OPSS_QUEUEING:
>  		/*
> @@ -1532,9 +1618,14 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  		 */
>  		BUG();
>  	case SCX_OPSS_QUEUED:
> -		if (SCX_HAS_OP(sch, dequeue))
> -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> -					 p, deq_flags);
> +		/*
> +		 * Task is still on the BPF scheduler (not dispatched yet).
> +		 * Call ops.dequeue() to notify it is leaving BPF custody.
> +		 */
> +		if (SCX_HAS_OP(sch, dequeue)) {
> +			WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_NEED_DEQ));
> +			call_task_dequeue(sch, rq, p, deq_flags);
> +		}
>  
>  		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
>  					    SCX_OPSS_NONE))
> @@ -1631,6 +1722,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  					 struct scx_dispatch_q *src_dsq,
>  					 struct rq *dst_rq)
>  {
> +	struct scx_sched *sch = scx_root;
>  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
>  
>  	/* @dsq is locked and @p is on @dst_rq */
> @@ -1639,6 +1731,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  
>  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
>  
> +	/*
> +	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> +	 * Call ops.dequeue() if the task was in BPF custody.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_NEED_DEQ)) {
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> +		p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +	}
> +
>  	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
>  		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
>  	else
> @@ -1879,7 +1980,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  		dispatch_dequeue_locked(p, src_dsq);
>  		raw_spin_unlock(&src_dsq->lock);
>  
> -		dispatch_enqueue(sch, dst_dsq, p, enq_flags);
> +		dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
>  	}
>  
>  	return dst_rq;
> @@ -1969,14 +2070,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  	 * If dispatching to @rq that @p is already on, no lock dancing needed.
>  	 */
>  	if (rq == src_rq && rq == dst_rq) {
> -		dispatch_enqueue(sch, dst_dsq, p,
> +		dispatch_enqueue(sch, rq, dst_dsq, p,
>  				 enq_flags | SCX_ENQ_CLEAR_OPSS);
>  		return;
>  	}
>  
>  	if (src_rq != dst_rq &&
>  	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
> -		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
> +		dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
>  				 enq_flags | SCX_ENQ_CLEAR_OPSS);
>  		return;
>  	}
> @@ -2014,9 +2115,21 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		 */
>  		if (src_rq == dst_rq) {
>  			p->scx.holding_cpu = -1;
> -			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
> +			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
>  					 enq_flags);
>  		} else {
> +			/*
> +			 * Moving to a remote local DSQ. dispatch_enqueue() is
> +			 * not used (we go through deactivate/activate), so
> +			 * call ops.dequeue() here if the task was in BPF
> +			 * custody.
> +			 */
> +			if (SCX_HAS_OP(sch, dequeue) &&
> +			    (p->scx.flags & SCX_TASK_NEED_DEQ)) {
> +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> +						 src_rq, p, 0);
> +				p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> +			}
>  			move_remote_task_to_local_dsq(p, enq_flags,
>  						      src_rq, dst_rq);
>  			/* task has been moved to dst_rq, which is now locked */
> @@ -2113,7 +2226,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
>  	if (dsq->id == SCX_DSQ_LOCAL)
>  		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
>  	else
> -		dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> +		dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
>  }
>  
>  static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
> @@ -2414,7 +2527,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
>  		 * DSQ.
>  		 */
>  		if (p->scx.slice && !scx_rq_bypassing(rq)) {
> -			dispatch_enqueue(sch, &rq->scx.local_dsq, p,
> +			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p,
>  					 SCX_ENQ_HEAD);
>  			goto switch_class;
>  		}
> @@ -2898,6 +3011,14 @@ static void scx_enable_task(struct task_struct *p)
>  
>  	lockdep_assert_rq_held(rq);
>  
> +	/*
> +	 * Verify the task is not in BPF scheduler's custody. If flag
> +	 * transitions are consistent, the flag should always be clear
> +	 * here.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))
> +		WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ);
> +
>  	/*
>  	 * Set the weight before calling ops.enable() so that the scheduler
>  	 * doesn't see a stale value if they inspect the task struct.
> @@ -2929,6 +3050,14 @@ static void scx_disable_task(struct task_struct *p)
>  	if (SCX_HAS_OP(sch, disable))
>  		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
>  	scx_set_task_state(p, SCX_TASK_READY);
> +
> +	/*
> +	 * Verify the task is not in BPF scheduler's custody. If flag
> +	 * transitions are consistent, the flag should always be clear
> +	 * here.
> +	 */
> +	if (SCX_HAS_OP(sch, dequeue))
> +		WARN_ON_ONCE(p->scx.flags & SCX_TASK_NEED_DEQ);
>  }
>  
>  static void scx_exit_task(struct task_struct *p)
> @@ -3919,7 +4048,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq,
>  		 * between bypass DSQs.
>  		 */
>  		dispatch_dequeue_locked(p, donor_dsq);
> -		dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);
> +		dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED);
>  
>  		/*
>  		 * $donee might have been idle and need to be woken up. No need
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 386c677e4c9a0..befa9a5d6e53f 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -982,6 +982,13 @@ enum scx_deq_flags {
>  	 * it hasn't been dispatched yet. Dequeue from the BPF side.
>  	 */
>  	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
> +
> +	/*
> +	 * The task is being dequeued due to a property change (e.g.,
> +	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
> +	 * etc.).
> +	 */
> +	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
>  };
>  
>  enum scx_pick_idle_cpu_flags {
> diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
> index c2c33df9292c2..dcc945304760f 100644
> --- a/tools/sched_ext/include/scx/enum_defs.autogen.h
> +++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
> @@ -21,6 +21,7 @@
>  #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
>  #define HAVE_SCX_DEQ_SLEEP
>  #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
> +#define HAVE_SCX_DEQ_SCHED_CHANGE
>  #define HAVE_SCX_DSQ_FLAG_BUILTIN
>  #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
>  #define HAVE_SCX_DSQ_INVALID
> diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
> index 2f8002bcc19ad..5da50f9376844 100644
> --- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
> +++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
> @@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
>  const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
>  #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
>  
> +const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
> +#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
> diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
> index fedec938584be..fc9a7a4d9dea5 100644
> --- a/tools/sched_ext/include/scx/enums.autogen.h
> +++ b/tools/sched_ext/include/scx/enums.autogen.h
> @@ -46,4 +46,5 @@
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
>  	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
> +	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
>  } while (0)


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-06 20:35   ` Emil Tsalapatis
@ 2026-02-07  9:26     ` Andrea Righi
  2026-02-09 17:28       ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-07  9:26 UTC (permalink / raw)
  To: Emil Tsalapatis
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hi Emil,

On Fri, Feb 06, 2026 at 03:35:34PM -0500, Emil Tsalapatis wrote:
> On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
...
> > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > index bcb962d5ee7d8..c48f818eee9b8 100644
> > --- a/include/linux/sched/ext.h
> > +++ b/include/linux/sched/ext.h
> > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> >  /* scx_entity.flags */
> >  enum scx_ent_flags {
> >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > +	SCX_TASK_NEED_DEQ	= 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */
> 
> Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be
> in BPF custody vs the core scx scheduler (terminal DSQs) this is a more
> general property that can be useful to check in the future. An example:
> We can now assert that a task's BPF state is consistent with its actual 
> kernel state when using BPF-based data structures to manage tasks.

Ack. I like SCX_TASK_IN_BPF and I also like the idea of resuing the flag
for other purposes. It can be helpful for debugging as well.

> 
> >  	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
> >  	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
> >  
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 0bb8fa927e9e9..d17fd9141adf4 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
> >  #endif
> >  }
> >  
> > +/**
> > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
> > + * @dsq_id: DSQ ID to check
> > + *
> > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
> > + * scheduler is considered "done" with the task.
> > + *
> > + * Builtin DSQs include:
> > + *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> > + *    where tasks go directly to execution,
> > + *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
> > + *  - Bypass DSQ: used during bypass mode.
> > + *
> > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
> > + * trigger ops.dequeue() when they are later consumed.
> > + */
> > +static inline bool is_terminal_dsq(u64 dsq_id)
> > +{
> > +	return dsq_id & SCX_DSQ_FLAG_BUILTIN;
> > +}
> > +
> >  /**
> >   * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
> >   * @rq: rq to read clock from, must be locked
> > @@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p
> >  		resched_curr(rq);
> >  }
> >  
> > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> > +			     struct scx_dispatch_q *dsq,
> >  			     struct task_struct *p, u64 enq_flags)
> >  {
> >  	bool is_local = dsq->id == SCX_DSQ_LOCAL;
> > @@ -1103,6 +1125,27 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> >  	dsq_mod_nr(dsq, 1);
> >  	p->scx.dsq = dsq;
> >  
> > +	/*
> > +	 * Handle ops.dequeue() and custody tracking.
> > +	 *
> > +	 * Builtin DSQs (local, global, bypass) are terminal: the BPF
> > +	 * scheduler is done with the task. If it was in BPF custody, call
> > +	 * ops.dequeue() and clear the flag.
> > +	 *
> > +	 * User DSQs: Task is in BPF scheduler's custody. Set the flag so
> > +	 * ops.dequeue() will be called when it leaves.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (is_terminal_dsq(dsq->id)) {
> > +			if (p->scx.flags & SCX_TASK_NEED_DEQ)
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue,
> > +						 rq, p, 0);
> > +			p->scx.flags &= ~SCX_TASK_NEED_DEQ;
> > +		} else {
> > +			p->scx.flags |= SCX_TASK_NEED_DEQ;
> > +		}
> > +	}
> > +
> >  	/*
> >  	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
> >  	 * direct dispatch path, but we clear them here because the direct
> > @@ -1323,7 +1366,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> >  		return;
> >  	}
> >  
> > -	dispatch_enqueue(sch, dsq, p,
> > +	dispatch_enqueue(sch, rq, dsq, p,
> >  			 p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
> >  }
> >  
> > @@ -1407,13 +1450,22 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> >  	 * dequeue may be waiting. The store_release matches their load_acquire.
> >  	 */
> >  	atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
> > +
> > +	/*
> > +	 * Task is now in BPF scheduler's custody (queued on BPF internal
> > +	 * structures). Set %SCX_TASK_NEED_DEQ so ops.dequeue() is called
> > +	 * when it leaves custody (e.g. dispatched to a terminal DSQ or on
> > +	 * property change).
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue))
> 
> Related to the rename: Can we remove the guards and track the flag
> regardless of whether ops.dequeue() is present?
> 
> There is no reason not to track whether a task is in BPF or the core, 
> and it is a property that's independent of whether we implement ops.dequeue(). 
> This also simplifies the code since we now just guard the actual ops.dequeue()
> call.

I was concerned about introducing overhead, with the guard we can save a
few memory writes to p->scx.flags. But I don't have numbers and probably
the overhead is negligible.

Also, if we have a working ops.dequeue(), I guess more schedulers will
start implementing an ops.dequeue() callback, so the guard itself may
actually become the extra overhead.

So, I guess we can remove the guard and just set/clear the flag even
without an ops.dequeue() callback...

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-07  9:26     ` Andrea Righi
@ 2026-02-09 17:28       ` Tejun Heo
  2026-02-09 19:06         ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-09 17:28 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Emil Tsalapatis, David Vernet, Changwoo Min, Kuba Piecuch,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Sat, Feb 07, 2026 at 10:26:17AM +0100, Andrea Righi wrote:
> Hi Emil,
> 
> On Fri, Feb 06, 2026 at 03:35:34PM -0500, Emil Tsalapatis wrote:
> > On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
> ...
> > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > > index bcb962d5ee7d8..c48f818eee9b8 100644
> > > --- a/include/linux/sched/ext.h
> > > +++ b/include/linux/sched/ext.h
> > > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> > >  /* scx_entity.flags */
> > >  enum scx_ent_flags {
> > >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > > +	SCX_TASK_NEED_DEQ	= 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */
> > 
> > Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be
> > in BPF custody vs the core scx scheduler (terminal DSQs) this is a more
> > general property that can be useful to check in the future. An example:
> > We can now assert that a task's BPF state is consistent with its actual 
> > kernel state when using BPF-based data structures to manage tasks.
> 
> Ack. I like SCX_TASK_IN_BPF and I also like the idea of resuing the flag
> for other purposes. It can be helpful for debugging as well.

One problem with the name is that when a task is in the BPF scheduler's
custody, it can be still be on the kernel side in a DSQ or can be on the BPF
side on a BPF data structure. This is currently distinguished by SCX_OPSS
state (queued on the ops side or not). We do say things like "the task is in
BPF" to note that the task is not on a DSQ but in BPF proper, so I think
SCX_TASK_IN_BPF can become confusing.

I don't know what the right name is. When we write it out, we say "in BPF
sched's custody" where "BPF sched" means the whole SCX scheduler. Maybe just
SCX_TASK_IN_CUSTODY?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-09 17:28       ` Tejun Heo
@ 2026-02-09 19:06         ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-09 19:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Emil Tsalapatis, David Vernet, Changwoo Min, Kuba Piecuch,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Mon, Feb 09, 2026 at 07:28:50AM -1000, Tejun Heo wrote:
> On Sat, Feb 07, 2026 at 10:26:17AM +0100, Andrea Righi wrote:
> > Hi Emil,
> > 
> > On Fri, Feb 06, 2026 at 03:35:34PM -0500, Emil Tsalapatis wrote:
> > > On Fri Feb 6, 2026 at 8:54 AM EST, Andrea Righi wrote:
> > ...
> > > > diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> > > > index bcb962d5ee7d8..c48f818eee9b8 100644
> > > > --- a/include/linux/sched/ext.h
> > > > +++ b/include/linux/sched/ext.h
> > > > @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> > > >  /* scx_entity.flags */
> > > >  enum scx_ent_flags {
> > > >  	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> > > > +	SCX_TASK_NEED_DEQ	= 1 << 1, /* in BPF custody, needs ops.dequeue() when leaving */
> > > 
> > > Can we make this "SCX_TASK_IN_BPF"? Since we've now defined what it means to be
> > > in BPF custody vs the core scx scheduler (terminal DSQs) this is a more
> > > general property that can be useful to check in the future. An example:
> > > We can now assert that a task's BPF state is consistent with its actual 
> > > kernel state when using BPF-based data structures to manage tasks.
> > 
> > Ack. I like SCX_TASK_IN_BPF and I also like the idea of resuing the flag
> > for other purposes. It can be helpful for debugging as well.
> 
> One problem with the name is that when a task is in the BPF scheduler's
> custody, it can be still be on the kernel side in a DSQ or can be on the BPF
> side on a BPF data structure. This is currently distinguished by SCX_OPSS
> state (queued on the ops side or not). We do say things like "the task is in
> BPF" to note that the task is not on a DSQ but in BPF proper, so I think
> SCX_TASK_IN_BPF can become confusing.
> 
> I don't know what the right name is. When we write it out, we say "in BPF
> sched's custody" where "BPF sched" means the whole SCX scheduler. Maybe just
> SCX_TASK_IN_CUSTODY?

Yeah, I agree that the "task in BPF" concept is a bit too overloaded. I
think SCX_TASK_IN_CUSTODY is clear enough and it doesn't overlap with the
"in BPF" concept. I'll rename the flag to SCX_TASK_IN_CUSTODY.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCHSET v8] sched_ext: Fix ops.dequeue() semantics
@ 2026-02-10 21:26 Andrea Righi
  2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
  2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  0 siblings, 2 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-10 21:26 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.

In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().

This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.

This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g., sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).

To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.

Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.

Changes in v8:
 - Rename SCX_TASK_NEED_DEQ -> SCX_TASK_IN_CUSTODY and set/clear this flag
   also when ops.dequeue() is not implemented (can be used for other
   purposes in the future)
 - Clarify ops.select_cpu() behavior: dispatch to terminal DSQs doesn't
   trigger ops.dequeue(), dispatch to user DSQs triggers ops.dequeue(),
   store to BPF-internal data structure is discouraged
 - Link to v7:
   https://lore.kernel.org/all/20260206135742.2339918-1-arighi@nvidia.com

Changes in v7:
 - Handle tasks stored to BPF internal data structures (trigger
   ops.dequeue())
 - Add a kselftest scenario with a BPF queue to verify ops.dequeue()
   behavior with tasks stored in internal BPF data structures
 - Link to v6:
   https://lore.kernel.org/all/20260205153304.1996142-1-arighi@nvidia.com

Changes in v6:
 - Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
 - Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
   DSQs (local, global, bypass)
 - centralize ops.dequeue() logic in dispatch_enqueue()
 - Remove "Property Change Notifications for Running Tasks" section from
   the documentation
 - The kselftest now validates the right behavior both from ops.enqueue()
   and ops.select_cpu()
 - Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@nvidia.com

Changes in v5:
 - Introduce the concept of "terminal DSQ" (when a task is dispatched to a
   terminal DSQ, the task leaves the BPF scheduler's custody)
 - Consider SCX_DSQ_GLOBAL as a terminal DSQ
 - Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@nvidia.com

Changes in v4:
 - Introduce the concept of "BPF scheduler custody"
 - Do not trigger ops.dequeue() for direct dispatches to local DSQs
 - Trigger ops.dequeue() only once; after the task leaves BPF scheduler
   custody, further dequeue events are not reported.
 - Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@nvidia.com

Changes in v3:
 - Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
 - Handle core-sched dequeues (Kuba)
 - Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@nvidia.com

Changes in v2:
 - Distinguish between "dispatch" dequeues and "property change" dequeues
   (flag SCX_DEQ_ASYNC)
 - Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@nvidia.com

Andrea Righi (2):
      sched_ext: Fix ops.dequeue() semantics
      selftests/sched_ext: Add test to validate ops.dequeue() semantics

 Documentation/scheduler/sched-ext.rst           |  78 ++++-
 include/linux/sched/ext.h                       |   1 +
 kernel/sched/ext.c                              | 155 ++++++++--
 kernel/sched/ext_internal.h                     |   7 +
 tools/sched_ext/include/scx/enum_defs.autogen.h |   1 +
 tools/sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h     |   1 +
 tools/testing/selftests/sched_ext/Makefile      |   1 +
 tools/testing/selftests/sched_ext/dequeue.bpf.c | 368 ++++++++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c     | 265 +++++++++++++++++
 10 files changed, 855 insertions(+), 24 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix ops.dequeue() semantics Andrea Righi
@ 2026-02-10 21:26 ` Andrea Righi
  2026-02-10 23:20   ` Tejun Heo
  2026-02-10 23:54   ` Tejun Heo
  2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  1 sibling, 2 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-10 21:26 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.

Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).

BPF scheduler custody concept: a task is considered to be in the BPF
scheduler's custody when the scheduler is responsible for managing its
lifecycle. This includes tasks dispatched to user-created DSQs or stored
in the BPF scheduler's internal data structures from ops.enqueue().
Custody ends when the task is dispatched to a terminal DSQ (such as the
local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed
due to a property change.

Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are never in its custody. Terminal DSQs include:
 - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
   where tasks go directly to execution.
 - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
   BPF scheduler is considered "done" with the task.

As a result, ops.dequeue() is not invoked for tasks directly dispatched
to terminal DSQs.

To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.

New ops.dequeue() semantics:
 - ops.dequeue() is invoked exactly once when the task leaves the BPF
   scheduler's custody, in one of the following cases:
   a) regular dispatch: a task dispatched to a user DSQ or stored in
      internal BPF data structures is moved to a terminal DSQ
      (ops.dequeue() called without any special flags set),
   b) core scheduling dispatch: core-sched picks task before dispatch
      (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
   c) property change: task properties modified before dispatch,
      (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).

This allows BPF schedulers to:
 - reliably track task ownership and lifecycle,
 - maintain accurate accounting of managed tasks,
 - update internal state when tasks change properties.

Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 Documentation/scheduler/sched-ext.rst         |  78 ++++++++-
 include/linux/sched/ext.h                     |   1 +
 kernel/sched/ext.c                            | 155 ++++++++++++++++--
 kernel/sched/ext_internal.h                   |   7 +
 .../sched_ext/include/scx/enum_defs.autogen.h |   1 +
 .../sched_ext/include/scx/enums.autogen.bpf.h |   2 +
 tools/sched_ext/include/scx/enums.autogen.h   |   1 +
 7 files changed, 221 insertions(+), 24 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..21c65e504da7c 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -229,16 +229,23 @@ The following briefly shows how a waking task is scheduled and executed.
    scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
    using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
 
-   A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
-   by calling ``scx_bpf_dsq_insert()``. If the task is inserted into
-   ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the
-   local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
-   Additionally, inserting directly from ``ops.select_cpu()`` will cause the
-   ``ops.enqueue()`` callback to be skipped.
-
    Note that the scheduler core will ignore an invalid CPU selection, for
    example, if it's outside the allowed cpumask of the task.
 
+   A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
+   by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``.
+
+   If the task is inserted into ``SCX_DSQ_LOCAL`` from
+   ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU
+   is returned from ``ops.select_cpu()``. Additionally, inserting directly
+   from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to
+   be skipped.
+
+   Any other attempt to store a task in BPF-internal data structures from
+   ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being
+   invoked. This is discouraged, as it can introduce racy or inconsistent
+   state.
+
 2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
    task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
    can make one of the following decisions:
@@ -252,6 +259,61 @@ The following briefly shows how a waking task is scheduled and executed.
 
    * Queue the task on the BPF side.
 
+   **Task State Tracking and ops.dequeue() Semantics**
+
+   A task is in the "BPF scheduler's custody" when the BPF scheduler is
+   responsible for managing its lifecycle. A task enters custody when it is
+   dispatched to a user DSQ or stored in the BPF scheduler's internal data
+   structures. Custody is entered only from ``ops.enqueue()`` for those
+   operations. The only exception is dispatching to a user DSQ from
+   ``ops.select_cpu()``: although the task is not yet technically in BPF
+   scheduler custody at that point, the dispatch has the same semantic
+   effect as dispatching from ``ops.enqueue()`` for custody-related
+   semantics.
+
+   Once ``ops.enqueue()`` is called, the task may or may not enter custody
+   depending on what the scheduler does:
+
+   * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
+     ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler
+     is done with the task - it either goes straight to a CPU's local run
+     queue or to the global DSQ as a fallback. The task never enters (or
+     exits) BPF custody, and ``ops.dequeue()`` will not be called.
+
+   * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
+     BPF scheduler's custody. When the task later leaves BPF custody
+     (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
+     sleep/property changes), ``ops.dequeue()`` will be called exactly
+     once.
+
+   * **Stored in BPF data structures** (e.g., internal BPF queues): the
+     task is in BPF custody. ``ops.dequeue()`` will be called when it
+     leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or
+     on property change / sleep).
+
+   When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked.
+   The dequeue can happen for different reasons, distinguished by flags:
+
+   1. **Regular dispatch**: when a task in BPF custody is dispatched to a
+      terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
+      execution), ``ops.dequeue()`` is triggered without any special flags.
+
+   2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+      core scheduling picks a task for execution while it's still in BPF
+      custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+   3. **Scheduling property change**: when a task property changes (via
+      operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+      priority changes, CPU migrations, etc.) while the task is still in
+      BPF custody, ``ops.dequeue()`` is called with the
+      ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+   **Important**: Once a task has left BPF custody (e.g. after being
+   dispatched to a terminal DSQ), property changes will not trigger
+   ``ops.dequeue()``, since the task is no longer managed by the BPF
+   scheduler.
+
 3. When a CPU is ready to schedule, it first looks at its local DSQ. If
    empty, it then looks at the global DSQ. If there still isn't a task to
    run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +381,8 @@ by a sched_ext scheduler:
                 /* Any usable CPU becomes available */
 
                 ops.dispatch(); /* Task is moved to a local DSQ */
+
+                ops.dequeue(); /* Exiting BPF scheduler */
             }
             ops.running();      /* Task starts running on its assigned CPU */
             while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..4601e5ecb43c0 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,6 +84,7 @@ struct scx_dispatch_q {
 /* scx_entity.flags */
 enum scx_ent_flags {
 	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
+	SCX_TASK_IN_CUSTODY	= 1 << 1, /* in custody, needs ops.dequeue() when leaving */
 	SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
 	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 3, /* last dequeue was for SLEEP */
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0bb8fa927e9e9..5f7c9088f90a9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
 #endif
 }
 
+/**
+ * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
+ * @dsq_id: DSQ ID to check
+ *
+ * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
+ * scheduler is considered "done" with the task.
+ *
+ * Builtin DSQs include:
+ *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
+ *    where tasks go directly to execution,
+ *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
+ *  - Bypass DSQ: used during bypass mode.
+ *
+ * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
+ * trigger ops.dequeue() when they are later consumed.
+ */
+static inline bool is_terminal_dsq(u64 dsq_id)
+{
+	return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID;
+}
+
 /**
  * touch_core_sched_dispatch - Update core-sched timestamp on dispatch
  * @rq: rq to read clock from, must be locked
@@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p
 		resched_curr(rq);
 }
 
-static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
+			     struct scx_dispatch_q *dsq,
 			     struct task_struct *p, u64 enq_flags)
 {
 	bool is_local = dsq->id == SCX_DSQ_LOCAL;
@@ -1103,6 +1125,23 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 	dsq_mod_nr(dsq, 1);
 	p->scx.dsq = dsq;
 
+	/*
+	 * Handle ops.dequeue() and custody tracking.
+	 *
+	 * Terminal DSQs: the BPF scheduler is done with the task. If it
+	 * was in BPF custody, call ops.dequeue() and clear the flag.
+	 *
+	 * Non-terminal DSQs: task is in BPF scheduler's custody.
+	 */
+	if (is_terminal_dsq(dsq->id)) {
+		if (SCX_HAS_OP(sch, dequeue) &&
+		    (p->scx.flags & SCX_TASK_IN_CUSTODY))
+			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+		p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
+	} else {
+		p->scx.flags |= SCX_TASK_IN_CUSTODY;
+	}
+
 	/*
 	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
 	 * direct dispatch path, but we clear them here because the direct
@@ -1323,7 +1362,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 		return;
 	}
 
-	dispatch_enqueue(sch, dsq, p,
+	dispatch_enqueue(sch, rq, dsq, p,
 			 p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
@@ -1407,13 +1446,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	 * dequeue may be waiting. The store_release matches their load_acquire.
 	 */
 	atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
+
+	/*
+	 * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY
+	 * so ops.dequeue() is called when it leaves custody.
+	 */
+	p->scx.flags |= SCX_TASK_IN_CUSTODY;
 	return;
 
 direct:
 	direct_dispatch(sch, p, enq_flags);
 	return;
 local_norefill:
-	dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
+	dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
 	return;
 local:
 	dsq = &rq->scx.local_dsq;
@@ -1433,7 +1478,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	 */
 	touch_core_sched(rq, p);
 	refill_task_slice_dfl(sch, p);
-	dispatch_enqueue(sch, dsq, p, enq_flags);
+	dispatch_enqueue(sch, rq, dsq, p, enq_flags);
 }
 
 static bool task_runnable(const struct task_struct *p)
@@ -1511,6 +1556,27 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
 		__scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1);
 }
 
+/*
+ * Call ops.dequeue() for a task leaving BPF custody.
+ */
+static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
+			      struct task_struct *p, u64 deq_flags,
+			      bool is_sched_change)
+{
+	if (SCX_HAS_OP(sch, dequeue)) {
+		/*
+		 * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a
+		 * property change (not sleep or core-sched pick).
+		 */
+		if (is_sched_change &&
+		    !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+			deq_flags |= SCX_DEQ_SCHED_CHANGE;
+
+		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags);
+	}
+	p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
+}
+
 static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 {
 	struct scx_sched *sch = scx_root;
@@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 
 	switch (opss & SCX_OPSS_STATE_MASK) {
 	case SCX_OPSS_NONE:
+		/*
+		 * If the task is still in BPF scheduler's custody
+		 * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
+		 */
+		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
+			call_task_dequeue(sch, rq, p, deq_flags, true);
 		break;
 	case SCX_OPSS_QUEUEING:
 		/*
@@ -1532,9 +1604,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
 		 */
 		BUG();
 	case SCX_OPSS_QUEUED:
-		if (SCX_HAS_OP(sch, dequeue))
-			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
-					 p, deq_flags);
+		/*
+		 * Task is BPF scheduler's custody (not dispatched yet).
+		 * Call ops.dequeue() to notify that it's leaving custody.
+		 */
+		WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY));
+		call_task_dequeue(sch, rq, p, deq_flags, true);
 
 		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
 					    SCX_OPSS_NONE))
@@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 					 struct scx_dispatch_q *src_dsq,
 					 struct rq *dst_rq)
 {
+	struct scx_sched *sch = scx_root;
 	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
 
 	/* @dsq is locked and @p is on @dst_rq */
@@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
 
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
 
+	/*
+	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
+	 * Call ops.dequeue() if the task was in BPF custody.
+	 */
+	if (p->scx.flags & SCX_TASK_IN_CUSTODY) {
+		if (SCX_HAS_OP(sch, dequeue))
+			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
+		p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
+	}
+
 	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
 		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
 	else
@@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
 		!WARN_ON_ONCE(src_rq != task_rq(p));
 }
 
-static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
-				struct scx_dispatch_q *dsq, struct rq *src_rq)
+static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
+			       struct task_struct *p,
+			       struct scx_dispatch_q *dsq, struct rq *src_rq)
 {
 	raw_spin_rq_unlock(this_rq);
 
 	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
+		/*
+		 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
+		 * Call ops.dequeue() if the task was in BPF custody.
+		 */
+		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
+			call_task_dequeue(sch, src_rq, p, 0, false);
 		move_remote_task_to_local_dsq(p, 0, src_rq, this_rq);
 		return true;
 	} else {
@@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 						     src_dsq, dst_rq);
 			raw_spin_unlock(&src_dsq->lock);
 		} else {
+			/*
+			 * Moving to a local DSQ, dispatch_enqueue() is not
+			 * used, so call ops.dequeue() here if the task was
+			 * in BPF scheduler's custody.
+			 */
+			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
+				call_task_dequeue(sch, src_rq, p, 0, false);
 			raw_spin_unlock(&src_dsq->lock);
 			move_remote_task_to_local_dsq(p, enq_flags,
 						      src_rq, dst_rq);
@@ -1879,7 +1979,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 		dispatch_dequeue_locked(p, src_dsq);
 		raw_spin_unlock(&src_dsq->lock);
 
-		dispatch_enqueue(sch, dst_dsq, p, enq_flags);
+		dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
 	}
 
 	return dst_rq;
@@ -1922,7 +2022,7 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 		}
 
 		if (task_can_run_on_remote_rq(sch, p, rq, false)) {
-			if (likely(consume_remote_task(rq, p, dsq, task_rq)))
+			if (likely(consume_remote_task(sch, rq, p, dsq, task_rq)))
 				return true;
 			goto retry;
 		}
@@ -1969,14 +2069,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 	 * If dispatching to @rq that @p is already on, no lock dancing needed.
 	 */
 	if (rq == src_rq && rq == dst_rq) {
-		dispatch_enqueue(sch, dst_dsq, p,
+		dispatch_enqueue(sch, rq, dst_dsq, p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
 
 	if (src_rq != dst_rq &&
 	    unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
-		dispatch_enqueue(sch, find_global_dsq(sch, p), p,
+		dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
 				 enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
@@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 		 */
 		if (src_rq == dst_rq) {
 			p->scx.holding_cpu = -1;
-			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
+			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
 					 enq_flags);
 		} else {
+			/*
+			 * Moving to a local DSQ, dispatch_enqueue() is not
+			 * used, so call ops.dequeue() here if the task was
+			 * in BPF scheduler's custody.
+			 */
+			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
+				call_task_dequeue(sch, src_rq, p, 0, false);
 			move_remote_task_to_local_dsq(p, enq_flags,
 						      src_rq, dst_rq);
 			/* task has been moved to dst_rq, which is now locked */
@@ -2113,7 +2220,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 	if (dsq->id == SCX_DSQ_LOCAL)
 		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
 	else
-		dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+		dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
 static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
@@ -2414,7 +2521,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 * DSQ.
 		 */
 		if (p->scx.slice && !scx_rq_bypassing(rq)) {
-			dispatch_enqueue(sch, &rq->scx.local_dsq, p,
+			dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p,
 					 SCX_ENQ_HEAD);
 			goto switch_class;
 		}
@@ -2898,6 +3005,13 @@ static void scx_enable_task(struct task_struct *p)
 
 	lockdep_assert_rq_held(rq);
 
+	/*
+	 * Verify the task is not in BPF scheduler's custody. If flag
+	 * transitions are consistent, the flag should always be clear
+	 * here.
+	 */
+	WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
+
 	/*
 	 * Set the weight before calling ops.enable() so that the scheduler
 	 * doesn't see a stale value if they inspect the task struct.
@@ -2929,6 +3043,13 @@ static void scx_disable_task(struct task_struct *p)
 	if (SCX_HAS_OP(sch, disable))
 		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
 	scx_set_task_state(p, SCX_TASK_READY);
+
+	/*
+	 * Verify the task is not in BPF scheduler's custody. If flag
+	 * transitions are consistent, the flag should always be clear
+	 * here.
+	 */
+	WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
 }
 
 static void scx_exit_task(struct task_struct *p)
@@ -3919,7 +4040,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq,
 		 * between bypass DSQs.
 		 */
 		dispatch_dequeue_locked(p, donor_dsq);
-		dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);
+		dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED);
 
 		/*
 		 * $donee might have been idle and need to be woken up. No need
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
 	 * it hasn't been dispatched yet. Dequeue from the BPF side.
 	 */
 	SCX_DEQ_CORE_SCHED_EXEC	= 1LLU << 32,
+
+	/*
+	 * The task is being dequeued due to a property change (e.g.,
+	 * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+	 * etc.).
+	 */
+	SCX_DEQ_SCHED_CHANGE	= 1LLU << 33,
 };
 
 enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..dcc945304760f 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
 #define HAVE_SCX_CPU_PREEMPT_UNKNOWN
 #define HAVE_SCX_DEQ_SLEEP
 #define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
 #define HAVE_SCX_DSQ_FLAG_BUILTIN
 #define HAVE_SCX_DSQ_FLAG_LOCAL_ON
 #define HAVE_SCX_DSQ_INVALID
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
 const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
 #define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ
 
+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
 	SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+	SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
 } while (0)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics
  2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix ops.dequeue() semantics Andrea Righi
  2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
@ 2026-02-10 21:26 ` Andrea Righi
  2026-02-12 17:15   ` Christian Loehle
  1 sibling, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-10 21:26 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Christian Loehle, Daniel Hodges,
	sched-ext, linux-kernel

Add a new kselftest to validate that the new ops.dequeue() semantics
work correctly for all task lifecycle scenarios, including the
distinction between terminal DSQs (where BPF scheduler is done with the
task), user DSQs (where BPF scheduler manages the task lifecycle) and
BPF data structures, regardless of which event performs the dispatch.

The test validates the following scenarios:

 - From ops.select_cpu():
     - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
       the BPF scheduler entirely; they never enter BPF custody, so
       ops.dequeue() is not called,
     - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
       bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
       not called,
     - scenario 2 (user DSQ): tasks dispatched to user DSQs from
       ops.select_cpu(): tasks enter BPF scheduler's custody with full
       enqueue/dequeue lifecycle tracking and state machine validation,
       expects 1:1 enqueue/dequeue pairing,

   - From ops.enqueue():
     - scenario 3 (local DSQ): same behavior as scenario 0,
     - scenario 4 (global DSQ): same behavior as scenario 1,
     - scenario 5 (user DSQ): same behavior as scenario 2,
     - scenario 6 (BPF internal queue): tasks are stored in a BPF queue
       from ops.enqueue() and consumed from ops.dispatch(); similarly to
       scenario 5, tasks enter BPF scheduler's custody with full
       lifecycle tracking and 1:1 enqueue/dequeue validation.

This verifies that:
 - terminal DSQ dispatch (local, global) don't trigger ops.dequeue(),
 - tasks dispatched to user DSQs, either from ops.select_cpu() or
   ops.enqueue(), enter BPF scheduler's custody and have exact 1:1
   enqueue/dequeue pairing,
 - tasks stored to internal BPF data structures from ops.enqueue() enter
   BPF scheduler's custody and have exact 1:1 enqueue/dequeue pairing,
 - dispatch dequeues have no flags (normal workflow),
 - property change dequeues have the %SCX_DEQ_SCHED_CHANGE flag set,
 - no duplicate enqueues or invalid state transitions are happening.

Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../testing/selftests/sched_ext/dequeue.bpf.c | 368 ++++++++++++++++++
 tools/testing/selftests/sched_ext/dequeue.c   | 265 +++++++++++++
 3 files changed, 634 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/dequeue.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 5fe45f9c5f8fd..764e91edabf93 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -161,6 +161,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs
 
 auto-test-targets :=			\
 	create_dsq			\
+	dequeue				\
 	enq_last_no_enq_fails		\
 	ddsp_bogus_dsq_fail		\
 	ddsp_vtimelocal_fail		\
diff --git a/tools/testing/selftests/sched_ext/dequeue.bpf.c b/tools/testing/selftests/sched_ext/dequeue.bpf.c
new file mode 100644
index 0000000000000..d9d12f14cd673
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/dequeue.bpf.c
@@ -0,0 +1,368 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A scheduler that validates ops.dequeue() is called correctly:
+ * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF
+ *   scheduler entirely: no ops.dequeue() should be called
+ * - Tasks dispatched to user DSQs from ops.enqueue() enter BPF custody:
+ *   ops.dequeue() must be called when they leave custody
+ * - Every ops.enqueue() dispatch to non-terminal DSQs is followed by
+ *   exactly one ops.dequeue() (validate 1:1 pairing and state machine)
+ *
+ * Copyright (c) 2026 NVIDIA Corporation.
+ */
+
+#include <scx/common.bpf.h>
+
+#define SHARED_DSQ	0
+
+/*
+ * BPF internal queue.
+ *
+ * Tasks are stored here and consumed from ops.dispatch(), validating that
+ * tasks on BPF internal structures still get ops.dequeue() when they
+ * leave.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_QUEUE);
+	__uint(max_entries, 32768);
+	__type(value, s32);
+} global_queue SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+/*
+ * Counters to track the lifecycle of tasks:
+ * - enqueue_cnt: Number of times ops.enqueue() was called
+ * - dequeue_cnt: Number of times ops.dequeue() was called (any type)
+ * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag)
+ * - change_dequeue_cnt: Number of property change dequeues
+ * - bpf_queue_full: Number of times the BPF internal queue was full
+ */
+u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt, bpf_queue_full;
+
+/*
+ * Test scenarios:
+ * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
+ *    scheduler, no dequeue callbacks)
+ * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
+ *    scheduler, no dequeue callbacks)
+ * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler,
+ *    dequeue callbacks expected)
+ * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
+ *    scheduler, no dequeue callbacks)
+ * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
+ *    scheduler, no dequeue callbacks)
+ * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler,
+ *    dequeue callbacks expected)
+ * 6) BPF internal queue from ops.enqueue(): store task PIDs in ops.enqueue(),
+ *    consume in ops.dispatch() and dispatch to local DSQ (validates dequeue
+ *    for tasks stored in internal BPF data structures)
+ */
+u32 test_scenario;
+
+/*
+ * Per-task state to track lifecycle and validate workflow semantics.
+ * State transitions:
+ *   NONE -> ENQUEUED (on enqueue)
+ *   ENQUEUED -> DISPATCHED (on dispatch dequeue)
+ *   DISPATCHED -> NONE (on property change dequeue or re-enqueue)
+ *   ENQUEUED -> NONE (on property change dequeue before dispatch)
+ */
+enum task_state {
+	TASK_NONE = 0,
+	TASK_ENQUEUED,
+	TASK_DISPATCHED,
+};
+
+struct task_ctx {
+	enum task_state state; /* Current state in the workflow */
+	u64 enqueue_seq;       /* Sequence number for debugging */
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct task_ctx);
+} task_ctx_stor SEC(".maps");
+
+static struct task_ctx *try_lookup_task_ctx(struct task_struct *p)
+{
+	return bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
+}
+
+s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p,
+		   s32 prev_cpu, u64 wake_flags)
+{
+	struct task_ctx *tctx;
+	s32 pid = p->pid;
+
+	tctx = try_lookup_task_ctx(p);
+	if (!tctx)
+		return prev_cpu;
+
+	switch (test_scenario) {
+	case 0:
+		/*
+		 * Direct dispatch to the local DSQ.
+		 *
+		 * Task bypasses BPF scheduler entirely: no enqueue
+		 * tracking, no ops.dequeue() callbacks.
+		 */
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+		tctx->state = TASK_DISPATCHED;
+		break;
+	case 1:
+		/*
+		 * Direct dispatch to the global DSQ.
+		 *
+		 * Task bypasses BPF scheduler entirely: no enqueue
+		 * tracking, no ops.dequeue() callbacks.
+		 */
+		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
+		tctx->state = TASK_DISPATCHED;
+		break;
+	case 2:
+		/*
+		 * Dispatch to shared a user DSQ.
+		 *
+		 * Task enters BPF scheduler management: track
+		 * enqueue/dequeue lifecycle and validate state
+		 * transitions.
+		 */
+		if (tctx->state == TASK_ENQUEUED)
+			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
+				      p->pid, p->comm, tctx->enqueue_seq);
+
+		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0);
+
+		__sync_fetch_and_add(&enqueue_cnt, 1);
+
+		tctx->state = TASK_ENQUEUED;
+		tctx->enqueue_seq++;
+		break;
+	}
+
+	return prev_cpu;
+}
+
+void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags)
+{
+	struct task_ctx *tctx;
+	s32 pid = p->pid;
+
+	tctx = try_lookup_task_ctx(p);
+	if (!tctx)
+		return;
+
+	switch (test_scenario) {
+	case 3:
+		/*
+		 * Direct dispatch to the local DSQ.
+		 *
+		 * Task bypasses BPF scheduler entirely: no enqueue
+		 * tracking, no ops.dequeue() callbacks.
+		 */
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
+		break;
+	case 4:
+		/*
+		 * Direct dispatch to the global DSQ.
+		 *
+		 * Task bypasses BPF scheduler entirely: no enqueue
+		 * tracking, no ops.dequeue() callbacks.
+		 */
+		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+		break;
+	case 5:
+		/*
+		 * Dispatch to shared user DSQ.
+		 *
+		 * Task enters BPF scheduler management: track
+		 * enqueue/dequeue lifecycle and validate state
+		 * transitions.
+		 */
+		if (tctx->state == TASK_ENQUEUED)
+			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
+				      p->pid, p->comm, tctx->enqueue_seq);
+
+		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
+
+		__sync_fetch_and_add(&enqueue_cnt, 1);
+
+		tctx->state = TASK_ENQUEUED;
+		tctx->enqueue_seq++;
+		break;
+	case 6:
+		/*
+		 * Store task in BPF internal queue.
+		 *
+		 * Task enters BPF scheduler management: track
+		 * enqueue/dequeue lifecycle and validate state
+		 * transitions.
+		 */
+		if (tctx->state == TASK_ENQUEUED)
+			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
+				      p->pid, p->comm, tctx->enqueue_seq);
+
+		if (bpf_map_push_elem(&global_queue, &pid, 0)) {
+			scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+			__sync_fetch_and_add(&bpf_queue_full, 1);
+
+			tctx->state = TASK_DISPATCHED;
+		} else {
+			__sync_fetch_and_add(&enqueue_cnt, 1);
+
+			tctx->state = TASK_ENQUEUED;
+			tctx->enqueue_seq++;
+		}
+		break;
+	default:
+		/* For all other scenarios, dispatch to the global DSQ */
+		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+		tctx->state = TASK_DISPATCHED;
+		break;
+	}
+
+	scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);
+}
+
+void BPF_STRUCT_OPS(dequeue_dequeue, struct task_struct *p, u64 deq_flags)
+{
+	struct task_ctx *tctx;
+
+	__sync_fetch_and_add(&dequeue_cnt, 1);
+
+	tctx = try_lookup_task_ctx(p);
+	if (!tctx)
+		return;
+
+	/*
+	 * For scenarios 0, 1, 3, and 4 (terminal DSQs: local and global),
+	 * ops.dequeue() should never be called because tasks bypass the
+	 * BPF scheduler entirely. If we get here, it's a kernel bug.
+	 */
+	if (test_scenario == 0 || test_scenario == 3) {
+		scx_bpf_error("%d (%s): dequeue called for local DSQ scenario",
+			      p->pid, p->comm);
+		return;
+	}
+
+	if (test_scenario == 1 || test_scenario == 4) {
+		scx_bpf_error("%d (%s): dequeue called for global DSQ scenario",
+			      p->pid, p->comm);
+		return;
+	}
+
+	if (deq_flags & SCX_DEQ_SCHED_CHANGE) {
+		/*
+		 * Property change interrupting the workflow. Valid from
+		 * both ENQUEUED and DISPATCHED states. Transitions task
+		 * back to NONE state.
+		 */
+		__sync_fetch_and_add(&change_dequeue_cnt, 1);
+
+		/* Validate state transition */
+		if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_DISPATCHED)
+			scx_bpf_error("%d (%s): invalid property change dequeue state=%d seq=%llu",
+				      p->pid, p->comm, tctx->state, tctx->enqueue_seq);
+
+		/* Transition back to NONE: task outside scheduler control */
+		tctx->state = TASK_NONE;
+	} else {
+		/*
+		 * Regular dispatch dequeue: normal workflow step. Valid
+		 * only from ENQUEUED state (after enqueue, before dispatch
+		 * dequeue). Transitions to DISPATCHED state.
+		 */
+		__sync_fetch_and_add(&dispatch_dequeue_cnt, 1);
+
+		/*
+		 * Dispatch dequeue should not have %SCX_DEQ_SCHED_CHANGE
+		 * flag.
+		 */
+		if (deq_flags & SCX_DEQ_SCHED_CHANGE)
+			scx_bpf_error("%d (%s): SCX_DEQ_SCHED_CHANGE in dispatch dequeue seq=%llu",
+				      p->pid, p->comm, tctx->enqueue_seq);
+
+		/*
+		 * Must be in ENQUEUED state.
+		 */
+		if (tctx->state != TASK_ENQUEUED)
+			scx_bpf_error("%d (%s): dispatch dequeue from state %d seq=%llu",
+				      p->pid, p->comm, tctx->state, tctx->enqueue_seq);
+
+		/*
+		 * Transition to DISPATCHED: normal cycle completed
+		 * dispatch.
+		 */
+		tctx->state = TASK_DISPATCHED;
+	}
+}
+
+void BPF_STRUCT_OPS(dequeue_dispatch, s32 cpu, struct task_struct *prev)
+{
+	if (test_scenario == 6) {
+		struct task_struct *p;
+		s32 pid;
+
+		if (bpf_map_pop_elem(&global_queue, &pid))
+			return;
+
+		p = bpf_task_from_pid(pid);
+		if (!p)
+			return;
+
+		if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
+			cpu = scx_bpf_task_cpu(p);
+
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0);
+		bpf_task_release(p);
+	} else {
+		scx_bpf_dsq_move_to_local(SHARED_DSQ);
+	}
+}
+
+s32 BPF_STRUCT_OPS(dequeue_init_task, struct task_struct *p,
+		   struct scx_init_task_args *args)
+{
+	struct task_ctx *tctx;
+
+	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0,
+				   BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!tctx)
+		return -ENOMEM;
+
+	return 0;
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(dequeue_init)
+{
+	s32 ret;
+
+	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+void BPF_STRUCT_OPS(dequeue_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops dequeue_ops = {
+	.select_cpu		= (void *)dequeue_select_cpu,
+	.enqueue		= (void *)dequeue_enqueue,
+	.dequeue		= (void *)dequeue_dequeue,
+	.dispatch		= (void *)dequeue_dispatch,
+	.init_task		= (void *)dequeue_init_task,
+	.init			= (void *)dequeue_init,
+	.exit			= (void *)dequeue_exit,
+	.timeout_ms		= 5000,
+	.name			= "dequeue_test",
+};
diff --git a/tools/testing/selftests/sched_ext/dequeue.c b/tools/testing/selftests/sched_ext/dequeue.c
new file mode 100644
index 0000000000000..8bc9d263aa05c
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/dequeue.c
@@ -0,0 +1,265 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 NVIDIA Corporation.
+ */
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <signal.h>
+#include <time.h>
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <sched.h>
+#include <pthread.h>
+#include "scx_test.h"
+#include "dequeue.bpf.skel.h"
+
+#define NUM_WORKERS 8
+#define AFFINITY_HAMMER_MS 50
+
+/*
+ * Worker function that creates enqueue/dequeue events via CPU work and
+ * sleeping. Property-change dequeues are triggered by the affinity hammer
+ * thread (external sched_setaffinity on worker PIDs).
+ */
+static void worker_fn(int id)
+{
+	int i;
+	volatile int sum = 0;
+
+	for (i = 0; i < 1000; i++) {
+		int j;
+
+		/* Do some work to trigger scheduling events */
+		for (j = 0; j < 10000; j++)
+			sum += j;
+
+		/* Sleep to trigger dequeue */
+		usleep(1000 + (id * 100));
+	}
+
+	exit(0);
+}
+
+/*
+ * Property-change dequeues only happen when a task gets a property change
+ * while still in the queue. This thread changes workers' affinity from
+ * outside so that some changes hit tasks while they are still in the
+ * queue.
+ */
+static void *affinity_hammer_fn(void *arg)
+{
+	pid_t *pids = arg;
+	cpu_set_t cpuset;
+	int i, n = NUM_WORKERS;
+	struct timespec ts = { .tv_sec = 0, .tv_nsec = 1000000 }; /* 1ms */
+
+	for (i = 0; i < (AFFINITY_HAMMER_MS * 1000 / 100); i++) {
+		int w = i % n;
+		int cpu = (i / n) % 4;
+
+		CPU_ZERO(&cpuset);
+		CPU_SET(cpu, &cpuset);
+		sched_setaffinity(pids[w], sizeof(cpuset), &cpuset);
+		nanosleep(&ts, NULL);
+	}
+
+	return NULL;
+}
+
+static enum scx_test_status run_scenario(struct dequeue *skel, u32 scenario,
+					 const char *scenario_name)
+{
+	struct bpf_link *link;
+	pid_t pids[NUM_WORKERS];
+	pthread_t hammer;
+
+	int i, status;
+	u64 enq_start, deq_start,
+	    dispatch_deq_start, change_deq_start, bpf_queue_full_start;
+	u64 enq_delta, deq_delta,
+	    dispatch_deq_delta, change_deq_delta, bpf_queue_full_delta;
+
+	/* Set the test scenario */
+	skel->bss->test_scenario = scenario;
+
+	/* Record starting counts */
+	enq_start = skel->bss->enqueue_cnt;
+	deq_start = skel->bss->dequeue_cnt;
+	dispatch_deq_start = skel->bss->dispatch_dequeue_cnt;
+	change_deq_start = skel->bss->change_dequeue_cnt;
+	bpf_queue_full_start = skel->bss->bpf_queue_full;
+
+	link = bpf_map__attach_struct_ops(skel->maps.dequeue_ops);
+	SCX_FAIL_IF(!link, "Failed to attach struct_ops for scenario %s", scenario_name);
+
+	/* Fork worker processes to generate enqueue/dequeue events */
+	for (i = 0; i < NUM_WORKERS; i++) {
+		pids[i] = fork();
+		SCX_FAIL_IF(pids[i] < 0, "Failed to fork worker %d", i);
+
+		if (pids[i] == 0) {
+			worker_fn(i);
+			/* Should not reach here */
+			exit(1);
+		}
+	}
+
+	/*
+	 * Run an "affinity hammer" so that some property changes hit tasks
+	 * while they are still in BPF custody (e.g. in user DSQ or BPF queue),
+	 * triggering SCX_DEQ_SCHED_CHANGE dequeues in scenarios 2, 3, 6 and 7.
+	 */
+	SCX_FAIL_IF(pthread_create(&hammer, NULL, affinity_hammer_fn, pids) != 0,
+		    "Failed to create affinity hammer thread");
+	pthread_join(hammer, NULL);
+
+	/* Wait for all workers to complete */
+	for (i = 0; i < NUM_WORKERS; i++) {
+		SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i],
+			    "Failed to wait for worker %d", i);
+		SCX_FAIL_IF(status != 0, "Worker %d exited with status %d", i, status);
+	}
+
+	bpf_link__destroy(link);
+
+	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG));
+
+	/* Calculate deltas */
+	enq_delta = skel->bss->enqueue_cnt - enq_start;
+	deq_delta = skel->bss->dequeue_cnt - deq_start;
+	dispatch_deq_delta = skel->bss->dispatch_dequeue_cnt - dispatch_deq_start;
+	change_deq_delta = skel->bss->change_dequeue_cnt - change_deq_start;
+	bpf_queue_full_delta = skel->bss->bpf_queue_full - bpf_queue_full_start;
+
+	printf("%s:\n", scenario_name);
+	printf("  enqueues: %lu\n", (unsigned long)enq_delta);
+	printf("  dequeues: %lu (dispatch: %lu, property_change: %lu)\n",
+	       (unsigned long)deq_delta,
+	       (unsigned long)dispatch_deq_delta,
+	       (unsigned long)change_deq_delta);
+	printf("  BPF queue full: %lu\n", (unsigned long)bpf_queue_full_delta);
+
+	/*
+	 * Validate enqueue/dequeue lifecycle tracking.
+	 *
+	 * For scenarios 0, 1, 3, 4 (local and global DSQs from
+	 * ops.select_cpu() and ops.enqueue()), both enqueues and dequeues
+	 * should be 0 because tasks bypass the BPF scheduler entirely:
+	 * tasks never enter BPF scheduler's custody.
+	 *
+	 * For scenarios 2, 5, 6 (user DSQ or BPF internal queue) we expect
+	 * both enqueues and dequeues.
+	 *
+	 * The BPF code does strict state machine validation with
+	 * scx_bpf_error() to ensure the workflow semantics are correct.
+	 *
+	 * If we reach this point without errors, the semantics are
+	 * validated correctly.
+	 */
+	if (scenario == 0 || scenario == 1 ||
+	    scenario == 3 || scenario == 4) {
+		/* Tasks bypass BPF scheduler completely */
+		SCX_EQ(enq_delta, 0);
+		SCX_EQ(deq_delta, 0);
+		SCX_EQ(dispatch_deq_delta, 0);
+		SCX_EQ(change_deq_delta, 0);
+	} else {
+		/*
+		 * User DSQ from ops.enqueue() or ops.select_cpu(): tasks
+		 * enter BPF scheduler's custody.
+		 *
+		 * Also validate 1:1 enqueue/dequeue pairing.
+		 */
+		SCX_GT(enq_delta, 0);
+		SCX_GT(deq_delta, 0);
+		SCX_EQ(enq_delta, deq_delta);
+	}
+
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct dequeue *skel;
+
+	skel = dequeue__open();
+	SCX_FAIL_IF(!skel, "Failed to open skel");
+	SCX_ENUM_INIT(skel);
+	SCX_FAIL_IF(dequeue__load(skel), "Failed to load skel");
+
+	*ctx = skel;
+
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct dequeue *skel = ctx;
+	enum scx_test_status status;
+
+	status = run_scenario(skel, 0, "Scenario 0: Local DSQ from ops.select_cpu()");
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	status = run_scenario(skel, 1, "Scenario 1: Global DSQ from ops.select_cpu()");
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	status = run_scenario(skel, 2, "Scenario 2: User DSQ from ops.select_cpu()");
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	status = run_scenario(skel, 3, "Scenario 3: Local DSQ from ops.enqueue()");
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	status = run_scenario(skel, 4, "Scenario 4: Global DSQ from ops.enqueue()");
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	status = run_scenario(skel, 5, "Scenario 5: User DSQ from ops.enqueue()");
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	status = run_scenario(skel, 6, "Scenario 6: BPF queue from ops.enqueue()");
+	if (status != SCX_TEST_PASS)
+		return status;
+
+	printf("\n=== Summary ===\n");
+	printf("Total enqueues: %lu\n", (unsigned long)skel->bss->enqueue_cnt);
+	printf("Total dequeues: %lu\n", (unsigned long)skel->bss->dequeue_cnt);
+	printf("  Dispatch dequeues: %lu (no flag, normal workflow)\n",
+	       (unsigned long)skel->bss->dispatch_dequeue_cnt);
+	printf("  Property change dequeues: %lu (SCX_DEQ_SCHED_CHANGE flag)\n",
+	       (unsigned long)skel->bss->change_dequeue_cnt);
+	printf("  BPF queue full: %lu\n",
+	       (unsigned long)skel->bss->bpf_queue_full);
+	printf("\nAll scenarios passed - no state machine violations detected\n");
+	printf("-> Validated: Local DSQ dispatch bypasses BPF scheduler\n");
+	printf("-> Validated: Global DSQ dispatch bypasses BPF scheduler\n");
+	printf("-> Validated: User DSQ dispatch triggers ops.dequeue() callbacks\n");
+	printf("-> Validated: Dispatch dequeues have no flags (normal workflow)\n");
+	printf("-> Validated: Property change dequeues have SCX_DEQ_SCHED_CHANGE flag\n");
+	printf("-> Validated: No duplicate enqueues or invalid state transitions\n");
+
+	return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+	struct dequeue *skel = ctx;
+
+	dequeue__destroy(skel);
+}
+
+struct scx_test dequeue_test = {
+	.name = "dequeue",
+	.description = "Verify ops.dequeue() semantics",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+
+REGISTER_SCX_TEST(&dequeue_test)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
@ 2026-02-10 23:20   ` Tejun Heo
  2026-02-11 16:06     ` Andrea Righi
  2026-02-10 23:54   ` Tejun Heo
  1 sibling, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-10 23:20 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote:
> +/**
> + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
> + * @dsq_id: DSQ ID to check
> + *
> + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
> + * scheduler is considered "done" with the task.
> + *
> + * Builtin DSQs include:
> + *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> + *    where tasks go directly to execution,
> + *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
> + *  - Bypass DSQ: used during bypass mode.
> + *
> + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
> + * trigger ops.dequeue() when they are later consumed.
> + */
> +static inline bool is_terminal_dsq(u64 dsq_id)
> +{
> +	return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID;
> +}

Please use () do clarify ordering between & and &&. It's just visually
confusing. I wonder whether it'd be cleaner to make it take @dsq instead of
@dsq_id and then it can just do:

        return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL;

because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ
id, and the above code positively identifies what's terminal.

> -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> +			     struct scx_dispatch_q *dsq,
>  			     struct task_struct *p, u64 enq_flags)

While minor, this patch would be easier to read if the @rq addition were
done in a separate patch.

> +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
> +			      struct task_struct *p, u64 deq_flags,
> +			      bool is_sched_change)

Isn't @is_sched_change a bit of misnomer given that it needs to exclude
SCX_DEQ_CORE_SCHED_EXEC. I wonder whether it'd be easier if @deq_flags
handling is separated out. This part is ops_dequeue() specific, right?
Everyone else statically knows what DEQ flags to use. That might make
ops_dequeue() calculate flags unnecessarily but ops_dequeue() is not
particularly hot, so I don't think that'd matter.

> +{
> +	if (SCX_HAS_OP(sch, dequeue)) {
> +		/*
> +		 * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a
> +		 * property change (not sleep or core-sched pick).
> +		 */
> +		if (is_sched_change &&
> +		    !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> +			deq_flags |= SCX_DEQ_SCHED_CHANGE;
> +
> +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags);
> +	}
> +	p->scx.flags &= ~SCX_TASK_IN_CUSTODY;

Let's move flag clearing to the call sites. It's a bit confusing w/ the
function name.

>  static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  {
>  	struct scx_sched *sch = scx_root;
> @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>  
>  	switch (opss & SCX_OPSS_STATE_MASK) {
>  	case SCX_OPSS_NONE:
> +		/*
> +		 * If the task is still in BPF scheduler's custody
> +		 * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
> +		 */
> +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> +			call_task_dequeue(sch, rq, p, deq_flags, true);

Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be
responsible for clearing IN_CUSTODY too?

> @@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  					 struct scx_dispatch_q *src_dsq,
>  					 struct rq *dst_rq)
>  {
> +	struct scx_sched *sch = scx_root;
>  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
>  
>  	/* @dsq is locked and @p is on @dst_rq */
> @@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
>  
>  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
>  
> +	/*
> +	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> +	 * Call ops.dequeue() if the task was in BPF custody.
> +	 */
> +	if (p->scx.flags & SCX_TASK_IN_CUSTODY) {
> +		if (SCX_HAS_OP(sch, dequeue))
> +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> +		p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> +	}

I think a better place to put this would be inside local_dsq_post_enq() so
that dispatch_enqueue() and move_local_task_to_local_dsq() can share the
path. This would mean breaking out local and global cases in
dispatch_enqueue(). ie. at the end of dispatch_enqueue():

        if (is_local) {
                local_dsq_post_enq(...);
        } else {
                if (dsq->id == SCX_DSQ_GLOBAL)
                        global_dsq_post_enq(...);       /* or open code with comment */
                raw_spin_unlock(&dsq->lock);
        }

> @@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
>  		!WARN_ON_ONCE(src_rq != task_rq(p));
>  }
>  
> -static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
> -				struct scx_dispatch_q *dsq, struct rq *src_rq)
> +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
> +			       struct task_struct *p,
> +			       struct scx_dispatch_q *dsq, struct rq *src_rq)
>  {
>  	raw_spin_rq_unlock(this_rq);
>  
>  	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
> +		/*
> +		 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> +		 * Call ops.dequeue() if the task was in BPF custody.
> +		 */
> +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> +			call_task_dequeue(sch, src_rq, p, 0, false);

and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates
and reactivates the task. The deactivation invokes ops_dequeue() but that
should suppress dequeue invocation as that's internal transfer (this is
discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it
gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should
trigger dequeue invocation, right?

> @@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
>  						     src_dsq, dst_rq);
>  			raw_spin_unlock(&src_dsq->lock);
>  		} else {
> +			/*
> +			 * Moving to a local DSQ, dispatch_enqueue() is not
> +			 * used, so call ops.dequeue() here if the task was
> +			 * in BPF scheduler's custody.
> +			 */
> +			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> +				call_task_dequeue(sch, src_rq, p, 0, false);

and then this becomes unnecessary too.

> @@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>  		 */
>  		if (src_rq == dst_rq) {
>  			p->scx.holding_cpu = -1;
> -			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
> +			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
>  					 enq_flags);
>  		} else {
> +			/*
> +			 * Moving to a local DSQ, dispatch_enqueue() is not
> +			 * used, so call ops.dequeue() here if the task was
> +			 * in BPF scheduler's custody.
> +			 */
> +			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> +				call_task_dequeue(sch, src_rq, p, 0, false);

ditto.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
  2026-02-10 23:20   ` Tejun Heo
@ 2026-02-10 23:54   ` Tejun Heo
  2026-02-11 16:07     ` Andrea Righi
  1 sibling, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-10 23:54 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

One more comment.

On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote:
> @@ -1407,13 +1446,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
>  	 * dequeue may be waiting. The store_release matches their load_acquire.
>  	 */
>  	atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
> +
> +	/*
> +	 * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY
> +	 * so ops.dequeue() is called when it leaves custody.
> +	 */
> +	p->scx.flags |= SCX_TASK_IN_CUSTODY;

As this is protected by task's rq lock, doing it here is okay but can you
move this above atomic_long_set_release()? That's conceptually more
straightforward as that set_release() is supposed to be the "I'm done with
this task" point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-10 23:20   ` Tejun Heo
@ 2026-02-11 16:06     ` Andrea Righi
  2026-02-11 19:47       ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-11 16:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hi Tejun,

On Tue, Feb 10, 2026 at 01:20:11PM -1000, Tejun Heo wrote:
> On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote:
> > +/**
> > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
> > + * @dsq_id: DSQ ID to check
> > + *
> > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
> > + * scheduler is considered "done" with the task.
> > + *
> > + * Builtin DSQs include:
> > + *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> > + *    where tasks go directly to execution,
> > + *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
> > + *  - Bypass DSQ: used during bypass mode.
> > + *
> > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
> > + * trigger ops.dequeue() when they are later consumed.
> > + */
> > +static inline bool is_terminal_dsq(u64 dsq_id)
> > +{
> > +	return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID;
> > +}
> 
> Please use () do clarify ordering between & and &&. It's just visually
> confusing. I wonder whether it'd be cleaner to make it take @dsq instead of
> @dsq_id and then it can just do:
> 
>         return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL;
> 
> because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ
> id, and the above code positively identifies what's terminal.

Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking
SCX_DSQ_FLAG_BUILTIN is more generic?

> 
> > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> > +			     struct scx_dispatch_q *dsq,
> >  			     struct task_struct *p, u64 enq_flags)
> 
> While minor, this patch would be easier to read if the @rq addition were
> done in a separate patch.

Ack. I'll split that out.

> 
> > +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
> > +			      struct task_struct *p, u64 deq_flags,
> > +			      bool is_sched_change)
> 
> Isn't @is_sched_change a bit of misnomer given that it needs to exclude
> SCX_DEQ_CORE_SCHED_EXEC. I wonder whether it'd be easier if @deq_flags
> handling is separated out. This part is ops_dequeue() specific, right?
> Everyone else statically knows what DEQ flags to use. That might make
> ops_dequeue() calculate flags unnecessarily but ops_dequeue() is not
> particularly hot, so I don't think that'd matter.

Ack, I'll handle deq_flags in ops_dequeue() and simplify
call_task_dequeue() accordingly.

> 
> > +{
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		/*
> > +		 * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a
> > +		 * property change (not sleep or core-sched pick).
> > +		 */
> > +		if (is_sched_change &&
> > +		    !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +			deq_flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags);
> > +	}
> > +	p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> 
> Let's move flag clearing to the call sites. It's a bit confusing w/ the
> function name.

Ack.

> 
> >  static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  {
> >  	struct scx_sched *sch = scx_root;
> > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * If the task is still in BPF scheduler's custody
> > +		 * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
> > +		 */
> > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +			call_task_dequeue(sch, rq, p, deq_flags, true);
> 
> Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be
> responsible for clearing IN_CUSTODY too?

The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in
dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING
to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent
dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or
move the task. So we can see NONE + IN_CUSTODY here and need to handle it.
And we can't clear IN_CUSTODY at the same time we set NONE there, because
we don't hold the task's rq lock yet and we can't trigger ops.dequeue().

> 
> > @@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  					 struct scx_dispatch_q *src_dsq,
> >  					 struct rq *dst_rq)
> >  {
> > +	struct scx_sched *sch = scx_root;
> >  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
> >  
> >  	/* @dsq is locked and @p is on @dst_rq */
> > @@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  
> >  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> >  
> > +	/*
> > +	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > +	 * Call ops.dequeue() if the task was in BPF custody.
> > +	 */
> > +	if (p->scx.flags & SCX_TASK_IN_CUSTODY) {
> > +		if (SCX_HAS_OP(sch, dequeue))
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> > +		p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> > +	}
> 
> I think a better place to put this would be inside local_dsq_post_enq() so
> that dispatch_enqueue() and move_local_task_to_local_dsq() can share the
> path. This would mean breaking out local and global cases in
> dispatch_enqueue(). ie. at the end of dispatch_enqueue():
> 
>         if (is_local) {
>                 local_dsq_post_enq(...);
>         } else {
>                 if (dsq->id == SCX_DSQ_GLOBAL)
>                         global_dsq_post_enq(...);       /* or open code with comment */
>                 raw_spin_unlock(&dsq->lock);
>         }

Agreed, I'll move this into local_dsq_post_enq() and introduce
a global_dsq_post_enq().

> 
> > @@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
> >  		!WARN_ON_ONCE(src_rq != task_rq(p));
> >  }
> >  
> > -static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
> > -				struct scx_dispatch_q *dsq, struct rq *src_rq)
> > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
> > +			       struct task_struct *p,
> > +			       struct scx_dispatch_q *dsq, struct rq *src_rq)
> >  {
> >  	raw_spin_rq_unlock(this_rq);
> >  
> >  	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
> > +		/*
> > +		 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > +		 * Call ops.dequeue() if the task was in BPF custody.
> > +		 */
> > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +			call_task_dequeue(sch, src_rq, p, 0, false);
> 
> and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates
> and reactivates the task. The deactivation invokes ops_dequeue() but that
> should suppress dequeue invocation as that's internal transfer (this is
> discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it
> gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should
> trigger dequeue invocation, right?

Should we trigger ops.dequeue() when the task is dequeued inside
move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by
deactivate_task() there) instead of suppressing it and invoking on the
target in local_dsq_post_enq()?

That way the BPF sees dequeue on the source and then enqueue on the target,
we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the
"when to call dequeue" logic stays consistent in ops_dequeue and the
terminal local/global post_enq paths.

Does it make sense or would you rather suppress it and only invoke on the
target when the task lands on the local DSQ??

> 
> > @@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  						     src_dsq, dst_rq);
> >  			raw_spin_unlock(&src_dsq->lock);
> >  		} else {
> > +			/*
> > +			 * Moving to a local DSQ, dispatch_enqueue() is not
> > +			 * used, so call ops.dequeue() here if the task was
> > +			 * in BPF scheduler's custody.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +				call_task_dequeue(sch, src_rq, p, 0, false);
> 
> and then this becomes unnecessary too.

Ack + same comment about consume_remote_task().

> 
> > @@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  		 */
> >  		if (src_rq == dst_rq) {
> >  			p->scx.holding_cpu = -1;
> > -			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
> > +			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
> >  					 enq_flags);
> >  		} else {
> > +			/*
> > +			 * Moving to a local DSQ, dispatch_enqueue() is not
> > +			 * used, so call ops.dequeue() here if the task was
> > +			 * in BPF scheduler's custody.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +				call_task_dequeue(sch, src_rq, p, 0, false);
> 
> ditto.

Ack + same as above.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-10 23:54   ` Tejun Heo
@ 2026-02-11 16:07     ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-11 16:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Tue, Feb 10, 2026 at 01:54:39PM -1000, Tejun Heo wrote:
> One more comment.
> 
> On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote:
> > @@ -1407,13 +1446,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> >  	 * dequeue may be waiting. The store_release matches their load_acquire.
> >  	 */
> >  	atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
> > +
> > +	/*
> > +	 * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY
> > +	 * so ops.dequeue() is called when it leaves custody.
> > +	 */
> > +	p->scx.flags |= SCX_TASK_IN_CUSTODY;
> 
> As this is protected by task's rq lock, doing it here is okay but can you
> move this above atomic_long_set_release()? That's conceptually more
> straightforward as that set_release() is supposed to be the "I'm done with
> this task" point.

Agreed, it definitely looks more correct to move this before the
atomic_long_set_release().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-11 16:06     ` Andrea Righi
@ 2026-02-11 19:47       ` Tejun Heo
  2026-02-11 22:34         ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-11 19:47 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hello,

On Wed, Feb 11, 2026 at 05:06:20PM +0100, Andrea Righi wrote:
...
> > Please use () do clarify ordering between & and &&. It's just visually
> > confusing. I wonder whether it'd be cleaner to make it take @dsq instead of
> > @dsq_id and then it can just do:
> > 
> >         return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL;
> > 
> > because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ
> > id, and the above code positively identifies what's terminal.
> 
> Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking
> SCX_DSQ_FLAG_BUILTIN is more generic?

Ah, forgot about that. Hmm... we can do:

        switch (dsq->id) {
                case SCX_DSQ_LOCAL:
                case SCX_DSQ_GLOBAL:
                case SCX_DSQ_BYPASS:
                        return true;
                default:
                        return false;
        }

I just feel iffy about not being specific. Easier to make mistakes in the
future and more difficult to notice after doing so, but I think this point
is kinda moot. If we break up LOCAL and GLOBAL/BYPASS handling into separate
paths in dispatch_enqueue(), we won't need this function anyway.

> > > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> > >  
> > >  	switch (opss & SCX_OPSS_STATE_MASK) {
> > >  	case SCX_OPSS_NONE:
> > > +		/*
> > > +		 * If the task is still in BPF scheduler's custody
> > > +		 * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
> > > +		 */
> > > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > > +			call_task_dequeue(sch, rq, p, deq_flags, true);
> > 
> > Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be
> > responsible for clearing IN_CUSTODY too?
> 
> The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in
> dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING
> to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent
> dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or
> move the task. So we can see NONE + IN_CUSTODY here and need to handle it.
> And we can't clear IN_CUSTODY at the same time we set NONE there, because
> we don't hold the task's rq lock yet and we can't trigger ops.dequeue().

I see. Can you please add a comment with the above?

...
> > I think a better place to put this would be inside local_dsq_post_enq() so
> > that dispatch_enqueue() and move_local_task_to_local_dsq() can share the
> > path. This would mean breaking out local and global cases in
> > dispatch_enqueue(). ie. at the end of dispatch_enqueue():
> > 
> >         if (is_local) {
> >                 local_dsq_post_enq(...);
> >         } else {
> >                 if (dsq->id == SCX_DSQ_GLOBAL)
> >                         global_dsq_post_enq(...);       /* or open code with comment */
> >                 raw_spin_unlock(&dsq->lock);
> >         }
> 
> Agreed, I'll move this into local_dsq_post_enq() and introduce
> a global_dsq_post_enq().

Yeah, and as you pointed out, BYPASS.

> > > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
> > > +			       struct task_struct *p,
> > > +			       struct scx_dispatch_q *dsq, struct rq *src_rq)
> > >  {
> > >  	raw_spin_rq_unlock(this_rq);
> > >  
> > >  	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
> > > +		/*
> > > +		 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > > +		 * Call ops.dequeue() if the task was in BPF custody.
> > > +		 */
> > > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > > +			call_task_dequeue(sch, src_rq, p, 0, false);
> > 
> > and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates
> > and reactivates the task. The deactivation invokes ops_dequeue() but that
> > should suppress dequeue invocation as that's internal transfer (this is
> > discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it
> > gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should
> > trigger dequeue invocation, right?
> 
> Should we trigger ops.dequeue() when the task is dequeued inside
> move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by
> deactivate_task() there) instead of suppressing it and invoking on the
> target in local_dsq_post_enq()?
> 
> That way the BPF sees dequeue on the source and then enqueue on the target,
> we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the
> "when to call dequeue" logic stays consistent in ops_dequeue and the
> terminal local/global post_enq paths.
> 
> Does it make sense or would you rather suppress it and only invoke on the
> target when the task lands on the local DSQ??

The end result is about the same because whenever we migrate we're sending
it to the local DSQ of the destination CPU, so whether we generate the event
on deactivation of the source CPU or activation on the destination doesn't
make *whole* lot of difference. However, conceptually, migrations are
internal events. There isn't anything actionable for the BPF scheduler. The
reason why ops.dequeue() should be emitted is not because the task is
changing CPUs (which caused the deactivation) but the fact that it ends up
in a local DSQ afterwards. I think it'll be cleaner both conceptually and
code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
paths.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-11 19:47       ` Tejun Heo
@ 2026-02-11 22:34         ` Andrea Righi
  2026-02-11 22:37           ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-11 22:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Wed, Feb 11, 2026 at 09:47:57AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 11, 2026 at 05:06:20PM +0100, Andrea Righi wrote:
> ...
> > > Please use () do clarify ordering between & and &&. It's just visually
> > > confusing. I wonder whether it'd be cleaner to make it take @dsq instead of
> > > @dsq_id and then it can just do:
> > > 
> > >         return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL;
> > > 
> > > because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ
> > > id, and the above code positively identifies what's terminal.
> > 
> > Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking
> > SCX_DSQ_FLAG_BUILTIN is more generic?
> 
> Ah, forgot about that. Hmm... we can do:
> 
>         switch (dsq->id) {
>                 case SCX_DSQ_LOCAL:
>                 case SCX_DSQ_GLOBAL:
>                 case SCX_DSQ_BYPASS:
>                         return true;
>                 default:
>                         return false;
>         }
> 
> I just feel iffy about not being specific. Easier to make mistakes in the
> future and more difficult to notice after doing so, but I think this point
> is kinda moot. If we break up LOCAL and GLOBAL/BYPASS handling into separate
> paths in dispatch_enqueue(), we won't need this function anyway.

Ack, makes sense.

> 
> > > > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> > > >  
> > > >  	switch (opss & SCX_OPSS_STATE_MASK) {
> > > >  	case SCX_OPSS_NONE:
> > > > +		/*
> > > > +		 * If the task is still in BPF scheduler's custody
> > > > +		 * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
> > > > +		 */
> > > > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > > > +			call_task_dequeue(sch, rq, p, deq_flags, true);
> > > 
> > > Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be
> > > responsible for clearing IN_CUSTODY too?
> > 
> > The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in
> > dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING
> > to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent
> > dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or
> > move the task. So we can see NONE + IN_CUSTODY here and need to handle it.
> > And we can't clear IN_CUSTODY at the same time we set NONE there, because
> > we don't hold the task's rq lock yet and we can't trigger ops.dequeue().
> 
> I see. Can you please add a comment with the above?

Ok.

> 
> ...
> > > I think a better place to put this would be inside local_dsq_post_enq() so
> > > that dispatch_enqueue() and move_local_task_to_local_dsq() can share the
> > > path. This would mean breaking out local and global cases in
> > > dispatch_enqueue(). ie. at the end of dispatch_enqueue():
> > > 
> > >         if (is_local) {
> > >                 local_dsq_post_enq(...);
> > >         } else {
> > >                 if (dsq->id == SCX_DSQ_GLOBAL)
> > >                         global_dsq_post_enq(...);       /* or open code with comment */
> > >                 raw_spin_unlock(&dsq->lock);
> > >         }
> > 
> > Agreed, I'll move this into local_dsq_post_enq() and introduce
> > a global_dsq_post_enq().
> 
> Yeah, and as you pointed out, BYPASS.

Ok.

> 
> > > > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
> > > > +			       struct task_struct *p,
> > > > +			       struct scx_dispatch_q *dsq, struct rq *src_rq)
> > > >  {
> > > >  	raw_spin_rq_unlock(this_rq);
> > > >  
> > > >  	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
> > > > +		/*
> > > > +		 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > > > +		 * Call ops.dequeue() if the task was in BPF custody.
> > > > +		 */
> > > > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > > > +			call_task_dequeue(sch, src_rq, p, 0, false);
> > > 
> > > and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates
> > > and reactivates the task. The deactivation invokes ops_dequeue() but that
> > > should suppress dequeue invocation as that's internal transfer (this is
> > > discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it
> > > gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should
> > > trigger dequeue invocation, right?
> > 
> > Should we trigger ops.dequeue() when the task is dequeued inside
> > move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by
> > deactivate_task() there) instead of suppressing it and invoking on the
> > target in local_dsq_post_enq()?
> > 
> > That way the BPF sees dequeue on the source and then enqueue on the target,
> > we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the
> > "when to call dequeue" logic stays consistent in ops_dequeue and the
> > terminal local/global post_enq paths.
> > 
> > Does it make sense or would you rather suppress it and only invoke on the
> > target when the task lands on the local DSQ??
> 
> The end result is about the same because whenever we migrate we're sending
> it to the local DSQ of the destination CPU, so whether we generate the event
> on deactivation of the source CPU or activation on the destination doesn't
> make *whole* lot of difference. However, conceptually, migrations are
> internal events. There isn't anything actionable for the BPF scheduler. The
> reason why ops.dequeue() should be emitted is not because the task is
> changing CPUs (which caused the deactivation) but the fact that it ends up
> in a local DSQ afterwards. I think it'll be cleaner both conceptually and
> code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
> paths.

Does this include core scheduler migrations or just SCX-initiated
migrations (move_remote_task_to_local_dsq())?

Because with core scheduler migrations we trigger ops.enqueue(), so we
should also trigger ops.dequeue(). Or we need to send the task straight to
local to prevent calling ops.enqueue().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-11 22:34         ` Andrea Righi
@ 2026-02-11 22:37           ` Tejun Heo
  2026-02-11 22:48             ` Andrea Righi
  2026-02-12 10:16             ` Andrea Righi
  0 siblings, 2 replies; 83+ messages in thread
From: Tejun Heo @ 2026-02-11 22:37 UTC (permalink / raw)
  To: Andrea Righi
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

Hello,

On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote:
> > The end result is about the same because whenever we migrate we're sending
> > it to the local DSQ of the destination CPU, so whether we generate the event
> > on deactivation of the source CPU or activation on the destination doesn't
> > make *whole* lot of difference. However, conceptually, migrations are
> > internal events. There isn't anything actionable for the BPF scheduler. The
> > reason why ops.dequeue() should be emitted is not because the task is
> > changing CPUs (which caused the deactivation) but the fact that it ends up
> > in a local DSQ afterwards. I think it'll be cleaner both conceptually and
> > code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
> > paths.
> 
> Does this include core scheduler migrations or just SCX-initiated
> migrations (move_remote_task_to_local_dsq())?
> 
> Because with core scheduler migrations we trigger ops.enqueue(), so we
> should also trigger ops.dequeue(). Or we need to send the task straight to
> local to prevent calling ops.enqueue().

I'm a bit lost. Can you elaborate on core scheduler migrations triggering
ops.enqueue()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-11 22:37           ` Tejun Heo
@ 2026-02-11 22:48             ` Andrea Righi
  2026-02-12 10:16             ` Andrea Righi
  1 sibling, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-11 22:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote:
> > > The end result is about the same because whenever we migrate we're sending
> > > it to the local DSQ of the destination CPU, so whether we generate the event
> > > on deactivation of the source CPU or activation on the destination doesn't
> > > make *whole* lot of difference. However, conceptually, migrations are
> > > internal events. There isn't anything actionable for the BPF scheduler. The
> > > reason why ops.dequeue() should be emitted is not because the task is
> > > changing CPUs (which caused the deactivation) but the fact that it ends up
> > > in a local DSQ afterwards. I think it'll be cleaner both conceptually and
> > > code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
> > > paths.
> > 
> > Does this include core scheduler migrations or just SCX-initiated
> > migrations (move_remote_task_to_local_dsq())?
> > 
> > Because with core scheduler migrations we trigger ops.enqueue(), so we
> > should also trigger ops.dequeue(). Or we need to send the task straight to
> > local to prevent calling ops.enqueue().
> 
> I'm a bit lost. Can you elaborate on core scheduler migrations triggering
> ops.enqueue()?

Nevermind, just ignore that comment, we clearly want to trigger
ops.dequeue/enqueue() in that case, it's the whole point of
SCX_DEQ_SCHED_CHANGE. I should probably go to bed and get some sleep. :)

-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-11 22:37           ` Tejun Heo
  2026-02-11 22:48             ` Andrea Righi
@ 2026-02-12 10:16             ` Andrea Righi
  2026-02-12 14:32               ` Christian Loehle
  1 sibling, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-12 10:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Christian Loehle, Daniel Hodges, sched-ext, linux-kernel

On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote:
> > > The end result is about the same because whenever we migrate we're sending
> > > it to the local DSQ of the destination CPU, so whether we generate the event
> > > on deactivation of the source CPU or activation on the destination doesn't
> > > make *whole* lot of difference. However, conceptually, migrations are
> > > internal events. There isn't anything actionable for the BPF scheduler. The
> > > reason why ops.dequeue() should be emitted is not because the task is
> > > changing CPUs (which caused the deactivation) but the fact that it ends up
> > > in a local DSQ afterwards. I think it'll be cleaner both conceptually and
> > > code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
> > > paths.
> > 
> > Does this include core scheduler migrations or just SCX-initiated
> > migrations (move_remote_task_to_local_dsq())?
> > 
> > Because with core scheduler migrations we trigger ops.enqueue(), so we
> > should also trigger ops.dequeue(). Or we need to send the task straight to
> > local to prevent calling ops.enqueue().
> 
> I'm a bit lost. Can you elaborate on core scheduler migrations triggering
> ops.enqueue()?

Alright, let me re-elaborate more on this with a (slightly) fresher brain.

We have two main classes of migrations:

 1) Internal SCX-initiated migrations: e.g.,
    dispatch_to_local_dsq() -> move_remote_task_to_local_dsq(), or
    consume_remote_task() -> move_remote_task_to_local_dsq(), these
    are completely internal to SCX and shouldn't trigger
    ops.dequeue/enqueue()

 2) Core scheduler migrations
  - CPU affinity: sched_setaffinity, cpuset/cgroup mask change, etc.
    affine_move_task -> move_queued_task migrates it -> we trigger
    ops.dequeue(SCX_DEQ_SCHED_CHANGE) on the source and ops.enqueue() on
    the target.

  - Core scheduling (CONFIG_SCHED_CORE): two different cases:
    - Migration (task moved between runqueues via move_queued_task_locked()
      to satisfy core cookie)

  - NUMA balancing: migrate_task_to() can move an SCX task to another CPU

  - CPU hotplug: on CPU down, runnable tasks are pushed off via
    __balance_push_cpu_stop() -> __migrate_task()

If we want to skip ops.dequeue() only for internal SCX migrations (and
maybe also for NUMA and hotplug?), then only checking
task_on_rq_migrating(p) is not enough, because that's true for every
migration listed above and we'd skip all of them.

So, we need a way to mark "this migration is internal to SCX", like a new
SCX_TASK_MIGRATING_INTERNAL flag?

The alternative is to always trigger ops.dequeue/enqueue() on every
migration (no flag): even for internal SCX migrations the BPF scheduler
could use it to track task movements, though there's nothing it can do.
That way we don't need the additional flag.

Does one of these directions fit better with what you have in mind?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-12 10:16             ` Andrea Righi
@ 2026-02-12 14:32               ` Christian Loehle
  2026-02-12 15:45                 ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Christian Loehle @ 2026-02-12 14:32 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo
  Cc: David Vernet, Changwoo Min, Kuba Piecuch, Emil Tsalapatis,
	Daniel Hodges, sched-ext, linux-kernel

On 2/12/26 10:16, Andrea Righi wrote:
> On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote:
>>>> The end result is about the same because whenever we migrate we're sending
>>>> it to the local DSQ of the destination CPU, so whether we generate the event
>>>> on deactivation of the source CPU or activation on the destination doesn't
>>>> make *whole* lot of difference. However, conceptually, migrations are
>>>> internal events. There isn't anything actionable for the BPF scheduler. The
>>>> reason why ops.dequeue() should be emitted is not because the task is
>>>> changing CPUs (which caused the deactivation) but the fact that it ends up
>>>> in a local DSQ afterwards. I think it'll be cleaner both conceptually and
>>>> code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
>>>> paths.
>>>
>>> Does this include core scheduler migrations or just SCX-initiated
>>> migrations (move_remote_task_to_local_dsq())?
>>>
>>> Because with core scheduler migrations we trigger ops.enqueue(), so we
>>> should also trigger ops.dequeue(). Or we need to send the task straight to
>>> local to prevent calling ops.enqueue().
>>
>> I'm a bit lost. Can you elaborate on core scheduler migrations triggering
>> ops.enqueue()?
> 
> Alright, let me re-elaborate more on this with a (slightly) fresher brain.
> 
> We have two main classes of migrations:
> 
>  1) Internal SCX-initiated migrations: e.g.,
>     dispatch_to_local_dsq() -> move_remote_task_to_local_dsq(), or
>     consume_remote_task() -> move_remote_task_to_local_dsq(), these
>     are completely internal to SCX and shouldn't trigger
>     ops.dequeue/enqueue()
> 
>  2) Core scheduler migrations
>   - CPU affinity: sched_setaffinity, cpuset/cgroup mask change, etc.
>     affine_move_task -> move_queued_task migrates it -> we trigger
>     ops.dequeue(SCX_DEQ_SCHED_CHANGE) on the source and ops.enqueue() on
>     the target.
> 
>   - Core scheduling (CONFIG_SCHED_CORE): two different cases:
>     - Migration (task moved between runqueues via move_queued_task_locked()
>       to satisfy core cookie)
> 
>   - NUMA balancing: migrate_task_to() can move an SCX task to another CPU
> 
>   - CPU hotplug: on CPU down, runnable tasks are pushed off via
>     __balance_push_cpu_stop() -> __migrate_task()
> 
> If we want to skip ops.dequeue() only for internal SCX migrations (and
> maybe also for NUMA and hotplug?), then only checking
> task_on_rq_migrating(p) is not enough, because that's true for every
> migration listed above and we'd skip all of them.
> 
> So, we need a way to mark "this migration is internal to SCX", like a new
> SCX_TASK_MIGRATING_INTERNAL flag?
> 
> The alternative is to always trigger ops.dequeue/enqueue() on every
> migration (no flag): even for internal SCX migrations the BPF scheduler
> could use it to track task movements, though there's nothing it can do.
> That way we don't need the additional flag.
> 
> Does one of these directions fit better with what you have in mind?
IIUC one example might sway your opinion (or not):
Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ
(and maybe being enqueued at another) prevents e.g. accurate PELT load
tracking on the BPF side.
Regular utilization tracking works through ops.running() and 
ops.stopping() but load I don't think load can be implemented accurately.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-12 14:32               ` Christian Loehle
@ 2026-02-12 15:45                 ` Andrea Righi
  2026-02-12 17:07                   ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-12 15:45 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Thu, Feb 12, 2026 at 02:32:02PM +0000, Christian Loehle wrote:
> On 2/12/26 10:16, Andrea Righi wrote:
> > On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote:
> >>>> The end result is about the same because whenever we migrate we're sending
> >>>> it to the local DSQ of the destination CPU, so whether we generate the event
> >>>> on deactivation of the source CPU or activation on the destination doesn't
> >>>> make *whole* lot of difference. However, conceptually, migrations are
> >>>> internal events. There isn't anything actionable for the BPF scheduler. The
> >>>> reason why ops.dequeue() should be emitted is not because the task is
> >>>> changing CPUs (which caused the deactivation) but the fact that it ends up
> >>>> in a local DSQ afterwards. I think it'll be cleaner both conceptually and
> >>>> code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
> >>>> paths.
> >>>
> >>> Does this include core scheduler migrations or just SCX-initiated
> >>> migrations (move_remote_task_to_local_dsq())?
> >>>
> >>> Because with core scheduler migrations we trigger ops.enqueue(), so we
> >>> should also trigger ops.dequeue(). Or we need to send the task straight to
> >>> local to prevent calling ops.enqueue().
> >>
> >> I'm a bit lost. Can you elaborate on core scheduler migrations triggering
> >> ops.enqueue()?
> > 
> > Alright, let me re-elaborate more on this with a (slightly) fresher brain.
> > 
> > We have two main classes of migrations:
> > 
> >  1) Internal SCX-initiated migrations: e.g.,
> >     dispatch_to_local_dsq() -> move_remote_task_to_local_dsq(), or
> >     consume_remote_task() -> move_remote_task_to_local_dsq(), these
> >     are completely internal to SCX and shouldn't trigger
> >     ops.dequeue/enqueue()
> > 
> >  2) Core scheduler migrations
> >   - CPU affinity: sched_setaffinity, cpuset/cgroup mask change, etc.
> >     affine_move_task -> move_queued_task migrates it -> we trigger
> >     ops.dequeue(SCX_DEQ_SCHED_CHANGE) on the source and ops.enqueue() on
> >     the target.
> > 
> >   - Core scheduling (CONFIG_SCHED_CORE): two different cases:
> >     - Migration (task moved between runqueues via move_queued_task_locked()
> >       to satisfy core cookie)
> > 
> >   - NUMA balancing: migrate_task_to() can move an SCX task to another CPU
> > 
> >   - CPU hotplug: on CPU down, runnable tasks are pushed off via
> >     __balance_push_cpu_stop() -> __migrate_task()
> > 
> > If we want to skip ops.dequeue() only for internal SCX migrations (and
> > maybe also for NUMA and hotplug?), then only checking
> > task_on_rq_migrating(p) is not enough, because that's true for every
> > migration listed above and we'd skip all of them.
> > 
> > So, we need a way to mark "this migration is internal to SCX", like a new
> > SCX_TASK_MIGRATING_INTERNAL flag?
> > 
> > The alternative is to always trigger ops.dequeue/enqueue() on every
> > migration (no flag): even for internal SCX migrations the BPF scheduler
> > could use it to track task movements, though there's nothing it can do.
> > That way we don't need the additional flag.
> > 
> > Does one of these directions fit better with what you have in mind?
> IIUC one example might sway your opinion (or not):
> Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ
> (and maybe being enqueued at another) prevents e.g. accurate PELT load
> tracking on the BPF side.
> Regular utilization tracking works through ops.running() and 
> ops.stopping() but load I don't think load can be implemented accurately.

It makes sense to me and I think it's actually valid reason to prefer the
"always trigger" way.

We have DSQs and potentially BPF can have its own queues, but to implement
accurate PELT (runnable contribution to a runqueue, possibly with decay),
we'd also need to know exactly when a task leaves one runqueue and joins
another.

Essentially we could get the full task lifecyle in BPF:
 - runnable lifecycle:
   - ops.dequeue(): task leaves runqueue, source CPU = scx_bpf_task_cpu(p),
   - ops.enqueue(): task wants to run, curr CPU = scx_bpf_task_cpu(p),
 - running lifecycle:
   - ops.running(p): task starts running on scx_bpf_task_cpu(p),
   - ops.stopping(p): task stops running on scx_bpf_task_cpu(p).

A potential concern could be about introducing more overhead, but I don't
think it matters much, especially since schedulers that don't implement
ops.dequeue() effectively pay no cost for these events.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-12 15:45                 ` Andrea Righi
@ 2026-02-12 17:07                   ` Tejun Heo
  2026-02-12 18:14                     ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-12 17:07 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

Hello,

On Thu, Feb 12, 2026 at 04:45:43PM +0100, Andrea Righi wrote:
> > > So, we need a way to mark "this migration is internal to SCX", like a new
> > > SCX_TASK_MIGRATING_INTERNAL flag?

Yeah, I think this is what we should do. That's the only ops.dequeue()
without matching ops.enqueue(), right?

...
> > IIUC one example might sway your opinion (or not):
> > Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ
> > (and maybe being enqueued at another) prevents e.g. accurate PELT load
> > tracking on the BPF side.
> > Regular utilization tracking works through ops.running() and 
> > ops.stopping() but load I don't think load can be implemented accurately.
> 
> It makes sense to me and I think it's actually valid reason to prefer the
> "always trigger" way.

I don't think this is a valid argument. PELT is done that way because the
association of the task and the CPU is meaningful for in-kernel schedulers.
The queues are actually per-CPU. For SCX scheds, the relationship is not
known to the kernel. Only the BPF scheduler itself knows, if it wants to
attribute per-task load to a specific CPU, which CPU it should be attributed
to. What's the point of following in-kernel association for PELT if the task
was going to be hot migrated to another CPU on execution?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics
  2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
@ 2026-02-12 17:15   ` Christian Loehle
  2026-02-12 18:25     ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Christian Loehle @ 2026-02-12 17:15 UTC (permalink / raw)
  To: Andrea Righi, Tejun Heo, David Vernet, Changwoo Min
  Cc: Kuba Piecuch, Emil Tsalapatis, Daniel Hodges, sched-ext,
	linux-kernel

On 2/10/26 21:26, Andrea Righi wrote:
> Add a new kselftest to validate that the new ops.dequeue() semantics
> work correctly for all task lifecycle scenarios, including the
> distinction between terminal DSQs (where BPF scheduler is done with the
> task), user DSQs (where BPF scheduler manages the task lifecycle) and
> BPF data structures, regardless of which event performs the dispatch.
> 
> The test validates the following scenarios:
> 
>  - From ops.select_cpu():
>      - scenario 0 (local DSQ): tasks dispatched to the local DSQ bypass
>        the BPF scheduler entirely; they never enter BPF custody, so
>        ops.dequeue() is not called,
>      - scenario 1 (global DSQ): tasks dispatched to SCX_DSQ_GLOBAL also
>        bypass the BPF scheduler, like the local DSQ; ops.dequeue() is
>        not called,
>      - scenario 2 (user DSQ): tasks dispatched to user DSQs from
>        ops.select_cpu(): tasks enter BPF scheduler's custody with full
>        enqueue/dequeue lifecycle tracking and state machine validation,
>        expects 1:1 enqueue/dequeue pairing,
> 
>    - From ops.enqueue():
>      - scenario 3 (local DSQ): same behavior as scenario 0,
>      - scenario 4 (global DSQ): same behavior as scenario 1,
>      - scenario 5 (user DSQ): same behavior as scenario 2,
>      - scenario 6 (BPF internal queue): tasks are stored in a BPF queue
>        from ops.enqueue() and consumed from ops.dispatch(); similarly to
>        scenario 5, tasks enter BPF scheduler's custody with full
>        lifecycle tracking and 1:1 enqueue/dequeue validation.
> 
> This verifies that:
>  - terminal DSQ dispatch (local, global) don't trigger ops.dequeue(),
>  - tasks dispatched to user DSQs, either from ops.select_cpu() or
>    ops.enqueue(), enter BPF scheduler's custody and have exact 1:1
>    enqueue/dequeue pairing,
>  - tasks stored to internal BPF data structures from ops.enqueue() enter
>    BPF scheduler's custody and have exact 1:1 enqueue/dequeue pairing,
>  - dispatch dequeues have no flags (normal workflow),
>  - property change dequeues have the %SCX_DEQ_SCHED_CHANGE flag set,
>  - no duplicate enqueues or invalid state transitions are happening.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Emil Tsalapatis <emil@etsalapatis.com>
> Cc: Kuba Piecuch <jpiecuch@google.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  tools/testing/selftests/sched_ext/Makefile    |   1 +
>  .../testing/selftests/sched_ext/dequeue.bpf.c | 368 ++++++++++++++++++
>  tools/testing/selftests/sched_ext/dequeue.c   | 265 +++++++++++++
>  3 files changed, 634 insertions(+)
>  create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
>  create mode 100644 tools/testing/selftests/sched_ext/dequeue.c
> 
> diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
> index 5fe45f9c5f8fd..764e91edabf93 100644
> --- a/tools/testing/selftests/sched_ext/Makefile
> +++ b/tools/testing/selftests/sched_ext/Makefile
> @@ -161,6 +161,7 @@ all_test_bpfprogs := $(foreach prog,$(wildcard *.bpf.c),$(INCLUDE_DIR)/$(patsubs
>  
>  auto-test-targets :=			\
>  	create_dsq			\
> +	dequeue				\
>  	enq_last_no_enq_fails		\
>  	ddsp_bogus_dsq_fail		\
>  	ddsp_vtimelocal_fail		\
> diff --git a/tools/testing/selftests/sched_ext/dequeue.bpf.c b/tools/testing/selftests/sched_ext/dequeue.bpf.c
> new file mode 100644
> index 0000000000000..d9d12f14cd673
> --- /dev/null
> +++ b/tools/testing/selftests/sched_ext/dequeue.bpf.c
> @@ -0,0 +1,368 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * A scheduler that validates ops.dequeue() is called correctly:
> + * - Tasks dispatched to terminal DSQs (local, global) bypass the BPF
> + *   scheduler entirely: no ops.dequeue() should be called
> + * - Tasks dispatched to user DSQs from ops.enqueue() enter BPF custody:
> + *   ops.dequeue() must be called when they leave custody
> + * - Every ops.enqueue() dispatch to non-terminal DSQs is followed by
> + *   exactly one ops.dequeue() (validate 1:1 pairing and state machine)
> + *
> + * Copyright (c) 2026 NVIDIA Corporation.
> + */
> +
> +#include <scx/common.bpf.h>
> +
> +#define SHARED_DSQ	0
> +
> +/*
> + * BPF internal queue.
> + *
> + * Tasks are stored here and consumed from ops.dispatch(), validating that
> + * tasks on BPF internal structures still get ops.dequeue() when they
> + * leave.
> + */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_QUEUE);
> +	__uint(max_entries, 32768);
> +	__type(value, s32);
> +} global_queue SEC(".maps");
> +
> +char _license[] SEC("license") = "GPL";
> +
> +UEI_DEFINE(uei);
> +
> +/*
> + * Counters to track the lifecycle of tasks:
> + * - enqueue_cnt: Number of times ops.enqueue() was called
> + * - dequeue_cnt: Number of times ops.dequeue() was called (any type)
> + * - dispatch_dequeue_cnt: Number of regular dispatch dequeues (no flag)
> + * - change_dequeue_cnt: Number of property change dequeues
> + * - bpf_queue_full: Number of times the BPF internal queue was full
> + */
> +u64 enqueue_cnt, dequeue_cnt, dispatch_dequeue_cnt, change_dequeue_cnt, bpf_queue_full;
> +
> +/*
> + * Test scenarios:
> + * 0) Dispatch to local DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
> + *    scheduler, no dequeue callbacks)
> + * 1) Dispatch to global DSQ from ops.select_cpu() (terminal DSQ, bypasses BPF
> + *    scheduler, no dequeue callbacks)
> + * 2) Dispatch to shared user DSQ from ops.select_cpu() (enters BPF scheduler,
> + *    dequeue callbacks expected)
> + * 3) Dispatch to local DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
> + *    scheduler, no dequeue callbacks)
> + * 4) Dispatch to global DSQ from ops.enqueue() (terminal DSQ, bypasses BPF
> + *    scheduler, no dequeue callbacks)
> + * 5) Dispatch to shared user DSQ from ops.enqueue() (enters BPF scheduler,
> + *    dequeue callbacks expected)
> + * 6) BPF internal queue from ops.enqueue(): store task PIDs in ops.enqueue(),
> + *    consume in ops.dispatch() and dispatch to local DSQ (validates dequeue
> + *    for tasks stored in internal BPF data structures)
> + */
> +u32 test_scenario;
> +
> +/*
> + * Per-task state to track lifecycle and validate workflow semantics.
> + * State transitions:
> + *   NONE -> ENQUEUED (on enqueue)
> + *   ENQUEUED -> DISPATCHED (on dispatch dequeue)
> + *   DISPATCHED -> NONE (on property change dequeue or re-enqueue)
> + *   ENQUEUED -> NONE (on property change dequeue before dispatch)
> + */
> +enum task_state {
> +	TASK_NONE = 0,
> +	TASK_ENQUEUED,
> +	TASK_DISPATCHED,
> +};
> +
> +struct task_ctx {
> +	enum task_state state; /* Current state in the workflow */
> +	u64 enqueue_seq;       /* Sequence number for debugging */
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, struct task_ctx);
> +} task_ctx_stor SEC(".maps");
> +
> +static struct task_ctx *try_lookup_task_ctx(struct task_struct *p)
> +{
> +	return bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
> +}
> +
> +s32 BPF_STRUCT_OPS(dequeue_select_cpu, struct task_struct *p,
> +		   s32 prev_cpu, u64 wake_flags)
> +{
> +	struct task_ctx *tctx;
> +	s32 pid = p->pid;
> +> +	tctx = try_lookup_task_ctx(p);
> +	if (!tctx)
> +		return prev_cpu;
> +
> +	switch (test_scenario) {
> +	case 0:
> +		/*
> +		 * Direct dispatch to the local DSQ.
> +		 *
> +		 * Task bypasses BPF scheduler entirely: no enqueue
> +		 * tracking, no ops.dequeue() callbacks.
> +		 */
> +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> +		tctx->state = TASK_DISPATCHED;
> +		break;
> +	case 1:
> +		/*
> +		 * Direct dispatch to the global DSQ.
> +		 *
> +		 * Task bypasses BPF scheduler entirely: no enqueue
> +		 * tracking, no ops.dequeue() callbacks.
> +		 */
> +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, 0);
> +		tctx->state = TASK_DISPATCHED;
> +		break;
> +	case 2:
> +		/*
> +		 * Dispatch to shared a user DSQ.
> +		 *
> +		 * Task enters BPF scheduler management: track
> +		 * enqueue/dequeue lifecycle and validate state
> +		 * transitions.
> +		 */
> +		if (tctx->state == TASK_ENQUEUED)
> +			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
> +				      p->pid, p->comm, tctx->enqueue_seq);
> +
> +		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, 0);
> +
> +		__sync_fetch_and_add(&enqueue_cnt, 1);
> +
> +		tctx->state = TASK_ENQUEUED;
> +		tctx->enqueue_seq++;
> +		break;
> +	}
> +
> +	return prev_cpu;
> +}
> +
> +void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> +	struct task_ctx *tctx;
> +	s32 pid = p->pid;

unused

> +
> +	tctx = try_lookup_task_ctx(p);
> +	if (!tctx)
> +		return;
> +
> +	switch (test_scenario) {
> +	case 3:
> +		/*
> +		 * Direct dispatch to the local DSQ.
> +		 *
> +		 * Task bypasses BPF scheduler entirely: no enqueue
> +		 * tracking, no ops.dequeue() callbacks.
> +		 */
> +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
> +		break;
> +	case 4:
> +		/*
> +		 * Direct dispatch to the global DSQ.
> +		 *
> +		 * Task bypasses BPF scheduler entirely: no enqueue
> +		 * tracking, no ops.dequeue() callbacks.
> +		 */
> +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
> +		break;
> +	case 5:
> +		/*
> +		 * Dispatch to shared user DSQ.
> +		 *
> +		 * Task enters BPF scheduler management: track
> +		 * enqueue/dequeue lifecycle and validate state
> +		 * transitions.
> +		 */
> +		if (tctx->state == TASK_ENQUEUED)
> +			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
> +				      p->pid, p->comm, tctx->enqueue_seq);
> +
> +		scx_bpf_dsq_insert(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
> +
> +		__sync_fetch_and_add(&enqueue_cnt, 1);
> +
> +		tctx->state = TASK_ENQUEUED;
> +		tctx->enqueue_seq++;
> +		break;
> +	case 6:
> +		/*
> +		 * Store task in BPF internal queue.
> +		 *
> +		 * Task enters BPF scheduler management: track
> +		 * enqueue/dequeue lifecycle and validate state
> +		 * transitions.
> +		 */
> +		if (tctx->state == TASK_ENQUEUED)
> +			scx_bpf_error("%d (%s): enqueue while in ENQUEUED state seq=%llu",
> +				      p->pid, p->comm, tctx->enqueue_seq);
> +
> +		if (bpf_map_push_elem(&global_queue, &pid, 0)) {
> +			scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
> +			__sync_fetch_and_add(&bpf_queue_full, 1);
> +
> +			tctx->state = TASK_DISPATCHED;
> +		} else {
> +			__sync_fetch_and_add(&enqueue_cnt, 1);
> +
> +			tctx->state = TASK_ENQUEUED;
> +			tctx->enqueue_seq++;
> +		}
> +		break;
> +	default:
> +		/* For all other scenarios, dispatch to the global DSQ */
> +		scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
> +		tctx->state = TASK_DISPATCHED;
> +		break;
> +	}
> +
> +	scx_bpf_kick_cpu(scx_bpf_task_cpu(p), SCX_KICK_IDLE);
> +}
> +
> +void BPF_STRUCT_OPS(dequeue_dequeue, struct task_struct *p, u64 deq_flags)
> +{
> +	struct task_ctx *tctx;
> +
> +	__sync_fetch_and_add(&dequeue_cnt, 1);
> +
> +	tctx = try_lookup_task_ctx(p);
> +	if (!tctx)
> +		return;
> +
> +	/*
> +	 * For scenarios 0, 1, 3, and 4 (terminal DSQs: local and global),
> +	 * ops.dequeue() should never be called because tasks bypass the
> +	 * BPF scheduler entirely. If we get here, it's a kernel bug.
> +	 */
> +	if (test_scenario == 0 || test_scenario == 3) {
> +		scx_bpf_error("%d (%s): dequeue called for local DSQ scenario",
> +			      p->pid, p->comm);
> +		return;
> +	}
> +
> +	if (test_scenario == 1 || test_scenario == 4) {
> +		scx_bpf_error("%d (%s): dequeue called for global DSQ scenario",
> +			      p->pid, p->comm);
> +		return;
> +	}
> +
> +	if (deq_flags & SCX_DEQ_SCHED_CHANGE) {
> +		/*
> +		 * Property change interrupting the workflow. Valid from
> +		 * both ENQUEUED and DISPATCHED states. Transitions task
> +		 * back to NONE state.
> +		 */
> +		__sync_fetch_and_add(&change_dequeue_cnt, 1);
> +
> +		/* Validate state transition */
> +		if (tctx->state != TASK_ENQUEUED && tctx->state != TASK_DISPATCHED)
> +			scx_bpf_error("%d (%s): invalid property change dequeue state=%d seq=%llu",
> +				      p->pid, p->comm, tctx->state, tctx->enqueue_seq);
> +
> +		/* Transition back to NONE: task outside scheduler control */
> +		tctx->state = TASK_NONE;
> +	} else {
> +		/*
> +		 * Regular dispatch dequeue: normal workflow step. Valid
> +		 * only from ENQUEUED state (after enqueue, before dispatch
> +		 * dequeue). Transitions to DISPATCHED state.
> +		 */
> +		__sync_fetch_and_add(&dispatch_dequeue_cnt, 1);
> +
> +		/*
> +		 * Dispatch dequeue should not have %SCX_DEQ_SCHED_CHANGE
> +		 * flag.
> +		 */
> +		if (deq_flags & SCX_DEQ_SCHED_CHANGE)
> +			scx_bpf_error("%d (%s): SCX_DEQ_SCHED_CHANGE in dispatch dequeue seq=%llu",
> +				      p->pid, p->comm, tctx->enqueue_seq);
> +
> +		/*
> +		 * Must be in ENQUEUED state.
> +		 */
> +		if (tctx->state != TASK_ENQUEUED)
> +			scx_bpf_error("%d (%s): dispatch dequeue from state %d seq=%llu",
> +				      p->pid, p->comm, tctx->state, tctx->enqueue_seq);
> +
> +		/*
> +		 * Transition to DISPATCHED: normal cycle completed
> +		 * dispatch.
> +		 */
> +		tctx->state = TASK_DISPATCHED;
> +	}
> +}
> +
> +void BPF_STRUCT_OPS(dequeue_dispatch, s32 cpu, struct task_struct *prev)
> +{
> +	if (test_scenario == 6) {
> +		struct task_struct *p;
> +		s32 pid;
> +
> +		if (bpf_map_pop_elem(&global_queue, &pid))
> +			return;
> +
> +		p = bpf_task_from_pid(pid);
> +		if (!p)
> +			return;
> +
> +		if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
> +			cpu = scx_bpf_task_cpu(p);
> +
> +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, SCX_SLICE_DFL, 0);
> +		bpf_task_release(p);
> +	} else {
> +		scx_bpf_dsq_move_to_local(SHARED_DSQ);
> +	}
> +}
> +
> +s32 BPF_STRUCT_OPS(dequeue_init_task, struct task_struct *p,
> +		   struct scx_init_task_args *args)
> +{
> +	struct task_ctx *tctx;
> +
> +	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0,
> +				   BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (!tctx)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +s32 BPF_STRUCT_OPS_SLEEPABLE(dequeue_init)
> +{
> +	s32 ret;
> +
> +	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +void BPF_STRUCT_OPS(dequeue_exit, struct scx_exit_info *ei)
> +{
> +	UEI_RECORD(uei, ei);
> +}
> +
> +SEC(".struct_ops.link")
> +struct sched_ext_ops dequeue_ops = {
> +	.select_cpu		= (void *)dequeue_select_cpu,
> +	.enqueue		= (void *)dequeue_enqueue,
> +	.dequeue		= (void *)dequeue_dequeue,
> +	.dispatch		= (void *)dequeue_dispatch,
> +	.init_task		= (void *)dequeue_init_task,
> +	.init			= (void *)dequeue_init,
> +	.exit			= (void *)dequeue_exit,
> +	.timeout_ms		= 5000,
> +	.name			= "dequeue_test",
> +};
> diff --git a/tools/testing/selftests/sched_ext/dequeue.c b/tools/testing/selftests/sched_ext/dequeue.c
> new file mode 100644
> index 0000000000000..8bc9d263aa05c
> --- /dev/null
> +++ b/tools/testing/selftests/sched_ext/dequeue.c
> @@ -0,0 +1,265 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2025 NVIDIA Corporation.
> + */
> +#define _GNU_SOURCE
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <signal.h>
> +#include <time.h>
> +#include <bpf/bpf.h>
> +#include <scx/common.h>
> +#include <sys/wait.h>
> +#include <sched.h>
> +#include <pthread.h>
> +#include "scx_test.h"
> +#include "dequeue.bpf.skel.h"
> +
> +#define NUM_WORKERS 8
> +#define AFFINITY_HAMMER_MS 50
> +
> +/*
> + * Worker function that creates enqueue/dequeue events via CPU work and
> + * sleeping. Property-change dequeues are triggered by the affinity hammer
> + * thread (external sched_setaffinity on worker PIDs).
> + */
> +static void worker_fn(int id)
> +{
> +	int i;
> +	volatile int sum = 0;
> +
> +	for (i = 0; i < 1000; i++) {
> +		int j;
> +
> +		/* Do some work to trigger scheduling events */
> +		for (j = 0; j < 10000; j++)
> +			sum += j;
> +
> +		/* Sleep to trigger dequeue */
> +		usleep(1000 + (id * 100));
> +	}
> +
> +	exit(0);
> +}
> +
> +/*
> + * Property-change dequeues only happen when a task gets a property change
> + * while still in the queue. This thread changes workers' affinity from
> + * outside so that some changes hit tasks while they are still in the
> + * queue.
> + */
> +static void *affinity_hammer_fn(void *arg)
> +{
> +	pid_t *pids = arg;
> +	cpu_set_t cpuset;
> +	int i, n = NUM_WORKERS;
> +	struct timespec ts = { .tv_sec = 0, .tv_nsec = 1000000 }; /* 1ms */
> +
> +	for (i = 0; i < (AFFINITY_HAMMER_MS * 1000 / 100); i++) {
> +		int w = i % n;
> +		int cpu = (i / n) % 4;
> +
> +		CPU_ZERO(&cpuset);
> +		CPU_SET(cpu, &cpuset);
> +		sched_setaffinity(pids[w], sizeof(cpuset), &cpuset);
> +		nanosleep(&ts, NULL);
> +	}
> +
> +	return NULL;
> +}
> +
> +static enum scx_test_status run_scenario(struct dequeue *skel, u32 scenario,
> +					 const char *scenario_name)
> +{
> +	struct bpf_link *link;
> +	pid_t pids[NUM_WORKERS];
> +	pthread_t hammer;
> +
> +	int i, status;
> +	u64 enq_start, deq_start,
> +	    dispatch_deq_start, change_deq_start, bpf_queue_full_start;
> +	u64 enq_delta, deq_delta,
> +	    dispatch_deq_delta, change_deq_delta, bpf_queue_full_delta;
> +
> +	/* Set the test scenario */
> +	skel->bss->test_scenario = scenario;
> +
> +	/* Record starting counts */
> +	enq_start = skel->bss->enqueue_cnt;
> +	deq_start = skel->bss->dequeue_cnt;
> +	dispatch_deq_start = skel->bss->dispatch_dequeue_cnt;
> +	change_deq_start = skel->bss->change_dequeue_cnt;
> +	bpf_queue_full_start = skel->bss->bpf_queue_full;
> +
> +	link = bpf_map__attach_struct_ops(skel->maps.dequeue_ops);
> +	SCX_FAIL_IF(!link, "Failed to attach struct_ops for scenario %s", scenario_name);
> +
> +	/* Fork worker processes to generate enqueue/dequeue events */
> +	for (i = 0; i < NUM_WORKERS; i++) {
> +		pids[i] = fork();
> +		SCX_FAIL_IF(pids[i] < 0, "Failed to fork worker %d", i);
> +
> +		if (pids[i] == 0) {
> +			worker_fn(i);
> +			/* Should not reach here */
> +			exit(1);
> +		}
> +	}
> +
> +	/*
> +	 * Run an "affinity hammer" so that some property changes hit tasks
> +	 * while they are still in BPF custody (e.g. in user DSQ or BPF queue),
> +	 * triggering SCX_DEQ_SCHED_CHANGE dequeues in scenarios 2, 3, 6 and 7.

Not true for 3, right?

> +	 */
> +	SCX_FAIL_IF(pthread_create(&hammer, NULL, affinity_hammer_fn, pids) != 0,
> +		    "Failed to create affinity hammer thread");
> +	pthread_join(hammer, NULL);
> +
> +	/* Wait for all workers to complete */
> +	for (i = 0; i < NUM_WORKERS; i++) {
> +		SCX_FAIL_IF(waitpid(pids[i], &status, 0) != pids[i],
> +			    "Failed to wait for worker %d", i);
> +		SCX_FAIL_IF(status != 0, "Worker %d exited with status %d", i, status);
> +	}
> +
> +	bpf_link__destroy(link);
> +
> +	SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_UNREG));
> +
> +	/* Calculate deltas */
> +	enq_delta = skel->bss->enqueue_cnt - enq_start;
> +	deq_delta = skel->bss->dequeue_cnt - deq_start;
> +	dispatch_deq_delta = skel->bss->dispatch_dequeue_cnt - dispatch_deq_start;
> +	change_deq_delta = skel->bss->change_dequeue_cnt - change_deq_start;
> +	bpf_queue_full_delta = skel->bss->bpf_queue_full - bpf_queue_full_start;
> +
> +	printf("%s:\n", scenario_name);
> +	printf("  enqueues: %lu\n", (unsigned long)enq_delta);
> +	printf("  dequeues: %lu (dispatch: %lu, property_change: %lu)\n",
> +	       (unsigned long)deq_delta,
> +	       (unsigned long)dispatch_deq_delta,
> +	       (unsigned long)change_deq_delta);
> +	printf("  BPF queue full: %lu\n", (unsigned long)bpf_queue_full_delta);
> +
> +	/*
> +	 * Validate enqueue/dequeue lifecycle tracking.
> +	 *
> +	 * For scenarios 0, 1, 3, 4 (local and global DSQs from
> +	 * ops.select_cpu() and ops.enqueue()), both enqueues and dequeues
> +	 * should be 0 because tasks bypass the BPF scheduler entirely:
> +	 * tasks never enter BPF scheduler's custody.
> +	 *
> +	 * For scenarios 2, 5, 6 (user DSQ or BPF internal queue) we expect
> +	 * both enqueues and dequeues.
> +	 *
> +	 * The BPF code does strict state machine validation with
> +	 * scx_bpf_error() to ensure the workflow semantics are correct.
> +	 *
> +	 * If we reach this point without errors, the semantics are
> +	 * validated correctly.
> +	 */
> +	if (scenario == 0 || scenario == 1 ||
> +	    scenario == 3 || scenario == 4) {
> +		/* Tasks bypass BPF scheduler completely */
> +		SCX_EQ(enq_delta, 0);
> +		SCX_EQ(deq_delta, 0);
> +		SCX_EQ(dispatch_deq_delta, 0);
> +		SCX_EQ(change_deq_delta, 0);
> +	} else {
> +		/*
> +		 * User DSQ from ops.enqueue() or ops.select_cpu(): tasks
> +		 * enter BPF scheduler's custody.
> +		 *
> +		 * Also validate 1:1 enqueue/dequeue pairing.
> +		 */
> +		SCX_GT(enq_delta, 0);
> +		SCX_GT(deq_delta, 0);
> +		SCX_EQ(enq_delta, deq_delta);
> +	}
> +
> +	return SCX_TEST_PASS;
> +}
> +
> +static enum scx_test_status setup(void **ctx)
> +{
> +	struct dequeue *skel;
> +
> +	skel = dequeue__open();
> +	SCX_FAIL_IF(!skel, "Failed to open skel");
> +	SCX_ENUM_INIT(skel);
> +	SCX_FAIL_IF(dequeue__load(skel), "Failed to load skel");
> +
> +	*ctx = skel;
> +
> +	return SCX_TEST_PASS;
> +}
> +
> +static enum scx_test_status run(void *ctx)
> +{
> +	struct dequeue *skel = ctx;
> +	enum scx_test_status status;
> +
> +	status = run_scenario(skel, 0, "Scenario 0: Local DSQ from ops.select_cpu()");
> +	if (status != SCX_TEST_PASS)
> +		return status;
> +
> +	status = run_scenario(skel, 1, "Scenario 1: Global DSQ from ops.select_cpu()");
> +	if (status != SCX_TEST_PASS)
> +		return status;
> +
> +	status = run_scenario(skel, 2, "Scenario 2: User DSQ from ops.select_cpu()");
> +	if (status != SCX_TEST_PASS)
> +		return status;
> +
> +	status = run_scenario(skel, 3, "Scenario 3: Local DSQ from ops.enqueue()");
> +	if (status != SCX_TEST_PASS)
> +		return status;
> +
> +	status = run_scenario(skel, 4, "Scenario 4: Global DSQ from ops.enqueue()");
> +	if (status != SCX_TEST_PASS)
> +		return status;
> +
> +	status = run_scenario(skel, 5, "Scenario 5: User DSQ from ops.enqueue()");
> +	if (status != SCX_TEST_PASS)
> +		return status;
> +
> +	status = run_scenario(skel, 6, "Scenario 6: BPF queue from ops.enqueue()");
> +	if (status != SCX_TEST_PASS)
> +		return status;
> +
> +	printf("\n=== Summary ===\n");
> +	printf("Total enqueues: %lu\n", (unsigned long)skel->bss->enqueue_cnt);
> +	printf("Total dequeues: %lu\n", (unsigned long)skel->bss->dequeue_cnt);
> +	printf("  Dispatch dequeues: %lu (no flag, normal workflow)\n",
> +	       (unsigned long)skel->bss->dispatch_dequeue_cnt);
> +	printf("  Property change dequeues: %lu (SCX_DEQ_SCHED_CHANGE flag)\n",
> +	       (unsigned long)skel->bss->change_dequeue_cnt);
> +	printf("  BPF queue full: %lu\n",
> +	       (unsigned long)skel->bss->bpf_queue_full);
> +	printf("\nAll scenarios passed - no state machine violations detected\n");
> +	printf("-> Validated: Local DSQ dispatch bypasses BPF scheduler\n");
> +	printf("-> Validated: Global DSQ dispatch bypasses BPF scheduler\n");
> +	printf("-> Validated: User DSQ dispatch triggers ops.dequeue() callbacks\n");
> +	printf("-> Validated: Dispatch dequeues have no flags (normal workflow)\n");
> +	printf("-> Validated: Property change dequeues have SCX_DEQ_SCHED_CHANGE flag\n");
> +	printf("-> Validated: No duplicate enqueues or invalid state transitions\n");
> +
> +	return SCX_TEST_PASS;
> +}
> +
> +static void cleanup(void *ctx)
> +{
> +	struct dequeue *skel = ctx;
> +
> +	dequeue__destroy(skel);
> +}
> +
> +struct scx_test dequeue_test = {
> +	.name = "dequeue",
> +	.description = "Verify ops.dequeue() semantics",
> +	.setup = setup,
> +	.run = run,
> +	.cleanup = cleanup,
> +};
> +
> +REGISTER_SCX_TEST(&dequeue_test)


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-12 17:07                   ` Tejun Heo
@ 2026-02-12 18:14                     ` Andrea Righi
  2026-02-12 18:35                       ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-12 18:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Thu, Feb 12, 2026 at 07:07:05AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Feb 12, 2026 at 04:45:43PM +0100, Andrea Righi wrote:
> > > > So, we need a way to mark "this migration is internal to SCX", like a new
> > > > SCX_TASK_MIGRATING_INTERNAL flag?
> 
> Yeah, I think this is what we should do. That's the only ops.dequeue()
> without matching ops.enqueue(), right?

Correct.

> 
> ...
> > > IIUC one example might sway your opinion (or not):
> > > Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ
> > > (and maybe being enqueued at another) prevents e.g. accurate PELT load
> > > tracking on the BPF side.
> > > Regular utilization tracking works through ops.running() and 
> > > ops.stopping() but load I don't think load can be implemented accurately.
> > 
> > It makes sense to me and I think it's actually valid reason to prefer the
> > "always trigger" way.
> 
> I don't think this is a valid argument. PELT is done that way because the
> association of the task and the CPU is meaningful for in-kernel schedulers.
> The queues are actually per-CPU. For SCX scheds, the relationship is not
> known to the kernel. Only the BPF scheduler itself knows, if it wants to
> attribute per-task load to a specific CPU, which CPU it should be attributed
> to. What's the point of following in-kernel association for PELT if the task
> was going to be hot migrated to another CPU on execution?

I see, let me elaborate more on this to make sure we're on the same page.

In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU:
it can put the task on an arbitrary DSQ or even in some internal BPF data
structures. The task is still associated with a runqueue, but only to
satisfy a kernel requirement, for sched_ext that association isn't
meaningful, because the task isn't really "on" that CPU (in fact in
ops.dispatch() can do the "last minute" migration).

Therefore, keeping accurate per-CPU information from the kernel's
perspective doesn't buy us much, given that the BPF scheduler can keep
tasks in its own queues or structures.

Accurate PELT is still doable: the BPF scheduler can track where it puts
each task in its own state, updates runnable load when it places the task
in a DSQ / data structure and when the task leaves (dequeue). And it can
use ops.running() / ops.stopping() for utilization.

And with a proper ops.dequeue() semantics, PELT can be driven by the BPF
scheduler's own placement and the scx callbacks, not by the specific rq a
task is on.

If all of the above makes sense for everyone, I agree that we don't need to
notify all the internal migrations.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 2/2] selftests/sched_ext: Add test to validate ops.dequeue() semantics
  2026-02-12 17:15   ` Christian Loehle
@ 2026-02-12 18:25     ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-12 18:25 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Tejun Heo, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Thu, Feb 12, 2026 at 05:15:28PM +0000, Christian Loehle wrote:
> On 2/10/26 21:26, Andrea Righi wrote:
...
> > +void BPF_STRUCT_OPS(dequeue_enqueue, struct task_struct *p, u64 enq_flags)
> > +{
> > +	struct task_ctx *tctx;
> > +	s32 pid = p->pid;
> 
> unused

This one is used, but the one in dequeue_select_cpu() is not. I'll remove
that. :)

> > +static enum scx_test_status run_scenario(struct dequeue *skel, u32 scenario,
> > +					 const char *scenario_name)
> > +{
> > +	struct bpf_link *link;
> > +	pid_t pids[NUM_WORKERS];
> > +	pthread_t hammer;
> > +
> > +	int i, status;
> > +	u64 enq_start, deq_start,
> > +	    dispatch_deq_start, change_deq_start, bpf_queue_full_start;
> > +	u64 enq_delta, deq_delta,
> > +	    dispatch_deq_delta, change_deq_delta, bpf_queue_full_delta;
> > +
> > +	/* Set the test scenario */
> > +	skel->bss->test_scenario = scenario;
> > +
> > +	/* Record starting counts */
> > +	enq_start = skel->bss->enqueue_cnt;
> > +	deq_start = skel->bss->dequeue_cnt;
> > +	dispatch_deq_start = skel->bss->dispatch_dequeue_cnt;
> > +	change_deq_start = skel->bss->change_dequeue_cnt;
> > +	bpf_queue_full_start = skel->bss->bpf_queue_full;
> > +
> > +	link = bpf_map__attach_struct_ops(skel->maps.dequeue_ops);
> > +	SCX_FAIL_IF(!link, "Failed to attach struct_ops for scenario %s", scenario_name);
> > +
> > +	/* Fork worker processes to generate enqueue/dequeue events */
> > +	for (i = 0; i < NUM_WORKERS; i++) {
> > +		pids[i] = fork();
> > +		SCX_FAIL_IF(pids[i] < 0, "Failed to fork worker %d", i);
> > +
> > +		if (pids[i] == 0) {
> > +			worker_fn(i);
> > +			/* Should not reach here */
> > +			exit(1);
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Run an "affinity hammer" so that some property changes hit tasks
> > +	 * while they are still in BPF custody (e.g. in user DSQ or BPF queue),
> > +	 * triggering SCX_DEQ_SCHED_CHANGE dequeues in scenarios 2, 3, 6 and 7.
> 
> Not true for 3, right?

Oh yes, this selftest has been changed so many times that I was sure I
forgot to update some comments (also, scenario 7 doesn't exist anymore).

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-12 18:14                     ` Andrea Righi
@ 2026-02-12 18:35                       ` Tejun Heo
  2026-02-12 22:30                         ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-12 18:35 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

Hello, Andrea.

On Thu, Feb 12, 2026 at 07:14:13PM +0100, Andrea Righi wrote:
...
> In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU:
> it can put the task on an arbitrary DSQ or even in some internal BPF data
> structures. The task is still associated with a runqueue, but only to
> satisfy a kernel requirement, for sched_ext that association isn't
> meaningful, because the task isn't really "on" that CPU (in fact in
> ops.dispatch() can do the "last minute" migration).

Yes.

> Therefore, keeping accurate per-CPU information from the kernel's
> perspective doesn't buy us much, given that the BPF scheduler can keep
> tasks in its own queues or structures.
> 
> Accurate PELT is still doable: the BPF scheduler can track where it puts
> each task in its own state, updates runnable load when it places the task
> in a DSQ / data structure and when the task leaves (dequeue). And it can
> use ops.running() / ops.stopping() for utilization.

And the BPF sched might choose to do load aggregation at a differnt level
too - e.g. maybe per-CPU load metric doesn't make sense given the machine
and scheduler and only per-LLC level aggregation would be meaningful, which
would be true for multiple of the current SCX schedulers given the per-LLC
DSQ usage.

> And with a proper ops.dequeue() semantics, PELT can be driven by the BPF
> scheduler's own placement and the scx callbacks, not by the specific rq a
> task is on.
> 
> If all of the above makes sense for everyone, I agree that we don't need to
> notify all the internal migrations.

Yeah, I think we're on the same page. BTW, I wonder whether we could use
p->scx.sticky_cpu to detect internal migrations. It's only used for internal
migrations, so maybe it can be used for detection.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-12 18:35                       ` Tejun Heo
@ 2026-02-12 22:30                         ` Andrea Righi
  2026-02-14 10:16                           ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-12 22:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Thu, Feb 12, 2026 at 08:35:55AM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> On Thu, Feb 12, 2026 at 07:14:13PM +0100, Andrea Righi wrote:
> ...
> > In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU:
> > it can put the task on an arbitrary DSQ or even in some internal BPF data
> > structures. The task is still associated with a runqueue, but only to
> > satisfy a kernel requirement, for sched_ext that association isn't
> > meaningful, because the task isn't really "on" that CPU (in fact in
> > ops.dispatch() can do the "last minute" migration).
> 
> Yes.
> 
> > Therefore, keeping accurate per-CPU information from the kernel's
> > perspective doesn't buy us much, given that the BPF scheduler can keep
> > tasks in its own queues or structures.
> > 
> > Accurate PELT is still doable: the BPF scheduler can track where it puts
> > each task in its own state, updates runnable load when it places the task
> > in a DSQ / data structure and when the task leaves (dequeue). And it can
> > use ops.running() / ops.stopping() for utilization.
> 
> And the BPF sched might choose to do load aggregation at a differnt level
> too - e.g. maybe per-CPU load metric doesn't make sense given the machine
> and scheduler and only per-LLC level aggregation would be meaningful, which
> would be true for multiple of the current SCX schedulers given the per-LLC
> DSQ usage.
> 
> > And with a proper ops.dequeue() semantics, PELT can be driven by the BPF
> > scheduler's own placement and the scx callbacks, not by the specific rq a
> > task is on.
> > 
> > If all of the above makes sense for everyone, I agree that we don't need to
> > notify all the internal migrations.
> 
> Yeah, I think we're on the same page. BTW, I wonder whether we could use
> p->scx.sticky_cpu to detect internal migrations. It's only used for internal
> migrations, so maybe it can be used for detection.

Perfect. And yes, I think if we set p->scx.sticky_cpu before
deactivate_task() in move_remote_task_to_local_dsq(), then in ops_dequeue()
we should be able to catch the internal migrations checking
task_on_rq_migrating(p) && p->scx.sticky_cpu >= 0.

I'll run some tests with that.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-12 22:30                         ` Andrea Righi
@ 2026-02-14 10:16                           ` Andrea Righi
  2026-02-14 17:56                             ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Andrea Righi @ 2026-02-14 10:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Thu, Feb 12, 2026 at 11:30:14PM +0100, Andrea Righi wrote:
> On Thu, Feb 12, 2026 at 08:35:55AM -1000, Tejun Heo wrote:
> > Hello, Andrea.
> > 
> > On Thu, Feb 12, 2026 at 07:14:13PM +0100, Andrea Righi wrote:
> > ...
> > > In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU:
> > > it can put the task on an arbitrary DSQ or even in some internal BPF data
> > > structures. The task is still associated with a runqueue, but only to
> > > satisfy a kernel requirement, for sched_ext that association isn't
> > > meaningful, because the task isn't really "on" that CPU (in fact in
> > > ops.dispatch() can do the "last minute" migration).
> > 
> > Yes.
> > 
> > > Therefore, keeping accurate per-CPU information from the kernel's
> > > perspective doesn't buy us much, given that the BPF scheduler can keep
> > > tasks in its own queues or structures.
> > > 
> > > Accurate PELT is still doable: the BPF scheduler can track where it puts
> > > each task in its own state, updates runnable load when it places the task
> > > in a DSQ / data structure and when the task leaves (dequeue). And it can
> > > use ops.running() / ops.stopping() for utilization.
> > 
> > And the BPF sched might choose to do load aggregation at a differnt level
> > too - e.g. maybe per-CPU load metric doesn't make sense given the machine
> > and scheduler and only per-LLC level aggregation would be meaningful, which
> > would be true for multiple of the current SCX schedulers given the per-LLC
> > DSQ usage.
> > 
> > > And with a proper ops.dequeue() semantics, PELT can be driven by the BPF
> > > scheduler's own placement and the scx callbacks, not by the specific rq a
> > > task is on.
> > > 
> > > If all of the above makes sense for everyone, I agree that we don't need to
> > > notify all the internal migrations.
> > 
> > Yeah, I think we're on the same page. BTW, I wonder whether we could use
> > p->scx.sticky_cpu to detect internal migrations. It's only used for internal
> > migrations, so maybe it can be used for detection.
> 
> Perfect. And yes, I think if we set p->scx.sticky_cpu before
> deactivate_task() in move_remote_task_to_local_dsq(), then in ops_dequeue()
> we should be able to catch the internal migrations checking
> task_on_rq_migrating(p) && p->scx.sticky_cpu >= 0.
> 
> I'll run some tests with that.

I ran more tests and I don't think we can simply rely on p->scx.sticky_cpu.

In particular, I don't see how to handle this scenario using only
p->scx.sticky_cpu: a task starts an internal migration, a sched_change
occurs, and ops.dequeue() gets skipped because p->scx.sticky_cpu >= 0.

So I'm back to the idea of introducing an SCX_TASK_MIGRATING_INTERNAL
flag...

-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-14 10:16                           ` Andrea Righi
@ 2026-02-14 17:56                             ` Tejun Heo
  2026-02-14 19:32                               ` Andrea Righi
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2026-02-14 17:56 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

Hello, Andrea.

On Sat, Feb 14, 2026 at 11:16:34AM +0100, Andrea Righi wrote:
> I ran more tests and I don't think we can simply rely on p->scx.sticky_cpu.
> 
> In particular, I don't see how to handle this scenario using only
> p->scx.sticky_cpu: a task starts an internal migration, a sched_change
> occurs, and ops.dequeue() gets skipped because p->scx.sticky_cpu >= 0.

Oh, that shouldn't happen, so move_remote_task_to_local_dsq() does the
following:

	deactivate_task(src_rq, p, 0);
	set_task_cpu(p, cpu_of(dst_rq));
	p->scx.sticky_cpu = cpu_of(dst_rq);

	raw_spin_rq_unlock(src_rq);
	raw_spin_rq_lock(dst_rq);
        ...
	activate_task(dst_rq, p, 0);

It *looks* like something get can get while the locks are switched; however,
the above deactivate_task() does WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING)
and task_rq_lock() does the following:

	for (;;) {
		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
		rq = task_rq(p);
		raw_spin_rq_lock(rq);
		/*
		 *	move_queued_task()		task_rq_lock()
		 *
		 *	ACQUIRE (rq->lock)
		 *	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
		 *	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
		 *	[S] ->cpu = new_cpu		[L] task_rq()
		 *					[L] ->on_rq
		 *	RELEASE (rq->lock)
		 *
		 * If we observe the old CPU in task_rq_lock(), the acquire of
		 * the old rq->lock will fully serialize against the stores.
		 *
		 * If we observe the new CPU in task_rq_lock(), the address
		 * dependency headed by '[L] rq = task_rq()' and the acquire
		 * will pair with the WMB to ensure we then also see migrating.
		 */
		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
			rq_pin_lock(rq, rf);
			return rq;
		}
		raw_spin_rq_unlock(rq);
		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);

		while (unlikely(task_on_rq_migrating(p)))
			cpu_relax();
	}

ie. TASK_ON_RQ_MIGRATING works like a separate lock that protects the task
while it's switching the RQs, so any operations that use task_rq_lock()
which includes any property changes can't get inbetween.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
  2026-02-14 17:56                             ` Tejun Heo
@ 2026-02-14 19:32                               ` Andrea Righi
  0 siblings, 0 replies; 83+ messages in thread
From: Andrea Righi @ 2026-02-14 19:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christian Loehle, David Vernet, Changwoo Min, Kuba Piecuch,
	Emil Tsalapatis, Daniel Hodges, sched-ext, linux-kernel

On Sat, Feb 14, 2026 at 07:56:12AM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> On Sat, Feb 14, 2026 at 11:16:34AM +0100, Andrea Righi wrote:
> > I ran more tests and I don't think we can simply rely on p->scx.sticky_cpu.
> > 
> > In particular, I don't see how to handle this scenario using only
> > p->scx.sticky_cpu: a task starts an internal migration, a sched_change
> > occurs, and ops.dequeue() gets skipped because p->scx.sticky_cpu >= 0.
> 
> Oh, that shouldn't happen, so move_remote_task_to_local_dsq() does the
> following:
> 
> 	deactivate_task(src_rq, p, 0);
> 	set_task_cpu(p, cpu_of(dst_rq));
> 	p->scx.sticky_cpu = cpu_of(dst_rq);
> 
> 	raw_spin_rq_unlock(src_rq);
> 	raw_spin_rq_lock(dst_rq);
>         ...
> 	activate_task(dst_rq, p, 0);
> 
> It *looks* like something get can get while the locks are switched; however,
> the above deactivate_task() does WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING)
> and task_rq_lock() does the following:
> 
> 	for (;;) {
> 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
> 		rq = task_rq(p);
> 		raw_spin_rq_lock(rq);
> 		/*
> 		 *	move_queued_task()		task_rq_lock()
> 		 *
> 		 *	ACQUIRE (rq->lock)
> 		 *	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
> 		 *	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
> 		 *	[S] ->cpu = new_cpu		[L] task_rq()
> 		 *					[L] ->on_rq
> 		 *	RELEASE (rq->lock)
> 		 *
> 		 * If we observe the old CPU in task_rq_lock(), the acquire of
> 		 * the old rq->lock will fully serialize against the stores.
> 		 *
> 		 * If we observe the new CPU in task_rq_lock(), the address
> 		 * dependency headed by '[L] rq = task_rq()' and the acquire
> 		 * will pair with the WMB to ensure we then also see migrating.
> 		 */
> 		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
> 			rq_pin_lock(rq, rf);
> 			return rq;
> 		}
> 		raw_spin_rq_unlock(rq);
> 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
> 
> 		while (unlikely(task_on_rq_migrating(p)))
> 			cpu_relax();
> 	}
> 
> ie. TASK_ON_RQ_MIGRATING works like a separate lock that protects the task
> while it's switching the RQs, so any operations that use task_rq_lock()
> which includes any property changes can't get inbetween.

Yeah, that makes sense, so the scenario I was thinking it was happening
can't happen. I guess I'm missing some ops.dequeue() events then or there's
a race somewhere, because I can see tasks being enqueued without a
corresponding ops.dequeue(). I'll add some debugging and keep
investigating.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2026-02-14 19:32 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20   ` Tejun Heo
2026-02-11 16:06     ` Andrea Righi
2026-02-11 19:47       ` Tejun Heo
2026-02-11 22:34         ` Andrea Righi
2026-02-11 22:37           ` Tejun Heo
2026-02-11 22:48             ` Andrea Righi
2026-02-12 10:16             ` Andrea Righi
2026-02-12 14:32               ` Christian Loehle
2026-02-12 15:45                 ` Andrea Righi
2026-02-12 17:07                   ` Tejun Heo
2026-02-12 18:14                     ` Andrea Righi
2026-02-12 18:35                       ` Tejun Heo
2026-02-12 22:30                         ` Andrea Righi
2026-02-14 10:16                           ` Andrea Righi
2026-02-14 17:56                             ` Tejun Heo
2026-02-14 19:32                               ` Andrea Righi
2026-02-10 23:54   ` Tejun Heo
2026-02-11 16:07     ` Andrea Righi
2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-12 17:15   ` Christian Loehle
2026-02-12 18:25     ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-06 13:54 [PATCHSET v7] sched_ext: Fix " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29   ` Kuba Piecuch
2026-02-05 21:32     ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14   ` Tejun Heo
2026-02-05  9:26     ` Andrea Righi
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47   ` Christian Loehle
2026-02-02  7:45     ` Andrea Righi
2026-02-02  9:26       ` Andrea Righi
2026-02-02 10:02         ` Christian Loehle
2026-02-02 15:32           ` Andrea Righi
2026-02-02 10:09       ` Christian Loehle
2026-02-02 13:59       ` Kuba Piecuch
2026-02-04  9:36         ` Andrea Righi
2026-02-04  9:51           ` Kuba Piecuch
2026-02-02 11:56   ` Kuba Piecuch
2026-02-04 10:11     ` Andrea Righi
2026-02-04 10:33       ` Kuba Piecuch
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38   ` Emil Tsalapatis
2026-01-27 16:41   ` Kuba Piecuch
2026-01-30  7:34     ` Andrea Righi
2026-01-30 13:14       ` Kuba Piecuch
2026-01-31  6:54         ` Andrea Righi
2026-01-31 16:45           ` Kuba Piecuch
2026-01-31 17:24             ` Andrea Righi
2026-01-28 21:21   ` Tejun Heo
2026-01-30 11:54     ` Kuba Piecuch
2026-01-31  9:02       ` Andrea Righi
2026-01-31 17:53         ` Kuba Piecuch
2026-01-31 20:26           ` Andrea Righi
2026-02-02 15:19             ` Tejun Heo
2026-02-02 15:30               ` Andrea Righi
2026-02-01 17:43       ` Tejun Heo
2026-02-02 15:52         ` Andrea Righi
2026-02-02 16:23           ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54   ` Christian Loehle
2026-01-21 12:57     ` Andrea Righi
2026-01-22  9:28   ` Kuba Piecuch
2026-01-23 13:32     ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28  3:20   ` Emil Tsalapatis
2025-12-29 16:36     ` Andrea Righi
2025-12-29 18:35       ` Emil Tsalapatis
2025-12-28 17:19   ` Tejun Heo
2025-12-28 23:28     ` Tejun Heo
2025-12-28 23:38       ` Tejun Heo
2025-12-29 17:07         ` Andrea Righi
2025-12-29 18:55           ` Emil Tsalapatis
2025-12-28 23:42   ` Tejun Heo
2025-12-29 17:17     ` Andrea Righi
2025-12-29  0:06   ` Tejun Heo
2025-12-29 18:56     ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox