Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Kuba Piecuch <jpiecuch@google.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Daniel Hodges <hodgesd@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Thu, 5 Feb 2026 10:26:11 +0100	[thread overview]
Message-ID: <aYRiM5leoHcWiXw3@gpd4> (raw)
In-Reply-To: <aYPE0GRJEu5gdBVl@slm.duckdns.org>

Hi Tejun,

On Wed, Feb 04, 2026 at 12:14:40PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 04, 2026 at 05:05:58PM +0100, Andrea Righi wrote:
> > Currently, ops.dequeue() is only invoked when the sched_ext core knows
> > that a task resides in BPF-managed data structures, which causes it to
> > miss scheduling property change events. In addition, ops.dequeue()
> > callbacks are completely skipped when tasks are dispatched to non-local
> > DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
> > track task state.
> > 
> > Fix this by guaranteeing that each task entering the BPF scheduler's
> > custody triggers exactly one ops.dequeue() call when it leaves that
> > custody, whether the exit is due to a dispatch (regular or via a core
> > scheduling pick) or to a scheduling property change (e.g.
> > sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
> > balancing, etc.).
> > 
> > BPF scheduler custody concept: a task is considered to be in "BPF
> > scheduler's custody" when it has been queued in user-created DSQs and
> > the BPF scheduler is responsible for its lifecycle. Custody ends when
> > the task is dispatched to a terminal DSQ (local DSQ or SCX_DSQ_GLOBAL),
> > selected by core scheduling, or removed due to a property change.
> > 
> > Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
> > entirely and are not in its custody. Terminal DSQs include:
> >  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> >    where tasks go directly to execution.
> >  - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
> >    BPF scheduler is considered "done" with the task.
> > 
> > As a result, ops.dequeue() is not invoked for tasks dispatched to
> > terminal DSQs, as the BPF scheduler no longer retains custody of them.
> > 
> > To identify dequeues triggered by scheduling property changes, introduce
> > the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
> > the dequeue was caused by a scheduling property change.
> > 
> ...
> > +   **Property Change Notifications for Running Tasks**:
> > +
> > +   For tasks that have left BPF custody (running or on terminal DSQs),
> > +   property changes can be intercepted through the dedicated callbacks:
> 
> I'm not sure this section is necessary. The way it's phrased makes it sound
> like schedulers would use DEQ_SCHED_CHANGE to process property changes but
> that's not the case. Relevant property changes will be notified in whatever
> ways they're notified and a task being dequeued for SCHED_CHANGE doesn't
> necessarily mean there will be an associated property change event either.
> e.g. We don't do anything re. on sched_setnuma().

Agreed, this section is a bit misleading, DEQ_SCHED_CHANGE is an
informational flag indicating the ops.dequeue() wasn't due to dispatch,
schedulers shouldn't use it to process property changes. I'll remove it.

> 
> > @@ -1102,6 +1122,18 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> >  	dsq_mod_nr(dsq, 1);
> >  	p->scx.dsq = dsq;
> >  
> > +	/*
> > +	 * Mark task as in BPF scheduler's custody if being queued to a
> > +	 * non-builtin (user) DSQ. Builtin DSQs (local, global, bypass) are
> > +	 * terminal: tasks on them have left BPF custody.
> > +	 *
> > +	 * Don't touch the flag if already set (e.g., by
> > +	 * mark_direct_dispatch() or direct_dispatch()/finish_dispatch()
> > +	 * for user DSQs).
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue) && !(dsq->id & SCX_DSQ_FLAG_BUILTIN))
> > +		p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> 
> given that this is tied to dequeue, maybe a more direct name would be less
> confusing? e.g. something like SCX_TASK_NEED_DEQ?

Ack.

> 
> > @@ -1274,6 +1306,24 @@ static void mark_direct_dispatch(struct scx_sched *sch,
> >  
> >  	p->scx.ddsp_dsq_id = dsq_id;
> >  	p->scx.ddsp_enq_flags = enq_flags;
> > +
> > +	/*
> > +	 * Mark the task as entering BPF scheduler's custody if it's being
> > +	 * dispatched to a non-terminal DSQ (i.e., custom user DSQs). This
> > +	 * handles the case where ops.select_cpu() directly dispatches - even
> > +	 * though ops.enqueue() won't be called, the task enters BPF custody
> > +	 * if dispatched to a user DSQ and should get ops.dequeue() when it
> > +	 * leaves.
> > +	 *
> > +	 * For terminal DSQs (local DSQs and SCX_DSQ_GLOBAL), ensure the flag
> > +	 * is clear since the BPF scheduler is done with the task.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq_id))
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		else
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +	}
> 
> Hmm... I'm a bit confused on why this needs to be in mark_direct_dispatch()
> AND dispatch_enqueue(). The flag should be clear when off SCX. The only
> places where it could be set is from the enqueue path - when a task is
> direct dispatched to a non-terminal DSQ or BPF. Both cases can be reliably
> captured in do_enqueue_task(), no?

You're right. I was incorrectly assuming we needed this in
mark_direct_dispatch() to catch direct dispatches to user DSQs from
ops.select_cpu(), but that's not true. All paths go through
do_enqueue_task() which funnels to dispatch_enqueue(), so we can handle it
all in one place.

> 
> >  static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> > @@ -1287,6 +1337,41 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
> ...
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq->id)) {
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		} else {
> > +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> > +	}
> 
> And when would direct_dispatch() need to call ops.dequeue()?
> direct_dispatch() is only used from do_enqueue_task() and there can only be
> one direct dispatch attempt on any given enqueue event. A task being
> enqueued shouldn't have the OPS_ENQUEUED set and would get dispatched once
> to either a terminal or non-terminal DSQ. If terminal, there's nothing to
> do. If non-terminal, the flag would need to be set. Am I missing something?

Nah, you're right, direct_dispatch() doesn't need to call ops.dequeue() or
manage the flag. I'll remove all the flag management from direct_dispatch()
and centralize it in dispatch_enqueue().

> 
> > @@ -1523,6 +1608,31 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> ...
> > +		if (SCX_HAS_OP(sch, dequeue) &&
> > +		    p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> 
> nit: () around & expression.
> 
> > +			u64 flags = deq_flags;
> > +
> > +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +				flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> >  		break;
> >  	case SCX_OPSS_QUEUEING:
> >  		/*
> > @@ -1531,9 +1641,24 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  		 */
> >  		BUG();
> >  	case SCX_OPSS_QUEUED:
> > -		if (SCX_HAS_OP(sch, dequeue))
> > -			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > -					 p, deq_flags);
> > +		/*
> > +		 * Task is still on the BPF scheduler (not dispatched yet).
> > +		 * Call ops.dequeue() to notify. Add %SCX_DEQ_SCHED_CHANGE
> > +		 * only for property changes, not for core-sched picks or
> > +		 * sleep.
> > +		 *
> > +		 * Clear the flag after calling ops.dequeue(): the task is
> > +		 * leaving BPF scheduler's custody.
> > +		 */
> > +		if (SCX_HAS_OP(sch, dequeue)) {
> > +			u64 flags = deq_flags;
> > +
> > +			if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +				flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> 
> I wonder whether this and the above block can be factored somehow.

Ack, we can add a helper for this.

> 
> > @@ -1630,6 +1755,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  					 struct scx_dispatch_q *src_dsq,
> >  					 struct rq *dst_rq)
> >  {
> > +	struct scx_sched *sch = scx_root;
> >  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
> >  
> >  	/* @dsq is locked and @p is on @dst_rq */
> > @@ -1638,6 +1764,15 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  
> >  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> >  
> > +	/*
> > +	 * Task is moving from a non-local DSQ to a local DSQ. Call
> > +	 * ops.dequeue() if the task was in BPF custody.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue) && (p->scx.flags & SCX_TASK_OPS_ENQUEUED)) {
> > +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +	}
> > +
> >  	if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
> >  		list_add(&p->scx.dsq_list.node, &dst_dsq->list);
> >  	else
> > @@ -2107,6 +2242,36 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
> >  
> >  	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
> >  
> > +	/*
> > +	 * Handle ops.dequeue() based on destination DSQ.
> > +	 *
> > +	 * Dispatch to terminal DSQs (local DSQs and SCX_DSQ_GLOBAL): the BPF
> > +	 * scheduler is done with the task. Call ops.dequeue() if it was in
> > +	 * BPF custody, then clear the %SCX_TASK_OPS_ENQUEUED flag.
> > +	 *
> > +	 * Dispatch to user DSQs: task is in BPF scheduler's custody.
> > +	 * Mark it so ops.dequeue() will be called when it leaves.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		if (!is_terminal_dsq(dsq_id)) {
> > +			p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> > +		} else {
> 
> Let's do "if (COND) { A } else { B }" instead of "if (!COND) { B } else { A
> }". Continuing from earlier, I don't understand why we'd need to set
> OPS_ENQUEUED here. Given that a transition to a terminal DSQ is terminal, I
> can't think of conditions where we'd need to set OPS_ENQUEUED from
> ops.dispatch().

Right, a task that reaches ops.dispatch() is already in QUEUED state, if
it's in a user DSQ the flag is already set from when it was enqueued, so
there's no need to set the flag in finish_dispatch().

> 
> > +			/*
> > +			 * Locking: we're holding the @rq lock (the
> > +			 * dispatch CPU's rq), but not necessarily
> > +			 * task_rq(p), since @p may be from a remote CPU.
> > +			 *
> > +			 * This is safe because SCX_OPSS_DISPATCHING state
> > +			 * prevents racing dequeues, any concurrent
> > +			 * ops_dequeue() will wait for this state to clear.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> > +				SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
> > +
> > +			p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +		}
> > +	}
> 
> I'm not sure finish_dispatch() is the right place to do this. e.g.
> scx_bpf_dsq_move() can also move tasks from a user DSQ to a terminal DSQ and
> the above wouldn't cover it. Wouldn't it make more sense to do this in
> dispatch_enqueue()?

Agreed.

> 
> > @@ -2894,6 +3059,14 @@ static void scx_enable_task(struct task_struct *p)
> >  
> >  	lockdep_assert_rq_held(rq);
> >  
> > +	/*
> > +	 * Clear enqueue/dequeue tracking flags when enabling the task.
> > +	 * This ensures a clean state when the task enters SCX. Only needed
> > +	 * if ops.dequeue() is implemented.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue))
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> > +
> >  	/*
> >  	 * Set the weight before calling ops.enable() so that the scheduler
> >  	 * doesn't see a stale value if they inspect the task struct.
> > @@ -2925,6 +3098,13 @@ static void scx_disable_task(struct task_struct *p)
> >  	if (SCX_HAS_OP(sch, disable))
> >  		SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
> >  	scx_set_task_state(p, SCX_TASK_READY);
> > +
> > +	/*
> > +	 * Clear enqueue/dequeue tracking flags when disabling the task.
> > +	 * Only needed if ops.dequeue() is implemented.
> > +	 */
> > +	if (SCX_HAS_OP(sch, dequeue))
> > +		p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> 
> If we make the flag transitions consistent, we shouldn't need these, right?
> We can add WARN_ON_ONCE() at the head of enqueue maybe.

Correct.

Thanks for the review! I'll post a new version.

-Andrea

next prev parent reply	other threads:[~2026-02-05  9:26 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-04 16:05 [PATCHSET v5] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14   ` Tejun Heo
2026-02-05  9:26     ` Andrea Righi [this message]
2026-02-04 16:05 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix " Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20   ` Tejun Heo
2026-02-11 16:06     ` Andrea Righi
2026-02-11 19:47       ` Tejun Heo
2026-02-11 22:34         ` Andrea Righi
2026-02-11 22:37           ` Tejun Heo
2026-02-11 22:48             ` Andrea Righi
2026-02-12 10:16             ` Andrea Righi
2026-02-12 14:32               ` Christian Loehle
2026-02-12 15:45                 ` Andrea Righi
2026-02-12 17:07                   ` Tejun Heo
2026-02-12 18:14                     ` Andrea Righi
2026-02-12 18:35                       ` Tejun Heo
2026-02-12 22:30                         ` Andrea Righi
2026-02-14 10:16                           ` Andrea Righi
2026-02-14 17:56                             ` Tejun Heo
2026-02-14 19:32                               ` Andrea Righi
2026-02-10 23:54   ` Tejun Heo
2026-02-11 16:07     ` Andrea Righi
2026-02-06 13:54 [PATCHSET v7] " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29   ` Kuba Piecuch
2026-02-05 21:32     ` Andrea Righi
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47   ` Christian Loehle
2026-02-02  7:45     ` Andrea Righi
2026-02-02  9:26       ` Andrea Righi
2026-02-02 10:02         ` Christian Loehle
2026-02-02 15:32           ` Andrea Righi
2026-02-02 10:09       ` Christian Loehle
2026-02-02 13:59       ` Kuba Piecuch
2026-02-04  9:36         ` Andrea Righi
2026-02-04  9:51           ` Kuba Piecuch
2026-02-02 11:56   ` Kuba Piecuch
2026-02-04 10:11     ` Andrea Righi
2026-02-04 10:33       ` Kuba Piecuch
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38   ` Emil Tsalapatis
2026-01-27 16:41   ` Kuba Piecuch
2026-01-30  7:34     ` Andrea Righi
2026-01-30 13:14       ` Kuba Piecuch
2026-01-31  6:54         ` Andrea Righi
2026-01-31 16:45           ` Kuba Piecuch
2026-01-31 17:24             ` Andrea Righi
2026-01-28 21:21   ` Tejun Heo
2026-01-30 11:54     ` Kuba Piecuch
2026-01-31  9:02       ` Andrea Righi
2026-01-31 17:53         ` Kuba Piecuch
2026-01-31 20:26           ` Andrea Righi
2026-02-02 15:19             ` Tejun Heo
2026-02-02 15:30               ` Andrea Righi
2026-02-01 17:43       ` Tejun Heo
2026-02-02 15:52         ` Andrea Righi
2026-02-02 16:23           ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54   ` Christian Loehle
2026-01-21 12:57     ` Andrea Righi
2026-01-22  9:28   ` Kuba Piecuch
2026-01-23 13:32     ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28  3:20   ` Emil Tsalapatis
2025-12-29 16:36     ` Andrea Righi
2025-12-29 18:35       ` Emil Tsalapatis
2025-12-28 17:19   ` Tejun Heo
2025-12-28 23:28     ` Tejun Heo
2025-12-28 23:38       ` Tejun Heo
2025-12-29 17:07         ` Andrea Righi
2025-12-29 18:55           ` Emil Tsalapatis
2025-12-28 23:42   ` Tejun Heo
2025-12-29 17:17     ` Andrea Righi
2025-12-29  0:06   ` Tejun Heo
2025-12-29 18:56     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYRiM5leoHcWiXw3@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=jpiecuch@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.