public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Kuba Piecuch <jpiecuch@google.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Daniel Hodges <hodgesd@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Wed, 11 Feb 2026 17:06:20 +0100	[thread overview]
Message-ID: <aYyo_ALV_BT6WaE0@gpd4> (raw)
In-Reply-To: <aYu9KzKTMfNIFDwD@slm.duckdns.org>

Hi Tejun,

On Tue, Feb 10, 2026 at 01:20:11PM -1000, Tejun Heo wrote:
> On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote:
> > +/**
> > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
> > + * @dsq_id: DSQ ID to check
> > + *
> > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
> > + * scheduler is considered "done" with the task.
> > + *
> > + * Builtin DSQs include:
> > + *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> > + *    where tasks go directly to execution,
> > + *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
> > + *  - Bypass DSQ: used during bypass mode.
> > + *
> > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
> > + * trigger ops.dequeue() when they are later consumed.
> > + */
> > +static inline bool is_terminal_dsq(u64 dsq_id)
> > +{
> > +	return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID;
> > +}
> 
> Please use () do clarify ordering between & and &&. It's just visually
> confusing. I wonder whether it'd be cleaner to make it take @dsq instead of
> @dsq_id and then it can just do:
> 
>         return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL;
> 
> because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ
> id, and the above code positively identifies what's terminal.

Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking
SCX_DSQ_FLAG_BUILTIN is more generic?

> 
> > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> > +			     struct scx_dispatch_q *dsq,
> >  			     struct task_struct *p, u64 enq_flags)
> 
> While minor, this patch would be easier to read if the @rq addition were
> done in a separate patch.

Ack. I'll split that out.

> 
> > +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
> > +			      struct task_struct *p, u64 deq_flags,
> > +			      bool is_sched_change)
> 
> Isn't @is_sched_change a bit of misnomer given that it needs to exclude
> SCX_DEQ_CORE_SCHED_EXEC. I wonder whether it'd be easier if @deq_flags
> handling is separated out. This part is ops_dequeue() specific, right?
> Everyone else statically knows what DEQ flags to use. That might make
> ops_dequeue() calculate flags unnecessarily but ops_dequeue() is not
> particularly hot, so I don't think that'd matter.

Ack, I'll handle deq_flags in ops_dequeue() and simplify
call_task_dequeue() accordingly.

> 
> > +{
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		/*
> > +		 * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a
> > +		 * property change (not sleep or core-sched pick).
> > +		 */
> > +		if (is_sched_change &&
> > +		    !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +			deq_flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags);
> > +	}
> > +	p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> 
> Let's move flag clearing to the call sites. It's a bit confusing w/ the
> function name.

Ack.

> 
> >  static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  {
> >  	struct scx_sched *sch = scx_root;
> > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * If the task is still in BPF scheduler's custody
> > +		 * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
> > +		 */
> > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +			call_task_dequeue(sch, rq, p, deq_flags, true);
> 
> Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be
> responsible for clearing IN_CUSTODY too?

The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in
dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING
to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent
dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or
move the task. So we can see NONE + IN_CUSTODY here and need to handle it.
And we can't clear IN_CUSTODY at the same time we set NONE there, because
we don't hold the task's rq lock yet and we can't trigger ops.dequeue().

> 
> > @@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  					 struct scx_dispatch_q *src_dsq,
> >  					 struct rq *dst_rq)
> >  {
> > +	struct scx_sched *sch = scx_root;
> >  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
> >  
> >  	/* @dsq is locked and @p is on @dst_rq */
> > @@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  
> >  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> >  
> > +	/*
> > +	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > +	 * Call ops.dequeue() if the task was in BPF custody.
> > +	 */
> > +	if (p->scx.flags & SCX_TASK_IN_CUSTODY) {
> > +		if (SCX_HAS_OP(sch, dequeue))
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> > +		p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> > +	}
> 
> I think a better place to put this would be inside local_dsq_post_enq() so
> that dispatch_enqueue() and move_local_task_to_local_dsq() can share the
> path. This would mean breaking out local and global cases in
> dispatch_enqueue(). ie. at the end of dispatch_enqueue():
> 
>         if (is_local) {
>                 local_dsq_post_enq(...);
>         } else {
>                 if (dsq->id == SCX_DSQ_GLOBAL)
>                         global_dsq_post_enq(...);       /* or open code with comment */
>                 raw_spin_unlock(&dsq->lock);
>         }

Agreed, I'll move this into local_dsq_post_enq() and introduce
a global_dsq_post_enq().

> 
> > @@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
> >  		!WARN_ON_ONCE(src_rq != task_rq(p));
> >  }
> >  
> > -static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
> > -				struct scx_dispatch_q *dsq, struct rq *src_rq)
> > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
> > +			       struct task_struct *p,
> > +			       struct scx_dispatch_q *dsq, struct rq *src_rq)
> >  {
> >  	raw_spin_rq_unlock(this_rq);
> >  
> >  	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
> > +		/*
> > +		 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > +		 * Call ops.dequeue() if the task was in BPF custody.
> > +		 */
> > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +			call_task_dequeue(sch, src_rq, p, 0, false);
> 
> and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates
> and reactivates the task. The deactivation invokes ops_dequeue() but that
> should suppress dequeue invocation as that's internal transfer (this is
> discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it
> gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should
> trigger dequeue invocation, right?

Should we trigger ops.dequeue() when the task is dequeued inside
move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by
deactivate_task() there) instead of suppressing it and invoking on the
target in local_dsq_post_enq()?

That way the BPF sees dequeue on the source and then enqueue on the target,
we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the
"when to call dequeue" logic stays consistent in ops_dequeue and the
terminal local/global post_enq paths.

Does it make sense or would you rather suppress it and only invoke on the
target when the task lands on the local DSQ??

> 
> > @@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  						     src_dsq, dst_rq);
> >  			raw_spin_unlock(&src_dsq->lock);
> >  		} else {
> > +			/*
> > +			 * Moving to a local DSQ, dispatch_enqueue() is not
> > +			 * used, so call ops.dequeue() here if the task was
> > +			 * in BPF scheduler's custody.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +				call_task_dequeue(sch, src_rq, p, 0, false);
> 
> and then this becomes unnecessary too.

Ack + same comment about consume_remote_task().

> 
> > @@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  		 */
> >  		if (src_rq == dst_rq) {
> >  			p->scx.holding_cpu = -1;
> > -			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
> > +			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
> >  					 enq_flags);
> >  		} else {
> > +			/*
> > +			 * Moving to a local DSQ, dispatch_enqueue() is not
> > +			 * used, so call ops.dequeue() here if the task was
> > +			 * in BPF scheduler's custody.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +				call_task_dequeue(sch, src_rq, p, 0, false);
> 
> ditto.

Ack + same as above.

Thanks,
-Andrea

  reply	other threads:[~2026-02-11 16:06 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20   ` Tejun Heo
2026-02-11 16:06     ` Andrea Righi [this message]
2026-02-11 19:47       ` Tejun Heo
2026-02-11 22:34         ` Andrea Righi
2026-02-11 22:37           ` Tejun Heo
2026-02-11 22:48             ` Andrea Righi
2026-02-12 10:16             ` Andrea Righi
2026-02-12 14:32               ` Christian Loehle
2026-02-12 15:45                 ` Andrea Righi
2026-02-12 17:07                   ` Tejun Heo
2026-02-12 18:14                     ` Andrea Righi
2026-02-12 18:35                       ` Tejun Heo
2026-02-12 22:30                         ` Andrea Righi
2026-02-14 10:16                           ` Andrea Righi
2026-02-14 17:56                             ` Tejun Heo
2026-02-14 19:32                               ` Andrea Righi
2026-02-10 23:54   ` Tejun Heo
2026-02-11 16:07     ` Andrea Righi
2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-12 17:15   ` Christian Loehle
2026-02-12 18:25     ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-06 13:54 [PATCHSET v7] sched_ext: Fix " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29   ` Kuba Piecuch
2026-02-05 21:32     ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14   ` Tejun Heo
2026-02-05  9:26     ` Andrea Righi
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47   ` Christian Loehle
2026-02-02  7:45     ` Andrea Righi
2026-02-02  9:26       ` Andrea Righi
2026-02-02 10:02         ` Christian Loehle
2026-02-02 15:32           ` Andrea Righi
2026-02-02 10:09       ` Christian Loehle
2026-02-02 13:59       ` Kuba Piecuch
2026-02-04  9:36         ` Andrea Righi
2026-02-04  9:51           ` Kuba Piecuch
2026-02-02 11:56   ` Kuba Piecuch
2026-02-04 10:11     ` Andrea Righi
2026-02-04 10:33       ` Kuba Piecuch
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38   ` Emil Tsalapatis
2026-01-27 16:41   ` Kuba Piecuch
2026-01-30  7:34     ` Andrea Righi
2026-01-30 13:14       ` Kuba Piecuch
2026-01-31  6:54         ` Andrea Righi
2026-01-31 16:45           ` Kuba Piecuch
2026-01-31 17:24             ` Andrea Righi
2026-01-28 21:21   ` Tejun Heo
2026-01-30 11:54     ` Kuba Piecuch
2026-01-31  9:02       ` Andrea Righi
2026-01-31 17:53         ` Kuba Piecuch
2026-01-31 20:26           ` Andrea Righi
2026-02-02 15:19             ` Tejun Heo
2026-02-02 15:30               ` Andrea Righi
2026-02-01 17:43       ` Tejun Heo
2026-02-02 15:52         ` Andrea Righi
2026-02-02 16:23           ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54   ` Christian Loehle
2026-01-21 12:57     ` Andrea Righi
2026-01-22  9:28   ` Kuba Piecuch
2026-01-23 13:32     ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28  3:20   ` Emil Tsalapatis
2025-12-29 16:36     ` Andrea Righi
2025-12-29 18:35       ` Emil Tsalapatis
2025-12-28 17:19   ` Tejun Heo
2025-12-28 23:28     ` Tejun Heo
2025-12-28 23:38       ` Tejun Heo
2025-12-29 17:07         ` Andrea Righi
2025-12-29 18:55           ` Emil Tsalapatis
2025-12-28 23:42   ` Tejun Heo
2025-12-29 17:17     ` Andrea Righi
2025-12-29  0:06   ` Tejun Heo
2025-12-29 18:56     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYyo_ALV_BT6WaE0@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=jpiecuch@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox