All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Kuba Piecuch <jpiecuch@google.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Daniel Hodges <hodgesd@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics
Date: Wed, 11 Feb 2026 17:06:20 +0100	[thread overview]
Message-ID: <aYyo_ALV_BT6WaE0@gpd4> (raw)
In-Reply-To: <aYu9KzKTMfNIFDwD@slm.duckdns.org>

Hi Tejun,

On Tue, Feb 10, 2026 at 01:20:11PM -1000, Tejun Heo wrote:
> On Tue, Feb 10, 2026 at 10:26:04PM +0100, Andrea Righi wrote:
> > +/**
> > + * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
> > + * @dsq_id: DSQ ID to check
> > + *
> > + * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
> > + * scheduler is considered "done" with the task.
> > + *
> > + * Builtin DSQs include:
> > + *  - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
> > + *    where tasks go directly to execution,
> > + *  - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
> > + *  - Bypass DSQ: used during bypass mode.
> > + *
> > + * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
> > + * trigger ops.dequeue() when they are later consumed.
> > + */
> > +static inline bool is_terminal_dsq(u64 dsq_id)
> > +{
> > +	return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID;
> > +}
> 
> Please use () do clarify ordering between & and &&. It's just visually
> confusing. I wonder whether it'd be cleaner to make it take @dsq instead of
> @dsq_id and then it can just do:
> 
>         return dsq->id == SCX_DSQ_LOCAL || dsq->id == SCX_DSQ_GLOBAL;
> 
> because SCX_DSQ_LOCAL_ON is only used as the designator not as actual DSQ
> id, and the above code positively identifies what's terminal.

Ok, but we also need to include SCX_DSQ_BYPASS, in that case maybe checking
SCX_DSQ_FLAG_BUILTIN is more generic?

> 
> > -static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
> > +static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> > +			     struct scx_dispatch_q *dsq,
> >  			     struct task_struct *p, u64 enq_flags)
> 
> While minor, this patch would be easier to read if the @rq addition were
> done in a separate patch.

Ack. I'll split that out.

> 
> > +static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
> > +			      struct task_struct *p, u64 deq_flags,
> > +			      bool is_sched_change)
> 
> Isn't @is_sched_change a bit of misnomer given that it needs to exclude
> SCX_DEQ_CORE_SCHED_EXEC. I wonder whether it'd be easier if @deq_flags
> handling is separated out. This part is ops_dequeue() specific, right?
> Everyone else statically knows what DEQ flags to use. That might make
> ops_dequeue() calculate flags unnecessarily but ops_dequeue() is not
> particularly hot, so I don't think that'd matter.

Ack, I'll handle deq_flags in ops_dequeue() and simplify
call_task_dequeue() accordingly.

> 
> > +{
> > +	if (SCX_HAS_OP(sch, dequeue)) {
> > +		/*
> > +		 * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a
> > +		 * property change (not sleep or core-sched pick).
> > +		 */
> > +		if (is_sched_change &&
> > +		    !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> > +			deq_flags |= SCX_DEQ_SCHED_CHANGE;
> > +
> > +		SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags);
> > +	}
> > +	p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> 
> Let's move flag clearing to the call sites. It's a bit confusing w/ the
> function name.

Ack.

> 
> >  static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  {
> >  	struct scx_sched *sch = scx_root;
> > @@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >  
> >  	switch (opss & SCX_OPSS_STATE_MASK) {
> >  	case SCX_OPSS_NONE:
> > +		/*
> > +		 * If the task is still in BPF scheduler's custody
> > +		 * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
> > +		 */
> > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +			call_task_dequeue(sch, rq, p, deq_flags, true);
> 
> Hmm... why is this path necessary? Shouldn't the one that cleared OPSS be
> responsible for clearing IN_CUSTODY too?

The path that clears OPSS to NONE doesn't always clear IN_CUSTODY: in
dispatch_to_local_dsq(), when we're moving a task that was in DISPATCHING
to a remote CPU's local DSQ, we only set ops_state to NONE, so a concurrent
dequeue can proceed, but we only clear IN_CUSTODY when we later enqueue or
move the task. So we can see NONE + IN_CUSTODY here and need to handle it.
And we can't clear IN_CUSTODY at the same time we set NONE there, because
we don't hold the task's rq lock yet and we can't trigger ops.dequeue().

> 
> > @@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  					 struct scx_dispatch_q *src_dsq,
> >  					 struct rq *dst_rq)
> >  {
> > +	struct scx_sched *sch = scx_root;
> >  	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
> >  
> >  	/* @dsq is locked and @p is on @dst_rq */
> > @@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
> >  
> >  	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> >  
> > +	/*
> > +	 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > +	 * Call ops.dequeue() if the task was in BPF custody.
> > +	 */
> > +	if (p->scx.flags & SCX_TASK_IN_CUSTODY) {
> > +		if (SCX_HAS_OP(sch, dequeue))
> > +			SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
> > +		p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> > +	}
> 
> I think a better place to put this would be inside local_dsq_post_enq() so
> that dispatch_enqueue() and move_local_task_to_local_dsq() can share the
> path. This would mean breaking out local and global cases in
> dispatch_enqueue(). ie. at the end of dispatch_enqueue():
> 
>         if (is_local) {
>                 local_dsq_post_enq(...);
>         } else {
>                 if (dsq->id == SCX_DSQ_GLOBAL)
>                         global_dsq_post_enq(...);       /* or open code with comment */
>                 raw_spin_unlock(&dsq->lock);
>         }

Agreed, I'll move this into local_dsq_post_enq() and introduce
a global_dsq_post_enq().

> 
> > @@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
> >  		!WARN_ON_ONCE(src_rq != task_rq(p));
> >  }
> >  
> > -static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
> > -				struct scx_dispatch_q *dsq, struct rq *src_rq)
> > +static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
> > +			       struct task_struct *p,
> > +			       struct scx_dispatch_q *dsq, struct rq *src_rq)
> >  {
> >  	raw_spin_rq_unlock(this_rq);
> >  
> >  	if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
> > +		/*
> > +		 * Task is moving from a non-local DSQ to a local (terminal) DSQ.
> > +		 * Call ops.dequeue() if the task was in BPF custody.
> > +		 */
> > +		if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +			call_task_dequeue(sch, src_rq, p, 0, false);
> 
> and this shouldn't be necessary. move_remote_task_to_local_dsq() deactivates
> and reactivates the task. The deactivation invokes ops_dequeue() but that
> should suppress dequeue invocation as that's internal transfer (this is
> discernable from p->on_rq being set to TASK_ON_RQ_MIGRATING) and when it
> gets enqueued on the target CPU, dispatch_enqueue() on the local DSQ should
> trigger dequeue invocation, right?

Should we trigger ops.dequeue() when the task is dequeued inside
move_remote_task_to_local_dsq() (in ops_dequeue() on the path triggered by
deactivate_task() there) instead of suppressing it and invoking on the
target in local_dsq_post_enq()?

That way the BPF sees dequeue on the source and then enqueue on the target,
we avoid special-casing SCX_TASK_IN_CUSTODY in do_enqueue_task() and the
"when to call dequeue" logic stays consistent in ops_dequeue and the
terminal local/global post_enq paths.

Does it make sense or would you rather suppress it and only invoke on the
target when the task lands on the local DSQ??

> 
> > @@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
> >  						     src_dsq, dst_rq);
> >  			raw_spin_unlock(&src_dsq->lock);
> >  		} else {
> > +			/*
> > +			 * Moving to a local DSQ, dispatch_enqueue() is not
> > +			 * used, so call ops.dequeue() here if the task was
> > +			 * in BPF scheduler's custody.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +				call_task_dequeue(sch, src_rq, p, 0, false);
> 
> and then this becomes unnecessary too.

Ack + same comment about consume_remote_task().

> 
> > @@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
> >  		 */
> >  		if (src_rq == dst_rq) {
> >  			p->scx.holding_cpu = -1;
> > -			dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
> > +			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
> >  					 enq_flags);
> >  		} else {
> > +			/*
> > +			 * Moving to a local DSQ, dispatch_enqueue() is not
> > +			 * used, so call ops.dequeue() here if the task was
> > +			 * in BPF scheduler's custody.
> > +			 */
> > +			if (p->scx.flags & SCX_TASK_IN_CUSTODY)
> > +				call_task_dequeue(sch, src_rq, p, 0, false);
> 
> ditto.

Ack + same as above.

Thanks,
-Andrea

  reply	other threads:[~2026-02-11 16:06 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 21:26 [PATCHSET v8] sched_ext: Fix ops.dequeue() semantics Andrea Righi
2026-02-10 21:26 ` [PATCH 1/2] " Andrea Righi
2026-02-10 23:20   ` Tejun Heo
2026-02-11 16:06     ` Andrea Righi [this message]
2026-02-11 19:47       ` Tejun Heo
2026-02-11 22:34         ` Andrea Righi
2026-02-11 22:37           ` Tejun Heo
2026-02-11 22:48             ` Andrea Righi
2026-02-12 10:16             ` Andrea Righi
2026-02-12 14:32               ` Christian Loehle
2026-02-12 15:45                 ` Andrea Righi
2026-02-12 17:07                   ` Tejun Heo
2026-02-12 18:14                     ` Andrea Righi
2026-02-12 18:35                       ` Tejun Heo
2026-02-12 22:30                         ` Andrea Righi
2026-02-14 10:16                           ` Andrea Righi
2026-02-14 17:56                             ` Tejun Heo
2026-02-14 19:32                               ` Andrea Righi
2026-02-10 23:54   ` Tejun Heo
2026-02-11 16:07     ` Andrea Righi
2026-02-10 21:26 ` [PATCH 2/2] selftests/sched_ext: Add test to validate " Andrea Righi
2026-02-12 17:15   ` Christian Loehle
2026-02-12 18:25     ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2026-02-06 13:54 [PATCHSET v7] sched_ext: Fix " Andrea Righi
2026-02-06 13:54 ` [PATCH 1/2] " Andrea Righi
2026-02-06 20:35   ` Emil Tsalapatis
2026-02-07  9:26     ` Andrea Righi
2026-02-09 17:28       ` Tejun Heo
2026-02-09 19:06         ` Andrea Righi
2026-02-05 15:32 [PATCHSET v6] " Andrea Righi
2026-02-05 15:32 ` [PATCH 1/2] " Andrea Righi
2026-02-05 19:29   ` Kuba Piecuch
2026-02-05 21:32     ` Andrea Righi
2026-02-04 16:05 [PATCHSET v5] " Andrea Righi
2026-02-04 16:05 ` [PATCH 1/2] " Andrea Righi
2026-02-04 22:14   ` Tejun Heo
2026-02-05  9:26     ` Andrea Righi
2026-02-01  9:08 [PATCHSET v4 sched_ext/for-6.20] " Andrea Righi
2026-02-01  9:08 ` [PATCH 1/2] " Andrea Righi
2026-02-01 22:47   ` Christian Loehle
2026-02-02  7:45     ` Andrea Righi
2026-02-02  9:26       ` Andrea Righi
2026-02-02 10:02         ` Christian Loehle
2026-02-02 15:32           ` Andrea Righi
2026-02-02 10:09       ` Christian Loehle
2026-02-02 13:59       ` Kuba Piecuch
2026-02-04  9:36         ` Andrea Righi
2026-02-04  9:51           ` Kuba Piecuch
2026-02-02 11:56   ` Kuba Piecuch
2026-02-04 10:11     ` Andrea Righi
2026-02-04 10:33       ` Kuba Piecuch
2026-01-26  8:41 [PATCHSET v3 sched_ext/for-6.20] " Andrea Righi
2026-01-26  8:41 ` [PATCH 1/2] " Andrea Righi
2026-01-27 16:38   ` Emil Tsalapatis
2026-01-27 16:41   ` Kuba Piecuch
2026-01-30  7:34     ` Andrea Righi
2026-01-30 13:14       ` Kuba Piecuch
2026-01-31  6:54         ` Andrea Righi
2026-01-31 16:45           ` Kuba Piecuch
2026-01-31 17:24             ` Andrea Righi
2026-01-28 21:21   ` Tejun Heo
2026-01-30 11:54     ` Kuba Piecuch
2026-01-31  9:02       ` Andrea Righi
2026-01-31 17:53         ` Kuba Piecuch
2026-01-31 20:26           ` Andrea Righi
2026-02-02 15:19             ` Tejun Heo
2026-02-02 15:30               ` Andrea Righi
2026-02-01 17:43       ` Tejun Heo
2026-02-02 15:52         ` Andrea Righi
2026-02-02 16:23           ` Kuba Piecuch
2026-01-21 12:25 [PATCHSET v2 sched_ext/for-6.20] " Andrea Righi
2026-01-21 12:25 ` [PATCH 1/2] " Andrea Righi
2026-01-21 12:54   ` Christian Loehle
2026-01-21 12:57     ` Andrea Righi
2026-01-22  9:28   ` Kuba Piecuch
2026-01-23 13:32     ` Andrea Righi
2025-12-19 22:43 [PATCH 0/2] sched_ext: Implement proper " Andrea Righi
2025-12-19 22:43 ` [PATCH 1/2] sched_ext: Fix " Andrea Righi
2025-12-28  3:20   ` Emil Tsalapatis
2025-12-29 16:36     ` Andrea Righi
2025-12-29 18:35       ` Emil Tsalapatis
2025-12-28 17:19   ` Tejun Heo
2025-12-28 23:28     ` Tejun Heo
2025-12-28 23:38       ` Tejun Heo
2025-12-29 17:07         ` Andrea Righi
2025-12-29 18:55           ` Emil Tsalapatis
2025-12-28 23:42   ` Tejun Heo
2025-12-29 17:17     ` Andrea Righi
2025-12-29  0:06   ` Tejun Heo
2025-12-29 18:56     ` Andrea Righi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYyo_ALV_BT6WaE0@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=christian.loehle@arm.com \
    --cc=emil@etsalapatis.com \
    --cc=hodgesd@meta.com \
    --cc=jpiecuch@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.