[PATCH 00/14] sched: Support shared runqueue locking

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/14] sched: Support shared runqueue locking
@ 2025-09-10 15:44 Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
                   ` (15 more replies)
  0 siblings, 16 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hi,

As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
is down to the core code not quite working right for shared runqueue stuff.

Instead of endlessly hacking around that, bite the bullet and fix it all up.

With these patches, it should be possible to clean up pick_task_scx() to not
rely on balance_scx(). Additionally it should be possible to fix that RT issue,
and the dl_server issue without further propagating lock breaks.

As is, these patches boot and run/pass selftests/sched_ext with lockdep on.

I meant to do more sched_ext cleanups, but since this has all already taken
longer than I would've liked (real life interrupted :/), I figured I should
post this as is and let TJ/Andrea poke at it.

Patches are also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup

[1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net

---
 include/linux/cleanup.h  |   5 +
 include/linux/sched.h    |   6 +-
 kernel/cgroup/cpuset.c   |   2 +-
 kernel/kthread.c         |  15 +-
 kernel/sched/core.c      | 370 +++++++++++++++++++++--------------------------
 kernel/sched/deadline.c  |  26 ++--
 kernel/sched/ext.c       | 104 +++++++------
 kernel/sched/fair.c      |  23 ++-
 kernel/sched/idle.c      |  14 +-
 kernel/sched/rt.c        |  13 +-
 kernel/sched/sched.h     | 225 ++++++++++++++++++++--------
 kernel/sched/stats.h     |   2 +-
 kernel/sched/stop_task.c |  14 +-
 kernel/sched/syscalls.c  |  80 ++++------
 14 files changed, 495 insertions(+), 404 deletions(-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 01/14] sched: Employ sched_change guards
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-11  9:06   ` K Prateek Nayak
  2025-10-06 15:21   ` Shrikanth Hegde
  2025-09-10 15:44 ` [PATCH 02/14] sched: Re-arrange the {EN,DE}QUEUE flags Peter Zijlstra
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

As proposed a long while ago -- and half done by scx -- wrap the
scheduler's 'change' pattern in a guard helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/cleanup.h |    5 +
 kernel/sched/core.c     |  156 +++++++++++++++++-------------------------------
 kernel/sched/ext.c      |   39 +++++-------
 kernel/sched/sched.h    |   21 +++---
 kernel/sched/syscalls.c |   65 +++++++-------------
 5 files changed, 114 insertions(+), 172 deletions(-)

--- a/include/linux/cleanup.h
+++ b/include/linux/cleanup.h
@@ -340,6 +340,11 @@ _label:
 #define __DEFINE_CLASS_IS_CONDITIONAL(_name, _is_cond)	\
 static __maybe_unused const bool class_##_name##_is_conditional = _is_cond
 
+#define DEFINE_CLASS_IS_UNCONDITIONAL(_name)		\
+	__DEFINE_CLASS_IS_CONDITIONAL(_name, false);	\
+	static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \
+	{ return (void *)1; }
+
 #define __GUARD_IS_ERR(_ptr)                                       \
 	({                                                         \
 		unsigned long _rc = (__force unsigned long)(_ptr); \
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7361,7 +7361,7 @@ void rt_mutex_post_schedule(void)
  */
 void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 {
-	int prio, oldprio, queued, running, queue_flag =
+	int prio, oldprio, queue_flag =
 		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
 	const struct sched_class *prev_class, *next_class;
 	struct rq_flags rf;
@@ -7426,52 +7426,42 @@ void rt_mutex_setprio(struct task_struct
 	if (prev_class != next_class && p->se.sched_delayed)
 		dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
 
-	queued = task_on_rq_queued(p);
-	running = task_current_donor(rq, p);
-	if (queued)
-		dequeue_task(rq, p, queue_flag);
-	if (running)
-		put_prev_task(rq, p);
-
-	/*
-	 * Boosting condition are:
-	 * 1. -rt task is running and holds mutex A
-	 *      --> -dl task blocks on mutex A
-	 *
-	 * 2. -dl task is running and holds mutex A
-	 *      --> -dl task blocks on mutex A and could preempt the
-	 *          running task
-	 */
-	if (dl_prio(prio)) {
-		if (!dl_prio(p->normal_prio) ||
-		    (pi_task && dl_prio(pi_task->prio) &&
-		     dl_entity_preempt(&pi_task->dl, &p->dl))) {
-			p->dl.pi_se = pi_task->dl.pi_se;
-			queue_flag |= ENQUEUE_REPLENISH;
+	scoped_guard (sched_change, p, queue_flag) {
+		/*
+		 * Boosting condition are:
+		 * 1. -rt task is running and holds mutex A
+		 *      --> -dl task blocks on mutex A
+		 *
+		 * 2. -dl task is running and holds mutex A
+		 *      --> -dl task blocks on mutex A and could preempt the
+		 *          running task
+		 */
+		if (dl_prio(prio)) {
+			if (!dl_prio(p->normal_prio) ||
+			    (pi_task && dl_prio(pi_task->prio) &&
+			     dl_entity_preempt(&pi_task->dl, &p->dl))) {
+				p->dl.pi_se = pi_task->dl.pi_se;
+				scope->flags |= ENQUEUE_REPLENISH;
+			} else {
+				p->dl.pi_se = &p->dl;
+			}
+		} else if (rt_prio(prio)) {
+			if (dl_prio(oldprio))
+				p->dl.pi_se = &p->dl;
+			if (oldprio < prio)
+				scope->flags |= ENQUEUE_HEAD;
 		} else {
-			p->dl.pi_se = &p->dl;
+			if (dl_prio(oldprio))
+				p->dl.pi_se = &p->dl;
+			if (rt_prio(oldprio))
+				p->rt.timeout = 0;
 		}
-	} else if (rt_prio(prio)) {
-		if (dl_prio(oldprio))
-			p->dl.pi_se = &p->dl;
-		if (oldprio < prio)
-			queue_flag |= ENQUEUE_HEAD;
-	} else {
-		if (dl_prio(oldprio))
-			p->dl.pi_se = &p->dl;
-		if (rt_prio(oldprio))
-			p->rt.timeout = 0;
-	}
 
-	p->sched_class = next_class;
-	p->prio = prio;
+		p->sched_class = next_class;
+		p->prio = prio;
 
-	check_class_changing(rq, p, prev_class);
-
-	if (queued)
-		enqueue_task(rq, p, queue_flag);
-	if (running)
-		set_next_task(rq, p);
+		check_class_changing(rq, p, prev_class);
+	}
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -8119,26 +8109,9 @@ int migrate_task_to(struct task_struct *
  */
 void sched_setnuma(struct task_struct *p, int nid)
 {
-	bool queued, running;
-	struct rq_flags rf;
-	struct rq *rq;
-
-	rq = task_rq_lock(p, &rf);
-	queued = task_on_rq_queued(p);
-	running = task_current_donor(rq, p);
-
-	if (queued)
-		dequeue_task(rq, p, DEQUEUE_SAVE);
-	if (running)
-		put_prev_task(rq, p);
-
-	p->numa_preferred_nid = nid;
-
-	if (queued)
-		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
-	if (running)
-		set_next_task(rq, p);
-	task_rq_unlock(rq, p, &rf);
+	guard(task_rq_lock)(p);
+	scoped_guard (sched_change, p, DEQUEUE_SAVE)
+		p->numa_preferred_nid = nid;
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
@@ -9240,8 +9213,9 @@ static void sched_change_group(struct ta
  */
 void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
-	int queued, running, queue_flags =
+	unsigned int queue_flags =
 		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+	bool resched = false;
 	struct rq *rq;
 
 	CLASS(task_rq_lock, rq_guard)(tsk);
@@ -9249,28 +9223,12 @@ void sched_move_task(struct task_struct
 
 	update_rq_clock(rq);
 
-	running = task_current_donor(rq, tsk);
-	queued = task_on_rq_queued(tsk);
-
-	if (queued)
-		dequeue_task(rq, tsk, queue_flags);
-	if (running)
-		put_prev_task(rq, tsk);
-
-	sched_change_group(tsk);
-	if (!for_autogroup)
-		scx_cgroup_move_task(tsk);
-
-	if (queued)
-		enqueue_task(rq, tsk, queue_flags);
-	if (running) {
-		set_next_task(rq, tsk);
-		/*
-		 * After changing group, the running task may have joined a
-		 * throttled one but it's still the running task. Trigger a
-		 * resched to make sure that task can still run.
-		 */
-		resched_curr(rq);
+	scoped_guard (sched_change, tsk, queue_flags) {
+		sched_change_group(tsk);
+		if (!for_autogroup)
+			scx_cgroup_move_task(tsk);
+		if (scope->running)
+			resched = true;
 	}
 }
 
@@ -10929,37 +10887,39 @@ void sched_mm_cid_fork(struct task_struc
 }
 #endif /* CONFIG_SCHED_MM_CID */
 
-#ifdef CONFIG_SCHED_CLASS_EXT
-void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
-			    struct sched_enq_and_set_ctx *ctx)
+static DEFINE_PER_CPU(struct sched_change_ctx, sched_change_ctx);
+
+struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int flags)
 {
+	struct sched_change_ctx *ctx = this_cpu_ptr(&sched_change_ctx);
 	struct rq *rq = task_rq(p);
 
 	lockdep_assert_rq_held(rq);
 
-	*ctx = (struct sched_enq_and_set_ctx){
+	*ctx = (struct sched_change_ctx){
 		.p = p,
-		.queue_flags = queue_flags,
+		.flags = flags,
 		.queued = task_on_rq_queued(p),
 		.running = task_current(rq, p),
 	};
 
-	update_rq_clock(rq);
 	if (ctx->queued)
-		dequeue_task(rq, p, queue_flags | DEQUEUE_NOCLOCK);
+		dequeue_task(rq, p, flags);
 	if (ctx->running)
 		put_prev_task(rq, p);
+
+	return ctx;
 }
 
-void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
+void sched_change_end(struct sched_change_ctx *ctx)
 {
-	struct rq *rq = task_rq(ctx->p);
+	struct task_struct *p = ctx->p;
+	struct rq *rq = task_rq(p);
 
 	lockdep_assert_rq_held(rq);
 
 	if (ctx->queued)
-		enqueue_task(rq, ctx->p, ctx->queue_flags | ENQUEUE_NOCLOCK);
+		enqueue_task(rq, p, ctx->flags | ENQUEUE_NOCLOCK);
 	if (ctx->running)
-		set_next_task(rq, ctx->p);
+		set_next_task(rq, p);
 }
-#endif /* CONFIG_SCHED_CLASS_EXT */
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4867,11 +4867,10 @@ static void scx_bypass(bool bypass)
 		 */
 		list_for_each_entry_safe_reverse(p, n, &rq->scx.runnable_list,
 						 scx.runnable_node) {
-			struct sched_enq_and_set_ctx ctx;
-
 			/* cycling deq/enq is enough, see the function comment */
-			sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
-			sched_enq_and_set_task(&ctx);
+			scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+				/* nothing */ ;
+			}
 		}
 
 		/* resched to restore ticks and idle state */
@@ -5003,17 +5002,16 @@ static void scx_disable_workfn(struct kt
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
-		struct sched_enq_and_set_ctx ctx;
-
-		if (old_class != new_class && p->se.sched_delayed)
-			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 
-		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
+		update_rq_clock(task_rq(p));
 
-		p->sched_class = new_class;
-		check_class_changing(task_rq(p), p, old_class);
+		if (old_class != new_class && p->se.sched_delayed)
+			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
 
-		sched_enq_and_set_task(&ctx);
+		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK) {
+			p->sched_class = new_class;
+			check_class_changing(task_rq(p), p, old_class);
+		}
 
 		check_class_changed(task_rq(p), p, old_class, p->prio);
 		scx_exit_task(p);
@@ -5747,21 +5745,20 @@ static int scx_enable(struct sched_ext_o
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
-		struct sched_enq_and_set_ctx ctx;
 
 		if (!tryget_task_struct(p))
 			continue;
 
-		if (old_class != new_class && p->se.sched_delayed)
-			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
-
-		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
+		update_rq_clock(task_rq(p));
 
-		p->scx.slice = SCX_SLICE_DFL;
-		p->sched_class = new_class;
-		check_class_changing(task_rq(p), p, old_class);
+		if (old_class != new_class && p->se.sched_delayed)
+			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
 
-		sched_enq_and_set_task(&ctx);
+		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK) {
+			p->scx.slice = SCX_SLICE_DFL;
+			p->sched_class = new_class;
+			check_class_changing(task_rq(p), p, old_class);
+		}
 
 		check_class_changed(task_rq(p), p, old_class, p->prio);
 		put_task_struct(p);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3860,23 +3860,22 @@ extern void check_class_changed(struct r
 extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
 extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
 
-#ifdef CONFIG_SCHED_CLASS_EXT
-/*
- * Used by SCX in the enable/disable paths to move tasks between sched_classes
- * and establish invariants.
- */
-struct sched_enq_and_set_ctx {
+struct sched_change_ctx {
 	struct task_struct	*p;
-	int			queue_flags;
+	int			flags;
 	bool			queued;
 	bool			running;
 };
 
-void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
-			    struct sched_enq_and_set_ctx *ctx);
-void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
+struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int flags);
+void sched_change_end(struct sched_change_ctx *ctx);
 
-#endif /* CONFIG_SCHED_CLASS_EXT */
+DEFINE_CLASS(sched_change, struct sched_change_ctx *,
+	     sched_change_end(_T),
+	     sched_change_begin(p, flags),
+	     struct task_struct *p, unsigned int flags)
+
+DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
 
 #include "ext.h"
 
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -64,7 +64,6 @@ static int effective_prio(struct task_st
 
 void set_user_nice(struct task_struct *p, long nice)
 {
-	bool queued, running;
 	struct rq *rq;
 	int old_prio;
 
@@ -90,22 +89,12 @@ void set_user_nice(struct task_struct *p
 		return;
 	}
 
-	queued = task_on_rq_queued(p);
-	running = task_current_donor(rq, p);
-	if (queued)
-		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
-	if (running)
-		put_prev_task(rq, p);
-
-	p->static_prio = NICE_TO_PRIO(nice);
-	set_load_weight(p, true);
-	old_prio = p->prio;
-	p->prio = effective_prio(p);
-
-	if (queued)
-		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
-	if (running)
-		set_next_task(rq, p);
+	scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
+		p->static_prio = NICE_TO_PRIO(nice);
+		set_load_weight(p, true);
+		old_prio = p->prio;
+		p->prio = effective_prio(p);
+	}
 
 	/*
 	 * If the task increased its priority or is running and
@@ -515,7 +504,7 @@ int __sched_setscheduler(struct task_str
 			 bool user, bool pi)
 {
 	int oldpolicy = -1, policy = attr->sched_policy;
-	int retval, oldprio, newprio, queued, running;
+	int retval, oldprio, newprio;
 	const struct sched_class *prev_class, *next_class;
 	struct balance_callback *head;
 	struct rq_flags rf;
@@ -698,33 +687,25 @@ int __sched_setscheduler(struct task_str
 	if (prev_class != next_class && p->se.sched_delayed)
 		dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
 
-	queued = task_on_rq_queued(p);
-	running = task_current_donor(rq, p);
-	if (queued)
-		dequeue_task(rq, p, queue_flags);
-	if (running)
-		put_prev_task(rq, p);
-
-	if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
-		__setscheduler_params(p, attr);
-		p->sched_class = next_class;
-		p->prio = newprio;
-	}
-	__setscheduler_uclamp(p, attr);
-	check_class_changing(rq, p, prev_class);
+	scoped_guard (sched_change, p, queue_flags) {
 
-	if (queued) {
-		/*
-		 * We enqueue to tail when the priority of a task is
-		 * increased (user space view).
-		 */
-		if (oldprio < p->prio)
-			queue_flags |= ENQUEUE_HEAD;
+		if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {
+			__setscheduler_params(p, attr);
+			p->sched_class = next_class;
+			p->prio = newprio;
+		}
+		__setscheduler_uclamp(p, attr);
+		check_class_changing(rq, p, prev_class);
 
-		enqueue_task(rq, p, queue_flags);
+		if (scope->queued) {
+			/*
+			 * We enqueue to tail when the priority of a task is
+			 * increased (user space view).
+			 */
+			if (oldprio < p->prio)
+				scope->flags |= ENQUEUE_HEAD;
+		}
 	}
-	if (running)
-		set_next_task(rq, p);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] sched: Employ sched_change guards
  2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
@ 2025-09-11  9:06   ` K Prateek Nayak
  2025-09-11  9:55     ` Peter Zijlstra
  2025-10-06 15:21   ` Shrikanth Hegde
  1 sibling, 1 reply; 68+ messages in thread
From: K Prateek Nayak @ 2025-09-11  9:06 UTC (permalink / raw)
  To: Peter Zijlstra, tj
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello Peter,

On 9/10/2025 9:14 PM, Peter Zijlstra wrote:
> @@ -9240,8 +9213,9 @@ static void sched_change_group(struct ta
>   */
>  void sched_move_task(struct task_struct *tsk, bool for_autogroup)
>  {
> -	int queued, running, queue_flags =
> +	unsigned int queue_flags =
>  		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;

nit.

Since we don't do a complete dequeue for delayed task in
sched_move_task(), can we get rid of that DEQUEUE_NOCLOCK and ...

> +	bool resched = false;
>  	struct rq *rq;
>  
>  	CLASS(task_rq_lock, rq_guard)(tsk);
> @@ -9249,28 +9223,12 @@ void sched_move_task(struct task_struct
>  
>  	update_rq_clock(rq);

... this clock update and instead rely on sched_change_begin() to
handle it within the guard?

>  
> -	running = task_current_donor(rq, tsk);
> -	queued = task_on_rq_queued(tsk);
> -
> -	if (queued)
> -		dequeue_task(rq, tsk, queue_flags);
> -	if (running)
> -		put_prev_task(rq, tsk);
> -
> -	sched_change_group(tsk);
> -	if (!for_autogroup)
> -		scx_cgroup_move_task(tsk);
> -
> -	if (queued)
> -		enqueue_task(rq, tsk, queue_flags);
> -	if (running) {
> -		set_next_task(rq, tsk);
> -		/*
> -		 * After changing group, the running task may have joined a
> -		 * throttled one but it's still the running task. Trigger a
> -		 * resched to make sure that task can still run.
> -		 */
> -		resched_curr(rq);
> +	scoped_guard (sched_change, tsk, queue_flags) {
> +		sched_change_group(tsk);
> +		if (!for_autogroup)
> +			scx_cgroup_move_task(tsk);
> +		if (scope->running)
> +			resched = true;
>  	}

Also, are we missing a:

	if (resched)
		resched_curr(rq);

here after the guard? I don't see anything in sched_change_end() at this
point that would trigger a resched.

>  }
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] sched: Employ sched_change guards
  2025-09-11  9:06   ` K Prateek Nayak
@ 2025-09-11  9:55     ` Peter Zijlstra
  2025-09-11 10:10       ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-11  9:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Sep 11, 2025 at 02:36:21PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 9/10/2025 9:14 PM, Peter Zijlstra wrote:
> > @@ -9240,8 +9213,9 @@ static void sched_change_group(struct ta
> >   */
> >  void sched_move_task(struct task_struct *tsk, bool for_autogroup)
> >  {
> > -	int queued, running, queue_flags =
> > +	unsigned int queue_flags =
> >  		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
> 
> nit.
> 
> Since we don't do a complete dequeue for delayed task in
> sched_move_task(), can we get rid of that DEQUEUE_NOCLOCK and ...
> 
> > +	bool resched = false;
> >  	struct rq *rq;
> >  
> >  	CLASS(task_rq_lock, rq_guard)(tsk);
> > @@ -9249,28 +9223,12 @@ void sched_move_task(struct task_struct
> >  
> >  	update_rq_clock(rq);
> 
> ... this clock update and instead rely on sched_change_begin() to
> handle it within the guard?

Yeah, I suppose we could. But let me try and do that in a later patch,
on-top of all this.

> > -	running = task_current_donor(rq, tsk);
> > -	queued = task_on_rq_queued(tsk);
> > -
> > -	if (queued)
> > -		dequeue_task(rq, tsk, queue_flags);
> > -	if (running)
> > -		put_prev_task(rq, tsk);
> > -
> > -	sched_change_group(tsk);
> > -	if (!for_autogroup)
> > -		scx_cgroup_move_task(tsk);
> > -
> > -	if (queued)
> > -		enqueue_task(rq, tsk, queue_flags);
> > -	if (running) {
> > -		set_next_task(rq, tsk);
> > -		/*
> > -		 * After changing group, the running task may have joined a
> > -		 * throttled one but it's still the running task. Trigger a
> > -		 * resched to make sure that task can still run.
> > -		 */
> > -		resched_curr(rq);
> > +	scoped_guard (sched_change, tsk, queue_flags) {
> > +		sched_change_group(tsk);
> > +		if (!for_autogroup)
> > +			scx_cgroup_move_task(tsk);
> > +		if (scope->running)
> > +			resched = true;
> >  	}
> 
> Also, are we missing a:
> 
> 	if (resched)
> 		resched_curr(rq);
> 
> here after the guard? I don't see anything in sched_change_end() at this
> point that would trigger a resched.

Bah, yes. That hunk must've gone missing in one of the many rebases I
did while folding back fixes :/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] sched: Employ sched_change guards
  2025-09-11  9:55     ` Peter Zijlstra
@ 2025-09-11 10:10       ` Peter Zijlstra
  2025-09-11 10:37         ` K Prateek Nayak
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-11 10:10 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Sep 11, 2025 at 11:55:23AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2025 at 02:36:21PM +0530, K Prateek Nayak wrote:
> > Hello Peter,
> > 
> > On 9/10/2025 9:14 PM, Peter Zijlstra wrote:
> > > @@ -9240,8 +9213,9 @@ static void sched_change_group(struct ta
> > >   */
> > >  void sched_move_task(struct task_struct *tsk, bool for_autogroup)
> > >  {
> > > -	int queued, running, queue_flags =
> > > +	unsigned int queue_flags =
> > >  		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
> > 
> > nit.
> > 
> > Since we don't do a complete dequeue for delayed task in
> > sched_move_task(), can we get rid of that DEQUEUE_NOCLOCK and ...
> > 
> > > +	bool resched = false;
> > >  	struct rq *rq;
> > >  
> > >  	CLASS(task_rq_lock, rq_guard)(tsk);
> > > @@ -9249,28 +9223,12 @@ void sched_move_task(struct task_struct
> > >  
> > >  	update_rq_clock(rq);
> > 
> > ... this clock update and instead rely on sched_change_begin() to
> > handle it within the guard?
> 
> Yeah, I suppose we could. But let me try and do that in a later patch,
> on-top of all this.

Something like so?

---
 core.c     |   33 +++++++++++----------------------
 ext.c      |   12 ++++--------
 syscalls.c |    4 +---
 3 files changed, 16 insertions(+), 33 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2359,10 +2359,8 @@ static void migrate_disable_switch(struc
 	if (p->cpus_ptr != &p->cpus_mask)
 		return;
 
-	scoped_guard (task_rq_lock, p) {
-		update_rq_clock(scope.rq);
+	scoped_guard (task_rq_lock, p)
 		do_set_cpus_allowed(p, &ac);
-	}
 }
 
 void migrate_disable(void)
@@ -2716,9 +2714,7 @@ void set_cpus_allowed_common(struct task
 static void
 do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 {
-	u32 flags = DEQUEUE_SAVE | DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
-
-	scoped_guard (sched_change, p, flags) {
+	scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_LOCKED) {
 		p->sched_class->set_cpus_allowed(p, ctx);
 		mm_set_cpus_allowed(p->mm, ctx->new_mask);
 	}
@@ -2740,10 +2736,8 @@ void set_cpus_allowed_force(struct task_
 		struct rcu_head rcu;
 	};
 
-	scoped_guard (__task_rq_lock, p) {
-		update_rq_clock(scope.rq);
+	scoped_guard (__task_rq_lock, p)
 		do_set_cpus_allowed(p, &ac);
-	}
 
 	/*
 	 * Because this is called with p->pi_lock held, it is not possible
@@ -9159,16 +9153,13 @@ static void sched_change_group(struct ta
  */
 void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
-	unsigned int queue_flags =
-		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
+	unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_LOCKED;
 	bool resched = false;
 	struct rq *rq;
 
 	CLASS(task_rq_lock, rq_guard)(tsk);
 	rq = rq_guard.rq;
 
-	update_rq_clock(rq);
-
 	scoped_guard (sched_change, tsk, queue_flags) {
 		sched_change_group(tsk);
 		if (!for_autogroup)
@@ -10852,19 +10843,17 @@ struct sched_change_ctx *sched_change_be
 	}
 #endif
 
+	if (!(flags & DEQUEUE_NOCLOCK)) {
+		update_rq_clock(rq);
+		flags |= DEQUEUE_NOCLOCK;
+	}
+
 	if (flags & DEQUEUE_CLASS) {
 		if (WARN_ON_ONCE(flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)))
 			flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
 
-		if (p->sched_class->switching_from) {
-			/*
-			 * switching_from_fair() assumes CLASS implies NOCLOCK;
-			 * fixing this assumption would mean switching_from()
-			 * would need to be able to change flags.
-			 */
-			WARN_ON(!(flags & DEQUEUE_NOCLOCK));
+		if (p->sched_class->switching_from)
 			p->sched_class->switching_from(rq, p);
-		}
 	}
 
 	*ctx = (struct sched_change_ctx){
@@ -10915,7 +10904,7 @@ void sched_change_end(struct sched_chang
 		p->sched_class->switching_to(rq, p);
 
 	if (ctx->queued)
-		enqueue_task(rq, p, ctx->flags | ENQUEUE_NOCLOCK);
+		enqueue_task(rq, p, ctx->flags);
 	if (ctx->running)
 		set_next_task(rq, p, ctx->flags);
 
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5018,14 +5018,12 @@ static void scx_disable_workfn(struct kt
 
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
-		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE |
-					   DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
+		unsigned int queue_flags =
+			DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_LOCKED;
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
 
-		update_rq_clock(task_rq(p));
-
 		if (old_class != new_class) {
 			queue_flags |= DEQUEUE_CLASS;
 			queue_flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
@@ -5763,8 +5761,8 @@ static int scx_enable(struct sched_ext_o
 	percpu_down_write(&scx_fork_rwsem);
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
-		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE |
-					   DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
+		unsigned int queue_flags =
+			DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_LOCKED;
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
@@ -5772,8 +5770,6 @@ static int scx_enable(struct sched_ext_o
 		if (!tryget_task_struct(p))
 			continue;
 
-		update_rq_clock(task_rq(p));
-
 		if (old_class != new_class) {
 			queue_flags |= DEQUEUE_CLASS;
 			queue_flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -76,8 +76,6 @@ void set_user_nice(struct task_struct *p
 	CLASS(task_rq_lock, rq_guard)(p);
 	rq = rq_guard.rq;
 
-	update_rq_clock(rq);
-
 	/*
 	 * The RT priorities are set via sched_setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
@@ -89,7 +87,7 @@ void set_user_nice(struct task_struct *p
 		return;
 	}
 
-	scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK | DEQUEUE_LOCKED) {
+	scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_LOCKED) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		set_load_weight(p, true);
 		old_prio = p->prio;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] sched: Employ sched_change guards
  2025-09-11 10:10       ` Peter Zijlstra
@ 2025-09-11 10:37         ` K Prateek Nayak
  0 siblings, 0 replies; 68+ messages in thread
From: K Prateek Nayak @ 2025-09-11 10:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello Peter,

On 9/11/2025 3:40 PM, Peter Zijlstra wrote:
>> Yeah, I suppose we could. But let me try and do that in a later patch,
>> on-top of all this.

Sure thing.

> 
> Something like so?

Yup! That whole lot look better. Thank you.

> 
>  [..snip..]
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] sched: Employ sched_change guards
  2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
  2025-09-11  9:06   ` K Prateek Nayak
@ 2025-10-06 15:21   ` Shrikanth Hegde
  2025-10-06 18:14     ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Shrikanth Hegde @ 2025-10-06 15:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx, tj



On 9/10/25 9:14 PM, Peter Zijlstra wrote:
> As proposed a long while ago -- and half done by scx -- wrap the
> scheduler's 'change' pattern in a guard helper.
> 
[...]>   		put_task_struct(p);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3860,23 +3860,22 @@ extern void check_class_changed(struct r
>   extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
>   extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
>   
> -#ifdef CONFIG_SCHED_CLASS_EXT
> -/*
> - * Used by SCX in the enable/disable paths to move tasks between sched_classes
> - * and establish invariants.
> - */
> -struct sched_enq_and_set_ctx {
> +struct sched_change_ctx {
>   	struct task_struct	*p;
> -	int			queue_flags;
> +	int			flags;
>   	bool			queued;
>   	bool			running;
>   };
>   
> -void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
> -			    struct sched_enq_and_set_ctx *ctx);
> -void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
> +struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int flags);
> +void sched_change_end(struct sched_change_ctx *ctx);
>   
> -#endif /* CONFIG_SCHED_CLASS_EXT */
> +DEFINE_CLASS(sched_change, struct sched_change_ctx *,
> +	     sched_change_end(_T),
> +	     sched_change_begin(p, flags),
> +	     struct task_struct *p, unsigned int flags)
> +
> +DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>   
>   #include "ext.h"
>   
could you please add a comment on matching flags on dequeue/enqueue
here?

Since the ctx->flags don't get cleared, one could be left wondering how
does the enqueue happens(exp: ENQUEUE_RESTORE) until they see it works 
since flags match.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] sched: Employ sched_change guards
  2025-10-06 15:21   ` Shrikanth Hegde
@ 2025-10-06 18:14     ` Peter Zijlstra
  2025-10-07  5:12       ` Shrikanth Hegde
  2025-10-16  9:33       ` [tip: sched/core] sched: Mandate shared flags for sched_change tip-bot2 for Peter Zijlstra
  0 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-10-06 18:14 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx, tj

On Mon, Oct 06, 2025 at 08:51:27PM +0530, Shrikanth Hegde wrote:
> 
> 
> On 9/10/25 9:14 PM, Peter Zijlstra wrote:
> > As proposed a long while ago -- and half done by scx -- wrap the
> > scheduler's 'change' pattern in a guard helper.
> > 
> [...]>   		put_task_struct(p);
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -3860,23 +3860,22 @@ extern void check_class_changed(struct r
> >   extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
> >   extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
> > -#ifdef CONFIG_SCHED_CLASS_EXT
> > -/*
> > - * Used by SCX in the enable/disable paths to move tasks between sched_classes
> > - * and establish invariants.
> > - */
> > -struct sched_enq_and_set_ctx {
> > +struct sched_change_ctx {
> >   	struct task_struct	*p;
> > -	int			queue_flags;
> > +	int			flags;
> >   	bool			queued;
> >   	bool			running;
> >   };
> > -void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
> > -			    struct sched_enq_and_set_ctx *ctx);
> > -void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
> > +struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int flags);
> > +void sched_change_end(struct sched_change_ctx *ctx);
> > -#endif /* CONFIG_SCHED_CLASS_EXT */
> > +DEFINE_CLASS(sched_change, struct sched_change_ctx *,
> > +	     sched_change_end(_T),
> > +	     sched_change_begin(p, flags),
> > +	     struct task_struct *p, unsigned int flags)
> > +
> > +DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
> >   #include "ext.h"
> could you please add a comment on matching flags on dequeue/enqueue
> here?

Would something like so be okay? This assumes at least the second patch
is applied as well.

---

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10783,6 +10783,12 @@ struct sched_change_ctx *sched_change_be
 	struct sched_change_ctx *ctx = this_cpu_ptr(&sched_change_ctx);
 	struct rq *rq = task_rq(p);
 
+	/*
+	 * Must exclusively use matched flags since this is both dequeue and
+	 * enqueue.
+	 */
+	WARN_ON_ONCE(flags & 0xFFFF0000);
+
 	lockdep_assert_rq_held(rq);
 
 	if (!(flags & DEQUEUE_NOCLOCK)) {

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] sched: Employ sched_change guards
  2025-10-06 18:14     ` Peter Zijlstra
@ 2025-10-07  5:12       ` Shrikanth Hegde
  2025-10-07  9:34         ` Peter Zijlstra
  2025-10-16  9:33       ` [tip: sched/core] sched: Mandate shared flags for sched_change tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Shrikanth Hegde @ 2025-10-07  5:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx, tj



On 10/6/25 11:44 PM, Peter Zijlstra wrote:
> On Mon, Oct 06, 2025 at 08:51:27PM +0530, Shrikanth Hegde wrote:
>>
>>
>> On 9/10/25 9:14 PM, Peter Zijlstra wrote:
>>> As proposed a long while ago -- and half done by scx -- wrap the
>>> scheduler's 'change' pattern in a guard helper.
>>>
>> [...]>   		put_task_struct(p);
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -3860,23 +3860,22 @@ extern void check_class_changed(struct r
>>>    extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
>>>    extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
>>> -#ifdef CONFIG_SCHED_CLASS_EXT
>>> -/*
>>> - * Used by SCX in the enable/disable paths to move tasks between sched_classes
>>> - * and establish invariants.
>>> - */
>>> -struct sched_enq_and_set_ctx {
>>> +struct sched_change_ctx {
>>>    	struct task_struct	*p;
>>> -	int			queue_flags;
>>> +	int			flags;
>>>    	bool			queued;
>>>    	bool			running;
>>>    };
>>> -void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
>>> -			    struct sched_enq_and_set_ctx *ctx);
>>> -void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
>>> +struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int flags);
>>> +void sched_change_end(struct sched_change_ctx *ctx);
>>> -#endif /* CONFIG_SCHED_CLASS_EXT */
>>> +DEFINE_CLASS(sched_change, struct sched_change_ctx *,
>>> +	     sched_change_end(_T),
>>> +	     sched_change_begin(p, flags),
>>> +	     struct task_struct *p, unsigned int flags)
>>> +
>>> +DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
>>>    #include "ext.h"
>> could you please add a comment on matching flags on dequeue/enqueue
>> here?
> 
> Would something like so be okay? This assumes at least the second patch
> is applied as well.
> 
> ---
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10783,6 +10783,12 @@ struct sched_change_ctx *sched_change_be
>   	struct sched_change_ctx *ctx = this_cpu_ptr(&sched_change_ctx);
>   	struct rq *rq = task_rq(p);
>   
> +	/*
> +	 * Must exclusively use matched flags since this is both dequeue and
> +	 * enqueue.
> +	 */

yes. Something like that. Unless callsites explicitly change the flags using
the scope, enqueue will happen with matching flags.

> +	WARN_ON_ONCE(flags & 0xFFFF0000);
> +

A mythical example:
scope_guard(sched_change, p, DEQUEUE_THROTTLE)
	scope->flags &= ~DEQUEUE_THROTTLE;
	scope->flags |= ENQUEUE_HEAD;

But, One could still do this right? for such users the warning may be wrong.

>   	lockdep_assert_rq_held(rq);
>   
>   	if (!(flags & DEQUEUE_NOCLOCK)) {


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/14] sched: Employ sched_change guards
  2025-10-07  5:12       ` Shrikanth Hegde
@ 2025-10-07  9:34         ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-10-07  9:34 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx, tj

On Tue, Oct 07, 2025 at 10:42:29AM +0530, Shrikanth Hegde wrote:
> On 10/6/25 11:44 PM, Peter Zijlstra wrote:

> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -10783,6 +10783,12 @@ struct sched_change_ctx *sched_change_be
> >   	struct sched_change_ctx *ctx = this_cpu_ptr(&sched_change_ctx);
> >   	struct rq *rq = task_rq(p);
> > +	/*
> > +	 * Must exclusively use matched flags since this is both dequeue and
> > +	 * enqueue.
> > +	 */
> 
> yes. Something like that. Unless callsites explicitly change the flags using
> the scope, enqueue will happen with matching flags.
> 
> > +	WARN_ON_ONCE(flags & 0xFFFF0000);
> > +
> 
> A mythical example:
> scope_guard(sched_change, p, DEQUEUE_THROTTLE)
> 	scope->flags &= ~DEQUEUE_THROTTLE;
> 	scope->flags |= ENQUEUE_HEAD;
> 
> But, One could still do this right? for such users the warning may be wrong.

Right, I suppose this would be possible. Lets worry about it if/when it
ever comes up though.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [tip: sched/core] sched: Mandate shared flags for sched_change
  2025-10-06 18:14     ` Peter Zijlstra
  2025-10-07  5:12       ` Shrikanth Hegde
@ 2025-10-16  9:33       ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-10-16  9:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Shrikanth Hegde, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     73ec89a1ce4bce98f74b6520a95e64cd9986aae5
Gitweb:        https://git.kernel.org/tip/73ec89a1ce4bce98f74b6520a95e64cd9986aae5
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 06 Oct 2025 20:12:34 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 16 Oct 2025 11:13:54 +02:00

sched: Mandate shared flags for sched_change

Shrikanth noted that sched_change pattern relies on using shared
flags.

Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3d5659f..e2199e4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10781,6 +10781,12 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int 
 	struct sched_change_ctx *ctx = this_cpu_ptr(&sched_change_ctx);
 	struct rq *rq = task_rq(p);
 
+	/*
+	 * Must exclusively use matched flags since this is both dequeue and
+	 * enqueue.
+	 */
+	WARN_ON_ONCE(flags & 0xFFFF0000);
+
 	lockdep_assert_rq_held(rq);
 
 	if (!(flags & DEQUEUE_NOCLOCK)) {

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 02/14] sched: Re-arrange the {EN,DE}QUEUE flags
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 03/14] sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern Peter Zijlstra
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Ensure the matched flags are in the low word while the unmatched flags
go into the second word.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/sched.h |   45 ++++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 21 deletions(-)

--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2337,27 +2337,30 @@ extern const u32		sched_prio_to_wmult[40
  *
  */
 
-#define DEQUEUE_SLEEP		0x01 /* Matches ENQUEUE_WAKEUP */
-#define DEQUEUE_SAVE		0x02 /* Matches ENQUEUE_RESTORE */
-#define DEQUEUE_MOVE		0x04 /* Matches ENQUEUE_MOVE */
-#define DEQUEUE_NOCLOCK		0x08 /* Matches ENQUEUE_NOCLOCK */
-#define DEQUEUE_SPECIAL		0x10
-#define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
-#define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
-#define DEQUEUE_THROTTLE	0x800
-
-#define ENQUEUE_WAKEUP		0x01
-#define ENQUEUE_RESTORE		0x02
-#define ENQUEUE_MOVE		0x04
-#define ENQUEUE_NOCLOCK		0x08
-
-#define ENQUEUE_HEAD		0x10
-#define ENQUEUE_REPLENISH	0x20
-#define ENQUEUE_MIGRATED	0x40
-#define ENQUEUE_INITIAL		0x80
-#define ENQUEUE_MIGRATING	0x100
-#define ENQUEUE_DELAYED		0x200
-#define ENQUEUE_RQ_SELECTED	0x400
+#define DEQUEUE_SLEEP		0x0001 /* Matches ENQUEUE_WAKEUP */
+#define DEQUEUE_SAVE		0x0002 /* Matches ENQUEUE_RESTORE */
+#define DEQUEUE_MOVE		0x0004 /* Matches ENQUEUE_MOVE */
+#define DEQUEUE_NOCLOCK		0x0008 /* Matches ENQUEUE_NOCLOCK */
+
+#define DEQUEUE_MIGRATING	0x0010 /* Matches ENQUEUE_MIGRATING */
+#define DEQUEUE_DELAYED		0x0020 /* Matches ENQUEUE_DELAYED */
+
+#define DEQUEUE_SPECIAL		0x00010000
+#define DEQUEUE_THROTTLE	0x00020000
+
+#define ENQUEUE_WAKEUP		0x0001
+#define ENQUEUE_RESTORE		0x0002
+#define ENQUEUE_MOVE		0x0004
+#define ENQUEUE_NOCLOCK		0x0008
+
+#define ENQUEUE_MIGRATING	0x0010
+#define ENQUEUE_DELAYED		0x0020
+
+#define ENQUEUE_HEAD		0x00010000
+#define ENQUEUE_REPLENISH	0x00020000
+#define ENQUEUE_MIGRATED	0x00040000
+#define ENQUEUE_INITIAL		0x00080000
+#define ENQUEUE_RQ_SELECTED	0x00100000
 
 #define RETRY_TASK		((void *)-1UL)
 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 03/14] sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 02/14] sched: Re-arrange the {EN,DE}QUEUE flags Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 04/14] sched: Cleanup sched_delayed handling for class switches Peter Zijlstra
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Add {DE,EN}QUEUE_CLASS and fold the sched_class::switch* methods into
the change pattern. This completes and makes the pattern more
symmetric.

This changes the order of callbacks slightly:

				|
				|  switching_from()
  dequeue_task();		|  dequeue_task()
  put_prev_task();		|  put_prev_task()
				|  switched_from()
				|
  ... change task ...		|  ... change task ...
				|
  switching_to();		|  switching_to()
  enqueue_task();		|  enqueue_task()
  set_next_task();		|  set_next_task()
  prev_class->switched_from()	|
  switched_to()			|  switched_to()
				|

Notably, it moves the switched_from() callback right after the
dequeue/put. Existing implementations don't appear to be affected by
this change in location -- specifically the task isn't enqueued on the
class in question in either location.

Make (CLASS)^(SAVE|MOVE), because there is nothing to save-restore
when changing scheduling classes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   56 +++++++++++++++++++++--------------------------
 kernel/sched/ext.c       |   26 ++++++++++++++++-----
 kernel/sched/idle.c      |    4 +--
 kernel/sched/rt.c        |    2 -
 kernel/sched/sched.h     |   22 ++++++------------
 kernel/sched/stop_task.c |    4 +--
 kernel/sched/syscalls.c  |    9 +++++--
 7 files changed, 66 insertions(+), 57 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2167,34 +2167,9 @@ inline int task_curr(const struct task_s
 	return cpu_curr(task_cpu(p)) == p;
 }
 
-/*
- * ->switching_to() is called with the pi_lock and rq_lock held and must not
- * mess with locking.
- */
-void check_class_changing(struct rq *rq, struct task_struct *p,
-			  const struct sched_class *prev_class)
+void check_prio_changed(struct rq *rq, struct task_struct *p, int oldprio)
 {
-	if (prev_class != p->sched_class && p->sched_class->switching_to)
-		p->sched_class->switching_to(rq, p);
-}
-
-/*
- * switched_from, switched_to and prio_changed must _NOT_ drop rq->lock,
- * use the balance_callback list if you want balancing.
- *
- * this means any call to check_class_changed() must be followed by a call to
- * balance_callback().
- */
-void check_class_changed(struct rq *rq, struct task_struct *p,
-			 const struct sched_class *prev_class,
-			 int oldprio)
-{
-	if (prev_class != p->sched_class) {
-		if (prev_class->switched_from)
-			prev_class->switched_from(rq, p);
-
-		p->sched_class->switched_to(rq, p);
-	} else if (oldprio != p->prio || dl_task(p))
+	if (oldprio != p->prio || dl_task(p))
 		p->sched_class->prio_changed(rq, p, oldprio);
 }
 
@@ -7423,6 +7398,11 @@ void rt_mutex_setprio(struct task_struct
 	prev_class = p->sched_class;
 	next_class = __setscheduler_class(p->policy, prio);
 
+	if (prev_class != next_class) {
+		queue_flag |= DEQUEUE_CLASS;
+		queue_flag &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
+	}
+
 	if (prev_class != next_class && p->se.sched_delayed)
 		dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
 
@@ -7459,11 +7439,10 @@ void rt_mutex_setprio(struct task_struct
 
 		p->sched_class = next_class;
 		p->prio = prio;
-
-		check_class_changing(rq, p, prev_class);
 	}
 
-	check_class_changed(rq, p, prev_class, oldprio);
+	if (!(queue_flag & DEQUEUE_CLASS))
+		check_prio_changed(rq, p, oldprio);
 out_unlock:
 	/* Avoid rq from going away on us: */
 	preempt_disable();
@@ -10896,6 +10875,14 @@ struct sched_change_ctx *sched_change_be
 
 	lockdep_assert_rq_held(rq);
 
+	if (flags & DEQUEUE_CLASS) {
+		if (WARN_ON_ONCE(flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)))
+			flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
+
+		if (p->sched_class->switching_from)
+			p->sched_class->switching_from(rq, p);
+	}
+
 	*ctx = (struct sched_change_ctx){
 		.p = p,
 		.flags = flags,
@@ -10908,6 +10895,9 @@ struct sched_change_ctx *sched_change_be
 	if (ctx->running)
 		put_prev_task(rq, p);
 
+	if ((flags & DEQUEUE_CLASS) && p->sched_class->switched_from)
+		p->sched_class->switched_from(rq, p);
+
 	return ctx;
 }
 
@@ -10918,8 +10908,14 @@ void sched_change_end(struct sched_chang
 
 	lockdep_assert_rq_held(rq);
 
+	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
+		p->sched_class->switching_to(rq, p);
+
 	if (ctx->queued)
 		enqueue_task(rq, p, ctx->flags | ENQUEUE_NOCLOCK);
 	if (ctx->running)
 		set_next_task(rq, p);
+
+	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switched_to)
+		p->sched_class->switched_to(rq, p);
 }
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4999,21 +4999,28 @@ static void scx_disable_workfn(struct kt
 
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
+		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
 
 		update_rq_clock(task_rq(p));
 
+		if (old_class != new_class) {
+			queue_flags |= DEQUEUE_CLASS;
+			queue_flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
+		}
+
 		if (old_class != new_class && p->se.sched_delayed)
 			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
 
-		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK) {
+		scoped_guard (sched_change, p, queue_flags) {
 			p->sched_class = new_class;
-			check_class_changing(task_rq(p), p, old_class);
 		}
 
-		check_class_changed(task_rq(p), p, old_class, p->prio);
+		if (!(queue_flags & DEQUEUE_CLASS))
+			check_prio_changed(task_rq(p), p, p->prio);
+
 		scx_exit_task(p);
 	}
 	scx_task_iter_stop(&sti);
@@ -5742,6 +5749,7 @@ static int scx_enable(struct sched_ext_o
 	percpu_down_write(&scx_fork_rwsem);
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
+		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
@@ -5751,16 +5759,22 @@ static int scx_enable(struct sched_ext_o
 
 		update_rq_clock(task_rq(p));
 
+		if (old_class != new_class) {
+			queue_flags |= DEQUEUE_CLASS;
+			queue_flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
+		}
+
 		if (old_class != new_class && p->se.sched_delayed)
 			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
 
-		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK) {
+		scoped_guard (sched_change, p, queue_flags) {
 			p->scx.slice = SCX_SLICE_DFL;
 			p->sched_class = new_class;
-			check_class_changing(task_rq(p), p, old_class);
 		}
 
-		check_class_changed(task_rq(p), p, old_class, p->prio);
+		if (!(queue_flags & DEQUEUE_CLASS))
+			check_prio_changed(task_rq(p), p, p->prio);
+
 		put_task_struct(p);
 	}
 	scx_task_iter_stop(&sti);
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -498,7 +498,7 @@ static void task_tick_idle(struct rq *rq
 {
 }
 
-static void switched_to_idle(struct rq *rq, struct task_struct *p)
+static void switching_to_idle(struct rq *rq, struct task_struct *p)
 {
 	BUG();
 }
@@ -536,6 +536,6 @@ DEFINE_SCHED_CLASS(idle) = {
 	.task_tick		= task_tick_idle,
 
 	.prio_changed		= prio_changed_idle,
-	.switched_to		= switched_to_idle,
+	.switching_to		= switching_to_idle,
 	.update_curr		= update_curr_idle,
 };
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2589,8 +2589,8 @@ DEFINE_SCHED_CLASS(rt) = {
 
 	.get_rr_interval	= get_rr_interval_rt,
 
-	.prio_changed		= prio_changed_rt,
 	.switched_to		= switched_to_rt,
+	.prio_changed		= prio_changed_rt,
 
 	.update_curr		= update_curr_rt,
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -20,7 +20,6 @@
 #include <linux/sched/task_flags.h>
 #include <linux/sched/task.h>
 #include <linux/sched/topology.h>
-
 #include <linux/atomic.h>
 #include <linux/bitmap.h>
 #include <linux/bug.h>
@@ -2344,6 +2343,7 @@ extern const u32		sched_prio_to_wmult[40
 
 #define DEQUEUE_MIGRATING	0x0010 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x0020 /* Matches ENQUEUE_DELAYED */
+#define DEQUEUE_CLASS		0x0040 /* Matches ENQUEUE_CLASS */
 
 #define DEQUEUE_SPECIAL		0x00010000
 #define DEQUEUE_THROTTLE	0x00020000
@@ -2355,6 +2355,7 @@ extern const u32		sched_prio_to_wmult[40
 
 #define ENQUEUE_MIGRATING	0x0010
 #define ENQUEUE_DELAYED		0x0020
+#define ENQUEUE_CLASS		0x0040
 
 #define ENQUEUE_HEAD		0x00010000
 #define ENQUEUE_REPLENISH	0x00020000
@@ -2418,14 +2419,11 @@ struct sched_class {
 	void (*task_fork)(struct task_struct *p);
 	void (*task_dead)(struct task_struct *p);
 
-	/*
-	 * The switched_from() call is allowed to drop rq->lock, therefore we
-	 * cannot assume the switched_from/switched_to pair is serialized by
-	 * rq->lock. They are however serialized by p->pi_lock.
-	 */
-	void (*switching_to) (struct rq *this_rq, struct task_struct *task);
-	void (*switched_from)(struct rq *this_rq, struct task_struct *task);
-	void (*switched_to)  (struct rq *this_rq, struct task_struct *task);
+	void (*switching_from)(struct rq *this_rq, struct task_struct *task);
+	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
+	void (*switching_to)  (struct rq *this_rq, struct task_struct *task);
+	void (*switched_to)   (struct rq *this_rq, struct task_struct *task);
+
 	void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
 			      const struct load_weight *lw);
 	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
@@ -3854,11 +3852,7 @@ extern void set_load_weight(struct task_
 extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags);
 extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 
-extern void check_class_changing(struct rq *rq, struct task_struct *p,
-				 const struct sched_class *prev_class);
-extern void check_class_changed(struct rq *rq, struct task_struct *p,
-				const struct sched_class *prev_class,
-				int oldprio);
+extern void check_prio_changed(struct rq *rq, struct task_struct *p, int oldprio);
 
 extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
 extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -75,7 +75,7 @@ static void task_tick_stop(struct rq *rq
 {
 }
 
-static void switched_to_stop(struct rq *rq, struct task_struct *p)
+static void switching_to_stop(struct rq *rq, struct task_struct *p)
 {
 	BUG(); /* its impossible to change to this class */
 }
@@ -112,6 +112,6 @@ DEFINE_SCHED_CLASS(stop) = {
 	.task_tick		= task_tick_stop,
 
 	.prio_changed		= prio_changed_stop,
-	.switched_to		= switched_to_stop,
+	.switching_to		= switching_to_stop,
 	.update_curr		= update_curr_stop,
 };
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -684,6 +684,11 @@ int __sched_setscheduler(struct task_str
 	prev_class = p->sched_class;
 	next_class = __setscheduler_class(policy, newprio);
 
+	if (prev_class != next_class) {
+		queue_flags |= DEQUEUE_CLASS;
+		queue_flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
+	}
+
 	if (prev_class != next_class && p->se.sched_delayed)
 		dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
 
@@ -695,7 +700,6 @@ int __sched_setscheduler(struct task_str
 			p->prio = newprio;
 		}
 		__setscheduler_uclamp(p, attr);
-		check_class_changing(rq, p, prev_class);
 
 		if (scope->queued) {
 			/*
@@ -707,7 +711,8 @@ int __sched_setscheduler(struct task_str
 		}
 	}
 
-	check_class_changed(rq, p, prev_class, oldprio);
+	if (!(queue_flags & DEQUEUE_CLASS))
+		check_prio_changed(rq, p, oldprio);
 
 	/* Avoid rq from going away on us: */
 	preempt_disable();



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 04/14] sched: Cleanup sched_delayed handling for class switches
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (2 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 03/14] sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern Peter Zijlstra
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Use the new sched_class::switching_from() method to dequeue delayed
tasks before switching to another class.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |   12 ++++++++----
 kernel/sched/ext.c      |    6 ------
 kernel/sched/fair.c     |    7 +++++++
 kernel/sched/syscalls.c |    3 ---
 4 files changed, 15 insertions(+), 13 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7403,9 +7403,6 @@ void rt_mutex_setprio(struct task_struct
 		queue_flag &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
 	}
 
-	if (prev_class != next_class && p->se.sched_delayed)
-		dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
-
 	scoped_guard (sched_change, p, queue_flag) {
 		/*
 		 * Boosting condition are:
@@ -10879,8 +10876,15 @@ struct sched_change_ctx *sched_change_be
 		if (WARN_ON_ONCE(flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)))
 			flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
 
-		if (p->sched_class->switching_from)
+		if (p->sched_class->switching_from) {
+			/*
+			 * switching_from_fair() assumes CLASS implies NOCLOCK;
+			 * fixing this assumption would mean switching_from()
+			 * would need to be able to change flags.
+			 */
+			WARN_ON(!(flags & DEQUEUE_NOCLOCK));
 			p->sched_class->switching_from(rq, p);
+		}
 	}
 
 	*ctx = (struct sched_change_ctx){
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5011,9 +5011,6 @@ static void scx_disable_workfn(struct kt
 			queue_flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
 		}
 
-		if (old_class != new_class && p->se.sched_delayed)
-			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
-
 		scoped_guard (sched_change, p, queue_flags) {
 			p->sched_class = new_class;
 		}
@@ -5764,9 +5761,6 @@ static int scx_enable(struct sched_ext_o
 			queue_flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
 		}
 
-		if (old_class != new_class && p->se.sched_delayed)
-			dequeue_task(task_rq(p), p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
-
 		scoped_guard (sched_change, p, queue_flags) {
 			p->scx.slice = SCX_SLICE_DFL;
 			p->sched_class = new_class;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13221,6 +13221,12 @@ static void attach_task_cfs_rq(struct ta
 	attach_entity_cfs_rq(se);
 }
 
+static void switching_from_fair(struct rq *rq, struct task_struct *p)
+{
+	if (p->se.sched_delayed)
+		dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
+}
+
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
 {
 	detach_task_cfs_rq(p);
@@ -13622,6 +13628,7 @@ DEFINE_SCHED_CLASS(fair) = {
 
 	.reweight_task		= reweight_task_fair,
 	.prio_changed		= prio_changed_fair,
+	.switching_from		= switching_from_fair,
 	.switched_from		= switched_from_fair,
 	.switched_to		= switched_to_fair,
 
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -689,9 +689,6 @@ int __sched_setscheduler(struct task_str
 		queue_flags &= ~(DEQUEUE_SAVE | DEQUEUE_MOVE);
 	}
 
-	if (prev_class != next_class && p->se.sched_delayed)
-		dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
-
 	scoped_guard (sched_change, p, queue_flags) {
 
 		if (!(attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)) {



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (3 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 04/14] sched: Cleanup sched_delayed handling for class switches Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-11  1:44   ` Tejun Heo
  2025-09-10 15:44 ` [PATCH 06/14] sched: Fix migrate_disable_switch() locking Peter Zijlstra
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Move sched_class::prio_changed() into the change pattern.

And while there, extend it with sched_class::get_prio() in order to
fix the deadline sitation.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   24 +++++++++++++-----------
 kernel/sched/deadline.c  |   20 +++++++++++---------
 kernel/sched/ext.c       |    8 +-------
 kernel/sched/fair.c      |    8 ++++++--
 kernel/sched/idle.c      |    5 ++++-
 kernel/sched/rt.c        |    5 ++++-
 kernel/sched/sched.h     |    7 ++++---
 kernel/sched/stop_task.c |    5 ++++-
 kernel/sched/syscalls.c  |    9 ---------
 9 files changed, 47 insertions(+), 44 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2167,12 +2167,6 @@ inline int task_curr(const struct task_s
 	return cpu_curr(task_cpu(p)) == p;
 }
 
-void check_prio_changed(struct rq *rq, struct task_struct *p, int oldprio)
-{
-	if (oldprio != p->prio || dl_task(p))
-		p->sched_class->prio_changed(rq, p, oldprio);
-}
-
 void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
@@ -7437,9 +7431,6 @@ void rt_mutex_setprio(struct task_struct
 		p->sched_class = next_class;
 		p->prio = prio;
 	}
-
-	if (!(queue_flag & DEQUEUE_CLASS))
-		check_prio_changed(rq, p, oldprio);
 out_unlock:
 	/* Avoid rq from going away on us: */
 	preempt_disable();
@@ -10894,6 +10885,13 @@ struct sched_change_ctx *sched_change_be
 		.running = task_current(rq, p),
 	};
 
+	if (!(flags & DEQUEUE_CLASS)) {
+		if (p->sched_class->get_prio)
+			ctx->prio = p->sched_class->get_prio(rq, p);
+		else
+			ctx->prio = p->prio;
+	}
+
 	if (ctx->queued)
 		dequeue_task(rq, p, flags);
 	if (ctx->running)
@@ -10920,6 +10918,10 @@ void sched_change_end(struct sched_chang
 	if (ctx->running)
 		set_next_task(rq, p);
 
-	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switched_to)
-		p->sched_class->switched_to(rq, p);
+	if (ctx->flags & ENQUEUE_CLASS) {
+		if (p->sched_class->switched_to)
+			p->sched_class->switched_to(rq, p);
+	} else {
+		p->sched_class->prio_changed(rq, p, ctx->prio);
+	}
 }
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -3071,23 +3071,24 @@ static void switched_to_dl(struct rq *rq
 	}
 }
 
+static u64 get_prio_dl(struct rq *rq, struct task_struct *p)
+{
+	return p->dl.deadline;
+}
+
 /*
  * If the scheduling parameters of a -deadline task changed,
  * a push or pull operation might be needed.
  */
-static void prio_changed_dl(struct rq *rq, struct task_struct *p,
-			    int oldprio)
+static void prio_changed_dl(struct rq *rq, struct task_struct *p, u64 old_deadline)
 {
 	if (!task_on_rq_queued(p))
 		return;
 
-	/*
-	 * This might be too much, but unfortunately
-	 * we don't have the old deadline value, and
-	 * we can't argue if the task is increasing
-	 * or lowering its prio, so...
-	 */
-	if (!rq->dl.overloaded)
+	if (p->dl.deadline == old_deadline)
+		return;
+
+	if (dl_time_before(old_deadline, p->dl.deadline))
 		deadline_queue_pull_task(rq);
 
 	if (task_current_donor(rq, p)) {
@@ -3142,6 +3143,7 @@ DEFINE_SCHED_CLASS(dl) = {
 	.task_tick		= task_tick_dl,
 	.task_fork              = task_fork_dl,
 
+	.get_prio		= get_prio_dl,
 	.prio_changed           = prio_changed_dl,
 	.switched_from		= switched_from_dl,
 	.switched_to		= switched_to_dl,
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4023,7 +4023,7 @@ static void reweight_task_scx(struct rq
 				 p, p->scx.weight);
 }
 
-static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
+static void prio_changed_scx(struct rq *rq, struct task_struct *p, u64 oldprio)
 {
 }
 
@@ -5015,9 +5015,6 @@ static void scx_disable_workfn(struct kt
 			p->sched_class = new_class;
 		}
 
-		if (!(queue_flags & DEQUEUE_CLASS))
-			check_prio_changed(task_rq(p), p, p->prio);
-
 		scx_exit_task(p);
 	}
 	scx_task_iter_stop(&sti);
@@ -5766,9 +5763,6 @@ static int scx_enable(struct sched_ext_o
 			p->sched_class = new_class;
 		}
 
-		if (!(queue_flags & DEQUEUE_CLASS))
-			check_prio_changed(task_rq(p), p, p->prio);
-
 		put_task_struct(p);
 	}
 	scx_task_iter_stop(&sti);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13122,11 +13122,14 @@ static void task_fork_fair(struct task_s
  * the current task.
  */
 static void
-prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
+prio_changed_fair(struct rq *rq, struct task_struct *p, u64 oldprio)
 {
 	if (!task_on_rq_queued(p))
 		return;
 
+	if (p->prio == oldprio)
+		return;
+
 	if (rq->cfs.nr_queued == 1)
 		return;
 
@@ -13138,8 +13141,9 @@ prio_changed_fair(struct rq *rq, struct
 	if (task_current_donor(rq, p)) {
 		if (p->prio > oldprio)
 			resched_curr(rq);
-	} else
+	} else {
 		wakeup_preempt(rq, p, 0);
+	}
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -504,8 +504,11 @@ static void switching_to_idle(struct rq
 }
 
 static void
-prio_changed_idle(struct rq *rq, struct task_struct *p, int oldprio)
+prio_changed_idle(struct rq *rq, struct task_struct *p, u64 oldprio)
 {
+	if (p->prio == oldprio)
+		return;
+
 	BUG();
 }
 
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2437,11 +2437,14 @@ static void switched_to_rt(struct rq *rq
  * us to initiate a push or pull.
  */
 static void
-prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
+prio_changed_rt(struct rq *rq, struct task_struct *p, u64 oldprio)
 {
 	if (!task_on_rq_queued(p))
 		return;
 
+	if (p->prio == oldprio)
+		return;
+
 	if (task_current_donor(rq, p)) {
 		/*
 		 * If our priority decreases while running, we
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2426,8 +2426,10 @@ struct sched_class {
 
 	void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
 			      const struct load_weight *lw);
+
+	u64  (*get_prio)     (struct rq *this_rq, struct task_struct *task);
 	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
-			      int oldprio);
+			      u64 oldprio);
 
 	unsigned int (*get_rr_interval)(struct rq *rq,
 					struct task_struct *task);
@@ -3852,12 +3854,11 @@ extern void set_load_weight(struct task_
 extern void enqueue_task(struct rq *rq, struct task_struct *p, int flags);
 extern bool dequeue_task(struct rq *rq, struct task_struct *p, int flags);
 
-extern void check_prio_changed(struct rq *rq, struct task_struct *p, int oldprio);
-
 extern struct balance_callback *splice_balance_callbacks(struct rq *rq);
 extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
 
 struct sched_change_ctx {
+	u64			prio;
 	struct task_struct	*p;
 	int			flags;
 	bool			queued;
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -81,8 +81,11 @@ static void switching_to_stop(struct rq
 }
 
 static void
-prio_changed_stop(struct rq *rq, struct task_struct *p, int oldprio)
+prio_changed_stop(struct rq *rq, struct task_struct *p, u64 oldprio)
 {
+	if (p->prio == oldprio)
+		return;
+
 	BUG(); /* how!?, what priority? */
 }
 
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -95,12 +95,6 @@ void set_user_nice(struct task_struct *p
 		old_prio = p->prio;
 		p->prio = effective_prio(p);
 	}
-
-	/*
-	 * If the task increased its priority or is running and
-	 * lowered its priority, then reschedule its CPU:
-	 */
-	p->sched_class->prio_changed(rq, p, old_prio);
 }
 EXPORT_SYMBOL(set_user_nice);
 
@@ -708,9 +702,6 @@ int __sched_setscheduler(struct task_str
 		}
 	}
 
-	if (!(queue_flags & DEQUEUE_CLASS))
-		check_prio_changed(rq, p, oldprio);
-
 	/* Avoid rq from going away on us: */
 	preempt_disable();
 	head = splice_balance_callbacks(rq);



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern
  2025-09-10 15:44 ` [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern Peter Zijlstra
@ 2025-09-11  1:44   ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2025-09-11  1:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Wed, Sep 10, 2025 at 05:44:14PM +0200, Peter Zijlstra wrote:
> Move sched_class::prio_changed() into the change pattern.
> 
> And while there, extend it with sched_class::get_prio() in order to
> fix the deadline sitation.
> 
> Suggested-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

For 1-5 from sched_ext POV:

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 06/14] sched: Fix migrate_disable_switch() locking
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (4 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking Peter Zijlstra
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

For some reason migrate_disable_switch() was more complicated than it
needs to be, resulting in mind bending locking of dubious quality.

Recognise that migrate_disable_switch() must be called before a
context switch, but any place before that switch is equally good.
Since the current place results in troubled locking, simply move the
thing before taking rq->lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   27 ++++++---------------------
 1 file changed, 6 insertions(+), 21 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2344,10 +2344,10 @@ static void migrate_disable_switch(struc
 	if (p->cpus_ptr != &p->cpus_mask)
 		return;
 
-	/*
-	 * Violates locking rules! See comment in __do_set_cpus_allowed().
-	 */
-	__do_set_cpus_allowed(p, &ac);
+	scoped_guard (task_rq_lock, p) {
+		update_rq_clock(scope.rq);
+		__do_set_cpus_allowed(p, &ac);
+	}
 }
 
 void migrate_disable(void)
@@ -2702,22 +2702,7 @@ __do_set_cpus_allowed(struct task_struct
 	struct rq *rq = task_rq(p);
 	bool queued, running;
 
-	/*
-	 * This here violates the locking rules for affinity, since we're only
-	 * supposed to change these variables while holding both rq->lock and
-	 * p->pi_lock.
-	 *
-	 * HOWEVER, it magically works, because ttwu() is the only code that
-	 * accesses these variables under p->pi_lock and only does so after
-	 * smp_cond_load_acquire(&p->on_cpu, !VAL), and we're in __schedule()
-	 * before finish_task().
-	 *
-	 * XXX do further audits, this smells like something putrid.
-	 */
-	if (ctx->flags & SCA_MIGRATE_DISABLE)
-		WARN_ON_ONCE(!p->on_cpu);
-	else
-		lockdep_assert_held(&p->pi_lock);
+	lockdep_assert_held(&p->pi_lock);
 
 	queued = task_on_rq_queued(p);
 	running = task_current_donor(rq, p);
@@ -6816,6 +6801,7 @@ static void __sched notrace __schedule(i
 
 	local_irq_disable();
 	rcu_note_context_switch(preempt);
+	migrate_disable_switch(rq, prev);
 
 	/*
 	 * Make sure that signal_pending_state()->signal_pending() below
@@ -6922,7 +6908,6 @@ static void __sched notrace __schedule(i
 		 */
 		++*switch_count;
 
-		migrate_disable_switch(rq, prev);
 		psi_account_irqtime(rq, prev, next);
 		psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
 					     prev->se.sched_delayed);



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (5 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 06/14] sched: Fix migrate_disable_switch() locking Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-10-30  0:12   ` Mark Brown
  2025-09-10 15:44 ` [PATCH 08/14] sched: Rename do_set_cpus_allowed() Peter Zijlstra
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

All callers of do_set_cpus_allowed() only take p->pi_lock, which is
not sufficient to actually change the cpumask. Again, this is mostly
ok in these cases, but it results in unnecessarily complicated
reasoning.

Furthermore, there is no reason what so ever to not just take all the
required locks, so do just that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/kthread.c     |   15 +++++----------
 kernel/sched/core.c  |   21 +++++++--------------
 kernel/sched/sched.h |    5 +++++
 3 files changed, 17 insertions(+), 24 deletions(-)

--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -593,18 +593,16 @@ EXPORT_SYMBOL(kthread_create_on_node);
 
 static void __kthread_bind_mask(struct task_struct *p, const struct cpumask *mask, unsigned int state)
 {
-	unsigned long flags;
-
 	if (!wait_task_inactive(p, state)) {
 		WARN_ON(1);
 		return;
 	}
 
+	scoped_guard (raw_spinlock_irqsave, &p->pi_lock)
+		do_set_cpus_allowed(p, mask);
+
 	/* It's safe because the task is inactive. */
-	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	do_set_cpus_allowed(p, mask);
 	p->flags |= PF_NO_SETAFFINITY;
-	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 }
 
 static void __kthread_bind(struct task_struct *p, unsigned int cpu, unsigned int state)
@@ -857,7 +855,6 @@ int kthread_affine_preferred(struct task
 {
 	struct kthread *kthread = to_kthread(p);
 	cpumask_var_t affinity;
-	unsigned long flags;
 	int ret = 0;
 
 	if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE) || kthread->started) {
@@ -882,10 +879,8 @@ int kthread_affine_preferred(struct task
 	list_add_tail(&kthread->hotplug_node, &kthreads_hotplug);
 	kthread_fetch_affinity(kthread, affinity);
 
-	/* It's safe because the task is inactive. */
-	raw_spin_lock_irqsave(&p->pi_lock, flags);
-	do_set_cpus_allowed(p, affinity);
-	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	scoped_guard (raw_spinlock_irqsave, &p->pi_lock)
+		do_set_cpus_allowed(p, affinity);
 
 	mutex_unlock(&kthreads_hotplug_lock);
 out:
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2703,18 +2703,14 @@ __do_set_cpus_allowed(struct task_struct
 	bool queued, running;
 
 	lockdep_assert_held(&p->pi_lock);
+	lockdep_assert_rq_held(rq);
 
 	queued = task_on_rq_queued(p);
 	running = task_current_donor(rq, p);
 
-	if (queued) {
-		/*
-		 * Because __kthread_bind() calls this on blocked tasks without
-		 * holding rq->lock.
-		 */
-		lockdep_assert_rq_held(rq);
+	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
-	}
+
 	if (running)
 		put_prev_task(rq, p);
 
@@ -2743,7 +2739,10 @@ void do_set_cpus_allowed(struct task_str
 		struct rcu_head rcu;
 	};
 
-	__do_set_cpus_allowed(p, &ac);
+	scoped_guard (__task_rq_lock, p) {
+		update_rq_clock(scope.rq);
+		__do_set_cpus_allowed(p, &ac);
+	}
 
 	/*
 	 * Because this is called with p->pi_lock held, it is not possible
@@ -3518,12 +3517,6 @@ static int select_fallback_rq(int cpu, s
 			}
 			fallthrough;
 		case possible:
-			/*
-			 * XXX When called from select_task_rq() we only
-			 * hold p->pi_lock and again violate locking order.
-			 *
-			 * More yuck to audit.
-			 */
 			do_set_cpus_allowed(p, task_cpu_fallback_mask(p));
 			state = fail;
 			break;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1822,6 +1822,11 @@ DEFINE_LOCK_GUARD_1(task_rq_lock, struct
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),
 		    struct rq *rq; struct rq_flags rf)
 
+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 static inline void rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
 	__acquires(rq->lock)
 {



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking
  2025-09-10 15:44 ` [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking Peter Zijlstra
@ 2025-10-30  0:12   ` Mark Brown
  2025-10-30  9:07     ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Mark Brown @ 2025-10-30  0:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

[-- Attachment #1: Type: text/plain, Size: 4743 bytes --]

On Wed, Sep 10, 2025 at 05:44:16PM +0200, Peter Zijlstra wrote:

> All callers of do_set_cpus_allowed() only take p->pi_lock, which is
> not sufficient to actually change the cpumask. Again, this is mostly
> ok in these cases, but it results in unnecessarily complicated
> reasoning.

We're seeing lockups on some arm64 platforms in -next with the LTP
cpuhotplug02 test, the machine sits there repeatedly complaining that
RCU is stalled on IPIs:

Running tests.......
Name:   cpuhotplug02
Date:   Wed Oct 29 17:22:13 UTC 2025
Desc:   What happens to a process when its CPU is offlined?
CPU is 1
<3>[   89.915745] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
<3>[   89.922145] rcu: 	1-...0: (0 ticks this GP) idle=28f4/1/0x4000000000000000 softirq=2203/2203 fqs=195
<3>[   89.931570] rcu: 	(detected by 4, t=5256 jiffies, g=10357, q=7 ncpus=6)
<6>[   89.938465] Sending NMI from CPU 4 to CPUs 1:
<3>[   99.944589] rcu: rcu_preempt kthread starved for 2629 jiffies! g10357 f0x0 RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=0
<3>[   99.955226] rcu: 	Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
<3>[   99.964637] rcu: RCU grace-period kthread stack dump:
<6>[   99.969957] task:rcu_preempt     state:R  running task     stack:0     pid:15    tgid:15    ppid:2      task_flags:0x208040 flags:0x00000012
<6>[   99.982871] Call trace:
<6>[   99.985582]  __switch_to+0xf0/0x1c0 (T)
<6>[   99.989702]  arch_send_call_function_single_ipi+0x30/0x3c
<6>[   99.995389]  __smp_call_single_queue+0xa0/0xb0
<6>[  100.000118]  irq_work_queue_on+0x78/0xd0
<6>[  100.004323]  rcu_watching_snap_recheck+0x304/0x350
<6>[  100.009394]  force_qs_rnp+0x1d0/0x364
<6>[  100.013330]  rcu_gp_fqs_loop+0x324/0x500
<6>[  100.017527]  rcu_gp_kthread+0x134/0x160
<6>[  100.021640]  kthread+0x12c/0x204
<6>[  100.025146]  ret_from_fork+0x10/0x20
<3>[  100.029000] rcu: Stack dump where RCU GP kthread last ran:

with the same stack trace repeating ad infinitum.  A bisect converges
fairly smoothly on this commit which looks plausible though I've not
really looked closely:

git bisect start
# status: waiting for both good and bad commits
# bad: [f9ba12abc5282bf992f9a9ae87ad814fd03a0270] Add linux-next specific files for 20251029
git bisect bad f9ba12abc5282bf992f9a9ae87ad814fd03a0270
# status: waiting for good commit(s), bad commit known
# good: [58efa7cdf77aae9e595b5f195d53d9abc37f1ecf] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
git bisect good 58efa7cdf77aae9e595b5f195d53d9abc37f1ecf
# good: [196c1d2131e9e2326e4a6a79eaa1ea54bdc90056] Merge branch 'libcrypto-next' of https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git
git bisect good 196c1d2131e9e2326e4a6a79eaa1ea54bdc90056
# good: [47af99b9fa06d7207d03f53099c58ab145819c20] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git
git bisect good 47af99b9fa06d7207d03f53099c58ab145819c20
# bad: [53ac14eeef9a69b4e881a5cd8d56ecf054a25dc3] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git
git bisect bad 53ac14eeef9a69b4e881a5cd8d56ecf054a25dc3
# good: [9ac3f65ed6bd03cc83d86c50e51caa1d223e9e76] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git
git bisect good 9ac3f65ed6bd03cc83d86c50e51caa1d223e9e76
# good: [301e1f4a2740ba1f8f312412b88fb9aabe3be7ec] Merge branch into tip/master: 'objtool/core'
git bisect good 301e1f4a2740ba1f8f312412b88fb9aabe3be7ec
# bad: [cca14814f28673e48bab2f1db13db420c33a2848] Merge branch 'for-next' of https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
git bisect bad cca14814f28673e48bab2f1db13db420c33a2848
# bad: [9b6f0572b2700cad5f3eaee3ca190ae960f56b80] Merge branch into tip/master: 'x86/entry'
git bisect bad 9b6f0572b2700cad5f3eaee3ca190ae960f56b80
# bad: [4c95380701f58b8112f0b891de8d160e4199e19d] sched/ext: Fold balance_scx() into pick_task_scx()
git bisect bad 4c95380701f58b8112f0b891de8d160e4199e19d
# good: [6455ad5346c9cf755fa9dda6e326c4028fb3c853] sched: Move sched_class::prio_changed() into the change pattern
git bisect good 6455ad5346c9cf755fa9dda6e326c4028fb3c853
# bad: [46a177fb01e52ec0e3f9eab9b217a0f7c8909eeb] sched: Add locking comments to sched_class methods
git bisect bad 46a177fb01e52ec0e3f9eab9b217a0f7c8909eeb
# bad: [abfc01077df66593f128d966fdad1d042facc9ac] sched: Fix do_set_cpus_allowed() locking
git bisect bad abfc01077df66593f128d966fdad1d042facc9ac
# good: [942b8db965006cf655d356162f7091a9238da94e] sched: Fix migrate_disable_switch() locking
git bisect good 942b8db965006cf655d356162f7091a9238da94e
# first bad commit: [abfc01077df66593f128d966fdad1d042facc9ac] sched: Fix do_set_cpus_allowed() locking

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking
  2025-10-30  0:12   ` Mark Brown
@ 2025-10-30  9:07     ` Peter Zijlstra
  2025-10-30 12:47       ` Mark Brown
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-10-30  9:07 UTC (permalink / raw)
  To: Mark Brown
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Oct 30, 2025 at 12:12:01AM +0000, Mark Brown wrote:
> On Wed, Sep 10, 2025 at 05:44:16PM +0200, Peter Zijlstra wrote:
> 
> > All callers of do_set_cpus_allowed() only take p->pi_lock, which is
> > not sufficient to actually change the cpumask. Again, this is mostly
> > ok in these cases, but it results in unnecessarily complicated
> > reasoning.
> 
> We're seeing lockups on some arm64 platforms in -next with the LTP
> cpuhotplug02 test, the machine sits there repeatedly complaining that
> RCU is stalled on IPIs:

Did not this help?

  https://lkml.kernel.org/r/20251027110133.GI3245006@noisy.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking
  2025-10-30  9:07     ` Peter Zijlstra
@ 2025-10-30 12:47       ` Mark Brown
  0 siblings, 0 replies; 68+ messages in thread
From: Mark Brown @ 2025-10-30 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

On Thu, Oct 30, 2025 at 10:07:15AM +0100, Peter Zijlstra wrote:
> On Thu, Oct 30, 2025 at 12:12:01AM +0000, Mark Brown wrote:

> > We're seeing lockups on some arm64 platforms in -next with the LTP
> > cpuhotplug02 test, the machine sits there repeatedly complaining that
> > RCU is stalled on IPIs:

> Did not this help?

>   https://lkml.kernel.org/r/20251027110133.GI3245006@noisy.programming.kicks-ass.net

It looks like that hadn't landed in yesterday's -next - it showed up
today and things do indeed look a lot happier, thanks!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 08/14] sched: Rename do_set_cpus_allowed()
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (6 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 09/14] sched: Make __do_set_cpus_allowed() use the sched_change pattern Peter Zijlstra
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hopefully saner naming.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h  |    4 ++--
 kernel/cgroup/cpuset.c |    2 +-
 kernel/kthread.c       |    4 ++--
 kernel/sched/core.c    |   16 ++++++++--------
 kernel/sched/sched.h   |    2 +-
 5 files changed, 14 insertions(+), 14 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1860,8 +1860,8 @@ extern int task_can_attach(struct task_s
 extern int dl_bw_alloc(int cpu, u64 dl_bw);
 extern void dl_bw_free(int cpu, u64 dl_bw);
 
-/* do_set_cpus_allowed() - consider using set_cpus_allowed_ptr() instead */
-extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
+/* set_cpus_allowed_force() - consider using set_cpus_allowed_ptr() instead */
+extern void set_cpus_allowed_force(struct task_struct *p, const struct cpumask *new_mask);
 
 /**
  * set_cpus_allowed_ptr - set CPU affinity mask of a task
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4122,7 +4122,7 @@ bool cpuset_cpus_allowed_fallback(struct
 	rcu_read_lock();
 	cs_mask = task_cs(tsk)->cpus_allowed;
 	if (is_in_v2_mode() && cpumask_subset(cs_mask, possible_mask)) {
-		do_set_cpus_allowed(tsk, cs_mask);
+		set_cpus_allowed_force(tsk, cs_mask);
 		changed = true;
 	}
 	rcu_read_unlock();
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -599,7 +599,7 @@ static void __kthread_bind_mask(struct t
 	}
 
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock)
-		do_set_cpus_allowed(p, mask);
+		set_cpus_allowed_force(p, mask);
 
 	/* It's safe because the task is inactive. */
 	p->flags |= PF_NO_SETAFFINITY;
@@ -880,7 +880,7 @@ int kthread_affine_preferred(struct task
 	kthread_fetch_affinity(kthread, affinity);
 
 	scoped_guard (raw_spinlock_irqsave, &p->pi_lock)
-		do_set_cpus_allowed(p, affinity);
+		set_cpus_allowed_force(p, affinity);
 
 	mutex_unlock(&kthreads_hotplug_lock);
 out:
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2329,7 +2329,7 @@ unsigned long wait_task_inactive(struct
 }
 
 static void
-__do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx);
+do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx);
 
 static void migrate_disable_switch(struct rq *rq, struct task_struct *p)
 {
@@ -2346,7 +2346,7 @@ static void migrate_disable_switch(struc
 
 	scoped_guard (task_rq_lock, p) {
 		update_rq_clock(scope.rq);
-		__do_set_cpus_allowed(p, &ac);
+		do_set_cpus_allowed(p, &ac);
 	}
 }
 
@@ -2697,7 +2697,7 @@ void set_cpus_allowed_common(struct task
 }
 
 static void
-__do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
+do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 {
 	struct rq *rq = task_rq(p);
 	bool queued, running;
@@ -2727,7 +2727,7 @@ __do_set_cpus_allowed(struct task_struct
  * Used for kthread_bind() and select_fallback_rq(), in both cases the user
  * affinity (if any) should be destroyed too.
  */
-void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
+void set_cpus_allowed_force(struct task_struct *p, const struct cpumask *new_mask)
 {
 	struct affinity_context ac = {
 		.new_mask  = new_mask,
@@ -2741,7 +2741,7 @@ void do_set_cpus_allowed(struct task_str
 
 	scoped_guard (__task_rq_lock, p) {
 		update_rq_clock(scope.rq);
-		__do_set_cpus_allowed(p, &ac);
+		do_set_cpus_allowed(p, &ac);
 	}
 
 	/*
@@ -2780,7 +2780,7 @@ int dup_user_cpus_ptr(struct task_struct
 	 * Use pi_lock to protect content of user_cpus_ptr
 	 *
 	 * Though unlikely, user_cpus_ptr can be reset to NULL by a concurrent
-	 * do_set_cpus_allowed().
+	 * set_cpus_allowed_force().
 	 */
 	raw_spin_lock_irqsave(&src->pi_lock, flags);
 	if (src->user_cpus_ptr) {
@@ -3108,7 +3108,7 @@ static int __set_cpus_allowed_ptr_locked
 		goto out;
 	}
 
-	__do_set_cpus_allowed(p, ctx);
+	do_set_cpus_allowed(p, ctx);
 
 	return affine_move_task(rq, p, rf, dest_cpu, ctx->flags);
 
@@ -3517,7 +3517,7 @@ static int select_fallback_rq(int cpu, s
 			}
 			fallthrough;
 		case possible:
-			do_set_cpus_allowed(p, task_cpu_fallback_mask(p));
+			set_cpus_allowed_force(p, task_cpu_fallback_mask(p));
 			state = fail;
 			break;
 		case fail:
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2592,7 +2592,7 @@ static inline bool task_allowed_on_cpu(s
 static inline cpumask_t *alloc_user_cpus_ptr(int node)
 {
 	/*
-	 * See do_set_cpus_allowed() above for the rcu_head usage.
+	 * See set_cpus_allowed_force() above for the rcu_head usage.
 	 */
 	int size = max_t(int, cpumask_size(), sizeof(struct rcu_head));
 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 09/14] sched: Make __do_set_cpus_allowed() use the sched_change pattern
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (7 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 08/14] sched: Rename do_set_cpus_allowed() Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 10/14] sched: Add locking comments to sched_class methods Peter Zijlstra
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Now that do_set_cpus_allowed() holds all the regular locks, convert it
to use the sched_change pattern helper.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c |   26 +++++---------------------
 1 file changed, 5 insertions(+), 21 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2697,28 +2697,12 @@ void set_cpus_allowed_common(struct task
 static void
 do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 {
-	struct rq *rq = task_rq(p);
-	bool queued, running;
+	u32 flags = DEQUEUE_SAVE | DEQUEUE_NOCLOCK;
 
-	lockdep_assert_held(&p->pi_lock);
-	lockdep_assert_rq_held(rq);
-
-	queued = task_on_rq_queued(p);
-	running = task_current_donor(rq, p);
-
-	if (queued)
-		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
-
-	if (running)
-		put_prev_task(rq, p);
-
-	p->sched_class->set_cpus_allowed(p, ctx);
-	mm_set_cpus_allowed(p->mm, ctx->new_mask);
-
-	if (queued)
-		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
-	if (running)
-		set_next_task(rq, p);
+	scoped_guard (sched_change, p, flags) {
+		p->sched_class->set_cpus_allowed(p, ctx);
+		mm_set_cpus_allowed(p->mm, ctx->new_mask);
+	}
 }
 
 /*



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 10/14] sched: Add locking comments to sched_class methods
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (8 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 09/14] sched: Make __do_set_cpus_allowed() use the sched_change pattern Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 11/14] sched: Add flags to {put_prev,set_next}_task() methods Peter Zijlstra
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

'Document' the locking context the various sched_class methods are
called under.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |    6 +-
 kernel/sched/sched.h |  106 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 103 insertions(+), 9 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -581,8 +581,8 @@ EXPORT_SYMBOL(__trace_set_current_state)
  *
  * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:
  *
- *   is set by activate_task() and cleared by deactivate_task(), under
- *   rq->lock. Non-zero indicates the task is runnable, the special
+ *   is set by activate_task() and cleared by deactivate_task()/block_task(),
+ *   under rq->lock. Non-zero indicates the task is runnable, the special
  *   ON_RQ_MIGRATING state is used for migration without holding both
  *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
  *
@@ -4193,7 +4193,7 @@ int try_to_wake_up(struct task_struct *p
 		 * __schedule().  See the comment for smp_mb__after_spinlock().
 		 *
 		 * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
-		 * schedule()'s deactivate_task() has 'happened' and p will no longer
+		 * schedule()'s block_task() has 'happened' and p will no longer
 		 * care about it's own p->state. See the comment in __schedule().
 		 */
 		smp_acquire__after_ctrl_dep();
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2320,8 +2320,7 @@ extern const u32		sched_prio_to_wmult[40
 /*
  * {de,en}queue flags:
  *
- * DEQUEUE_SLEEP  - task is no longer runnable
- * ENQUEUE_WAKEUP - task just became runnable
+ * SLEEP/WAKEUP - task is no-longer/just-became runnable
  *
  * SAVE/RESTORE - an otherwise spurious dequeue/enqueue, done to ensure tasks
  *                are in a known state which allows modification. Such pairs
@@ -2334,6 +2333,11 @@ extern const u32		sched_prio_to_wmult[40
  *
  * MIGRATION - p->on_rq == TASK_ON_RQ_MIGRATING (used for DEADLINE)
  *
+ * DELAYED - de/re-queue a sched_delayed task
+ *
+ * CLASS - going to update p->sched_class; makes sched_change call the
+ *         various switch methods.
+ *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
@@ -2384,14 +2388,50 @@ struct sched_class {
 	int uclamp_enabled;
 #endif
 
+	/*
+	 * move_queued_task/activate_task/enqueue_task: rq->lock
+	 * ttwu_do_activate/activate_task/enqueue_task: rq->lock
+	 * wake_up_new_task/activate_task/enqueue_task: task_rq_lock
+	 * ttwu_runnable/enqueue_task: task_rq_lock
+	 * proxy_task_current: rq->lock
+	 * sched_change_end
+	 */
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
+	/*
+	 * move_queued_task/deactivate_task/dequeue_task: rq->lock
+	 * __schedule/block_task/dequeue_task: rq->lock
+	 * proxy_task_current: rq->lock
+	 * wait_task_inactive: task_rq_lock
+	 * sched_change_begin
+	 */
 	bool (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
+
+	/*
+	 * do_sched_yield: rq->lock
+	 */
 	void (*yield_task)   (struct rq *rq);
+	/*
+	 * yield_to: rq->lock (double)
+	 */
 	bool (*yield_to_task)(struct rq *rq, struct task_struct *p);
 
+	/*
+	 * move_queued_task: rq->lock
+	 * __migrate_swap_task: rq->lock
+	 * ttwu_do_activate: rq->lock
+	 * ttwu_runnable: task_rq_lock
+	 * wake_up_new_task: task_rq_lock
+	 */
 	void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
 
+	/*
+	 * schedule/pick_next_task/prev_balance: rq->lock
+	 */
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
+
+	/*
+	 * schedule/pick_next_task: rq->lock
+	 */
 	struct task_struct *(*pick_task)(struct rq *rq);
 	/*
 	 * Optional! When implemented pick_next_task() should be equivalent to:
@@ -2404,48 +2444,102 @@ struct sched_class {
 	 */
 	struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev);
 
+	/*
+	 * sched_change:
+	 * __schedule: rq->lock
+	 */
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
 
+	/*
+	 * select_task_rq: p->pi_lock
+	 * sched_exec: p->pi_lock
+	 */
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
 
+	/*
+	 * set_task_cpu: p->pi_lock || rq->lock (ttwu like)
+	 */
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
+	/*
+	 * ttwu_do_activate: rq->lock
+	 * wake_up_new_task: task_rq_lock
+	 */
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
 
+	/*
+	 * do_set_cpus_allowed: task_rq_lock + sched_change
+	 */
 	void (*set_cpus_allowed)(struct task_struct *p, struct affinity_context *ctx);
 
+	/*
+	 * sched_set_rq_{on,off}line: rq->lock
+	 */
 	void (*rq_online)(struct rq *rq);
 	void (*rq_offline)(struct rq *rq);
 
+	/*
+	 * push_cpu_stop: p->pi_lock && rq->lock
+	 */
 	struct rq *(*find_lock_rq)(struct task_struct *p, struct rq *rq);
 
+	/*
+	 * hrtick: rq->lock
+	 * sched_tick: rq->lock
+	 * sched_tick_remote: rq->lock
+	 */
 	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
+	/*
+	 * sched_cgroup_fork: p->pi_lock
+	 */
 	void (*task_fork)(struct task_struct *p);
+	/*
+	 * finish_task_switch: no locks
+	 */
 	void (*task_dead)(struct task_struct *p);
 
+	/*
+	 * sched_change
+	 */
 	void (*switching_from)(struct rq *this_rq, struct task_struct *task);
 	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
 	void (*switching_to)  (struct rq *this_rq, struct task_struct *task);
 	void (*switched_to)   (struct rq *this_rq, struct task_struct *task);
-
-	void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
-			      const struct load_weight *lw);
-
 	u64  (*get_prio)     (struct rq *this_rq, struct task_struct *task);
 	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
 			      u64 oldprio);
 
+	/*
+	 * set_load_weight: task_rq_lock + sched_change
+	 * __setscheduler_parms: task_rq_lock + sched_change
+	 */
+	void (*reweight_task)(struct rq *this_rq, struct task_struct *task,
+			      const struct load_weight *lw);
+
+	/*
+	 * sched_rr_get_interval: task_rq_lock
+	 */
 	unsigned int (*get_rr_interval)(struct rq *rq,
 					struct task_struct *task);
 
+	/*
+	 * task_sched_runtime: task_rq_lock
+	 */
 	void (*update_curr)(struct rq *rq);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	/*
+	 * sched_change_group: task_rq_lock + sched_change
+	 */
 	void (*task_change_group)(struct task_struct *p);
 #endif
 
 #ifdef CONFIG_SCHED_CORE
+	/*
+	 * pick_next_task: rq->lock
+	 * try_steal_cookie: rq->lock (double)
+	 */
 	int (*task_is_throttled)(struct task_struct *p, int cpu);
 #endif
 };



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 11/14] sched: Add flags to {put_prev,set_next}_task() methods
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (9 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 10/14] sched: Add locking comments to sched_class methods Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 15:44 ` [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock() Peter Zijlstra
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |    4 ++--
 kernel/sched/deadline.c  |    6 ++++--
 kernel/sched/ext.c       |    4 ++--
 kernel/sched/fair.c      |    8 +++++---
 kernel/sched/idle.c      |    5 +++--
 kernel/sched/rt.c        |    6 ++++--
 kernel/sched/sched.h     |   18 ++++++++++--------
 kernel/sched/stop_task.c |    5 +++--
 8 files changed, 33 insertions(+), 23 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10857,7 +10857,7 @@ struct sched_change_ctx *sched_change_be
 	if (ctx->queued)
 		dequeue_task(rq, p, flags);
 	if (ctx->running)
-		put_prev_task(rq, p);
+		put_prev_task(rq, p, flags);
 
 	if ((flags & DEQUEUE_CLASS) && p->sched_class->switched_from)
 		p->sched_class->switched_from(rq, p);
@@ -10878,7 +10878,7 @@ void sched_change_end(struct sched_chang
 	if (ctx->queued)
 		enqueue_task(rq, p, ctx->flags | ENQUEUE_NOCLOCK);
 	if (ctx->running)
-		set_next_task(rq, p);
+		set_next_task(rq, p, ctx->flags);
 
 	if (ctx->flags & ENQUEUE_CLASS) {
 		if (p->sched_class->switched_to)
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2340,10 +2340,11 @@ static void start_hrtick_dl(struct rq *r
 }
 #endif /* !CONFIG_SCHED_HRTICK */
 
-static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
+static void set_next_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
 	struct dl_rq *dl_rq = &rq->dl;
+	bool first = flags & ENQUEUE_FIRST;
 
 	p->se.exec_start = rq_clock_task(rq);
 	if (on_dl_rq(&p->dl))
@@ -2413,7 +2414,8 @@ static struct task_struct *pick_task_dl(
 	return __pick_task_dl(rq);
 }
 
-static void put_prev_task_dl(struct rq *rq, struct task_struct *p, struct task_struct *next)
+static void put_prev_task_dl(struct rq *rq, struct task_struct *p,
+			     struct task_struct *next, int flags)
 {
 	struct sched_dl_entity *dl_se = &p->dl;
 	struct dl_rq *dl_rq = &rq->dl;
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3243,7 +3243,7 @@ static void process_ddsp_deferred_locals
 	}
 }
 
-static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
+static void set_next_task_scx(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct scx_sched *sch = scx_root;
 
@@ -3346,7 +3346,7 @@ static void switch_class(struct rq *rq,
 }
 
 static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
-			      struct task_struct *next)
+			      struct task_struct *next, int flags)
 {
 	struct scx_sched *sch = scx_root;
 	update_curr_scx(rq);
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8839,7 +8839,7 @@ static struct task_struct *pick_task_fai
 }
 
 static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
-static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
+static void set_next_task_fair(struct rq *rq, struct task_struct *p, int flags);
 
 struct task_struct *
 pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
@@ -8955,7 +8955,8 @@ void fair_server_init(struct rq *rq)
 /*
  * Account for a descheduled task:
  */
-static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next)
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev,
+			       struct task_struct *next, int flags)
 {
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
@@ -13286,9 +13287,10 @@ static void __set_next_task_fair(struct
  * This routine is mostly called to set cfs_rq->curr field when a task
  * migrates between groups/classes.
  */
-static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
+static void set_next_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_entity *se = &p->se;
+	bool first = flags & ENQUEUE_FIRST;
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -452,13 +452,14 @@ static void wakeup_preempt_idle(struct r
 	resched_curr(rq);
 }
 
-static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct task_struct *next)
+static void put_prev_task_idle(struct rq *rq, struct task_struct *prev,
+			       struct task_struct *next, int flags)
 {
 	dl_server_update_idle_time(rq, prev);
 	scx_update_idle(rq, false, true);
 }
 
-static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
+static void set_next_task_idle(struct rq *rq, struct task_struct *next, int flags)
 {
 	update_idle_core(rq);
 	scx_update_idle(rq, true, true);
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1636,10 +1636,11 @@ static void wakeup_preempt_rt(struct rq
 		check_preempt_equal_prio(rq, p);
 }
 
-static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool first)
+static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
 	struct rt_rq *rt_rq = &rq->rt;
+	bool first = flags & ENQUEUE_FIRST;
 
 	p->se.exec_start = rq_clock_task(rq);
 	if (on_rt_rq(&p->rt))
@@ -1707,7 +1708,8 @@ static struct task_struct *pick_task_rt(
 	return p;
 }
 
-static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_struct *next)
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p,
+			     struct task_struct *next, int flags)
 {
 	struct sched_rt_entity *rt_se = &p->rt;
 	struct rt_rq *rt_rq = &rq->rt;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2370,7 +2370,9 @@ extern const u32		sched_prio_to_wmult[40
 #define ENQUEUE_REPLENISH	0x00020000
 #define ENQUEUE_MIGRATED	0x00040000
 #define ENQUEUE_INITIAL		0x00080000
+
 #define ENQUEUE_RQ_SELECTED	0x00100000
+#define ENQUEUE_FIRST		0x00200000
 
 #define RETRY_TASK		((void *)-1UL)
 
@@ -2448,8 +2450,8 @@ struct sched_class {
 	 * sched_change:
 	 * __schedule: rq->lock
 	 */
-	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next);
-	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
+	void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next, int flags);
+	void (*set_next_task)(struct rq *rq, struct task_struct *p, int flags);
 
 	/*
 	 * select_task_rq: p->pi_lock
@@ -2544,15 +2546,15 @@ struct sched_class {
 #endif
 };
 
-static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
+static inline void put_prev_task(struct rq *rq, struct task_struct *prev, int flags)
 {
 	WARN_ON_ONCE(rq->donor != prev);
-	prev->sched_class->put_prev_task(rq, prev, NULL);
+	prev->sched_class->put_prev_task(rq, prev, NULL, flags);
 }
 
-static inline void set_next_task(struct rq *rq, struct task_struct *next)
+static inline void set_next_task(struct rq *rq, struct task_struct *next, int flags)
 {
-	next->sched_class->set_next_task(rq, next, false);
+	next->sched_class->set_next_task(rq, next, flags);
 }
 
 static inline void
@@ -2576,8 +2578,8 @@ static inline void put_prev_set_next_tas
 	if (next == prev)
 		return;
 
-	prev->sched_class->put_prev_task(rq, prev, next);
-	next->sched_class->set_next_task(rq, next, true);
+	prev->sched_class->put_prev_task(rq, prev, next, 0);
+	next->sched_class->set_next_task(rq, next, ENQUEUE_FIRST);
 }
 
 /*
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -27,7 +27,7 @@ wakeup_preempt_stop(struct rq *rq, struc
 	/* we're never preempted */
 }
 
-static void set_next_task_stop(struct rq *rq, struct task_struct *stop, bool first)
+static void set_next_task_stop(struct rq *rq, struct task_struct *stop, int flags)
 {
 	stop->se.exec_start = rq_clock_task(rq);
 }
@@ -58,7 +58,8 @@ static void yield_task_stop(struct rq *r
 	BUG(); /* the stop task should never yield, its pointless. */
 }
 
-static void put_prev_task_stop(struct rq *rq, struct task_struct *prev, struct task_struct *next)
+static void put_prev_task_stop(struct rq *rq, struct task_struct *prev,
+			       struct task_struct *next, int flags)
 {
 	update_curr_common(rq);
 }



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (10 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 11/14] sched: Add flags to {put_prev,set_next}_task() methods Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-12  0:19   ` Tejun Heo
  2025-09-10 15:44 ` [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED Peter Zijlstra
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

In order to fix the whole SCHED_EXT balance/pick mess, and avoid
further complicating all this, make the regular:

  p->pi_lock
    rq->lock
      dsq->lock

order work. Notably, while sched_class::pick_task() is called with
rq->lock held, and pick_task_scx() takes dsq->lock, and while the
normal sched_change pattern goes into dequeue/enqueue and thus takes
dsq->lock, various other things like task_call_func() /
sched_setaffinity() do not necessarily do so.

Therefore, add a per task spinlock pointer that can be set to
reference the shared runqueue lock where appropriate and teach
__task_rq_lock() to take this long along with rq->lock.

This ensures all 'normal' scheduling operations serialize against the
shared lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |    2 +-
 kernel/sched/core.c   |   27 ++++++++++++++++++++++-----
 kernel/sched/sched.h  |   10 ++++++----
 kernel/sched/stats.h  |    2 +-
 4 files changed, 30 insertions(+), 11 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1225,8 +1225,8 @@ struct task_struct {
 	/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
 	spinlock_t			alloc_lock;
 
-	/* Protection of the PI data structures: */
 	raw_spinlock_t			pi_lock;
+	raw_spinlock_t			*srq_lock;
 
 	struct wake_q_node		wake_q;
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -703,17 +703,24 @@ void double_rq_lock(struct rq *rq1, stru
 struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	__acquires(rq->lock)
 {
+	raw_spinlock_t *slock;
 	struct rq *rq;
 
 	lockdep_assert_held(&p->pi_lock);
 
 	for (;;) {
 		rq = task_rq(p);
+		slock = p->srq_lock;
 		raw_spin_rq_lock(rq);
-		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
+		if (slock)
+			raw_spin_lock(slock);
+		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p) &&
+			   (!slock || p->srq_lock == slock))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
+		if (slock)
+			raw_spin_unlock(slock);
 		raw_spin_rq_unlock(rq);
 
 		while (unlikely(task_on_rq_migrating(p)))
@@ -728,12 +735,16 @@ struct rq *task_rq_lock(struct task_stru
 	__acquires(p->pi_lock)
 	__acquires(rq->lock)
 {
+	raw_spinlock_t *slock;
 	struct rq *rq;
 
 	for (;;) {
 		raw_spin_lock_irqsave(&p->pi_lock, rf->flags);
 		rq = task_rq(p);
+		slock = p->srq_lock;
 		raw_spin_rq_lock(rq);
+		if (slock)
+			raw_spin_lock(slock);
 		/*
 		 *	move_queued_task()		task_rq_lock()
 		 *
@@ -751,10 +762,14 @@ struct rq *task_rq_lock(struct task_stru
 		 * dependency headed by '[L] rq = task_rq()' and the acquire
 		 * will pair with the WMB to ensure we then also see migrating.
 		 */
-		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
+		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p) &&
+			   (!slock || p->srq_lock == slock))) {
 			rq_pin_lock(rq, rf);
 			return rq;
 		}
+
+		if (slock)
+			raw_spin_unlock(slock);
 		raw_spin_rq_unlock(rq);
 		raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 
@@ -2617,7 +2632,8 @@ static int migration_cpu_stop(void *data
 		 */
 		WARN_ON_ONCE(!pending->stop_pending);
 		preempt_disable();
-		task_rq_unlock(rq, p, &rf);
+		rq_unlock(rq, &rf);
+		raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 		stop_one_cpu_nowait(task_cpu(p), migration_cpu_stop,
 				    &pending->arg, &pending->stop_work);
 		preempt_enable();
@@ -2626,7 +2642,8 @@ static int migration_cpu_stop(void *data
 out:
 	if (pending)
 		pending->stop_pending = false;
-	task_rq_unlock(rq, p, &rf);
+	rq_unlock(rq, &rf);
+	raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
 
 	if (complete)
 		complete_all(&pending->done);
@@ -3743,7 +3760,7 @@ static int ttwu_runnable(struct task_str
 		ttwu_do_wakeup(p);
 		ret = 1;
 	}
-	__task_rq_unlock(rq, &rf);
+	__task_rq_unlock(rq, p, &rf);
 
 	return ret;
 }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1800,10 +1800,13 @@ struct rq *task_rq_lock(struct task_stru
 	__acquires(p->pi_lock)
 	__acquires(rq->lock);
 
-static inline void __task_rq_unlock(struct rq *rq, struct rq_flags *rf)
+static inline void
+__task_rq_unlock(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 	__releases(rq->lock)
 {
 	rq_unpin_lock(rq, rf);
+	if (p->srq_lock)
+		raw_spin_unlock(p->srq_lock);
 	raw_spin_rq_unlock(rq);
 }
 
@@ -1812,8 +1815,7 @@ task_rq_unlock(struct rq *rq, struct tas
 	__releases(rq->lock)
 	__releases(p->pi_lock)
 {
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
+	__task_rq_unlock(rq, p, rf);
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }
 
@@ -1824,7 +1826,7 @@ DEFINE_LOCK_GUARD_1(task_rq_lock, struct
 
 DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
 		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
-		    __task_rq_unlock(_T->rq, &_T->rf),
+		    __task_rq_unlock(_T->rq, _T->lock, &_T->rf),
 		    struct rq *rq; struct rq_flags rf)
 
 static inline void rq_lock_irqsave(struct rq *rq, struct rq_flags *rf)
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -206,7 +206,7 @@ static inline void psi_ttwu_dequeue(stru
 
 		rq = __task_rq_lock(p, &rf);
 		psi_task_change(p, p->psi_flags, 0);
-		__task_rq_unlock(rq, &rf);
+		__task_rq_unlock(rq, p, &rf);
 	}
 }
 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-10 15:44 ` [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock() Peter Zijlstra
@ 2025-09-12  0:19   ` Tejun Heo
  2025-09-12 11:54     ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-12  0:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Wed, Sep 10, 2025 at 05:44:21PM +0200, Peter Zijlstra wrote:
> @@ -703,17 +703,24 @@ void double_rq_lock(struct rq *rq1, stru
>  struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
>  	__acquires(rq->lock)
>  {
> +	raw_spinlock_t *slock;
>  	struct rq *rq;
>  
>  	lockdep_assert_held(&p->pi_lock);
>  
>  	for (;;) {
>  		rq = task_rq(p);
> +		slock = p->srq_lock;
>  		raw_spin_rq_lock(rq);
> -		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
> +		if (slock)
> +			raw_spin_lock(slock);
> +		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p) &&
> +			   (!slock || p->srq_lock == slock))) {
>  			rq_pin_lock(rq, rf);
>  			return rq;
>  		}

With the !slock condition, the following scenario is possible:

  __task_rq_lock()
     slock = p->srq_lock; /* NULL */
                                                dispatch_enqueue()
                                                  p->srq_lock = &dsq->lock;
                                                enqueue finishes
     raw_spin_rq_lock(rq);
     rq is the same, $slock is NULL, return
  do something assuming p is locked down        p gets dispatched to another rq

I'm unclear on when p->srq_lock would be safe to set and clear, so the goal
is that whoever does [__]task_rq_lock() ends up waiting on the dsq lock that
the task is queued on, and if we can exclude other sched operations that
way, we don't have to hold source rq lock when moving the task to another rq
for execution, right?

In the last patch, it's set on dispatch_enqueue() and cleared when the task
leaves the DSQ. Let's consider a simple scenario where a task gets enqueued,
gets put on a non-local DSQ and then dispatched to a local DSQ, Assuming
everything works out and we don't have to lock the source rq for migration,
we'd be depending on task_rq_lock() reliably hitting p->srq_lock to avoid
races, but I'm not sure how this would work. Let's say p is currently
associated with CPU1 on a non-local DSQ w/ p->srq_lock set to its source
DSQ.

  pick_task_ext() on CPU0               task property change on CPU1
    locks the DSQ
    picks p      
    task_unlink_from_dsq()              task_rq_lock();
      p->srq_lock = NULL;                 lock rq on CPU1
    p is moved to local DSQ               sees p->src_lock == NULL
                                          return
  p starts running
  anything can happen
                                        proceed with property change

What am I missing?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-12  0:19   ` Tejun Heo
@ 2025-09-12 11:54     ` Peter Zijlstra
  2025-09-12 14:11       ` Peter Zijlstra
  2025-09-12 17:56       ` Tejun Heo
  0 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-12 11:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Sep 11, 2025 at 02:19:57PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 10, 2025 at 05:44:21PM +0200, Peter Zijlstra wrote:
> > @@ -703,17 +703,24 @@ void double_rq_lock(struct rq *rq1, stru
> >  struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> >  	__acquires(rq->lock)
> >  {
> > +	raw_spinlock_t *slock;
> >  	struct rq *rq;
> >  
> >  	lockdep_assert_held(&p->pi_lock);
> >  
> >  	for (;;) {
> >  		rq = task_rq(p);
> > +		slock = p->srq_lock;
> >  		raw_spin_rq_lock(rq);
> > -		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
> > +		if (slock)
> > +			raw_spin_lock(slock);
> > +		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p) &&
> > +			   (!slock || p->srq_lock == slock))) {
> >  			rq_pin_lock(rq, rf);
> >  			return rq;
> >  		}

Yeah, I think that needs to change a little. Perhaps something like:

	slock2 = p->srq_lock;
	if (... && (!slock2 || slock2 == slock))

> With the !slock condition, the following scenario is possible:
> 
>   __task_rq_lock()
>      slock = p->srq_lock; /* NULL */
>                                                 dispatch_enqueue()
>                                                   p->srq_lock = &dsq->lock;
>                                                 enqueue finishes
>      raw_spin_rq_lock(rq);
>      rq is the same, $slock is NULL, return
>   do something assuming p is locked down        p gets dispatched to another rq
> 
> I'm unclear on when p->srq_lock would be safe to set and clear, so the goal
> is that whoever does [__]task_rq_lock() ends up waiting on the dsq lock that
> the task is queued on, and if we can exclude other sched operations that
> way, we don't have to hold source rq lock when moving the task to another rq
> for execution, right?

Indeed. If !p->srq_lock then task_rq(p)->lock must be sufficient.

So for enqueue, which sets p->srq_lock, this must be done while holding
task_rq(p)->lock.

So the above example should be serialized on task_rq(p)->lock, since
__task_rq_lock() holds it, enqueue cannot happen. Conversely, if enqueue
holds task_rq(p)->lock, then __task_rq_lock() will have to wait for
that, and then observe the newly set p->srq_lock and cycle to take that.

> In the last patch, it's set on dispatch_enqueue() and cleared when the task
> leaves the DSQ. Let's consider a simple scenario where a task gets enqueued,
> gets put on a non-local DSQ and then dispatched to a local DSQ, Assuming
> everything works out and we don't have to lock the source rq for migration,
> we'd be depending on task_rq_lock() reliably hitting p->srq_lock to avoid
> races, but I'm not sure how this would work. Let's say p is currently
> associated with CPU1 on a non-local DSQ w/ p->srq_lock set to its source
> DSQ.
> 
>   pick_task_ext() on CPU0               task property change on CPU1
>     locks the DSQ
>     picks p      
>     task_unlink_from_dsq()              task_rq_lock();
>       p->srq_lock = NULL;                 lock rq on CPU1
>     p is moved to local DSQ               sees p->src_lock == NULL
>                                           return
>   p starts running
>   anything can happen
>                                         proceed with property change

Hmm, the thinking was that if !p->srq_lock then task_rq(p)->lock should
be sufficient.

We must do set_task_cpu(0) before task_unlink_from_dsq() (and I got this
order wrong in yesterday's email).

  pick_task_ext() on CPU0		
    lock DSQ
    pick p
    set_task_cpu(0)			task_rq_lock()
    task_unlink_from_dsq()		  if !p->srq_lock, then task_rq(p) == 0
      p->srq_lock = NULL;
    p is moved to local DSQ

Perhaps the p->srq_lock store should be store-release, so that the cpu
store is before.

Then if we observe p->srq_lock, we'll serialize against DSQ and all is
well, if we observe !p->srq_lock then we must also observe task_rq(p) ==
0 and then we'll serialize on rq->lock.


Now let me see if there isn't an ABA issue here, consider:

pre: task_cpu(p) != 2, p->srq_lock = NULL

  CPU0				CPU1				CPU2

  __task_rq_lock()		enqueue_task_scx() 		pick_task_scx()

				rq = task_rq(p);
				LOCK rq->lock
  rq = task_rq(p)
  LOCK rq->lock
    .. waits
				LOCK dsq->lock
				enqueue on dsq
				p->srq_lock = &dsq->lock	
				UNLOCK dsq->lock
								LOCK dsq->lock
								pick p
				UNLOCK rq->lock
								set_task_cpu(2)
								task_unlink_from_dsq()
								  p->srq_lock = NULL;
								UNLOCK dsq->lock
    .. resumes
 
At this point our CPU0's __task_rq_lock():

  - if it observes p->srq_lock, it will cycle taking that, only to then
    find out p->srq_lock is no longer set, but then it must also see
    task_rq() has changed, so the next cycle will block on CPU2's
    rq->lock.

  - if it observes !p->srq_lock, then it cannot be the initial NULL,
    since the initial task_rq(p)->lock ordering prohibits this. So it
    must be the second NULL, which then also mandates we see the CPU
    change and we'll cycle to take CPU2's rq->lock.

That is, I _think_ we're okay :-)


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-12 11:54     ` Peter Zijlstra
@ 2025-09-12 14:11       ` Peter Zijlstra
  2025-09-12 17:56       ` Tejun Heo
  1 sibling, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-12 14:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Fri, Sep 12, 2025 at 01:54:59PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2025 at 02:19:57PM -1000, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Sep 10, 2025 at 05:44:21PM +0200, Peter Zijlstra wrote:
> > > @@ -703,17 +703,24 @@ void double_rq_lock(struct rq *rq1, stru
> > >  struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
> > >  	__acquires(rq->lock)
> > >  {
> > > +	raw_spinlock_t *slock;
> > >  	struct rq *rq;
> > >  
> > >  	lockdep_assert_held(&p->pi_lock);
> > >  
> > >  	for (;;) {
> > >  		rq = task_rq(p);
> > > +		slock = p->srq_lock;
> > >  		raw_spin_rq_lock(rq);
> > > -		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p))) {
> > > +		if (slock)
> > > +			raw_spin_lock(slock);
> > > +		if (likely(rq == task_rq(p) && !task_on_rq_migrating(p) &&
> > > +			   (!slock || p->srq_lock == slock))) {
> > >  			rq_pin_lock(rq, rf);
> > >  			return rq;
> > >  		}
> 
> Yeah, I think that needs to change a little. Perhaps something like:
> 
> 	slock2 = p->srq_lock;
> 	if (... && (!slock2 || slock2 == slock))

I'm being stupid, all that wants is: && (p->srq_lock == slock). If there
is a mis-match, unlock and re-try.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-12 11:54     ` Peter Zijlstra
  2025-09-12 14:11       ` Peter Zijlstra
@ 2025-09-12 17:56       ` Tejun Heo
  2025-09-15  8:38         ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-12 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Fri, Sep 12, 2025 at 01:54:59PM +0200, Peter Zijlstra wrote:
> > With the !slock condition, the following scenario is possible:
> > 
> >   __task_rq_lock()
> >      slock = p->srq_lock; /* NULL */
> >                                                 dispatch_enqueue()
> >                                                   p->srq_lock = &dsq->lock;
> >                                                 enqueue finishes
> >      raw_spin_rq_lock(rq);
> >      rq is the same, $slock is NULL, return
> >   do something assuming p is locked down        p gets dispatched to another rq
> > 
> > I'm unclear on when p->srq_lock would be safe to set and clear, so the goal
> > is that whoever does [__]task_rq_lock() ends up waiting on the dsq lock that
> > the task is queued on, and if we can exclude other sched operations that
> > way, we don't have to hold source rq lock when moving the task to another rq
> > for execution, right?
> 
> Indeed. If !p->srq_lock then task_rq(p)->lock must be sufficient.
> 
> So for enqueue, which sets p->srq_lock, this must be done while holding
> task_rq(p)->lock.
> 
> So the above example should be serialized on task_rq(p)->lock, since
> __task_rq_lock() holds it, enqueue cannot happen. Conversely, if enqueue
> holds task_rq(p)->lock, then __task_rq_lock() will have to wait for
> that, and then observe the newly set p->srq_lock and cycle to take that.

For that to work, [__]task_rq_lock() would have to read p->srq_lock while
holdling rq_lock. Simple reordering, but yeah it'd help to have
setter/getter for explicit locking rules.

...
> We must do set_task_cpu(0) before task_unlink_from_dsq() (and I got this
> order wrong in yesterday's email).
> 
>   pick_task_ext() on CPU0		
>     lock DSQ
>     pick p
>     set_task_cpu(0)			task_rq_lock()
>     task_unlink_from_dsq()		  if !p->srq_lock, then task_rq(p) == 0
>       p->srq_lock = NULL;
>     p is moved to local DSQ
> 
> Perhaps the p->srq_lock store should be store-release, so that the cpu
> store is before.
> 
> Then if we observe p->srq_lock, we'll serialize against DSQ and all is
> well, if we observe !p->srq_lock then we must also observe task_rq(p) ==
> 0 and then we'll serialize on rq->lock.

I see, so the interlocking is between task_rq(p) and p->srq_lock - either
one sees the updated CPU or non-NULL srq_lock. As long as the one that
clears ->srq_lock has both the destination rq and DSQ locked, task_rq_lock()
either ends up waiting on ->srq_lock or sees updated CPU and has to loop
over and wait on the destination rq.

> Now let me see if there isn't an ABA issue here, consider:
> 
> pre: task_cpu(p) != 2, p->srq_lock = NULL
> 
>   CPU0				CPU1				CPU2
> 
>   __task_rq_lock()		enqueue_task_scx() 		pick_task_scx()
> 
> 				rq = task_rq(p);
> 				LOCK rq->lock
>   rq = task_rq(p)
>   LOCK rq->lock
>     .. waits
> 				LOCK dsq->lock
> 				enqueue on dsq
> 				p->srq_lock = &dsq->lock	
> 				UNLOCK dsq->lock
> 								LOCK dsq->lock
> 								pick p
> 				UNLOCK rq->lock
> 								set_task_cpu(2)
> 								task_unlink_from_dsq()
> 								  p->srq_lock = NULL;
> 								UNLOCK dsq->lock
>     .. resumes
>  
> At this point our CPU0's __task_rq_lock():
> 
>   - if it observes p->srq_lock, it will cycle taking that, only to then
>     find out p->srq_lock is no longer set, but then it must also see
>     task_rq() has changed, so the next cycle will block on CPU2's
>     rq->lock.
> 
>   - if it observes !p->srq_lock, then it cannot be the initial NULL,
>     since the initial task_rq(p)->lock ordering prohibits this. So it
>     must be the second NULL, which then also mandates we see the CPU
>     change and we'll cycle to take CPU2's rq->lock.
> 
> That is, I _think_ we're okay :-)

It *seems* that way to me. There are two other scenarios tho.

- A task can move from a non-local DSQ to another non-local DSQ at any time
  while queued. As this doesn't cause rq migration, we can probably just
  overwrite p->srq_lock to the new one. Need to think about it a bit more.

- A task can be queued on a BPF data structure and thus may not be on any
  DSQ. I think this can be handled by adding a raw_spinlock to task_struct
  and treating the task as if it's on its own DSQ by pointing to that one,
  and grabbing that lock when transferring that task from BPF side.

So, it *seems* solvable but I'm afraid it's becoming too subtle. How about
doing something simpler and just add a per-task lock which nests inside rq
lock which is always grabbed by [__]task_rq_lock() and optionally grabbed by
sched classes that want to migrate tasks without grabbing the source rq
lock? That way, we don't need to the lock pointer dancing while achieving
about the same result. From sched_ext's POV, grabbing that per-task lock is
likely going to be cheaper than doing the rq lock switching, so it's way
simpler and nothing gets worse.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-12 17:56       ` Tejun Heo
@ 2025-09-15  8:38         ` Peter Zijlstra
  2025-09-16 22:29           ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-15  8:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Fri, Sep 12, 2025 at 07:56:21AM -1000, Tejun Heo wrote:

> It *seems* that way to me. There are two other scenarios tho.
> 
> - A task can move from a non-local DSQ to another non-local DSQ at any time
>   while queued. As this doesn't cause rq migration, we can probably just
>   overwrite p->srq_lock to the new one. Need to think about it a bit more.

It can use task_on_rq_migrating(), exactly like 'normal' rq-to-rq
migration:

	LOCK src_dsq->lock
	p->on_rq = TASK_ON_RQ_MIGRATING;
	task_unlink_from_dsq();
	UNLOCK src_dsq->lock

	LOCK dst_dsq->lock
	dispatch_enqueue()
	p->on_rq = TASK_ON_RQ_QUEUED;
	UNLOCK dst_dsq->lock

Same reasoning as for the pick_task_scx() migration, if it observes
!p->srq_lock, then it must observe MIGRATING and we'll spin-wait until
QUEUED. At which point we'll see the new srq_lock.

> - A task can be queued on a BPF data structure and thus may not be on any
>   DSQ. I think this can be handled by adding a raw_spinlock to task_struct
>   and treating the task as if it's on its own DSQ by pointing to that one,
>   and grabbing that lock when transferring that task from BPF side.

Hmm, and BPF data structures cannot have a lock associated with them?
I'm thinking they must, something is serializing all that.

> So, it *seems* solvable but I'm afraid it's becoming too subtle. How about
> doing something simpler and just add a per-task lock which nests inside rq
> lock which is always grabbed by [__]task_rq_lock() and optionally grabbed by
> sched classes that want to migrate tasks without grabbing the source rq
> lock? That way, we don't need to the lock pointer dancing while achieving
> about the same result. From sched_ext's POV, grabbing that per-task lock is
> likely going to be cheaper than doing the rq lock switching, so it's way
> simpler and nothing gets worse.

I *really* don't like that. Fundamentally a runqueue is 'rich' data
structure. It has a container (list, tree, whatever) but also a pile of
statistics (time, vtime, counts, load-sums, averages). Adding/removing a
task from a runqueue needs all that serialized. A per-task lock simply
cannot do this.

If you've hidden this lock inside BPF such that C cannot access it, then
your abstraction needs fixing. Surely it is possible to have a C DSQ to
mirror whatever the BPF thing does. Add a few helpers for BPF to
create/destroy DSQs (IDs) and a callback to map a task to a DSQ. Then
the C part can use the DSQ lock, and hold it while calling into whatever
BPF.

Additionally, it can sanity check the BPF thing, tasks cannot go
'missing' without C knowing wtf they went -- which is that bypass
problem, no?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-15  8:38         ` Peter Zijlstra
@ 2025-09-16 22:29           ` Tejun Heo
  2025-09-16 22:41             ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-16 22:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Mon, Sep 15, 2025 at 10:38:15AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 12, 2025 at 07:56:21AM -1000, Tejun Heo wrote:
> > It *seems* that way to me. There are two other scenarios tho.
> > 
> > - A task can move from a non-local DSQ to another non-local DSQ at any time
> >   while queued. As this doesn't cause rq migration, we can probably just
> >   overwrite p->srq_lock to the new one. Need to think about it a bit more.
> 
> It can use task_on_rq_migrating(), exactly like 'normal' rq-to-rq
> migration:
> 
> 	LOCK src_dsq->lock
> 	p->on_rq = TASK_ON_RQ_MIGRATING;
> 	task_unlink_from_dsq();
> 	UNLOCK src_dsq->lock
> 
> 	LOCK dst_dsq->lock
> 	dispatch_enqueue()
> 	p->on_rq = TASK_ON_RQ_QUEUED;
> 	UNLOCK dst_dsq->lock
> 
> Same reasoning as for the pick_task_scx() migration, if it observes
> !p->srq_lock, then it must observe MIGRATING and we'll spin-wait until
> QUEUED. At which point we'll see the new srq_lock.

I see.

> > - A task can be queued on a BPF data structure and thus may not be on any
> >   DSQ. I think this can be handled by adding a raw_spinlock to task_struct
> >   and treating the task as if it's on its own DSQ by pointing to that one,
> >   and grabbing that lock when transferring that task from BPF side.
> 
> Hmm, and BPF data structures cannot have a lock associated with them?
> I'm thinking they must, something is serializing all that.
> 
> > So, it *seems* solvable but I'm afraid it's becoming too subtle. How about
> > doing something simpler and just add a per-task lock which nests inside rq
> > lock which is always grabbed by [__]task_rq_lock() and optionally grabbed by
> > sched classes that want to migrate tasks without grabbing the source rq
> > lock? That way, we don't need to the lock pointer dancing while achieving
> > about the same result. From sched_ext's POV, grabbing that per-task lock is
> > likely going to be cheaper than doing the rq lock switching, so it's way
> > simpler and nothing gets worse.
> 
> I *really* don't like that. Fundamentally a runqueue is 'rich' data
> structure. It has a container (list, tree, whatever) but also a pile of
> statistics (time, vtime, counts, load-sums, averages). Adding/removing a
> task from a runqueue needs all that serialized. A per-task lock simply
> cannot do this.
> 
> If you've hidden this lock inside BPF such that C cannot access it, then
> your abstraction needs fixing. Surely it is possible to have a C DSQ to
> mirror whatever the BPF thing does. Add a few helpers for BPF to
> create/destroy DSQs (IDs) and a callback to map a task to a DSQ. Then
> the C part can use the DSQ lock, and hold it while calling into whatever
> BPF.

Most current schedulers (except for scx_qmap which is there just to demo how
to use BPF side queueing) use use DSQs to handle tasks the way you're
describing. However, BPF arena is becoming more accessible and gaining wider
usage, paired with purely BPF side synchronization constructs (spinlock or
some lockless data structure).

Long term, I think maintaining flexibility is of higher importance for
sched_ext than e.g. small performance improvements or even design or
implementation aesthetics. The primary purpose is enabling trying out new,
sometimes wild, things after all. As such, I don't think it'd be a good idea
to put strict restrictions on how the BPF side operates unless it affects
the ability to recover the system from a malfunctioning BPF scheduler, of
course.

> Additionally, it can sanity check the BPF thing, tasks cannot go
> 'missing' without C knowing wtf they went -- which is that bypass
> problem, no?

They are orthogonal. Even if all tasks are on DSQs, the scheduler may still
fail to dispatch some DSQs for too long, mess up the ordering inside, cause
excessive bouncing across them, and what not. So, the kernel side still
needs to be able to detect and contain failures. The only difference between
a task being on a DSQ or BPF side is that there needs to be extra per-task
state tracking to ensure that tasks only transit states in orderly fashion
(ie. don't dispatch the same task twice).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-16 22:29           ` Tejun Heo
@ 2025-09-16 22:41             ` Tejun Heo
  2025-09-25  8:35               ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-16 22:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello, again.

On Tue, Sep 16, 2025 at 12:29:57PM -1000, Tejun Heo wrote:
...
> Long term, I think maintaining flexibility is of higher importance for
> sched_ext than e.g. small performance improvements or even design or
> implementation aesthetics. The primary purpose is enabling trying out new,
> sometimes wild, things after all. As such, I don't think it'd be a good idea
> to put strict restrictions on how the BPF side operates unless it affects
> the ability to recover the system from a malfunctioning BPF scheduler, of
> course.

Thinking a bit more about it. I wonder the status-quo is actually an okay
balance. All in-kernel sched classes are per-CPU rich rq design, which
meshes well with the current locking scheme, for obvious reasons.

sched_ext is an oddball in that it may want to hot-migrate tasks at the last
minute because who knows what the BPF side wants to do. However, this just
boils down to having to always call balance() before any pick_task()
attempts (including DL server case). Yeah, it's a niggle, especially as
there needs to be a secondary hook to handle losing the race between
balance() and pick_task(), but it's pretty contained conceptually and not a
lot of code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-16 22:41             ` Tejun Heo
@ 2025-09-25  8:35               ` Peter Zijlstra
  2025-09-25 21:43                 ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-25  8:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hi! Sorry for the delay,

On Tue, Sep 16, 2025 at 12:41:54PM -1000, Tejun Heo wrote:

> On Tue, Sep 16, 2025 at 12:29:57PM -1000, Tejun Heo wrote:
> ...
> > Long term, I think maintaining flexibility is of higher importance for
> > sched_ext than e.g. small performance improvements or even design or
> > implementation aesthetics. The primary purpose is enabling trying out new,
> > sometimes wild, things after all. As such, I don't think it'd be a good idea
> > to put strict restrictions on how the BPF side operates unless it affects
> > the ability to recover the system from a malfunctioning BPF scheduler, of
> > course.
> 
> Thinking a bit more about it. I wonder the status-quo is actually an okay
> balance. All in-kernel sched classes are per-CPU rich rq design, which
> meshes well with the current locking scheme, for obvious reasons.
> 
> sched_ext is an oddball in that it may want to hot-migrate tasks at the last
> minute because who knows what the BPF side wants to do. However, this just
> boils down to having to always call balance() before any pick_task()
> attempts (including DL server case). Yeah, it's a niggle, especially as
> there needs to be a secondary hook to handle losing the race between
> balance() and pick_task(), but it's pretty contained conceptually and not a
> lot of code.

Status quo isn't sufficient; there is that guy that wants to fix some RT
interaction, and there is that dl_server series.

The only viable option other than overhauling the locking, is pushing rf
into pick_task() and have that do all the lock dancing. This gets rid of
that balance abuse (which is needed for dl_server) and also allows
fixing that rt thing.

It just makes a giant mess of pick_task_scx() which might have to drop
locks and retry/abort -- which you weren't very keen on, but yeah, it
should work.

As to letting BPF do wild experiments; that's fine of course, but not
exposing the actual locking requirements is like denying reality. You
can't do lock-break in pick_task_scx() and then claim lockless or
advanced locking -- that's just not true.

Also, you cannot claim bpf-sched author is clever enough to implement
advanced locking, but then somehow not clever enough to deal with a
simple interface to express locking to the core code. That feels
disingenuous.

For all the DSQ based schedulers, this new locking really is an
improvement, but if you don't want to constrain bpf-sched authors to
reality, then perhaps only do the lock break dance for them?

Anyway, I'll go poke at this series again -- the latest queue.git
version seemed to work reliably for me (I could run stress-ng while
having scx_simple loaded), but the robot seems to have found an issue.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-25  8:35               ` Peter Zijlstra
@ 2025-09-25 21:43                 ` Tejun Heo
  2025-09-26  9:59                   ` Peter Zijlstra
  2025-09-26 10:36                   ` Peter Zijlstra
  0 siblings, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2025-09-25 21:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Thu, Sep 25, 2025 at 10:35:33AM +0200, Peter Zijlstra wrote:
> > sched_ext is an oddball in that it may want to hot-migrate tasks at the last
> > minute because who knows what the BPF side wants to do. However, this just
> > boils down to having to always call balance() before any pick_task()
> > attempts (including DL server case). Yeah, it's a niggle, especially as
> > there needs to be a secondary hook to handle losing the race between
> > balance() and pick_task(), but it's pretty contained conceptually and not a
> > lot of code.
> 
> Status quo isn't sufficient; there is that guy that wants to fix some RT
> interaction, and there is that dl_server series.

Can you point me to the RT interaction issue?

Just for context, from sched_ext side, the two pending issues are:

- An extra hook after the next task to run is picked regardless of sched
  class to fix ops.cpu_acquire/release().

- Invoking sched_ext's balance() if its DL server is going to run. This will
  be the same place as the balance() calling for pick_task() and it
  shouldn't be too difficult to package them together so that they're a bit
  less crufty.

Both can be addressed in a neater way if we can pick_task() atomically, and
that will likely make other things easier too. However, it also isn't like
the benefits are overwhelming depending on how the overall tradeoff comes
out to be.

> The only viable option other than overhauling the locking, is pushing rf
> into pick_task() and have that do all the lock dancing. This gets rid of
> that balance abuse (which is needed for dl_server) and also allows
> fixing that rt thing.
> 
> It just makes a giant mess of pick_task_scx() which might have to drop
> locks and retry/abort -- which you weren't very keen on, but yeah, it
> should work.

It does feel really fragile tho. Introducing an extra inner locking layer
makes sense to me. I feel nervous about interlocking around dynamic lock
pointer. It feels too easy to make subtle mistakes in terms of update and
visibility rules. It seems too smart to me. I'd much prefer it to be a bit
dumber.

> As to letting BPF do wild experiments; that's fine of course, but not
> exposing the actual locking requirements is like denying reality. You
> can't do lock-break in pick_task_scx() and then claim lockless or
> advanced locking -- that's just not true.
> 
> Also, you cannot claim bpf-sched author is clever enough to implement
> advanced locking, but then somehow not clever enough to deal with a
> simple interface to express locking to the core code. That feels
> disingenuous.

It's not about cleverness but more about the gap between the two execution
environments. For example, the following is pure BPF spinlock implementation
that some started using:

 https://github.com/sched-ext/scx/blob/main/scheds/include/scx/bpf_arena_spin_lock.h

Kernel isn't involved in any way. It's BPF code doing atomic ops, managing
waiters and also giving up if reasonable forward progress can't be made.
It's all on BPF arena memory, which is not only readable but also writable
from userspace too. All of this is completely opaque to the kernel. It is
all safe from BPF side, but I don't see how we could interlock with
something like this from kernel side, and we do want BPF to be able to do
things like this.

> For all the DSQ based schedulers, this new locking really is an
> improvement, but if you don't want to constrain bpf-sched authors to
> reality, then perhaps only do the lock break dance for them?

Yes, I was on a similar train of thought. The only reasonable way that I can
think of for solving this for BPF managed tasks is giving each task its own
inner sched lock, which makes sense as all sched operations (except for
things like watchdog) are per-task and we don't really need wider scope
locking.

So, let's say we do that for BPF tasks. Then, why not do the same thing for
DSQ tasks too? It will provide all the necessary synchronization guarantees.
The only downside for tasks on DSQ is that we'll grab one more lock instead
of piggy-backing on the DSQ locks. However, going back to my queasiness
about dynamic lock pointers, I'd rather go for the dumber thing even if it's
slightly less efficient and I'm really doubtful that we'd notice the extra
lock overhead in any practical way.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-25 21:43                 ` Tejun Heo
@ 2025-09-26  9:59                   ` Peter Zijlstra
  2025-09-26 16:48                     ` Tejun Heo
  2025-09-26 10:36                   ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-26  9:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Sep 25, 2025 at 11:43:18AM -1000, Tejun Heo wrote:

> Can you point me to the RT interaction issue?

https://lkml.kernel.org/r/fca528bb34394de3a7e87a873fadd9df@honor.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-26  9:59                   ` Peter Zijlstra
@ 2025-09-26 16:48                     ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2025-09-26 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Fri, Sep 26, 2025 at 11:59:34AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 25, 2025 at 11:43:18AM -1000, Tejun Heo wrote:
> 
> > Can you point me to the RT interaction issue?
> 
> https://lkml.kernel.org/r/fca528bb34394de3a7e87a873fadd9df@honor.com

Ah, that one. Thought there was another RT proper issue. That's the one
which can be solved by having post-pick_task() hook.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-25 21:43                 ` Tejun Heo
  2025-09-26  9:59                   ` Peter Zijlstra
@ 2025-09-26 10:36                   ` Peter Zijlstra
  2025-09-26 21:39                     ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-26 10:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Sep 25, 2025 at 11:43:18AM -1000, Tejun Heo wrote:
> Yes, I was on a similar train of thought. The only reasonable way that I can
> think of for solving this for BPF managed tasks is giving each task its own
> inner sched lock, which makes sense as all sched operations (except for
> things like watchdog) are per-task and we don't really need wider scope
> locking.

Like I've said before; I really don't understand how that would be
helpful at all.

How can you migrate a task by holding a per-task lock?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-26 10:36                   ` Peter Zijlstra
@ 2025-09-26 21:39                     ` Tejun Heo
  2025-09-29 10:06                       ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-26 21:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Fri, Sep 26, 2025 at 12:36:28PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 25, 2025 at 11:43:18AM -1000, Tejun Heo wrote:
> > Yes, I was on a similar train of thought. The only reasonable way that I can
> > think of for solving this for BPF managed tasks is giving each task its own
> > inner sched lock, which makes sense as all sched operations (except for
> > things like watchdog) are per-task and we don't really need wider scope
> > locking.
> 
> Like I've said before; I really don't understand how that would be
> helpful at all.
> 
> How can you migrate a task by holding a per-task lock?

Let's see whether I'm completely confused. Let's say we have p->sub_lock
which is optionally grabbed by task_rq_lock() if requested by the current
sched class (maybe it's a sched_class flag). Then, whoever is holding the
sub_lock would exclude property and other changes to the task.

In sched_ext, let's say p->sub_lock nests inside dsq locks. Also, right now,
we're piggy backing on rq lock for local DSQs. We'd need to make local DSQs
use their own locks like user DSQs. Then,

- If a task needs to be migrated either during enqueue through
  process_ddsp_deferred_locals() or during dispatch from BPF through
  finish_dispatch(): Leave rq locks alone. Grab sub_lock inside
  dispatch_to_local_dsq() after grabbing the target DSQ's lock.

- scx_bpf_dsq_move_to_local() from dispatch: This is a bit tricky as we need
  to scan the tasks on the source DSQ to find the task to dispatch. However,
  there's a patch being worked on to add rcu protected pointer to the first
  task which would be the task to be consumed in vast majority of cases, so
  the fast path wouldn't be complicated - grab sub_lock, do the moving. If
  the first task isn't a good candidate, we'd have to grab DSQ lock, iterate
  looking for the right candidate, unlock DSQ and grab sub_lock (or
  trylock), and see if the task is still on the DSQ and then relock and
  remove.

- scx_bpf_dsq_move() during BPF iteration: DSQ is unlocked during each
  iteration visit, so this is straightforward. Grab sub-lock and do the rest
  the same.

Wouldn't something like the above provide equivalent synchronization as the
dynamic lock approach? Whoever is holding sub_lock would be guaranteed that
the task won't be migrating while the lock is held.

However, thinking more about it. I'm unsure how e.g. the actual migration
would work. The actual migration is done by: deactivate_task() ->
set_task_cpu() -> switch rq locks -> activate_task(). Enqueueing/dequeueing
steps have operations that depend on rq lock - psi updates, uclamp updates
and so on. How would they work?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-26 21:39                     ` Tejun Heo
@ 2025-09-29 10:06                       ` Peter Zijlstra
  2025-09-30 23:49                         ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-29 10:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Fri, Sep 26, 2025 at 11:39:21AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 26, 2025 at 12:36:28PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 25, 2025 at 11:43:18AM -1000, Tejun Heo wrote:
> > > Yes, I was on a similar train of thought. The only reasonable way that I can
> > > think of for solving this for BPF managed tasks is giving each task its own
> > > inner sched lock, which makes sense as all sched operations (except for
> > > things like watchdog) are per-task and we don't really need wider scope
> > > locking.
> > 
> > Like I've said before; I really don't understand how that would be
> > helpful at all.
> > 
> > How can you migrate a task by holding a per-task lock?
> 
> Let's see whether I'm completely confused. Let's say we have p->sub_lock
> which is optionally grabbed by task_rq_lock() if requested by the current
> sched class (maybe it's a sched_class flag). Then, whoever is holding the
> sub_lock would exclude property and other changes to the task.
> 
> In sched_ext, let's say p->sub_lock nests inside dsq locks. Also, right now,
> we're piggy backing on rq lock for local DSQs. We'd need to make local DSQs
> use their own locks like user DSQs. Then,
> 
> - If a task needs to be migrated either during enqueue through
>   process_ddsp_deferred_locals() or during dispatch from BPF through
>   finish_dispatch(): Leave rq locks alone. Grab sub_lock inside
>   dispatch_to_local_dsq() after grabbing the target DSQ's lock.
> 
> - scx_bpf_dsq_move_to_local() from dispatch: This is a bit tricky as we need
>   to scan the tasks on the source DSQ to find the task to dispatch. However,
>   there's a patch being worked on to add rcu protected pointer to the first
>   task which would be the task to be consumed in vast majority of cases, so
>   the fast path wouldn't be complicated - grab sub_lock, do the moving. If
>   the first task isn't a good candidate, we'd have to grab DSQ lock, iterate
>   looking for the right candidate, unlock DSQ and grab sub_lock (or
>   trylock), and see if the task is still on the DSQ and then relock and
>   remove.
> 
> - scx_bpf_dsq_move() during BPF iteration: DSQ is unlocked during each
>   iteration visit, so this is straightforward. Grab sub-lock and do the rest
>   the same.
> 
> Wouldn't something like the above provide equivalent synchronization as the
> dynamic lock approach? Whoever is holding sub_lock would be guaranteed that
> the task won't be migrating while the lock is held.
> 
> However, thinking more about it. I'm unsure how e.g. the actual migration
> would work. The actual migration is done by: deactivate_task() ->
> set_task_cpu() -> switch rq locks -> activate_task(). Enqueueing/dequeueing
> steps have operations that depend on rq lock - psi updates, uclamp updates
> and so on. How would they work?

Suppose __task_rq_lock() will take rq->lock and p->sub_lock, in that
order, such that task_rq_lock() will take p->pi_lock, rq->lock and
p->sub_lock.

Then something like:

  guard(task_rq_lock)(p);
  scoped_guard (sched_change, p, ...) {
      // change me
  }

Will end up doing something like:

  // task_rq_lock
  IRQ-DISABLE
  LOCK pi->lock
1:
  rq = task_rq(p);
  LOCK rq->lock;
  if (rq != task_rq(p)) {
    UNLOCK rq->lock
    goto 1;
  }
  LOCK p->sub_lock

  // sched_change
  dequeue_task() := dequeue_task_scx()
    LOCK dsq->lock

While at the same time, above you argued p->sub_lock should be inside
dsq->lock. Because:

__schedule()
  rq = this_rq();
  LOCK rq->lock
  next = pick_next() := pick_next_scx()
    LOCK dsq->lock
    p = find_task(dsq);
    LOCK p->sub_lock
    dequeue(dsq, p);
    UNLOCK dsq->lock

Because if you did something like:

__schedule()
  rq = this_rq();
  LOCK rq->lock
  next = pick_next() := pick_next_scx()
    LOCK dsq->lock (or RCU, doesn't matter)
    p = find_task(dsq);
    UNLOCK dsq->lock
				migrate:
				LOCK p->pi_lock
				rq = task_rq(p)
				LOCK rq->lock
				(verify bla bla)
				LOCK p->sub_lock
				LOCK dsq->lock
				dequeue(dsq, p)
				UNLOCK dsq->lock
				set_task_cpu(n);
				UNLOCK rq->lock
				rq = cpu_rq(n);
				LOCK rq->lock (inversion vs p->sub_lock)
				LOCK dsq2->lock
				enqueue(dsq2, p)
				UNLOCK dsq2->lock

    LOCK p->sub_lock
    LOCK dsq->lock   (whoopsie, p is on dsq2)
    dequeue(dsq, p)
    set_task_cpu(here);
    UNLOCK dsq->lock


That is, either way around: dsq->lock outside, p->sub_lock inside, or
the other way around, I emd up with inversions and race conditions that
are not fun.

Also, if you do put p->sub_lock inside dsq->lock, this means
__task_rq_lock() cannot take it and it needs to be pushed deep into scx
(possibly into bpf ?) and that means I'm not sure how to do the change
pattern sanely.

Having __task_rq_lock() take p->dsq->lock solves all these problems,
except for that one weird case where BPF wants to do things their own
way. The longer I'm thinking about it, the more I dislike that. I just
don't see *ANY* upside from allowing BPF to do this while it is making
everything else quite awkward.

The easy fix is to have these BPF managed things have a single global
lock. That works and is correct. Then if they want something better,
they can use DSQs :-)

Fundamentally, we need the DSQ->lock to cover all CPUs that will pick
from it, there is no wiggle room there. Also note that while we change
only the attributes of a single task with the change pattern, that
affects the whole RQ, since a runqueue is an aggregate of all tasks.
This is very much why dequeue/enqueue around the change pattern, to keep
the runqueue aggregates updated.

Use the BPF thing to play with scheduling policies, but leave the
locking to the core code.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-29 10:06                       ` Peter Zijlstra
@ 2025-09-30 23:49                         ` Tejun Heo
  2025-10-01 11:54                           ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-30 23:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello, Peter.

On Mon, Sep 29, 2025 at 12:06:58PM +0200, Peter Zijlstra wrote:
...
> > However, thinking more about it. I'm unsure how e.g. the actual migration
> > would work. The actual migration is done by: deactivate_task() ->
> > set_task_cpu() -> switch rq locks -> activate_task(). Enqueueing/dequeueing
> > steps have operations that depend on rq lock - psi updates, uclamp updates
> > and so on. How would they work?
> 
> Suppose __task_rq_lock() will take rq->lock and p->sub_lock, in that
> order, such that task_rq_lock() will take p->pi_lock, rq->lock and
> p->sub_lock.
> 
...
> While at the same time, above you argued p->sub_lock should be inside
> dsq->lock. Because:
> 
> __schedule()
>   rq = this_rq();
>   LOCK rq->lock
>   next = pick_next() := pick_next_scx()
>     LOCK dsq->lock
>     p = find_task(dsq);
>     LOCK p->sub_lock
>     dequeue(dsq, p);
>     UNLOCK dsq->lock

I was going back and forth with the locking order. Note that even if
sub_lock is nested outside DSQ locks, the hot path for the pick_task() path
wouldn't be that different - it just needs RCU protected first_task pointer
and the DSQ association needs to be verified after grabbing the sub_lock
(much like how task_rq_lock() needs to retry).

> Because if you did something like:
> 
> __schedule()
>   rq = this_rq();
>   LOCK rq->lock
>   next = pick_next() := pick_next_scx()
>     LOCK dsq->lock (or RCU, doesn't matter)
>     p = find_task(dsq);
>     UNLOCK dsq->lock
> 				migrate:
> 				LOCK p->pi_lock
> 				rq = task_rq(p)
> 				LOCK rq->lock
> 				(verify bla bla)
> 				LOCK p->sub_lock
> 				LOCK dsq->lock
> 				dequeue(dsq, p)
> 				UNLOCK dsq->lock
> 				set_task_cpu(n);
> 				UNLOCK rq->lock
> 				rq = cpu_rq(n);
> 				LOCK rq->lock (inversion vs p->sub_lock)
> 				LOCK dsq2->lock
> 				enqueue(dsq2, p)
> 				UNLOCK dsq2->lock
> 
>     LOCK p->sub_lock
>     LOCK dsq->lock   (whoopsie, p is on dsq2)
>     dequeue(dsq, p)
>     set_task_cpu(here);
>     UNLOCK dsq->lock

I suppose the above is showing that p->sub_lock is nested outside dsq->lock,
right? If so, the right sequence would be:

        LOCK p->sub_lock
        LOCK src_dsq->lock
        verify p is still on src_dsq, if not retry
        remove from src_dsq
        UNLOCK src_dsq->lock
        LOCK dst_dsq->lock
        insert into dst_dsq
        UNLOCK dst_dsq->lock
        UNLOCK p->sub_lock

> That is, either way around: dsq->lock outside, p->sub_lock inside, or
> the other way around, I emd up with inversions and race conditions that
> are not fun.

It's not straightforward for sure but none of these approaches are. They are
all complicated in some areas. From sched_ext POV, I think what matter most
are providing as much latitude as possible to BPF scheduler implementations
and having lower likelihood of really subtle issues.

> Also, if you do put p->sub_lock inside dsq->lock, this means
> __task_rq_lock() cannot take it and it needs to be pushed deep into scx
> (possibly into bpf ?) and that means I'm not sure how to do the change
> pattern sanely.

I'm not quite following why task_rq_lock() wouldn't be able to take it.
Whether p->sub_lock nests under or over DSQ locks should only matter to
sched_ext proper. From core's POV, the only thing that matters is that as
long as p->sub_lock is held, the task won't be migrating and is safe to
change states for (more on this later).

> Having __task_rq_lock() take p->dsq->lock solves all these problems,
> except for that one weird case where BPF wants to do things their own
> way. The longer I'm thinking about it, the more I dislike that. I just
> don't see *ANY* upside from allowing BPF to do this while it is making
> everything else quite awkward.

I'm failing to see the problems you're seeing and have to disagree that
allowing more capabilities on BPF side doesn't bring any upsides. DSQs are
useful but it quickly becomes too restrictive - e.g. people often want to
put the same task on multiple queues and other data structures, which is a
lot more straightforward to do if the data structure and locking are managed
in BPF. In general, I don't think it's a good direction to be prescriptive
about how schedulers should be implemented or behave. Even if we might not
be able to think up something neat right now, someone will.

> The easy fix is to have these BPF managed things have a single global
> lock. That works and is correct. Then if they want something better,
> they can use DSQs :-)
> 
> Fundamentally, we need the DSQ->lock to cover all CPUs that will pick
> from it, there is no wiggle room there. Also note that while we change
> only the attributes of a single task with the change pattern, that
> affects the whole RQ, since a runqueue is an aggregate of all tasks.
> This is very much why dequeue/enqueue around the change pattern, to keep
> the runqueue aggregates updated.
> 
> Use the BPF thing to play with scheduling policies, but leave the
> locking to the core code.

I have two questions:

- Let's say something works, whether that's holding dsq->lock or
  p->sub_lock. I still don't understand how things would be safe w.r.t.
  things like PSI and uclamp updates. How would they cope with
  set_task_cpu() happening without the rq locked?

- This all started from two proposed core changes. One additional hook after
  task pick regardless of the picked task's class (this is a regression that
  I missed during the pick_task() conversion) and balance() call for
  deadline server, which I think can be combined with existing special case
  for sched_ext. While it'd be nice to be able to migrate without holding rq
  locks, that path seems very invasive and to have a lot of proposed
  capability impacts. This doesn't seem like a particularly productive
  direection to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-09-30 23:49                         ` Tejun Heo
@ 2025-10-01 11:54                           ` Peter Zijlstra
  2025-10-02 23:32                             ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-10-01 11:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Tue, Sep 30, 2025 at 01:49:12PM -1000, Tejun Heo wrote:

> > That is, either way around: dsq->lock outside, p->sub_lock inside, or
> > the other way around, I emd up with inversions and race conditions that
> > are not fun.
> 
> It's not straightforward for sure but none of these approaches are. They are
> all complicated in some areas. From sched_ext POV, I think what matter most
> are providing as much latitude as possible to BPF scheduler implementations
> and having lower likelihood of really subtle issues.

So I don't mind complicated per-se. There is some definitely non-trivial
code in the scheduler already, some needed TLA+ to be proven correct.
The srq_lock thing doesn't come close to that.

OTOH I see absolutely no future in allowing BPF to have any say what so
ever in the locking -- that's just not sane.

> > Also, if you do put p->sub_lock inside dsq->lock, this means
> > __task_rq_lock() cannot take it and it needs to be pushed deep into scx
> > (possibly into bpf ?) and that means I'm not sure how to do the change
> > pattern sanely.
> 
> I'm not quite following why task_rq_lock() wouldn't be able to take it.
> Whether p->sub_lock nests under or over DSQ locks should only matter to
> sched_ext proper. From core's POV, the only thing that matters is that as
> long as p->sub_lock is held, the task won't be migrating and is safe to
> change states for (more on this later).

The change pattern does dequeue+enqueue, both require dsq->lock. 

If you want __task_rq_lock() to take p->sub_lock, the only possible
order is p->sub_lock outside of dsq->lock. But that has definite
problems elsewhere -- picking a task comes to mind.

If you want p->sub_lock inside dsq->lock, then dequeue must take the
lock (and enqueue release it).

But this has problems when you're switching dsq, since you cannot do:

	LOCK dsq1->lock
	LOCK p->sub_lock
	UNLOCK dsq1->lock

	LOCK dsq2->lock
	UNLOCK p->sub_lock
	UNLOCK dsq2->lock

Again, no matter which way around it I flip this, its not nice. Worse,
if you want to allow BPF to manage the dsq->lock, you then also have to
have BPF manage the p->sub_lock. This means all of the kernel's
scheduler locking is going to depend on BPF and I really, as in *REALLY*
object to that.

> > Having __task_rq_lock() take p->dsq->lock solves all these problems,
> > except for that one weird case where BPF wants to do things their own
> > way. The longer I'm thinking about it, the more I dislike that. I just
> > don't see *ANY* upside from allowing BPF to do this while it is making
> > everything else quite awkward.
> 
> I'm failing to see the problems you're seeing and have to disagree that
> allowing more capabilities on BPF side doesn't bring any upsides. DSQs are
> useful but it quickly becomes too restrictive - e.g. people often want to
> put the same task on multiple queues and other data structures, which is a
> lot more straightforward to do if the data structure and locking are managed
> in BPF. In general, I don't think it's a good direction to be prescriptive
> about how schedulers should be implemented or behave. Even if we might not
> be able to think up something neat right now, someone will.

Once they do, they can come and work through the locking. We're not
going to do insane things just in case. Seriously, stop this nonsense.

You cannot have one task on multiple queues without serialization. Every
enqueue,dequeue,pick will have to lock all those runqueues. The moment
you *have* implemented the required locking such that it is both correct
and faster than a single larger runqueue we can talk.

Up to that point, its vapourware.

And no, BPF is not the only possible way to test-drive crazy ideas. You
can implement such things in userspace just fine. We did most of the
qspinlock development in userspace.

> > The easy fix is to have these BPF managed things have a single global
> > lock. That works and is correct. Then if they want something better,
> > they can use DSQs :-)
> > 
> > Fundamentally, we need the DSQ->lock to cover all CPUs that will pick
> > from it, there is no wiggle room there. Also note that while we change
> > only the attributes of a single task with the change pattern, that
> > affects the whole RQ, since a runqueue is an aggregate of all tasks.
> > This is very much why dequeue/enqueue around the change pattern, to keep
> > the runqueue aggregates updated.
> > 
> > Use the BPF thing to play with scheduling policies, but leave the
> > locking to the core code.
> 
> I have two questions:
> 
> - Let's say something works, whether that's holding dsq->lock or
>   p->sub_lock. I still don't understand how things would be safe w.r.t.
>   things like PSI and uclamp updates. How would they cope with
>   set_task_cpu() happening without the rq locked?

I'm not quite sure what the psi problem is; but uclamp update -- like
all updates -- is done with task_rq_lock(). Nobody is proposing to ever
do set_task_cpu() without rq locks held, that would be broken.

So with my proposal updates (task_rq_lock) would:

 LOCK p->pi_lock
 LOCK rq->lock
 LOCK *p->srq_lock (== &p->scx.dsq->lock)

Taking pi->lock serializes against the task getting woken up.
Taking rq->lock serializes against schedule(), it pins the task if it is
current.
Taking rq->lock serializes the local runqueue.
Taking *p->srq_lock serializes the non-local runqueue.

The constraint is that *p->srq_lock/dsq->lock must fully serialize the
non-local state (eg. scx.runnable_list is out).

Any migration would need to take rq->lock and (optionally, when
relevant) *p->srq_lock to dequeue the task before doing set_next_cpu().
This is both explicit migration and pick based migration. It is
therefore properly serialized against updates.

To be specific, normal migration does:

  rq = __task_rq_lock(p); 	// aquires rq->lock and dsq->lock
  p->on_rq = MIGRATING;
  dequeue_task(p, LOCK);
    p->scx.dsq = NULL;
    task_rq_set_shared(p, NULL);
    raw_spin_unlock(&dsq->lock);
  set_task_cpu(new);
  rq_unlock(rq);

  rq = cpu_rq(new);
  rq_lock(rq);
  enqueue_task(p, LOCK);
    raw_spin_lock(&dsq->lock);
    p->scx.dsq = dsq;
    task_rq_set_shared(p, &dsq->loc);
  p->on_rq = QUEUED;
  __task_rq_unlock(rq, p);	// release dsq->lock and rq->lock

while pick_task_scx() based migration is like:

  __schedule()
    rq_lock();
    ...
    next = pick_task() := pick_task_scx()
      raw_spin_lock(&dsq->lock);
      p = dsq_pick(dsq);
      set_task_cpu(p, here);
      raw_spin_unlock(&dsq->lock);
    ...
    rq_unlock();

Yes, there is a little bit of tricky in __task_rq_lock(), but that is
entirely manageable.

This is all nicely serialized.

> - This all started from two proposed core changes. One additional hook after
>   task pick regardless of the picked task's class (this is a regression that
>   I missed during the pick_task() conversion) and balance() call for
>   deadline server, which I think can be combined with existing special case
>   for sched_ext. While it'd be nice to be able to migrate without holding rq
>   locks, that path seems very invasive and to have a lot of proposed
>   capability impacts. This doesn't seem like a particularly productive
>   direection to me.

So you're asking me to puke all over the core code for maybes and
mights -- I'm not having it.

Here are the options:

 - push rf into pick_task; delete that balance hack and you do all
   the lock dancing in pick_task_scx() and return RETRY_TASK on races.
   This allows your BPF managed vapourware to do whatever, the price is
   you get to deal with the races in pick_task_scx().

 - do the p->srq_lock thing; delete that balance hack and have
   pick_task_scx() be nice and simple. This means a single global lock
   for BPF managed nonsense (or just delete that option entirely).

 - have some do a detailed analysis of all cases and present a coherent
   alternative -- no handwaving, no maybes.

You didn't much like the first option, so I invested time in an
alternative, which is the second option. At which point you pulled out
the might and maybes and BPF needs to do its own locking nonsense.

What is not an option is sprinkling on more hacks.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock()
  2025-10-01 11:54                           ` Peter Zijlstra
@ 2025-10-02 23:32                             ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2025-10-02 23:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Wed, Oct 01, 2025 at 01:54:52PM +0200, Peter Zijlstra wrote:
...
> So I don't mind complicated per-se. There is some definitely non-trivial
> code in the scheduler already, some needed TLA+ to be proven correct.
> The srq_lock thing doesn't come close to that.
> 
> OTOH I see absolutely no future in allowing BPF to have any say what so
> ever in the locking -- that's just not sane.

We for sure don't want BPF locking to be intertwined with kernel locking.
Whatever happens in BPF happens in BPF. The kernel should track the task
states and keep its own locking so that whatever happens in BPF doesn't
affect kernel's integrity or overly complicate what kernel side needs to do.
The goal isn't prescribing what BPF can or cannot do but ensuring that there
is proper isolation and protection.

> > I'm not quite following why task_rq_lock() wouldn't be able to take it.
> > Whether p->sub_lock nests under or over DSQ locks should only matter to
> > sched_ext proper. From core's POV, the only thing that matters is that as
> > long as p->sub_lock is held, the task won't be migrating and is safe to
> > change states for (more on this later).
> 
> The change pattern does dequeue+enqueue, both require dsq->lock. 
> 
> If you want __task_rq_lock() to take p->sub_lock, the only possible
> order is p->sub_lock outside of dsq->lock. But that has definite
> problems elsewhere -- picking a task comes to mind.

Yeah, I need to think more about this but this isn't going to be completely
straightforward regardless of the nesting order. However, the picking side,
the fast path can snoop the first task, which is usually the task to be
picked anyway, locklessly, so DSQ locks nesting inside p->sub_lock likely is
more straightforward although it'd require some lock dancing when the first
task isn't suitable (only happens in global DSQs which are used for
fallbacks anyway).

> If you want p->sub_lock inside dsq->lock, then dequeue must take the
> lock (and enqueue release it).
> 
> But this has problems when you're switching dsq, since you cannot do:
> 
> 	LOCK dsq1->lock
> 	LOCK p->sub_lock
> 	UNLOCK dsq1->lock
> 
> 	LOCK dsq2->lock
> 	UNLOCK p->sub_lock
> 	UNLOCK dsq2->lock

Yeah, this is nastier. I'd have to double lock DSQ locks and then grab the
sub_lock when migrating.

> Again, no matter which way around it I flip this, its not nice. Worse,
> if you want to allow BPF to manage the dsq->lock, you then also have to
> have BPF manage the p->sub_lock. This means all of the kernel's
> scheduler locking is going to depend on BPF and I really, as in *REALLY*
> object to that.

I wonder whether this is where disagreement is coming from. BPF side has no
need and is not going to have any control over p->sub_lock just like it
currently doesn't have any control over rq locks or DSQ locks. This will all
kernel side synchronization.

> > I'm failing to see the problems you're seeing and have to disagree that
> > allowing more capabilities on BPF side doesn't bring any upsides. DSQs are
> > useful but it quickly becomes too restrictive - e.g. people often want to
> > put the same task on multiple queues and other data structures, which is a
> > lot more straightforward to do if the data structure and locking are managed
> > in BPF. In general, I don't think it's a good direction to be prescriptive
> > about how schedulers should be implemented or behave. Even if we might not
> > be able to think up something neat right now, someone will.
> 
> Once they do, they can come and work through the locking. We're not
> going to do insane things just in case. Seriously, stop this nonsense.

I don't think any of the ideas being discussed were particularly more or
less insane than others. Which part is so insane?

> You cannot have one task on multiple queues without serialization. Every
> enqueue,dequeue,pick will have to lock all those runqueues. The moment
> you *have* implemented the required locking such that it is both correct
> and faster than a single larger runqueue we can talk.

Yes, that's exactly what the BPF arena spinlock can be used for. The point
is that the kernel shouldn't have to worry about what BPF side is doing as
long as they don't clearly misbehave in terms of scheduling decisions.

> Up to that point, its vapourware.

The ability to queue a task on multiple data structures is something which
has been requested from early on. We tried that with BPF managed data
structures - bpf_list and bpf_rbtree - but it turned out too restricted and
finicky to use. Now with BPF arena, implementing complex data structures is
becoming feasible. This is still in early development but the following is
queueing structure implemented in BPF arena proper:

  https://github.com/sched-ext/scx/blob/main/lib/atq.bpf.c

It doesn't do anything fancy yet but provides the flexibility necessary for
managing a task in multiple data structures. While it's not used widely
right now, it's definitely more than vapor.

> And no, BPF is not the only possible way to test-drive crazy ideas. You
> can implement such things in userspace just fine. We did most of the
> qspinlock development in userspace.

Of course, there are multiple ways to do anything, but, here, my point is
that we don't want to restrict what sched_ext BPF schedulers can do if
reasonbly possible.

> > I have two questions:
> > 
> > - Let's say something works, whether that's holding dsq->lock or
> >   p->sub_lock. I still don't understand how things would be safe w.r.t.
> >   things like PSI and uclamp updates. How would they cope with
> >   set_task_cpu() happening without the rq locked?
> 
> I'm not quite sure what the psi problem is; but uclamp update -- like
> all updates -- is done with task_rq_lock(). Nobody is proposing to ever
> do set_task_cpu() without rq locks held, that would be broken.
...

Everything upto this point makes sense to me.

> while pick_task_scx() based migration is like:
> 
>   __schedule()
>     rq_lock();
>     ...
>     next = pick_task() := pick_task_scx()
>       raw_spin_lock(&dsq->lock);
>       p = dsq_pick(dsq);
>       set_task_cpu(p, here);
>       raw_spin_unlock(&dsq->lock);
>     ...
>     rq_unlock();
> 
> Yes, there is a little bit of tricky in __task_rq_lock(), but that is
> entirely manageable.
> 
> This is all nicely serialized.

This is where I am confused, so the above is doing set_task_cpu() while only
holding the destination rq lock, but this doesn't dequeue the task from the
previous CPU. dequeue_task() calls psi_dequeue() and uclamp_rq_dec(). Both
mechanisms track per-CPU states which are protected by the task's current rq
lock. When a task moves from one CPU to another, the task should discount
itself from the previous CPU and add itself into the new CPU. This also
applies to sched_core_dequeue() and sched_info_dequeue().

> So you're asking me to puke all over the core code for maybes and
> mights -- I'm not having it.

This is an exaggeration. There are two hooks on discussion in this thread:

- balance() promotion. All it does is ensuring that balance() is always
  called before pick_task() for sched_ext. What we want to add for DL
  support is doing the same promotion for DL server. Whether to promote or
  not can be packaged on the sched_ext side, so that it doesn't make the
  code messier than now.

- An additional hook to tell sched_ext the next picked task. We had this in
  v6.12 when sched_ext was merged. During the refactoring of sched_class
  switch ops afterwards, it was incorrectly moved into sched_ext's switch
  ops, which I failed to notice.

Sure, they aren't the prettiest things but they're contained and don't have
significant impact in terms of performance or readability. If the dedicated
hooks are bothersome, we can generalize them too.

I'm not against making it prettier but let's not make up an emergency.

> Here are the options:
> 
>  - push rf into pick_task; delete that balance hack and you do all
>    the lock dancing in pick_task_scx() and return RETRY_TASK on races.
>    This allows your BPF managed vapourware to do whatever, the price is
>    you get to deal with the races in pick_task_scx().
> 
>  - do the p->srq_lock thing; delete that balance hack and have
>    pick_task_scx() be nice and simple. This means a single global lock
>    for BPF managed nonsense (or just delete that option entirely).
> 
>  - have some do a detailed analysis of all cases and present a coherent
>    alternative -- no handwaving, no maybes.
> 
> You didn't much like the first option, so I invested time in an
> alternative, which is the second option. At which point you pulled out
> the might and maybes and BPF needs to do its own locking nonsense.

While I appreciate your effort, this is an unfair representation of how the
thread developed. This isn't a productive way to discuss anything. Let's
take a step back and see where we're diverging:

- As much as possible, I don't want to put restrictions on what BPF
  schedulers can do. I don't see why this should be a point of fundamental
  disagreement as long as locking issues are resolved in kernel proper.

- Whether it's RETRY_TASK, p->srq_lock or p->sub_lock, I still don't see how
  it would work w.r.t. [un]charging per-cpu states. You tried to explain but
  I still don't see how this could work. What am I missing?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (11 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock() Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-11  2:01   ` Tejun Heo
  2025-09-10 15:44 ` [PATCH 14/14] sched/ext: Implement p->srq_lock support Peter Zijlstra
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Provide a LOCKED queue flag, indicating that the {en,de}queue()
operation is in task_rq_lock() context.

Note: the sched_change in scx_bypass() is the only one that does not
use task_rq_lock(). If that were fixed, we could have sched_change
imply LOCKED.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c     |   31 +++++++++++++++++++++++++------
 kernel/sched/sched.h    |    7 +++++++
 kernel/sched/syscalls.c |    4 ++--
 3 files changed, 34 insertions(+), 8 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2716,7 +2716,7 @@ void set_cpus_allowed_common(struct task
 static void
 do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 {
-	u32 flags = DEQUEUE_SAVE | DEQUEUE_NOCLOCK;
+	u32 flags = DEQUEUE_SAVE | DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
 
 	scoped_guard (sched_change, p, flags) {
 		p->sched_class->set_cpus_allowed(p, ctx);
@@ -3749,7 +3749,7 @@ static int ttwu_runnable(struct task_str
 	if (task_on_rq_queued(p)) {
 		update_rq_clock(rq);
 		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED | ENQUEUE_LOCKED);
 		if (!task_on_cpu(rq, p)) {
 			/*
 			 * When on_rq && !on_cpu the task is preempted, see if
@@ -4816,7 +4816,7 @@ void wake_up_new_task(struct task_struct
 	update_rq_clock(rq);
 	post_init_entity_util_avg(p);
 
-	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_INITIAL);
+	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_INITIAL | ENQUEUE_LOCKED);
 	trace_sched_wakeup_new(p);
 	wakeup_preempt(rq, p, wake_flags);
 	if (p->sched_class->task_woken) {
@@ -7310,7 +7310,7 @@ void rt_mutex_post_schedule(void)
 void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
 {
 	int prio, oldprio, queue_flag =
-		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
 	const struct sched_class *prev_class, *next_class;
 	struct rq_flags rf;
 	struct rq *rq;
@@ -8056,7 +8056,7 @@ int migrate_task_to(struct task_struct *
 void sched_setnuma(struct task_struct *p, int nid)
 {
 	guard(task_rq_lock)(p);
-	scoped_guard (sched_change, p, DEQUEUE_SAVE)
+	scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_LOCKED)
 		p->numa_preferred_nid = nid;
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -9160,7 +9160,7 @@ static void sched_change_group(struct ta
 void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
 	unsigned int queue_flags =
-		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+		DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
 	bool resched = false;
 	struct rq *rq;
 
@@ -10841,6 +10841,13 @@ struct sched_change_ctx *sched_change_be
 	struct rq *rq = task_rq(p);
 
 	lockdep_assert_rq_held(rq);
+#ifdef CONFIG_PROVE_LOCKING
+	if (flags & DEQUEUE_LOCKED) {
+		lockdep_assert_held(&p->pi_lock);
+		if (p->srq_lock)
+			lockdep_assert_held(p->srq_lock);
+	}
+#endif
 
 	if (flags & DEQUEUE_CLASS) {
 		if (WARN_ON_ONCE(flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)))
@@ -10862,6 +10869,9 @@ struct sched_change_ctx *sched_change_be
 		.flags = flags,
 		.queued = task_on_rq_queued(p),
 		.running = task_current(rq, p),
+#ifdef CONFIG_PROVE_LOCKING
+		.srq_lock = p->srq_lock,
+#endif
 	};
 
 	if (!(flags & DEQUEUE_CLASS)) {
@@ -10888,6 +10898,15 @@ void sched_change_end(struct sched_chang
 	struct rq *rq = task_rq(p);
 
 	lockdep_assert_rq_held(rq);
+#ifdef CONFIG_PROVE_LOCKING
+	if (ctx->flags & ENQUEUE_LOCKED) {
+		lockdep_assert_held(&p->pi_lock);
+		if (p->srq_lock)
+			lockdep_assert_held(p->srq_lock);
+		if (ctx->srq_lock && ctx->srq_lock != p->srq_lock)
+			lockdep_assert_not_held(ctx->srq_lock);
+	}
+#endif
 
 	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
 		p->sched_class->switching_to(rq, p);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2340,6 +2340,8 @@ extern const u32		sched_prio_to_wmult[40
  * CLASS - going to update p->sched_class; makes sched_change call the
  *         various switch methods.
  *
+ * LOCKED - task_rq_lock() context, implies p->srq_lock taken when set.
+ *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
@@ -2355,6 +2357,7 @@ extern const u32		sched_prio_to_wmult[40
 #define DEQUEUE_MIGRATING	0x0010 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x0020 /* Matches ENQUEUE_DELAYED */
 #define DEQUEUE_CLASS		0x0040 /* Matches ENQUEUE_CLASS */
+#define DEQUEUE_LOCKED		0x0080 /* Matches ENQUEUE_LOCKED */
 
 #define DEQUEUE_SPECIAL		0x00010000
 #define DEQUEUE_THROTTLE	0x00020000
@@ -2367,6 +2370,7 @@ extern const u32		sched_prio_to_wmult[40
 #define ENQUEUE_MIGRATING	0x0010
 #define ENQUEUE_DELAYED		0x0020
 #define ENQUEUE_CLASS		0x0040
+#define ENQUEUE_LOCKED		0x0080
 
 #define ENQUEUE_HEAD		0x00010000
 #define ENQUEUE_REPLENISH	0x00020000
@@ -3963,6 +3967,9 @@ extern void balance_callbacks(struct rq
 struct sched_change_ctx {
 	u64			prio;
 	struct task_struct	*p;
+#ifdef CONFIG_PROVE_LOCKING
+	raw_spinlock_t		*srq_lock;
+#endif
 	int			flags;
 	bool			queued;
 	bool			running;
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -89,7 +89,7 @@ void set_user_nice(struct task_struct *p
 		return;
 	}
 
-	scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK) {
+	scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK | DEQUEUE_LOCKED) {
 		p->static_prio = NICE_TO_PRIO(nice);
 		set_load_weight(p, true);
 		old_prio = p->prio;
@@ -503,7 +503,7 @@ int __sched_setscheduler(struct task_str
 	struct balance_callback *head;
 	struct rq_flags rf;
 	int reset_on_fork;
-	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+	int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
 	struct rq *rq;
 	bool cpuset_locked = false;
 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-10 15:44 ` [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED Peter Zijlstra
@ 2025-09-11  2:01   ` Tejun Heo
  2025-09-11  9:42     ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-11  2:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello, Peter.

On Wed, Sep 10, 2025 at 05:44:22PM +0200, Peter Zijlstra wrote:
> Provide a LOCKED queue flag, indicating that the {en,de}queue()
> operation is in task_rq_lock() context.
> 
> Note: the sched_change in scx_bypass() is the only one that does not
> use task_rq_lock(). If that were fixed, we could have sched_change
> imply LOCKED.

I don't see any harm in doing task_rq_lock() in the scx_bypass() loop.
Please feel free to switch that for simplicity.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-11  2:01   ` Tejun Heo
@ 2025-09-11  9:42     ` Peter Zijlstra
  2025-09-11 20:40       ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-11  9:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Wed, Sep 10, 2025 at 04:01:55PM -1000, Tejun Heo wrote:
> Hello, Peter.
> 
> On Wed, Sep 10, 2025 at 05:44:22PM +0200, Peter Zijlstra wrote:
> > Provide a LOCKED queue flag, indicating that the {en,de}queue()
> > operation is in task_rq_lock() context.
> > 
> > Note: the sched_change in scx_bypass() is the only one that does not
> > use task_rq_lock(). If that were fixed, we could have sched_change
> > imply LOCKED.
> 
> I don't see any harm in doing task_rq_lock() in the scx_bypass() loop.
> Please feel free to switch that for simplicity.

I didn't immediately see how to do that. Doesn't that
list_for_each_entry_safe_reverse() rely on rq->lock to retain integrity?

Moreover, since the goal is to allow:

 __schedule()
   lock(rq->lock);
   next = pick_task() := pick_task_scx()
     lock(dsq->lock);
     p = some_dsq_task(dsq);
     task_unlink_from_dsq(p, dsq);
     set_task_cpu(p, cpu_of(rq));
     move_task_to_local_dsq(p, ...);
     return p;

without dropping rq->lock, by relying on dsq->lock to serialize things,
I don't see how we can retain the runnable list at all.

And at this point, I'm not sure I understand ext well enough to know
what this bypass stuff does at all, let alone suggest means to
re architect this.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-11  9:42     ` Peter Zijlstra
@ 2025-09-11 20:40       ` Tejun Heo
  2025-09-12 14:19         ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-11 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Thu, Sep 11, 2025 at 11:42:40AM +0200, Peter Zijlstra wrote:
...
> I didn't immediately see how to do that. Doesn't that
> list_for_each_entry_safe_reverse() rely on rq->lock to retain integrity?

Ah, sorry, I was thinking it was iterating scx_tasks list. Yes, as
implemented, it needs to hold rq lock throughout.

> Moreover, since the goal is to allow:
> 
>  __schedule()
>    lock(rq->lock);
>    next = pick_task() := pick_task_scx()
>      lock(dsq->lock);
>      p = some_dsq_task(dsq);
>      task_unlink_from_dsq(p, dsq);
>      set_task_cpu(p, cpu_of(rq));
>      move_task_to_local_dsq(p, ...);
>      return p;
> 
> without dropping rq->lock, by relying on dsq->lock to serialize things,
> I don't see how we can retain the runnable list at all.
>
> And at this point, I'm not sure I understand ext well enough to know
> what this bypass stuff does at all, let alone suggest means to
> re architect this.

Bypass mode is enabled when the kernel side can't trust the BPF scheduling
anymore and wants to fall back to dumb FIFO scheduling to guarantee forward
progress (e.g. so that we can switch back to fair).

It comes down to flipping scx_rq_bypassing() on, which makes scheduling
paths bypass most BPF parts and fall back to FIFO behavior, and then making
sure every thread is on FIFO behavior. The latter part is what the loop is
doing. It scans all currently runnable tasks and dequeues and re-enqueues
them. As scx_rq_bypass() is true at this point, if a task were queued on the
BPF side, the cycling takes it out of the BPF side and puts it on the
fallback FIFO queue.

If we want to get rid of the locking requirement:

- Walk scx_tasks list which is iterated with a cursor and allows dropping
  locks while iterating. However, on some hardware, there are cases where
  CPUs are extremely slowed down from BPF scheduler making bad decisions and
  causing a lot of sync cacheline pingponging across e.g. NUMA nodes. As
  scx_bypass() is what's supposed to extricate the system from this state,
  walking all tasks while going through each's locking probably isn't going
  to be great.

- We can update ->runnable_list iteration to allow dropping rq lock e.g.
  with a cursor based iteration. Maybe some code can be shared with
  scx_tasks iteration. Cycling through locks still isn't going to be great
  but here it's likely a lot fewer of them at least.

Neither option is great. Leave it as-is for now?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-11 20:40       ` Tejun Heo
@ 2025-09-12 14:19         ` Peter Zijlstra
  2025-09-12 16:32           ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-12 14:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Sep 11, 2025 at 10:40:06AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Sep 11, 2025 at 11:42:40AM +0200, Peter Zijlstra wrote:
> ...
> > I didn't immediately see how to do that. Doesn't that
> > list_for_each_entry_safe_reverse() rely on rq->lock to retain integrity?
> 
> Ah, sorry, I was thinking it was iterating scx_tasks list. Yes, as
> implemented, it needs to hold rq lock throughout.
> 
> > Moreover, since the goal is to allow:
> > 
> >  __schedule()
> >    lock(rq->lock);
> >    next = pick_task() := pick_task_scx()
> >      lock(dsq->lock);
> >      p = some_dsq_task(dsq);
> >      task_unlink_from_dsq(p, dsq);
> >      set_task_cpu(p, cpu_of(rq));
> >      move_task_to_local_dsq(p, ...);
> >      return p;
> > 
> > without dropping rq->lock, by relying on dsq->lock to serialize things,
> > I don't see how we can retain the runnable list at all.
> >
> > And at this point, I'm not sure I understand ext well enough to know
> > what this bypass stuff does at all, let alone suggest means to
> > re architect this.
> 
> Bypass mode is enabled when the kernel side can't trust the BPF scheduling
> anymore and wants to fall back to dumb FIFO scheduling to guarantee forward
> progress (e.g. so that we can switch back to fair).
> 
> It comes down to flipping scx_rq_bypassing() on, which makes scheduling
> paths bypass most BPF parts and fall back to FIFO behavior, and then making
> sure every thread is on FIFO behavior. The latter part is what the loop is
> doing. It scans all currently runnable tasks and dequeues and re-enqueues
> them. As scx_rq_bypass() is true at this point, if a task were queued on the
> BPF side, the cycling takes it out of the BPF side and puts it on the
> fallback FIFO queue.
> 
> If we want to get rid of the locking requirement:
> 
> - Walk scx_tasks list which is iterated with a cursor and allows dropping
>   locks while iterating. However, on some hardware, there are cases where
>   CPUs are extremely slowed down from BPF scheduler making bad decisions and
>   causing a lot of sync cacheline pingponging across e.g. NUMA nodes. As
>   scx_bypass() is what's supposed to extricate the system from this state,
>   walking all tasks while going through each's locking probably isn't going
>   to be great.
> 
> - We can update ->runnable_list iteration to allow dropping rq lock e.g.
>   with a cursor based iteration. Maybe some code can be shared with
>   scx_tasks iteration. Cycling through locks still isn't going to be great
>   but here it's likely a lot fewer of them at least.
> 
> Neither option is great. Leave it as-is for now?

Ah, but I think we *have* to change it :/ The thing is that with the new
pick you can change 'rq' without holding the source rq->lock. So we
can't maintain this list.

Could something like so work?

	scoped_guard (rcu) for_each_process_thread(g, p) {
		if (p->flags & PF_EXITING || p->sched_class != ext_sched_class)
			continue;

		guard(task_rq_lock)(p);
		scoped_guard (sched_change, p) {
			/* no-op */
		}
	}	



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-12 14:19         ` Peter Zijlstra
@ 2025-09-12 16:32           ` Tejun Heo
  2025-09-13 22:32             ` Tejun Heo
  2025-09-25 13:10             ` Peter Zijlstra
  0 siblings, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2025-09-12 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Fri, Sep 12, 2025 at 04:19:04PM +0200, Peter Zijlstra wrote:
...
> Ah, but I think we *have* to change it :/ The thing is that with the new
> pick you can change 'rq' without holding the source rq->lock. So we
> can't maintain this list.
> 
> Could something like so work?
> 
> 	scoped_guard (rcu) for_each_process_thread(g, p) {
> 		if (p->flags & PF_EXITING || p->sched_class != ext_sched_class)
> 			continue;
> 
> 		guard(task_rq_lock)(p);
> 		scoped_guard (sched_change, p) {
> 			/* no-op */
> 		}
> 	}	

Yeah, or I can make scx_tasks iteration smarter so that it can skip through
the list for tasks which aren't runnable. As long as it doesn't do lock ops
on every task, it should be fine. I think this is solvable one way or
another. Let's continue in the other subthread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-12 16:32           ` Tejun Heo
@ 2025-09-13 22:32             ` Tejun Heo
  2025-09-15  8:48               ` Peter Zijlstra
  2025-09-25 13:10             ` Peter Zijlstra
  1 sibling, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-13 22:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Fri, Sep 12, 2025 at 06:32:32AM -1000, Tejun Heo wrote:
> Yeah, or I can make scx_tasks iteration smarter so that it can skip through
> the list for tasks which aren't runnable. As long as it doesn't do lock ops
> on every task, it should be fine. I think this is solvable one way or
> another. Let's continue in the other subthread.

Thought more about it. There's another use case for this runnable list,
which is the watchdog. As in the migration synchronization, I think the
right thing to do here is just adding a nested lock. That doesn't add any
overhead or complications to other sched classes and from sched_ext POV
given how expensive migrations can be, if we make that a bit cheaper (and I
believe we will with changes being discussed), added up, the outcome would
likely be lower overhead.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-13 22:32             ` Tejun Heo
@ 2025-09-15  8:48               ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-15  8:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Sat, Sep 13, 2025 at 12:32:27PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 12, 2025 at 06:32:32AM -1000, Tejun Heo wrote:
> > Yeah, or I can make scx_tasks iteration smarter so that it can skip through
> > the list for tasks which aren't runnable. As long as it doesn't do lock ops
> > on every task, it should be fine. I think this is solvable one way or
> > another. Let's continue in the other subthread.
> 
> Thought more about it. There's another use case for this runnable list,
> which is the watchdog. As in the migration synchronization, I think the
> right thing to do here is just adding a nested lock. That doesn't add any
> overhead or complications to other sched classes and from sched_ext POV
> given how expensive migrations can be, if we make that a bit cheaper (and I
> believe we will with changes being discussed), added up, the outcome would
> likely be lower overhead.

I really don't see how you could possibly retain that runnable_list.

pick_next_task() must be able to migrate a task from a shared runqueue
to a local runqueue. It must do this without taking a random other
per-cpu runqueue. Therefore, a task on a DSQ must have no 'local' state.

This very much means the runnable_list cannot be per cpu.

No per-task lock is going to help with that.

The watchdog will have to go iterate DSQs or something like that.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-12 16:32           ` Tejun Heo
  2025-09-13 22:32             ` Tejun Heo
@ 2025-09-25 13:10             ` Peter Zijlstra
  2025-09-25 15:40               ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-25 13:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Fri, Sep 12, 2025 at 06:32:32AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 12, 2025 at 04:19:04PM +0200, Peter Zijlstra wrote:
> ...
> > Ah, but I think we *have* to change it :/ The thing is that with the new
> > pick you can change 'rq' without holding the source rq->lock. So we
> > can't maintain this list.
> > 
> > Could something like so work?
> > 
> > 	scoped_guard (rcu) for_each_process_thread(g, p) {
> > 		if (p->flags & PF_EXITING || p->sched_class != ext_sched_class)
> > 			continue;
> > 
> > 		guard(task_rq_lock)(p);
> > 		scoped_guard (sched_change, p) {
> > 			/* no-op */
> > 		}
> > 	}	
> 
> Yeah, or I can make scx_tasks iteration smarter so that it can skip through
> the list for tasks which aren't runnable. As long as it doesn't do lock ops
> on every task, it should be fine. I think this is solvable one way or
> another. Let's continue in the other subthread.

Well, either this or scx_tasks iterator will result in lock ops for
every task, this is unavoidable if we want the normal p->pi_lock,
rq->lock (dsq->lock) taken for every sched_change caller.

I have the below which I would like to include in the series such that I
can clean up all that DEQUEUE_LOCKED stuff a bit, this being the only
sched_change that's 'weird'.

Added 'bonus' is of course one less user of the runnable_list.

(also, I have to note, for_each_cpu with preemption disabled is asking
for trouble, the enormous core count machines are no longer super
esoteric)

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4817,6 +4817,7 @@ static void scx_bypass(bool bypass)
 {
 	static DEFINE_RAW_SPINLOCK(bypass_lock);
 	static unsigned long bypass_timestamp;
+	struct task_struct *g, *p;
 	struct scx_sched *sch;
 	unsigned long flags;
 	int cpu;
@@ -4849,16 +4850,16 @@ static void scx_bypass(bool bypass)
 	 * queued tasks are re-queued according to the new scx_rq_bypassing()
 	 * state. As an optimization, walk each rq's runnable_list instead of
 	 * the scx_tasks list.
-	 *
-	 * This function can't trust the scheduler and thus can't use
-	 * cpus_read_lock(). Walk all possible CPUs instead of online.
+	 */
+
+	/*
+	 * XXX online_mask is stable due to !preempt (per bypass_lock)
+	 * so could this be for_each_online_cpu() ?
 	 */
 	for_each_possible_cpu(cpu) {
 		struct rq *rq = cpu_rq(cpu);
-		struct task_struct *p, *n;
 
 		raw_spin_rq_lock(rq);
-
 		if (bypass) {
 			WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING);
 			rq->scx.flags |= SCX_RQ_BYPASSING;
@@ -4866,36 +4867,33 @@ static void scx_bypass(bool bypass)
 			WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING));
 			rq->scx.flags &= ~SCX_RQ_BYPASSING;
 		}
+		raw_spin_rq_unlock(rq);
+	}
+
+	/* implicit RCU section due to bypass_lock */
+	for_each_process_thread(g, p) {
+		unsigned int state;
 
-		/*
-		 * We need to guarantee that no tasks are on the BPF scheduler
-		 * while bypassing. Either we see enabled or the enable path
-		 * sees scx_rq_bypassing() before moving tasks to SCX.
-		 */
-		if (!scx_enabled()) {
-			raw_spin_rq_unlock(rq);
+		guard(raw_spinlock)(&p->pi_lock);
+		if (p->flags & PF_EXITING || p->sched_class != &ext_sched_class)
+			continue;
+
+		state = READ_ONCE(p->__state);
+		if (state != TASK_RUNNING && state != TASK_WAKING)
 			continue;
-		}
 
-		/*
-		 * The use of list_for_each_entry_safe_reverse() is required
-		 * because each task is going to be removed from and added back
-		 * to the runnable_list during iteration. Because they're added
-		 * to the tail of the list, safe reverse iteration can still
-		 * visit all nodes.
-		 */
-		list_for_each_entry_safe_reverse(p, n, &rq->scx.runnable_list,
-						 scx.runnable_node) {
-			/* cycling deq/enq is enough, see the function comment */
-			scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
-				/* nothing */ ;
-			}
+		guard(__task_rq_lock)(p);
+		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+			/* nothing */ ;
 		}
+	}
 
-		/* resched to restore ticks and idle state */
-		if (cpu_online(cpu) || cpu == smp_processor_id())
-			resched_curr(rq);
+	/* implicit !preempt section due to bypass_lock */
+	for_each_online_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
 
+		raw_spin_rq_lock(rq);
+		resched_curr(cpu_rq(cpu));
 		raw_spin_rq_unlock(rq);
 	}
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-25 13:10             ` Peter Zijlstra
@ 2025-09-25 15:40               ` Tejun Heo
  2025-09-25 15:53                 ` Peter Zijlstra
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2025-09-25 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Thu, Sep 25, 2025 at 03:10:25PM +0200, Peter Zijlstra wrote:
...
> Well, either this or scx_tasks iterator will result in lock ops for
> every task, this is unavoidable if we want the normal p->pi_lock,
> rq->lock (dsq->lock) taken for every sched_change caller.
> 
> I have the below which I would like to include in the series such that I
> can clean up all that DEQUEUE_LOCKED stuff a bit, this being the only
> sched_change that's 'weird'.
> 
> Added 'bonus' is of course one less user of the runnable_list.
> 
> (also, I have to note, for_each_cpu with preemption disabled is asking
> for trouble, the enormous core count machines are no longer super
> esoteric)

Oh yeah, we can break up every N CPUs. There's no cross-CPU atomicity
requirement.

> +	/*
> +	 * XXX online_mask is stable due to !preempt (per bypass_lock)
> +	 * so could this be for_each_online_cpu() ?
>  	 */

CPUs can go on and offline while CPUs are being bypassed. We can handle that
in hotplug ops but I'm not sure the complexity is justified in this case.

>  	for_each_possible_cpu(cpu) {
>  		struct rq *rq = cpu_rq(cpu);
> -		struct task_struct *p, *n;
>  
>  		raw_spin_rq_lock(rq);
> -
>  		if (bypass) {
>  			WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING);
>  			rq->scx.flags |= SCX_RQ_BYPASSING;
> @@ -4866,36 +4867,33 @@ static void scx_bypass(bool bypass)
>  			WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING));
>  			rq->scx.flags &= ~SCX_RQ_BYPASSING;

I may be using BYPASSING being set as all tasks having been cycled. Will
check. We may need an extra state to note that bypass switching is complete.
Hmm... the switching is not synchronized against scheduling operations
anymore - ie. we can end up mixing regular op and bypassed operation for the
same scheduling event (e.g. enqueue vs. task state transitions), which can
lead subtle state inconsistencies on the BPF scheduler side. Either the
bypassing state should become per-task, which likely has system
recoverability issues under lock storm conditions, or maybe we can just
shift it to the scheduling path - e.g. decide whether to bypass or not at
the beginning of enqueue path and then stick to it until the whole operation
is finished.

>  		}
> +		raw_spin_rq_unlock(rq);
> +	}
> +
> +	/* implicit RCU section due to bypass_lock */
> +	for_each_process_thread(g, p) {

I don't think this is safe. p->tasks is unlinked from __unhash_process() but
tasks can schedule between being unhashed and the final preemt_disable() in
do_exit() and thus the above iteration can miss tasks which may currently be
runnable.

> +		unsigned int state;
>  
> +		guard(raw_spinlock)(&p->pi_lock);
> +		if (p->flags & PF_EXITING || p->sched_class != &ext_sched_class)
> +			continue;
> +
> +		state = READ_ONCE(p->__state);
> +		if (state != TASK_RUNNING && state != TASK_WAKING)
>  			continue;
>  
> +		guard(__task_rq_lock)(p);
> +		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
> +			/* nothing */ ;
>  		}
> +	}

This is significantly more expensive. On large systems, the number of
threads can easily reach six digits. Iterating all of them while doing
locking ops on each of them might become problematic depending on what the
rest of the system is doing (unfortunately, it's not too difficult to cause
meltdowns on some NUMA systems with cross-node traffic). I don't think
p->tasks iterations can be broken up either.

I'm sure there's a solution for all these. Maybe once bypass is set and the
per-task iteration can be broken up, this is no longer a problem, ok maybe
there's some other way to maintain runnable list in a way that's decoupled
from rq lock. The interlocking requirement is relaxed on the removal side.
There must be a way to visit all runnable tasks but visiting some tasks
spuriously is not a problem, so there's some leeway too.

As with everything, this part is a bit tricky and will need non-trivial
amount of testing to verify that it can recover the system from BPF
scheduler induced death sprials (e.g. migrating tasks too frequently across
NUMA boundaries on some systems). The change guard cleanups make sense
regardless of how the rest develops. Would it make sense to land them first?
Once we know what to do with the core scheduling locking, I'm sure we can
find a way to make this work accordingly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-25 15:40               ` Tejun Heo
@ 2025-09-25 15:53                 ` Peter Zijlstra
  2025-09-25 18:44                   ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-25 15:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Sep 25, 2025 at 05:40:27AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Sep 25, 2025 at 03:10:25PM +0200, Peter Zijlstra wrote:
> ...
> > Well, either this or scx_tasks iterator will result in lock ops for
> > every task, this is unavoidable if we want the normal p->pi_lock,
> > rq->lock (dsq->lock) taken for every sched_change caller.
> > 
> > I have the below which I would like to include in the series such that I
> > can clean up all that DEQUEUE_LOCKED stuff a bit, this being the only
> > sched_change that's 'weird'.
> > 
> > Added 'bonus' is of course one less user of the runnable_list.
> > 
> > (also, I have to note, for_each_cpu with preemption disabled is asking
> > for trouble, the enormous core count machines are no longer super
> > esoteric)
> 
> Oh yeah, we can break up every N CPUs. There's no cross-CPU atomicity
> requirement.

Right.

> > +	/*
> > +	 * XXX online_mask is stable due to !preempt (per bypass_lock)
> > +	 * so could this be for_each_online_cpu() ?
> >  	 */
> 
> CPUs can go on and offline while CPUs are being bypassed. We can handle that
> in hotplug ops but I'm not sure the complexity is justified in this case.

Well, not in the current code, since the CPU running this has IRQs and
preemption disabled (per bypass_lock) and thus stop_machine, as used in
hotplug can't make progress.

That is; disabling preemption serializes against hotplug. This is
something that the scheduler relies on in quite a few places.

> >  	for_each_possible_cpu(cpu) {
> >  		struct rq *rq = cpu_rq(cpu);
> > -		struct task_struct *p, *n;
> >  
> >  		raw_spin_rq_lock(rq);
> > -
> >  		if (bypass) {
> >  			WARN_ON_ONCE(rq->scx.flags & SCX_RQ_BYPASSING);
> >  			rq->scx.flags |= SCX_RQ_BYPASSING;
> > @@ -4866,36 +4867,33 @@ static void scx_bypass(bool bypass)
> >  			WARN_ON_ONCE(!(rq->scx.flags & SCX_RQ_BYPASSING));
> >  			rq->scx.flags &= ~SCX_RQ_BYPASSING;
> 
> I may be using BYPASSING being set as all tasks having been cycled. Will
> check. We may need an extra state to note that bypass switching is complete.
> Hmm... the switching is not synchronized against scheduling operations
> anymore - ie. we can end up mixing regular op and bypassed operation for the
> same scheduling event (e.g. enqueue vs. task state transitions), which can
> lead subtle state inconsistencies on the BPF scheduler side. Either the
> bypassing state should become per-task, which likely has system
> recoverability issues under lock storm conditions, or maybe we can just
> shift it to the scheduling path - e.g. decide whether to bypass or not at
> the beginning of enqueue path and then stick to it until the whole operation
> is finished.

Makes sense.

> >  		}
> > +		raw_spin_rq_unlock(rq);
> > +	}
> > +
> > +	/* implicit RCU section due to bypass_lock */
> > +	for_each_process_thread(g, p) {
> 
> I don't think this is safe. p->tasks is unlinked from __unhash_process() but
> tasks can schedule between being unhashed and the final preemt_disable() in
> do_exit() and thus the above iteration can miss tasks which may currently be
> runnable.

Bah, you're quite right :/

> > +		unsigned int state;
> >  
> > +		guard(raw_spinlock)(&p->pi_lock);
> > +		if (p->flags & PF_EXITING || p->sched_class != &ext_sched_class)
> > +			continue;
> > +
> > +		state = READ_ONCE(p->__state);
> > +		if (state != TASK_RUNNING && state != TASK_WAKING)
> >  			continue;
> >  
> > +		guard(__task_rq_lock)(p);
> > +		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
> > +			/* nothing */ ;
> >  		}
> > +	}
> 
> This is significantly more expensive. On large systems, the number of
> threads can easily reach six digits. Iterating all of them while doing
> locking ops on each of them might become problematic depending on what the
> rest of the system is doing (unfortunately, it's not too difficult to cause
> meltdowns on some NUMA systems with cross-node traffic). I don't think
> p->tasks iterations can be broken up either.

I thought to have understood that bypass isn't something that happens
when the system is happy. As long as it completes at some point all this
should be fine right?

I mean, yeah, it'll take a while, but meh.

Also, we could run the thing at fair or FIFO-1 or something, to be
outside of ext itself. Possibly we can freeze all the ext tasks on
return to user to limit the amount of noise they generate.

> The change guard cleanups make sense
> regardless of how the rest develops. Would it make sense to land them first?
> Once we know what to do with the core scheduling locking, I'm sure we can
> find a way to make this work accordingly.

Yeah, definitely. Thing is, if we can get all sched_change users to be
the same, that all cleans up better.

But if cleaning this up gets to be too vexing, we can postpone that.




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED
  2025-09-25 15:53                 ` Peter Zijlstra
@ 2025-09-25 18:44                   ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2025-09-25 18:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Hello,

On Thu, Sep 25, 2025 at 05:53:23PM +0200, Peter Zijlstra wrote:
> > CPUs can go on and offline while CPUs are being bypassed. We can handle that
> > in hotplug ops but I'm not sure the complexity is justified in this case.
> 
> Well, not in the current code, since the CPU running this has IRQs and
> preemption disabled (per bypass_lock) and thus stop_machine, as used in
> hotplug can't make progress.
> 
> That is; disabling preemption serializes against hotplug. This is
> something that the scheduler relies on in quite a few places.

Oh, I meant something like:

                                                        CPU X goes down

        scx_bypass(true);

        stuff happening in bypass mode.
        tasks are scheduling, sleeping and              CPU X comes up
        everything.

        scx_bypass(false);

When CPU X comes up, it should come up in bypass mode, which can easily be
done in online callback, but it's just a bit simpler to keep them always in
sync.

> > This is significantly more expensive. On large systems, the number of
> > threads can easily reach six digits. Iterating all of them while doing
> > locking ops on each of them might become problematic depending on what the
> > rest of the system is doing (unfortunately, it's not too difficult to cause
> > meltdowns on some NUMA systems with cross-node traffic). I don't think
> > p->tasks iterations can be broken up either.
> 
> I thought to have understood that bypass isn't something that happens
> when the system is happy. As long as it completes at some point all this
> should be fine right?
> 
> I mean, yeah, it'll take a while, but meh.
> 
> Also, we could run the thing at fair or FIFO-1 or something, to be
> outside of ext itself. Possibly we can freeze all the ext tasks on
> return to user to limit the amount of noise they generate.

One problem scenario that we saw with sapphire rapids multi socket machines
is that when there are a lot of cross-socket locking operations (same locks
getting hammered on from two sockets), forward progress slows down to the
point where hard lockup triggers really easily. We saw two problems in such
scenarios - the total throughput of locking operations was low and the
distribution of successes across CPUs was pretty skewed. Combining the two
factors, the slowest CPU on sapphire rapids ran about two orders of
magnitude slower than a similarly sized AMD machine doing the smae thing.
The benchmark became a part of stress-ng, the --flipflop.

Anyways, what this comes down to is that on some machines, scx_bypass(true)
has to be pretty careful to avoid these hard lockup scenarios as that's
what's expected to recover the system when such situations develop.

> > The change guard cleanups make sense
> > regardless of how the rest develops. Would it make sense to land them first?
> > Once we know what to do with the core scheduling locking, I'm sure we can
> > find a way to make this work accordingly.
> 
> Yeah, definitely. Thing is, if we can get all sched_change users to be
> the same, that all cleans up better.
> 
> But if cleaning this up gets to be too vexing, we can postpone that.

Yeah, I think it's just going to be a bit more involved and it'd be easier
if we don't make it block other stuff.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 14/14] sched/ext: Implement p->srq_lock support
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (12 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED Peter Zijlstra
@ 2025-09-10 15:44 ` Peter Zijlstra
  2025-09-10 16:07   ` Peter Zijlstra
  2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
  2025-09-18 15:15 ` Christian Loehle
  15 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 15:44 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, peterz, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

Have enqueue set p->srq_lock to &dsq->lock and have dequeue clear it,
when dst is non-local.

When enqueue sees ENQUEUE_LOCKED, it must lock dsq->lock (since
p->srq_lock will be NULL on entry) but must not unlock on exit when it
sets p->srq_lock.

When dequeue sees DEQUEUE_LOCKED, it must not lock dsq->lock when
p->srq_lock is set (instead it must verify they are the same), but it
must unlock on exit, since it will have cleared p->srq_lock.

For DEQUEUE_SAVE/ENQUEUE_RESTORE it can retain p->srq_lock, since
the extra unlock+lock cycle is pointless.

Note: set_next_task_scx() relies on LOCKED to avoid self-recursion on
dsq->lock in the enqueue_task/set_next_task case.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/ext.c |   68 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 47 insertions(+), 21 deletions(-)

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1952,13 +1952,16 @@ static void dispatch_enqueue(struct scx_
 			     struct task_struct *p, u64 enq_flags)
 {
 	bool is_local = dsq->id == SCX_DSQ_LOCAL;
+	bool locked = enq_flags & ENQUEUE_LOCKED;
+	bool restore = enq_flags & ENQUEUE_RESTORE;
 
 	WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
 	WARN_ON_ONCE((p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) ||
 		     !RB_EMPTY_NODE(&p->scx.dsq_priq));
 
 	if (!is_local) {
-		raw_spin_lock(&dsq->lock);
+		if (!locked || !p->srq_lock)
+			raw_spin_lock(&dsq->lock);
 		if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
 			scx_error(sch, "attempting to dispatch to a destroyed dsq");
 			/* fall back to the global dsq */
@@ -2028,6 +2031,10 @@ static void dispatch_enqueue(struct scx_
 
 	dsq_mod_nr(dsq, 1);
 	p->scx.dsq = dsq;
+	if (!is_local) {
+		WARN_ON_ONCE(locked && restore && p->srq_lock && p->srq_lock != &dsq->lock);
+		p->srq_lock = &dsq->lock;
+	}
 
 	/*
 	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
@@ -2059,13 +2066,17 @@ static void dispatch_enqueue(struct scx_
 						 rq->curr->sched_class))
 			resched_curr(rq);
 	} else {
-		raw_spin_unlock(&dsq->lock);
+		if (!locked)
+			raw_spin_unlock(&dsq->lock);
 	}
 }
 
 static void task_unlink_from_dsq(struct task_struct *p,
-				 struct scx_dispatch_q *dsq)
+				 struct scx_dispatch_q *dsq,
+				 int deq_flags)
 {
+	bool save = deq_flags & DEQUEUE_SAVE;
+
 	WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
 
 	if (p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) {
@@ -2076,12 +2087,15 @@ static void task_unlink_from_dsq(struct
 
 	list_del_init(&p->scx.dsq_list.node);
 	dsq_mod_nr(dsq, -1);
+	if (!save)
+		p->srq_lock = NULL;
 }
 
-static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
+static void dispatch_dequeue(struct rq *rq, struct task_struct *p, int deq_flags)
 {
 	struct scx_dispatch_q *dsq = p->scx.dsq;
 	bool is_local = dsq == &rq->scx.local_dsq;
+	bool locked = deq_flags & DEQUEUE_LOCKED;
 
 	if (!dsq) {
 		/*
@@ -2103,8 +2117,10 @@ static void dispatch_dequeue(struct rq *
 		return;
 	}
 
-	if (!is_local)
-		raw_spin_lock(&dsq->lock);
+	if (!is_local) {
+		if (!locked)
+			raw_spin_lock(&dsq->lock);
+	}
 
 	/*
 	 * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_* can't
@@ -2112,7 +2128,8 @@ static void dispatch_dequeue(struct rq *
 	*/
 	if (p->scx.holding_cpu < 0) {
 		/* @p must still be on @dsq, dequeue */
-		task_unlink_from_dsq(p, dsq);
+		WARN_ON_ONCE(!is_local && !p->srq_lock);
+		task_unlink_from_dsq(p, dsq, deq_flags);
 	} else {
 		/*
 		 * We're racing against dispatch_to_local_dsq() which already
@@ -2125,8 +2142,10 @@ static void dispatch_dequeue(struct rq *
 	}
 	p->scx.dsq = NULL;
 
-	if (!is_local)
-		raw_spin_unlock(&dsq->lock);
+	if (!is_local) {
+		if (!locked || !p->srq_lock)
+			raw_spin_unlock(&dsq->lock);
+	}
 }
 
 static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
@@ -2372,7 +2391,7 @@ static void clr_task_runnable(struct tas
 		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
 }
 
-static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first);
+static void __set_next_task_scx(struct rq *rq, struct task_struct *p, u32 qf);
 
 static void enqueue_task_scx(struct rq *rq, struct task_struct *p, u32 enq_flags)
 {
@@ -2421,7 +2440,7 @@ static void enqueue_task_scx(struct rq *
 		__scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1);
 
 	if (enq_flags & ENQUEUE_CURR)
-		set_next_task_scx(rq, p, false);
+		__set_next_task_scx(rq, p, enq_flags);
 }
 
 static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
@@ -2516,7 +2535,7 @@ static bool dequeue_task_scx(struct rq *
 	rq->scx.nr_running--;
 	sub_nr_running(rq, 1);
 
-	dispatch_dequeue(rq, p);
+	dispatch_dequeue(rq, p, deq_flags);
 
 out:
 	if (deq_flags & DEQUEUE_CURR)
@@ -2710,7 +2729,7 @@ static bool unlink_dsq_and_lock_src_rq(s
 	lockdep_assert_held(&dsq->lock);
 
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
-	task_unlink_from_dsq(p, dsq);
+	task_unlink_from_dsq(p, dsq, 0);
 	p->scx.holding_cpu = cpu;
 
 	raw_spin_unlock(&dsq->lock);
@@ -2782,7 +2801,7 @@ static struct rq *move_task_between_dsqs
 	if (dst_dsq->id == SCX_DSQ_LOCAL) {
 		/* @p is going from a non-local DSQ to a local DSQ */
 		if (src_rq == dst_rq) {
-			task_unlink_from_dsq(p, src_dsq);
+			task_unlink_from_dsq(p, src_dsq, 0);
 			move_local_task_to_local_dsq(p, enq_flags,
 						     src_dsq, dst_rq);
 			raw_spin_unlock(&src_dsq->lock);
@@ -2796,7 +2815,7 @@ static struct rq *move_task_between_dsqs
 		 * @p is going from a non-local DSQ to a non-local DSQ. As
 		 * $src_dsq is already locked, do an abbreviated dequeue.
 		 */
-		task_unlink_from_dsq(p, src_dsq);
+		task_unlink_from_dsq(p, src_dsq, 0);
 		p->scx.dsq = NULL;
 		raw_spin_unlock(&src_dsq->lock);
 
@@ -2862,7 +2881,7 @@ static bool consume_dispatch_q(struct sc
 		struct rq *task_rq = task_rq(p);
 
 		if (rq == task_rq) {
-			task_unlink_from_dsq(p, dsq);
+			task_unlink_from_dsq(p, dsq, 0);
 			move_local_task_to_local_dsq(p, 0, dsq, rq);
 			raw_spin_unlock(&dsq->lock);
 			return true;
@@ -3256,7 +3275,7 @@ static void process_ddsp_deferred_locals
 	}
 }
 
-static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
+static void __set_next_task_scx(struct rq *rq, struct task_struct *p, u32 qf)
 {
 	struct scx_sched *sch = scx_root;
 
@@ -3266,7 +3285,7 @@ static void set_next_task_scx(struct rq
 		 * dispatched. Call ops_dequeue() to notify the BPF scheduler.
 		 */
 		ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC);
-		dispatch_dequeue(rq, p);
+		dispatch_dequeue(rq, p, qf);
 	}
 
 	p->se.exec_start = rq_clock_task(rq);
@@ -3300,6 +3319,11 @@ static void set_next_task_scx(struct rq
 	}
 }
 
+static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
+{
+	__set_next_task_scx(rq, p, 0);
+}
+
 static enum scx_cpu_preempt_reason
 preempt_reason_from_class(const struct sched_class *class)
 {
@@ -5012,7 +5036,8 @@ static void scx_disable_workfn(struct kt
 
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
-		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE |
+					   DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
@@ -5756,7 +5781,8 @@ static int scx_enable(struct sched_ext_o
 	percpu_down_write(&scx_fork_rwsem);
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
-		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE |
+					   DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
@@ -6808,7 +6834,7 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(
 		if (p->migration_pending || is_migration_disabled(p) || p->nr_cpus_allowed == 1)
 			continue;
 
-		dispatch_dequeue(rq, p);
+		dispatch_dequeue(rq, p, 0);
 		list_add_tail(&p->scx.dsq_list.node, &tasks);
 	}
 



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 14/14] sched/ext: Implement p->srq_lock support
  2025-09-10 15:44 ` [PATCH 14/14] sched/ext: Implement p->srq_lock support Peter Zijlstra
@ 2025-09-10 16:07   ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 16:07 UTC (permalink / raw)
  To: tj
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Wed, Sep 10, 2025 at 05:44:23PM +0200, Peter Zijlstra wrote:
> Have enqueue set p->srq_lock to &dsq->lock and have dequeue clear it,
> when dst is non-local.
> 
> When enqueue sees ENQUEUE_LOCKED, it must lock dsq->lock (since
> p->srq_lock will be NULL on entry) but must not unlock on exit when it
> sets p->srq_lock.
> 
> When dequeue sees DEQUEUE_LOCKED, it must not lock dsq->lock when
> p->srq_lock is set (instead it must verify they are the same), but it
> must unlock on exit, since it will have cleared p->srq_lock.
> 
> For DEQUEUE_SAVE/ENQUEUE_RESTORE it can retain p->srq_lock, since
> the extra unlock+lock cycle is pointless.
> 
> Note: set_next_task_scx() relies on LOCKED to avoid self-recursion on
> dsq->lock in the enqueue_task/set_next_task case.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

*groan* and obviously I lost a refresh on this patch...

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1952,13 +1952,16 @@ static void dispatch_enqueue(struct scx_
 			     struct task_struct *p, u64 enq_flags)
 {
 	bool is_local = dsq->id == SCX_DSQ_LOCAL;
+	bool locked = enq_flags & ENQUEUE_LOCKED;
+	bool restore = enq_flags & ENQUEUE_RESTORE;
 
 	WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
 	WARN_ON_ONCE((p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) ||
 		     !RB_EMPTY_NODE(&p->scx.dsq_priq));
 
 	if (!is_local) {
-		raw_spin_lock(&dsq->lock);
+		if (!locked || !p->srq_lock)
+			raw_spin_lock(&dsq->lock);
 		if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
 			scx_error(sch, "attempting to dispatch to a destroyed dsq");
 			/* fall back to the global dsq */
@@ -2028,6 +2031,10 @@ static void dispatch_enqueue(struct scx_
 
 	dsq_mod_nr(dsq, 1);
 	p->scx.dsq = dsq;
+	if (!is_local) {
+		WARN_ON_ONCE(locked && restore && p->srq_lock && p->srq_lock != &dsq->lock);
+		p->srq_lock = &dsq->lock;
+	}
 
 	/*
 	 * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
@@ -2059,13 +2066,17 @@ static void dispatch_enqueue(struct scx_
 						 rq->curr->sched_class))
 			resched_curr(rq);
 	} else {
-		raw_spin_unlock(&dsq->lock);
+		if (!locked)
+			raw_spin_unlock(&dsq->lock);
 	}
 }
 
 static void task_unlink_from_dsq(struct task_struct *p,
-				 struct scx_dispatch_q *dsq)
+				 struct scx_dispatch_q *dsq,
+				 int deq_flags)
 {
+	bool save = deq_flags & DEQUEUE_SAVE;
+
 	WARN_ON_ONCE(list_empty(&p->scx.dsq_list.node));
 
 	if (p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) {
@@ -2076,12 +2087,15 @@ static void task_unlink_from_dsq(struct
 
 	list_del_init(&p->scx.dsq_list.node);
 	dsq_mod_nr(dsq, -1);
+	if (!save)
+		p->srq_lock = NULL;
 }
 
-static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
+static void dispatch_dequeue(struct rq *rq, struct task_struct *p, int deq_flags)
 {
 	struct scx_dispatch_q *dsq = p->scx.dsq;
 	bool is_local = dsq == &rq->scx.local_dsq;
+	bool locked = deq_flags & DEQUEUE_LOCKED;
 
 	if (!dsq) {
 		/*
@@ -2103,8 +2117,10 @@ static void dispatch_dequeue(struct rq *
 		return;
 	}
 
-	if (!is_local)
-		raw_spin_lock(&dsq->lock);
+	if (!is_local) {
+		if (!locked)
+			raw_spin_lock(&dsq->lock);
+	}
 
 	/*
 	 * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_* can't
@@ -2112,7 +2128,8 @@ static void dispatch_dequeue(struct rq *
 	*/
 	if (p->scx.holding_cpu < 0) {
 		/* @p must still be on @dsq, dequeue */
-		task_unlink_from_dsq(p, dsq);
+		WARN_ON_ONCE(!is_local && !p->srq_lock);
+		task_unlink_from_dsq(p, dsq, deq_flags);
 	} else {
 		/*
 		 * We're racing against dispatch_to_local_dsq() which already
@@ -2125,8 +2142,10 @@ static void dispatch_dequeue(struct rq *
 	}
 	p->scx.dsq = NULL;
 
-	if (!is_local)
-		raw_spin_unlock(&dsq->lock);
+	if (!is_local) {
+		if (!locked || !p->srq_lock)
+			raw_spin_unlock(&dsq->lock);
+	}
 }
 
 static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
@@ -2508,7 +2527,7 @@ static bool dequeue_task_scx(struct rq *
 	rq->scx.nr_running--;
 	sub_nr_running(rq, 1);
 
-	dispatch_dequeue(rq, p);
+	dispatch_dequeue(rq, p, deq_flags);
 	return true;
 }
 
@@ -2697,7 +2716,7 @@ static bool unlink_dsq_and_lock_src_rq(s
 	lockdep_assert_held(&dsq->lock);
 
 	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
-	task_unlink_from_dsq(p, dsq);
+	task_unlink_from_dsq(p, dsq, 0);
 	p->scx.holding_cpu = cpu;
 
 	raw_spin_unlock(&dsq->lock);
@@ -2769,7 +2788,7 @@ static struct rq *move_task_between_dsqs
 	if (dst_dsq->id == SCX_DSQ_LOCAL) {
 		/* @p is going from a non-local DSQ to a local DSQ */
 		if (src_rq == dst_rq) {
-			task_unlink_from_dsq(p, src_dsq);
+			task_unlink_from_dsq(p, src_dsq, 0);
 			move_local_task_to_local_dsq(p, enq_flags,
 						     src_dsq, dst_rq);
 			raw_spin_unlock(&src_dsq->lock);
@@ -2783,7 +2802,7 @@ static struct rq *move_task_between_dsqs
 		 * @p is going from a non-local DSQ to a non-local DSQ. As
 		 * $src_dsq is already locked, do an abbreviated dequeue.
 		 */
-		task_unlink_from_dsq(p, src_dsq);
+		task_unlink_from_dsq(p, src_dsq, 0);
 		p->scx.dsq = NULL;
 		raw_spin_unlock(&src_dsq->lock);
 
@@ -2849,7 +2868,7 @@ static bool consume_dispatch_q(struct sc
 		struct rq *task_rq = task_rq(p);
 
 		if (rq == task_rq) {
-			task_unlink_from_dsq(p, dsq);
+			task_unlink_from_dsq(p, dsq, 0);
 			move_local_task_to_local_dsq(p, 0, dsq, rq);
 			raw_spin_unlock(&dsq->lock);
 			return true;
@@ -3253,7 +3272,7 @@ static void set_next_task_scx(struct rq
 		 * dispatched. Call ops_dequeue() to notify the BPF scheduler.
 		 */
 		ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC);
-		dispatch_dequeue(rq, p);
+		dispatch_dequeue(rq, p, flags);
 	}
 
 	p->se.exec_start = rq_clock_task(rq);
@@ -4999,7 +5018,8 @@ static void scx_disable_workfn(struct kt
 
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
-		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE |
+					   DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
@@ -5743,7 +5763,8 @@ static int scx_enable(struct sched_ext_o
 	percpu_down_write(&scx_fork_rwsem);
 	scx_task_iter_start(&sti);
 	while ((p = scx_task_iter_next_locked(&sti))) {
-		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
+		unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE |
+					   DEQUEUE_NOCLOCK | DEQUEUE_LOCKED;
 		const struct sched_class *old_class = p->sched_class;
 		const struct sched_class *new_class =
 			__setscheduler_class(p->policy, p->prio);
@@ -6795,7 +6816,7 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(
 		if (p->migration_pending || is_migration_disabled(p) || p->nr_cpus_allowed == 1)
 			continue;
 
-		dispatch_dequeue(rq, p);
+		dispatch_dequeue(rq, p, 0);
 		list_add_tail(&p->scx.dsq_list.node, &tasks);
 	}
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (13 preceding siblings ...)
  2025-09-10 15:44 ` [PATCH 14/14] sched/ext: Implement p->srq_lock support Peter Zijlstra
@ 2025-09-10 17:32 ` Andrea Righi
  2025-09-10 18:19   ` Peter Zijlstra
                     ` (2 more replies)
  2025-09-18 15:15 ` Christian Loehle
  15 siblings, 3 replies; 68+ messages in thread
From: Andrea Righi @ 2025-09-10 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

Hi Peter,

thanks for jumping on this. Comment below.

On Wed, Sep 10, 2025 at 05:44:09PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
> is down to the core code not quite working right for shared runqueue stuff.
> 
> Instead of endlessly hacking around that, bite the bullet and fix it all up.
> 
> With these patches, it should be possible to clean up pick_task_scx() to not
> rely on balance_scx(). Additionally it should be possible to fix that RT issue,
> and the dl_server issue without further propagating lock breaks.
> 
> As is, these patches boot and run/pass selftests/sched_ext with lockdep on.
> 
> I meant to do more sched_ext cleanups, but since this has all already taken
> longer than I would've liked (real life interrupted :/), I figured I should
> post this as is and let TJ/Andrea poke at it.
> 
> Patches are also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup
> 
> 
> [1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net

I've done a quick test with this patch set applied and I was able to
trigger this:

[   49.746281] ============================================
[   49.746457] WARNING: possible recursive locking detected
[   49.746559] 6.17.0-rc4-virtme #85 Not tainted
[   49.746666] --------------------------------------------
[   49.746763] stress-ng-race-/5818 is trying to acquire lock:
[   49.746856] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: dispatch_dequeue+0x125/0x1f0
[   49.747052]
[   49.747052] but task is already holding lock:
[   49.747234] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
[   49.747416]
[   49.747416] other info that might help us debug this:
[   49.747557]  Possible unsafe locking scenario:
[   49.747557]
[   49.747689]        CPU0
[   49.747740]        ----
[   49.747793]   lock(&dsq->lock);
[   49.747867]   lock(&dsq->lock);
[   49.747950]
[   49.747950]  *** DEADLOCK ***
[   49.747950]
[   49.748086]  May be due to missing lock nesting notation
[   49.748086]
[   49.748197] 3 locks held by stress-ng-race-/5818:
[   49.748335]  #0: ffff890e0f0fce70 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x38/0x170
[   49.748474]  #1: ffff890e3b6bcc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0xa0
[   49.748652]  #2: ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170

Reproducer:

 $ cd tools/sched_ext
 $ make scx_simple
 $ sudo ./build/bin/scx_simple
 ... and in another shell
 $ stress-ng --race-sched 0

I added an explicit BUG_ON() to see where the double locking is happening:

[   15.160400] Call Trace:
[   15.160706]  dequeue_task_scx+0x14a/0x270
[   15.160857]  move_queued_task+0x7d/0x2d0
[   15.160952]  affine_move_task+0x6ca/0x700
[   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
[   15.161348]  __sched_setaffinity+0x72/0x100
[   15.161459]  sched_setaffinity+0x261/0x2f0
[   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
[   15.161705]  do_syscall_64+0xbb/0x370
[   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
@ 2025-09-10 18:19   ` Peter Zijlstra
  2025-09-10 18:35   ` Peter Zijlstra
  2025-09-11 14:00   ` Peter Zijlstra
  2 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 18:19 UTC (permalink / raw)
  To: Andrea Righi
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:

> I've done a quick test with this patch set applied and I was able to
> trigger this:
> 
> [   49.746281] ============================================
> [   49.746457] WARNING: possible recursive locking detected
> [   49.746559] 6.17.0-rc4-virtme #85 Not tainted
> [   49.746666] --------------------------------------------
> [   49.746763] stress-ng-race-/5818 is trying to acquire lock:
> [   49.746856] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: dispatch_dequeue+0x125/0x1f0
> [   49.747052]
> [   49.747052] but task is already holding lock:
> [   49.747234] ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
> [   49.747416]
> [   49.747416] other info that might help us debug this:
> [   49.747557]  Possible unsafe locking scenario:
> [   49.747557]
> [   49.747689]        CPU0
> [   49.747740]        ----
> [   49.747793]   lock(&dsq->lock);
> [   49.747867]   lock(&dsq->lock);
> [   49.747950]
> [   49.747950]  *** DEADLOCK ***
> [   49.747950]
> [   49.748086]  May be due to missing lock nesting notation
> [   49.748086]
> [   49.748197] 3 locks held by stress-ng-race-/5818:
> [   49.748335]  #0: ffff890e0f0fce70 (&p->pi_lock){-.-.}-{2:2}, at: task_rq_lock+0x38/0x170
> [   49.748474]  #1: ffff890e3b6bcc98 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested+0x20/0xa0
> [   49.748652]  #2: ffff890e0adacc18 (&dsq->lock){-.-.}-{2:2}, at: task_rq_lock+0x6c/0x170
> 
> Reproducer:
> 
>  $ cd tools/sched_ext
>  $ make scx_simple
>  $ sudo ./build/bin/scx_simple
>  ... and in another shell
>  $ stress-ng --race-sched 0

Heh, the selftests thing was bound to not cover everything. I'll have a
poke at it. Thanks!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
  2025-09-10 18:19   ` Peter Zijlstra
@ 2025-09-10 18:35   ` Peter Zijlstra
  2025-09-10 19:00     ` Andrea Righi
  2025-09-11  9:58     ` Peter Zijlstra
  2025-09-11 14:00   ` Peter Zijlstra
  2 siblings, 2 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-10 18:35 UTC (permalink / raw)
  To: Andrea Righi
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:

> [   15.160400] Call Trace:
> [   15.160706]  dequeue_task_scx+0x14a/0x270
> [   15.160857]  move_queued_task+0x7d/0x2d0
> [   15.160952]  affine_move_task+0x6ca/0x700
> [   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
> [   15.161348]  __sched_setaffinity+0x72/0x100
> [   15.161459]  sched_setaffinity+0x261/0x2f0
> [   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
> [   15.161705]  do_syscall_64+0xbb/0x370
> [   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?

Yeah, the affine_move_task->move_queued_task path is messed up. It
relied on raw_spin_lock_irqsave(&p->pi_lock); rq_lock(rq); being
equivalent to task_rq_lock(), which is no longer true.

I fixed a few such sites earlier today but missed this one.

I'll go untangle it, but probably something for tomorrow, I'm bound to
make a mess of it now :-)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-10 18:35   ` Peter Zijlstra
@ 2025-09-10 19:00     ` Andrea Righi
  2025-09-11  9:58     ` Peter Zijlstra
  1 sibling, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2025-09-10 19:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

On Wed, Sep 10, 2025 at 08:35:55PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:
> 
> > [   15.160400] Call Trace:
> > [   15.160706]  dequeue_task_scx+0x14a/0x270
> > [   15.160857]  move_queued_task+0x7d/0x2d0
> > [   15.160952]  affine_move_task+0x6ca/0x700
> > [   15.161210]  __set_cpus_allowed_ptr+0x64/0xa0
> > [   15.161348]  __sched_setaffinity+0x72/0x100
> > [   15.161459]  sched_setaffinity+0x261/0x2f0
> > [   15.161569]  __x64_sys_sched_setaffinity+0x50/0x80
> > [   15.161705]  do_syscall_64+0xbb/0x370
> > [   15.161816]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > 
> > Are we missing a DEQUEUE_LOCKED in the sched_setaffinity() path?
> 
> Yeah, the affine_move_task->move_queued_task path is messed up. It
> relied on raw_spin_lock_irqsave(&p->pi_lock); rq_lock(rq); being
> equivalent to task_rq_lock(), which is no longer true.
> 
> I fixed a few such sites earlier today but missed this one.
> 
> I'll go untangle it, but probably something for tomorrow, I'm bound to
> make a mess of it now :-)

Sure! I’ll run more tests in the meantime. For now, that's the only issue
I've found. :)

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-10 18:35   ` Peter Zijlstra
  2025-09-10 19:00     ` Andrea Righi
@ 2025-09-11  9:58     ` Peter Zijlstra
  2025-09-11 14:51       ` Andrea Righi
  1 sibling, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-11  9:58 UTC (permalink / raw)
  To: Andrea Righi
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

On Wed, Sep 10, 2025 at 08:35:55PM +0200, Peter Zijlstra wrote:

> I'll go untangle it, but probably something for tomorrow, I'm bound to
> make a mess of it now :-)

Best I could come up with is something like this. I tried a few other
approaches, but they all turned into a bigger mess.

Let me go try and run this.

---
 kernel/sched/core.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2481,11 +2481,11 @@ static inline bool is_cpu_allowed(struct
  * Returns (locked) new rq. Old rq's lock is released.
  */
 static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
-				   struct task_struct *p, int new_cpu)
+				   struct task_struct *p, int new_cpu, int flags)
 {
 	lockdep_assert_rq_held(rq);
 
-	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	deactivate_task(rq, p, flags | DEQUEUE_NOCLOCK);
 	set_task_cpu(p, new_cpu);
 	rq_unlock(rq, rf);
 
@@ -2493,7 +2493,7 @@ static struct rq *move_queued_task(struc
 
 	rq_lock(rq, rf);
 	WARN_ON_ONCE(task_cpu(p) != new_cpu);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, flags);
 	wakeup_preempt(rq, p, 0);
 
 	return rq;
@@ -2533,7 +2533,7 @@ static struct rq *__migrate_task(struct
 	if (!is_cpu_allowed(p, dest_cpu))
 		return rq;
 
-	rq = move_queued_task(rq, rf, p, dest_cpu);
+	rq = move_queued_task(rq, rf, p, dest_cpu, 0);
 
 	return rq;
 }
@@ -3007,7 +3007,7 @@ static int affine_move_task(struct rq *r
 
 		if (!is_migration_disabled(p)) {
 			if (task_on_rq_queued(p))
-				rq = move_queued_task(rq, rf, p, dest_cpu);
+				rq = move_queued_task(rq, rf, p, dest_cpu, DEQUEUE_LOCKED);
 
 			if (!pending->stop_pending) {
 				p->migration_pending = NULL;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-11  9:58     ` Peter Zijlstra
@ 2025-09-11 14:51       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2025-09-11 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

On Thu, Sep 11, 2025 at 11:58:45AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 10, 2025 at 08:35:55PM +0200, Peter Zijlstra wrote:
> 
> > I'll go untangle it, but probably something for tomorrow, I'm bound to
> > make a mess of it now :-)
> 
> Best I could come up with is something like this. I tried a few other
> approaches, but they all turned into a bigger mess.
> 
> Let me go try and run this.

With this one it's complaining about lockdep_assert_held(p->srq_lock):

[   19.055730] WARNING: CPU: 2 PID: 368 at kernel/sched/core.c:10840 sched_change_begin+0x2ac/0x3e0
...
[   19.056468] RIP: 0010:sched_change_begin+0x2ac/0x3e0
...
[   19.057217] RSP: 0018:ffffa9f7805bbde8 EFLAGS: 00010046
[   19.057359] RAX: 0000000000000000 RBX: ffff97ae04880000 RCX: 0000000000000001
[   19.057464] RDX: 0000000000000046 RSI: ffff97ae01715518 RDI: ffff97ae027f0b68
[   19.057568] RBP: 0000000000000082 R08: 0000000000000001 R09: 0000000000000001
[   19.057706] R10: 0000000000000001 R11: 0000000000000001 R12: ffff97ae3bdbcc80
[   19.057833] R13: ffff97ae93c48000 R14: ffff97ae3b717f20 R15: 0000000000000000
[   19.057973] FS:  00007f18999edb00(0000) GS:ffff97ae93c48000(0000) knlGS:0000000000000000
[   19.058112] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   19.058223] CR2: 000055e1e6b0246c CR3: 0000000102ce8000 CR4: 0000000000750ef0
[   19.058460] PKRU: 55555554
[   19.058561] Call Trace:
[   19.058604]  <TASK>
[   19.058675]  __set_cpus_allowed_ptr_locked+0x17c/0x230
[   19.058769]  __set_cpus_allowed_ptr+0x64/0xa0
[   19.058853]  __sched_setaffinity+0x72/0x100
[   19.058920]  sched_setaffinity+0x261/0x2f0
[   19.058985]  __x64_sys_sched_setaffinity+0x50/0x80
[   19.059084]  do_syscall_64+0xbb/0x370
[   19.059158]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   19.059236] RIP: 0033:0x7f189a3bd25b

Thanks,
-Andrea

> 
> ---
>  kernel/sched/core.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2481,11 +2481,11 @@ static inline bool is_cpu_allowed(struct
>   * Returns (locked) new rq. Old rq's lock is released.
>   */
>  static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
> -				   struct task_struct *p, int new_cpu)
> +				   struct task_struct *p, int new_cpu, int flags)
>  {
>  	lockdep_assert_rq_held(rq);
>  
> -	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	deactivate_task(rq, p, flags | DEQUEUE_NOCLOCK);
>  	set_task_cpu(p, new_cpu);
>  	rq_unlock(rq, rf);
>  
> @@ -2493,7 +2493,7 @@ static struct rq *move_queued_task(struc
>  
>  	rq_lock(rq, rf);
>  	WARN_ON_ONCE(task_cpu(p) != new_cpu);
> -	activate_task(rq, p, 0);
> +	activate_task(rq, p, flags);
>  	wakeup_preempt(rq, p, 0);
>  
>  	return rq;
> @@ -2533,7 +2533,7 @@ static struct rq *__migrate_task(struct
>  	if (!is_cpu_allowed(p, dest_cpu))
>  		return rq;
>  
> -	rq = move_queued_task(rq, rf, p, dest_cpu);
> +	rq = move_queued_task(rq, rf, p, dest_cpu, 0);
>  
>  	return rq;
>  }
> @@ -3007,7 +3007,7 @@ static int affine_move_task(struct rq *r
>  
>  		if (!is_migration_disabled(p)) {
>  			if (task_on_rq_queued(p))
> -				rq = move_queued_task(rq, rf, p, dest_cpu);
> +				rq = move_queued_task(rq, rf, p, dest_cpu, DEQUEUE_LOCKED);
>  
>  			if (!pending->stop_pending) {
>  				p->migration_pending = NULL;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
  2025-09-10 18:19   ` Peter Zijlstra
  2025-09-10 18:35   ` Peter Zijlstra
@ 2025-09-11 14:00   ` Peter Zijlstra
  2025-09-11 14:30     ` Peter Zijlstra
  2 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-11 14:00 UTC (permalink / raw)
  To: Andrea Righi
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:

> Reproducer:
> 
>  $ cd tools/sched_ext
>  $ make scx_simple

FWIW, I only have one machine where this works. Most of my machines this
results in an endless stream of build fail; same for the selftest stuff.

No clues given, just endless build fail :-(

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-11 14:00   ` Peter Zijlstra
@ 2025-09-11 14:30     ` Peter Zijlstra
  2025-09-11 14:48       ` Andrea Righi
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-11 14:30 UTC (permalink / raw)
  To: Andrea Righi
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

On Thu, Sep 11, 2025 at 04:00:22PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:
> 
> > Reproducer:
> > 
> >  $ cd tools/sched_ext
> >  $ make scx_simple
> 
> FWIW, I only have one machine where this works. Most of my machines this
> results in an endless stream of build fail; same for the selftest stuff.
> 
> No clues given, just endless build fail :-(

Ah, I need to do: make O=/build-path/. The one machine it worked on had
an actual test kernel installed and booted.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-11 14:30     ` Peter Zijlstra
@ 2025-09-11 14:48       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2025-09-11 14:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, changwoo, cgroups, sched-ext, liuwenfang,
	tglx

On Thu, Sep 11, 2025 at 04:30:00PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2025 at 04:00:22PM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 10, 2025 at 07:32:12PM +0200, Andrea Righi wrote:
> > 
> > > Reproducer:
> > > 
> > >  $ cd tools/sched_ext
> > >  $ make scx_simple
> > 
> > FWIW, I only have one machine where this works. Most of my machines this
> > results in an endless stream of build fail; same for the selftest stuff.
> > 
> > No clues given, just endless build fail :-(
> 
> Ah, I need to do: make O=/build-path/. The one machine it worked on had
> an actual test kernel installed and booted.

Maybe you need this?
https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/commit/?h=for-6.18&id=de68c05189cc4508c3ac4e1e44da1ddb16b1bceb

In case you're getting build failures with the likely() macro.

-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
                   ` (14 preceding siblings ...)
  2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
@ 2025-09-18 15:15 ` Christian Loehle
  2025-09-25  9:00   ` Peter Zijlstra
  15 siblings, 1 reply; 68+ messages in thread
From: Christian Loehle @ 2025-09-18 15:15 UTC (permalink / raw)
  To: Peter Zijlstra, tj
  Cc: linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On 9/10/25 16:44, Peter Zijlstra wrote:
> Hi,
> 
> As mentioned [1], a fair amount of sched ext weirdness (current and proposed)
> is down to the core code not quite working right for shared runqueue stuff.
> 
> Instead of endlessly hacking around that, bite the bullet and fix it all up.
> 
> With these patches, it should be possible to clean up pick_task_scx() to not
> rely on balance_scx(). Additionally it should be possible to fix that RT issue,
> and the dl_server issue without further propagating lock breaks.
> 
> As is, these patches boot and run/pass selftests/sched_ext with lockdep on.
> 
> I meant to do more sched_ext cleanups, but since this has all already taken
> longer than I would've liked (real life interrupted :/), I figured I should
> post this as is and let TJ/Andrea poke at it.
> 
> Patches are also available at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/cleanup
> 
> 
> [1] https://lkml.kernel.org/r/20250904202858.GN4068168@noisy.programming.kicks-ass.net
> 
> 
> ---
>  include/linux/cleanup.h  |   5 +
>  include/linux/sched.h    |   6 +-
>  kernel/cgroup/cpuset.c   |   2 +-
>  kernel/kthread.c         |  15 +-
>  kernel/sched/core.c      | 370 +++++++++++++++++++++--------------------------
>  kernel/sched/deadline.c  |  26 ++--
>  kernel/sched/ext.c       | 104 +++++++------
>  kernel/sched/fair.c      |  23 ++-
>  kernel/sched/idle.c      |  14 +-
>  kernel/sched/rt.c        |  13 +-
>  kernel/sched/sched.h     | 225 ++++++++++++++++++++--------
>  kernel/sched/stats.h     |   2 +-
>  kernel/sched/stop_task.c |  14 +-
>  kernel/sched/syscalls.c  |  80 ++++------
>  14 files changed, 495 insertions(+), 404 deletions(-)
> 
> 

Hi Peter, A couple of issues popped up when testing this [1] (that don't trigger on [2]):

When booting (arm64 orion o6) I get:

[    1.298020] sched: DL replenish lagged too much
[    1.298364] ------------[ cut here ]------------
[    1.298377] WARNING: CPU: 4 PID: 0 at kernel/sched/deadline.c:239 inactive_task_timer+0x3d0/0x474
[    1.298413] Modules linked in:
[    1.298436] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Tainted: G S                  6.17.0-rc4-cix-build+ #56 PREEMPT 
[    1.298455] Tainted: [S]=CPU_OUT_OF_SPEC
[    1.298463] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 0.3.0-1 2025-04-28T03:35:34+00:00
[    1.298473] pstate: 034000c9 (nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[    1.298486] pc : inactive_task_timer+0x3d0/0x474
[    1.298505] lr : inactive_task_timer+0x394/0x474
[    1.298522] sp : ffff800083d4be00
[    1.298530] x29: ffff800083d4be20 x28: ffff00008362d888 x27: ffff800082ab1f88
[    1.298561] x26: ffff800082ab4a98 x25: ffff0001fef50c18 x24: 0000000000019999
[    1.298589] x23: 000000000000cccc x22: ffff0001fef51708 x21: ffff00008362d640
[    1.298616] x20: ffff0001fef50c00 x19: ffff00008362d7f0 x18: fffffffffff0b580
[    1.298642] x17: ffff80017c966000 x16: ffff800083d48000 x15: 0000000000000028
[    1.298669] x14: 0000000000000000 x13: 00000000000c4000 x12: 00000000000000c5
[    1.298695] x11: 0000000000004bb8 x10: 0000000000004bb8 x9 : 0000000000000000
[    1.298722] x8 : 0000000000000000 x7 : 0000000000000011 x6 : ffff0001fef51bc0
[    1.298747] x5 : ffff0001fef50c00 x4 : 00000000000000cc x3 : 0000000000000000
[    1.298773] x2 : ffff80017c966000 x1 : 0000000000000000 x0 : ffffffffffff3333
[    1.298800] Call trace:
[    1.298808]  inactive_task_timer+0x3d0/0x474 (P)
[    1.298830]  __hrtimer_run_queues+0x3c4/0x440
[    1.298852]  hrtimer_interrupt+0xe4/0x244
[    1.298871]  arch_timer_handler_phys+0x2c/0x44
[    1.298893]  handle_percpu_devid_irq+0xa8/0x1f0
[    1.298916]  handle_irq_desc+0x40/0x58
[    1.298933]  generic_handle_domain_irq+0x1c/0x28
[    1.298950]  gic_handle_irq+0x4c/0x11c
[    1.298965]  call_on_irq_stack+0x30/0x48
[    1.298982]  do_interrupt_handler+0x80/0x84
[    1.299001]  el1_interrupt+0x34/0x64
[    1.299022]  el1h_64_irq_handler+0x18/0x24
[    1.299037]  el1h_64_irq+0x6c/0x70
[    1.299052]  finish_task_switch.isra.0+0xac/0x2bc (P)
[    1.299070]  __schedule+0x45c/0xffc
[    1.299088]  schedule_idle+0x28/0x48
[    1.299104]  do_idle+0x184/0x27c
[    1.299121]  cpu_startup_entry+0x34/0x3c
[    1.299137]  secondary_start_kernel+0x134/0x154
[    1.299158]  __secondary_switched+0xc0/0xc4
[    1.299179] irq event stamp: 1634
[    1.299189] hardirqs last  enabled at (1633): [<ffff800081486354>] el1_interrupt+0x54/0x64
[    1.299210] hardirqs last disabled at (1634): [<ffff800081486324>] el1_interrupt+0x24/0x64
[    1.299229] softirqs last  enabled at (1614): [<ffff8000800bf7b0>] handle_softirqs+0x4a0/0x4b8
[    1.299248] softirqs last disabled at (1609): [<ffff800080010600>] __do_softirq+0x14/0x20
[    1.299262] ---[ end trace 0000000000000000 ]---

and when running actual tests (e.g. iterating through all scx schedulers under load):

[  146.532691] ================================                                                                                                                                                                                                                                    
[  146.536947] WARNING: inconsistent lock state                                                                                                                                                                                                                                    
[  146.541204] 6.17.0-rc4-cix-build+ #58 Tainted: G S      W                                                                                                                                                                                                                       
[  146.547457] --------------------------------                                                                                                                                                                                                                                    
[  146.551713] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.                                                                                                                                                                                                                
[  146.557705] rcu_tasks_trace/79 [HC0[0]:SC0[0]:HE0:SE1] takes:                                                                                                                                                                                                                   
[  146.563438] ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194                                                                                                                                                                                            
[  146.571179] {IN-HARDIRQ-W} state was registered at:                                                                                                                                                                                                                             
[  146.576042]   lock_acquire+0x1c8/0x338                                                                                                                                                                                                                                          
[  146.579788]   _raw_spin_lock+0x48/0x60                                                                                                                                                                                                                                          
[  146.583536]   dispatch_enqueue+0x130/0x3e8                                                                                                                                                                                                                                      
[  146.587632]   do_enqueue_task+0x2f0/0x464                                                                                                                                                                                                                                       
[  146.591629]   enqueue_task_scx+0x1b0/0x290                                                                                                                                                                                                                                      
[  146.595712]   enqueue_task+0x84/0x18c                                                                                                                                                                                                                                           
[  146.599360]   ttwu_do_activate+0x84/0x25c                                                                                                                                                                                                                                       
[  146.603361]   try_to_wake_up+0x310/0x5f8                                                                                                                                                                                                                                        
[  146.607271]   wake_up_process+0x18/0x24                                                                                                                                                                                                                                         
[  146.611094]   kick_pool+0x9c/0x17c                                                                                                                                                                                                                                              
[  146.614483]   __queue_work+0x544/0x7a8                                                                                                                                                                                                                                          
[  146.618223]   __queue_delayed_work+0x118/0x15c                                                                                                                                                                                                                                  
[  146.622653]   mod_delayed_work_on+0xcc/0xe0                                                                                                                                                                                                                                     
[  146.626823]   kblockd_mod_delayed_work_on+0x20/0x30                                                                                                                                                                                                                             
[  146.631696]   blk_mq_kick_requeue_list+0x1c/0x28                                                                                                                                                                                                                                
[  146.636307]   blk_flush_complete_seq+0xd4/0x2a4                                                                                                                                                                                                                                 
[  146.640824]   flush_end_io+0x1e0/0x3f4                                                                                                                                                                                                                                          
[  146.644559]   blk_mq_end_request+0x60/0x154                                                                                                                                                                                                                                     
[  146.648733]   nvme_end_req+0x30/0x78                                                                                                                                                                                                                                            
[  146.652306]   nvme_complete_rq+0x7c/0x218                                                                                                                                                                                                                                       
[  146.656302]   nvme_pci_complete_rq+0x98/0x110                                                                                                                                                                                                                                   
[  146.660650]   nvme_poll_cq+0x1cc/0x3b4                                                                                                                                                                                                                                          
[  146.664385]   nvme_irq+0x34/0x88                                                                                                                                                                                                                                                
[  146.667600]   __handle_irq_event_percpu+0x88/0x304                                                                                                                                                                                                                              
[  146.672384]   handle_irq_event+0x4c/0xa8                                                                                                                                                                                                                                        
[  146.676293]   handle_fasteoi_irq+0x108/0x20c                                                                                                                                                                                                                                    
[  146.680555]   handle_irq_desc+0x40/0x58                                                                                                                                                                                                                                         
[  146.684378]   generic_handle_domain_irq+0x1c/0x28                                                                                                                                                                                                                               
[  146.689068]   gic_handle_irq+0x4c/0x11c                                                                                                                                                                                                                                         
[  146.692891]   call_on_irq_stack+0x30/0x48                                                                                                                                                                                                                                       
[  146.696891]   do_interrupt_handler+0x80/0x84                                                                                                                                                                                                                                    
[  146.701151]   el1_interrupt+0x34/0x64                                                                                                                                                                                                                                           
[  146.704810]   el1h_64_irq_handler+0x18/0x24                                                                                                                                                                                                                                     
[  146.708979]   el1h_64_irq+0x6c/0x70                                                                                                                                                                                                                                             
[  146.712453]   cpuidle_enter_state+0x12c/0x53c                                                                                                                                                                                                                                   
[  146.716796]   cpuidle_enter+0x38/0x50                                                                                                                                                                                                                                           
[  146.720458]   do_idle+0x204/0x27c                                                                                                                                                                                                                                               
[  146.723759]   cpu_startup_entry+0x38/0x3c                                                                                                                                                                                                                                       
[  146.727755]   secondary_start_kernel+0x134/0x154                                                                                                                                                                                                                                
[  146.732370]   __secondary_switched+0xc0/0xc4                                                                                                                                                                                                                                    
[  146.736638] irq event stamp: 1754                                                                                                                                                                                                                                               
[  146.739938] hardirqs last  enabled at (1753): [<ffff800081497184>] _raw_spin_unlock_irqrestore+0x6c/0x70                                                                                                                                                                        
[  146.749405] hardirqs last disabled at (1754): [<ffff8000814965e4>] _raw_spin_lock_irqsave+0x84/0x88                                                                                                                                                                             
[  146.758437] softirqs last  enabled at (1664): [<ffff800080195598>] rcu_tasks_invoke_cbs+0x100/0x394                                                                                                                                                                             
[  146.767476] softirqs last disabled at (1660): [<ffff800080195598>] rcu_tasks_invoke_cbs+0x100/0x394                                                                                                                                                                             
[  146.776506]                                                                                                                                                                                                                                                                     
[  146.776506] other info that might help us debug this:                                                                                                                                                                                                                           
[  146.783019]  Possible unsafe locking scenario:                                                                                                                                                                                                                                  
[  146.783019]                                                                                                                                                                                                                                                                     
[  146.788923]        CPU0                                                                                                                                                                                                                                                         
[  146.791356]        ----                                                                                                                                                                                                                                                         
[  146.793788]   lock(&dsq->lock);                                                                                                                                                                                                                                                 
[  146.796915]   <Interrupt>                                                                                                                                                                                                                                                       
[  146.799521]     lock(&dsq->lock);                                                                                                                                                                                                                                               
[  146.802821]                                                                                                                                                                                                                                                                     
[  146.802821]  *** DEADLOCK ***                                                                                                                                                                                                                                                   
[  146.802821]                                                                                                                                                                                                                                                                     
[  146.808725] 3 locks held by rcu_tasks_trace/79:                                                                                                                                                                                                                                 
[  146.813242]  #0: ffff800082e6e650 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x328/0x570                                                                                                                                                                
[  146.823403]  #1: ffff800082adc1f0 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock+0x10/0x1c                                                                                                                                                                                  
[  146.832014]  #2: ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194                                                                                                                                                                                       
[  146.840178]                                                                                                                                                                                     
[  146.813242]  #0: ffff800082e6e650 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x328/0x570                                                                                                                                                                
[  146.823403]  #1: ffff800082adc1f0 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock+0x10/0x1c                                                                                                                                                                                  
[  146.832014]  #2: ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194                                                                                                                                                                                       
[  146.840178] 

[  146.840178] stack backtrace:
[  146.844521] CPU: 10 UID: 0 PID: 79 Comm: rcu_tasks_trace Tainted: G S      W           6.17.0-rc4-cix-build+ #58 PREEMPT 
[  146.855463] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[  146.860240] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 0.3.0-1 2025-04-28T03:35:34+00:00
[  146.872136] Sched_ext: simple (enabled+all), task: runnable_at=-4ms
[  146.872138] Call trace:
[  146.880822]  show_stack+0x18/0x24 (C)
[  146.884471]  dump_stack_lvl+0x90/0xd0
[  146.888131]  dump_stack+0x18/0x24
[  146.891432]  print_usage_bug.part.0+0x29c/0x364
[  146.895950]  mark_lock+0x778/0x978
[  146.899338]  mark_held_locks+0x58/0x90
[  146.903074]  lockdep_hardirqs_on_prepare+0x100/0x210
[  146.908025]  trace_hardirqs_on+0x5c/0x1cc
[  146.912025]  _raw_spin_unlock_irqrestore+0x6c/0x70
[  146.916803]  task_call_func+0x110/0x164
[  146.920625]  trc_wait_for_one_reader.part.0+0x5c/0x3b8
[  146.925750]  check_all_holdout_tasks_trace+0x124/0x480
[  146.930874]  rcu_tasks_wait_gp+0x1f0/0x3b4
[  146.934957]  rcu_tasks_one_gp+0x4a4/0x570
[  146.938953]  rcu_tasks_kthread+0xd4/0xe0
[  146.942862]  kthread+0x148/0x208
[  146.946079]  ret_from_fork+0x10/0x20
              

(This actually locks up the system without any further print FWIW).

I'll keep testing and start debugging now, but if I can help you with anything immediately, please
do shout.


[1]
This referring to sched/cleanup at time of writing:
e127838bf8f9 sched: Cleanup NOCLOCK
ce024feefe1c sched/ext: Implement p->srq_lock support
6ef342071dd7 sched: Add {DE,EN}QUEUE_LOCKED
ed738ce6f9fb sched: Add shared runqueue locking to __task_rq_lock()
94f197f28834 sched: Add flags to {put_prev,set_next}_task() methods
254d43c94105 sched: Add locking comments to sched_class methods
f8864b505a17 sched: Make __do_set_cpus_allowed() use the sched_change pattern
d0e9cfb835d3 sched: Rename do_set_cpus_allowed()
cfcabf45249d sched: Fix do_set_cpus_allowed() locking
f7b9b39041fb sched: Fix migrate_disable_switch() locking
91128b33456a sched: Move sched_class::prio_changed() into the change pattern
c59dc6ce071b sched: Cleanup sched_delayed handling for class switches
13ea43940095 sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern
f0b336327a1b sched: Re-arrange the {EN,DE}QUEUE flags
b55442cb4ec1 sched: Employ sched_change guards

[2]
5b726e9bf954 sched/fair: Get rid of throttled_lb_pair()


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/14] sched: Support shared runqueue locking
  2025-09-18 15:15 ` Christian Loehle
@ 2025-09-25  9:00   ` Peter Zijlstra
  0 siblings, 0 replies; 68+ messages in thread
From: Peter Zijlstra @ 2025-09-25  9:00 UTC (permalink / raw)
  To: Christian Loehle
  Cc: tj, linux-kernel, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, longman,
	hannes, mkoutny, void, arighi, changwoo, cgroups, sched-ext,
	liuwenfang, tglx

On Thu, Sep 18, 2025 at 04:15:45PM +0100, Christian Loehle wrote:

> Hi Peter, A couple of issues popped up when testing this [1] (that don't trigger on [2]):
> 
> When booting (arm64 orion o6) I get:
> 
> [    1.298020] sched: DL replenish lagged too much
> [    1.298364] ------------[ cut here ]------------
> [    1.298377] WARNING: CPU: 4 PID: 0 at kernel/sched/deadline.c:239 inactive_task_timer+0x3d0/0x474

Ah, right. Robot reported this one too. I'll look into it. Could be one
of the patches in sched/urgent cures it, but who knows. I'll have a
poke.

> and when running actual tests (e.g. iterating through all scx schedulers under load):
> 
> [  146.532691] ================================                                                                     
> [  146.536947] WARNING: inconsistent lock state                                                                     
> [  146.541204] 6.17.0-rc4-cix-build+ #58 Tainted: G S      W                                                        
> [  146.547457] --------------------------------                                                                     
> [  146.551713] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.                                                 
> [  146.557705] rcu_tasks_trace/79 [HC0[0]:SC0[0]:HE0:SE1] takes:                                                    
> [  146.563438] ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194                             

> [  146.840178]                                                                                                      
> [  146.813242]  #0: ffff800082e6e650 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{4:4}, at: rcu_tasks_one_gp+0x328/0x570 
> [  146.823403]  #1: ffff800082adc1f0 (cpu_hotplug_lock){++++}-{0:0}, at: cpus_read_lock+0x10/0x1c                   
> [  146.832014]  #2: ffff000089c90e58 (&dsq->lock){?.-.}-{2:2}, at: __task_rq_lock+0x88/0x194                        
> 
> [  146.840178] stack backtrace:
> [  146.844521] CPU: 10 UID: 0 PID: 79 Comm: rcu_tasks_trace Tainted: G S      W           6.17.0-rc4-cix-build+ #58 PREEMPT 
> [  146.855463] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
> [  146.860240] Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 0.3.0-1 2025-04-28T03:35:34+00:00
> [  146.872136] Sched_ext: simple (enabled+all), task: runnable_at=-4ms
> [  146.872138] Call trace:
> [  146.880822]  show_stack+0x18/0x24 (C)
> [  146.884471]  dump_stack_lvl+0x90/0xd0
> [  146.888131]  dump_stack+0x18/0x24
> [  146.891432]  print_usage_bug.part.0+0x29c/0x364
> [  146.895950]  mark_lock+0x778/0x978
> [  146.899338]  mark_held_locks+0x58/0x90
> [  146.903074]  lockdep_hardirqs_on_prepare+0x100/0x210
> [  146.908025]  trace_hardirqs_on+0x5c/0x1cc
> [  146.912025]  _raw_spin_unlock_irqrestore+0x6c/0x70
> [  146.916803]  task_call_func+0x110/0x164

Ooh, yeah, that's buggered. Let me go fix!

Thanks for testing!

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2025-10-30 12:48 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10 15:44 [PATCH 00/14] sched: Support shared runqueue locking Peter Zijlstra
2025-09-10 15:44 ` [PATCH 01/14] sched: Employ sched_change guards Peter Zijlstra
2025-09-11  9:06   ` K Prateek Nayak
2025-09-11  9:55     ` Peter Zijlstra
2025-09-11 10:10       ` Peter Zijlstra
2025-09-11 10:37         ` K Prateek Nayak
2025-10-06 15:21   ` Shrikanth Hegde
2025-10-06 18:14     ` Peter Zijlstra
2025-10-07  5:12       ` Shrikanth Hegde
2025-10-07  9:34         ` Peter Zijlstra
2025-10-16  9:33       ` [tip: sched/core] sched: Mandate shared flags for sched_change tip-bot2 for Peter Zijlstra
2025-09-10 15:44 ` [PATCH 02/14] sched: Re-arrange the {EN,DE}QUEUE flags Peter Zijlstra
2025-09-10 15:44 ` [PATCH 03/14] sched: Fold sched_class::switch{ing,ed}_{to,from}() into the change pattern Peter Zijlstra
2025-09-10 15:44 ` [PATCH 04/14] sched: Cleanup sched_delayed handling for class switches Peter Zijlstra
2025-09-10 15:44 ` [PATCH 05/14] sched: Move sched_class::prio_changed() into the change pattern Peter Zijlstra
2025-09-11  1:44   ` Tejun Heo
2025-09-10 15:44 ` [PATCH 06/14] sched: Fix migrate_disable_switch() locking Peter Zijlstra
2025-09-10 15:44 ` [PATCH 07/14] sched: Fix do_set_cpus_allowed() locking Peter Zijlstra
2025-10-30  0:12   ` Mark Brown
2025-10-30  9:07     ` Peter Zijlstra
2025-10-30 12:47       ` Mark Brown
2025-09-10 15:44 ` [PATCH 08/14] sched: Rename do_set_cpus_allowed() Peter Zijlstra
2025-09-10 15:44 ` [PATCH 09/14] sched: Make __do_set_cpus_allowed() use the sched_change pattern Peter Zijlstra
2025-09-10 15:44 ` [PATCH 10/14] sched: Add locking comments to sched_class methods Peter Zijlstra
2025-09-10 15:44 ` [PATCH 11/14] sched: Add flags to {put_prev,set_next}_task() methods Peter Zijlstra
2025-09-10 15:44 ` [PATCH 12/14] sched: Add shared runqueue locking to __task_rq_lock() Peter Zijlstra
2025-09-12  0:19   ` Tejun Heo
2025-09-12 11:54     ` Peter Zijlstra
2025-09-12 14:11       ` Peter Zijlstra
2025-09-12 17:56       ` Tejun Heo
2025-09-15  8:38         ` Peter Zijlstra
2025-09-16 22:29           ` Tejun Heo
2025-09-16 22:41             ` Tejun Heo
2025-09-25  8:35               ` Peter Zijlstra
2025-09-25 21:43                 ` Tejun Heo
2025-09-26  9:59                   ` Peter Zijlstra
2025-09-26 16:48                     ` Tejun Heo
2025-09-26 10:36                   ` Peter Zijlstra
2025-09-26 21:39                     ` Tejun Heo
2025-09-29 10:06                       ` Peter Zijlstra
2025-09-30 23:49                         ` Tejun Heo
2025-10-01 11:54                           ` Peter Zijlstra
2025-10-02 23:32                             ` Tejun Heo
2025-09-10 15:44 ` [PATCH 13/14] sched: Add {DE,EN}QUEUE_LOCKED Peter Zijlstra
2025-09-11  2:01   ` Tejun Heo
2025-09-11  9:42     ` Peter Zijlstra
2025-09-11 20:40       ` Tejun Heo
2025-09-12 14:19         ` Peter Zijlstra
2025-09-12 16:32           ` Tejun Heo
2025-09-13 22:32             ` Tejun Heo
2025-09-15  8:48               ` Peter Zijlstra
2025-09-25 13:10             ` Peter Zijlstra
2025-09-25 15:40               ` Tejun Heo
2025-09-25 15:53                 ` Peter Zijlstra
2025-09-25 18:44                   ` Tejun Heo
2025-09-10 15:44 ` [PATCH 14/14] sched/ext: Implement p->srq_lock support Peter Zijlstra
2025-09-10 16:07   ` Peter Zijlstra
2025-09-10 17:32 ` [PATCH 00/14] sched: Support shared runqueue locking Andrea Righi
2025-09-10 18:19   ` Peter Zijlstra
2025-09-10 18:35   ` Peter Zijlstra
2025-09-10 19:00     ` Andrea Righi
2025-09-11  9:58     ` Peter Zijlstra
2025-09-11 14:51       ` Andrea Righi
2025-09-11 14:00   ` Peter Zijlstra
2025-09-11 14:30     ` Peter Zijlstra
2025-09-11 14:48       ` Andrea Righi
2025-09-18 15:15 ` Christian Loehle
2025-09-25  9:00   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox