[PATCH 0/5] sched: Random collection of patches

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] sched: Random collection of patches
@ 2025-11-27 15:39 Peter Zijlstra
  2025-11-27 15:39 ` [PATCH 1/5] sched/fair: Fold the sched_avg update Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-27 15:39 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

Hi!

Here a fairly eclectic mix of patches.

The first is a very old patch I recently re-discovered in my patch cabinet.

The next two are cleanups that came about from the recent newidle patches.

And the final two are from an earlier discussion with TJ about tracking wakeups
across classes. This informs a sched_class it no longer has access to the CPU
and can look at pushing tasks away. The case in point was sched_ext, but IIRC
this very same point once came up for sched_rt as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/5] sched/fair: Fold the sched_avg update
  2025-11-27 15:39 [PATCH 0/5] sched: Random collection of patches Peter Zijlstra
@ 2025-11-27 15:39 ` Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper tip-bot2 for Peter Zijlstra
  2025-11-27 15:39 ` [PATCH 2/5] sched/fair: Avoid rq->lock bouncing in sched_balance_newidle() Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-27 15:39 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

Nine (and a half) instances of the same pattern is just silly, fold the lot.

Notable, the half instance in enqueue_load_avg() is right after setting
cfs_rq->avg.load_sum to cfs_rq->avg.load_avg * get_pelt_divider(&cfs_rq->avg).
Since get_pelt_divisor() >= PELT_MIN_DIVIDER, this ends up being a no-op
change.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/compiler_types.h |   19 +++++++
 kernel/sched/fair.c            |  108 ++++++++++++-----------------------------
 2 files changed, 51 insertions(+), 76 deletions(-)

--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -558,6 +558,25 @@ struct ftrace_likely_data {
 			 __scalar_type_to_expr_cases(long long),	\
 			 default: (x)))
 
+/*
+ * __signed_scalar_typeof(x) - Declare a signed scalar type, leaving
+ *			       non-scalar types unchanged.
+ */
+
+#define __scalar_type_to_signed_cases(type)				\
+		unsigned type:	(signed type)0,				\
+		signed type:	(signed type)0
+
+#define __signed_scalar_typeof(x) typeof(				\
+		_Generic((x),						\
+			 char:	(signed char)0,				\
+			 __scalar_type_to_signed_cases(char),		\
+			 __scalar_type_to_signed_cases(short),		\
+			 __scalar_type_to_signed_cases(int),		\
+			 __scalar_type_to_signed_cases(long),		\
+			 __scalar_type_to_signed_cases(long long),	\
+			 default: (x)))
+
 /* Is this type a native word size -- useful for atomic operations */
 #define __native_word(t) \
 	(sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3693,7 +3693,7 @@ account_entity_dequeue(struct cfs_rq *cf
  */
 #define add_positive(_ptr, _val) do {                           \
 	typeof(_ptr) ptr = (_ptr);                              \
-	typeof(_val) val = (_val);                              \
+	__signed_scalar_typeof(*ptr) val = (_val);              \
 	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
 								\
 	res = var + val;                                        \
@@ -3705,23 +3705,6 @@ account_entity_dequeue(struct cfs_rq *cf
 } while (0)
 
 /*
- * Unsigned subtract and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define sub_positive(_ptr, _val) do {				\
-	typeof(_ptr) ptr = (_ptr);				\
-	typeof(*ptr) val = (_val);				\
-	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
-	res = var - val;					\
-	if (res > var)						\
-		res = 0;					\
-	WRITE_ONCE(*ptr, res);					\
-} while (0)
-
-/*
  * Remove and clamp on negative, from a local variable.
  *
  * A variant of sub_positive(), which does not use explicit load-store
@@ -3732,21 +3715,37 @@ account_entity_dequeue(struct cfs_rq *cf
 	*ptr -= min_t(typeof(*ptr), *ptr, _val);		\
 } while (0)
 
+
+/*
+ * Because of rounding, se->util_sum might ends up being +1 more than
+ * cfs->util_sum. Although this is not a problem by itself, detaching
+ * a lot of tasks with the rounding problem between 2 updates of
+ * util_avg (~1ms) can make cfs->util_sum becoming null whereas
+ * cfs_util_avg is not.
+ *
+ * Check that util_sum is still above its lower bound for the new
+ * util_avg. Given that period_contrib might have moved since the last
+ * sync, we are only sure that util_sum must be above or equal to
+ *    util_avg * minimum possible divider
+ */
+#define __update_sa(sa, name, delta_avg, delta_sum) do {	\
+	add_positive(&(sa)->name##_avg, delta_avg);		\
+	add_positive(&(sa)->name##_sum, delta_sum);		\
+	(sa)->name##_sum = max_t(typeof((sa)->name##_sum),	\
+			       (sa)->name##_sum,		\
+			       (sa)->name##_avg * PELT_MIN_DIVIDER); \
+} while (0)
+
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->avg.load_avg += se->avg.load_avg;
-	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
+	__update_sa(&cfs_rq->avg, load, se->avg.load_avg, se->avg.load_sum);
 }
 
 static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
-					  cfs_rq->avg.load_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, load, -se->avg.load_avg, -se->avg.load_sum);
 }
 
 static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags);
@@ -4239,7 +4238,6 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 	 */
 	divider = get_pelt_divider(&cfs_rq->avg);
 
-
 	/* Set new sched_entity's utilization */
 	se->avg.util_avg = gcfs_rq->avg.util_avg;
 	new_sum = se->avg.util_avg * divider;
@@ -4247,12 +4245,7 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
 	se->avg.util_sum = new_sum;
 
 	/* Update parent cfs_rq utilization */
-	add_positive(&cfs_rq->avg.util_avg, delta_avg);
-	add_positive(&cfs_rq->avg.util_sum, delta_sum);
-
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
-					  cfs_rq->avg.util_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, util, delta_avg, delta_sum);
 }
 
 static inline void
@@ -4278,11 +4271,7 @@ update_tg_cfs_runnable(struct cfs_rq *cf
 	se->avg.runnable_sum = new_sum;
 
 	/* Update parent cfs_rq runnable */
-	add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
-	add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
-					      cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, runnable, delta_avg, delta_sum);
 }
 
 static inline void
@@ -4346,11 +4335,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
 
 	se->avg.load_sum = runnable_sum;
 	se->avg.load_avg = load_avg;
-	add_positive(&cfs_rq->avg.load_avg, delta_avg);
-	add_positive(&cfs_rq->avg.load_sum, delta_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
-					  cfs_rq->avg.load_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, load, delta_avg, delta_sum);
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -4549,33 +4534,13 @@ update_cfs_rq_load_avg(u64 now, struct c
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
 		r = removed_load;
-		sub_positive(&sa->load_avg, r);
-		sub_positive(&sa->load_sum, r * divider);
-		/* See sa->util_sum below */
-		sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * PELT_MIN_DIVIDER);
+		__update_sa(sa, load, -r, -r*divider);
 
 		r = removed_util;
-		sub_positive(&sa->util_avg, r);
-		sub_positive(&sa->util_sum, r * divider);
-		/*
-		 * Because of rounding, se->util_sum might ends up being +1 more than
-		 * cfs->util_sum. Although this is not a problem by itself, detaching
-		 * a lot of tasks with the rounding problem between 2 updates of
-		 * util_avg (~1ms) can make cfs->util_sum becoming null whereas
-		 * cfs_util_avg is not.
-		 * Check that util_sum is still above its lower bound for the new
-		 * util_avg. Given that period_contrib might have moved since the last
-		 * sync, we are only sure that util_sum must be above or equal to
-		 *    util_avg * minimum possible divider
-		 */
-		sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * PELT_MIN_DIVIDER);
+		__update_sa(sa, util, -r, -r*divider);
 
 		r = removed_runnable;
-		sub_positive(&sa->runnable_avg, r);
-		sub_positive(&sa->runnable_sum, r * divider);
-		/* See sa->util_sum above */
-		sa->runnable_sum = max_t(u32, sa->runnable_sum,
-					      sa->runnable_avg * PELT_MIN_DIVIDER);
+		__update_sa(sa, runnable, -r, -r*divider);
 
 		/*
 		 * removed_runnable is the unweighted version of removed_load so we
@@ -4660,17 +4625,8 @@ static void attach_entity_load_avg(struc
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	dequeue_load_avg(cfs_rq, se);
-	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
-	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
-					  cfs_rq->avg.util_avg * PELT_MIN_DIVIDER);
-
-	sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
-	sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
-					      cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, util, -se->avg.util_avg, -se->avg.util_sum);
+	__update_sa(&cfs_rq->avg, runnable, -se->avg.runnable_avg, -se->avg.runnable_sum);
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/5] sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()
  2025-11-27 15:39 [PATCH 0/5] sched: Random collection of patches Peter Zijlstra
  2025-11-27 15:39 ` [PATCH 1/5] sched/fair: Fold the sched_avg update Peter Zijlstra
@ 2025-11-27 15:39 ` Peter Zijlstra
  2025-11-29 18:59   ` Shrikanth Hegde
  2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  2025-11-27 15:39 ` [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-27 15:39 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

While poking at this code recently I noted we do a pointless
unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so
we can balance) but then instantly grab the same rq->lock again in
sched_balance_update_blocked_averages().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9902,15 +9902,11 @@ static unsigned long task_h_load(struct
 }
 #endif /* !CONFIG_FAIR_GROUP_SCHED */
 
-static void sched_balance_update_blocked_averages(int cpu)
+static void __sched_balance_update_blocked_averages(struct rq *rq)
 {
 	bool decayed = false, done = true;
-	struct rq *rq = cpu_rq(cpu);
-	struct rq_flags rf;
 
-	rq_lock_irqsave(rq, &rf);
 	update_blocked_load_tick(rq);
-	update_rq_clock(rq);
 
 	decayed |= __update_blocked_others(rq, &done);
 	decayed |= __update_blocked_fair(rq, &done);
@@ -9918,7 +9914,15 @@ static void sched_balance_update_blocked
 	update_blocked_load_status(rq, !done);
 	if (decayed)
 		cpufreq_update_util(rq, 0);
-	rq_unlock_irqrestore(rq, &rf);
+}
+
+static void sched_balance_update_blocked_averages(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	guard(rq_lock_irqsave)(rq);
+	update_rq_clock(rq);
+	__sched_balance_update_blocked_averages(rq);
 }
 
 /********** Helpers for sched_balance_find_src_group ************************/
@@ -12865,12 +12869,17 @@ static int sched_balance_newidle(struct
 	}
 	rcu_read_unlock();
 
+	/*
+	 * Include sched_balance_update_blocked_averages() in the cost
+	 * calculation because it can be quite costly -- this ensures we skip
+	 * it when avg_idle gets to be very low.
+	 */
+	t0 = sched_clock_cpu(this_cpu);
+	__sched_balance_update_blocked_averages(this_rq);
+
 	rq_modified_clear(this_rq);
 	raw_spin_rq_unlock(this_rq);
 
-	t0 = sched_clock_cpu(this_cpu);
-	sched_balance_update_blocked_averages(this_cpu);
-
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		u64 domain_cost;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched
  2025-11-27 15:39 [PATCH 0/5] sched: Random collection of patches Peter Zijlstra
  2025-11-27 15:39 ` [PATCH 1/5] sched/fair: Fold the sched_avg update Peter Zijlstra
  2025-11-27 15:39 ` [PATCH 2/5] sched/fair: Avoid rq->lock bouncing in sched_balance_newidle() Peter Zijlstra
@ 2025-11-27 15:39 ` Peter Zijlstra
  2025-11-28 10:57   ` Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] sched/fair: Remove superfluous rcu_read_lock() tip-bot2 for Peter Zijlstra
  2025-11-27 15:39 ` [PATCH 4/5] sched: Add assertions to QUEUE_CLASS Peter Zijlstra
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
  4 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-27 15:39 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

By changing rcu_dereference_check_sched_domain() to use
rcu_dereference_sched_check() it also considers preempt_disable() to
be equivalent to rcu_read_lock().

Since rcu fully implies rcu_sched this has absolutely no change in
behaviour, but it does allow removing a bunch of otherwise redundant
rcu_read_lock() noise.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  |    9 +--------
 kernel/sched/sched.h |    2 +-
 2 files changed, 2 insertions(+), 9 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12853,21 +12853,16 @@ static int sched_balance_newidle(struct
 	 */
 	rq_unpin_lock(this_rq, rf);
 
-	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
-	if (!sd) {
-		rcu_read_unlock();
+	if (!sd)
 		goto out;
-	}
 
 	if (!get_rd_overloaded(this_rq->rd) ||
 	    this_rq->avg_idle < sd->max_newidle_lb_cost) {
 
 		update_next_balance(sd, &next_balance);
-		rcu_read_unlock();
 		goto out;
 	}
-	rcu_read_unlock();
 
 	/*
 	 * Include sched_balance_update_blocked_averages() in the cost
@@ -12880,7 +12875,6 @@ static int sched_balance_newidle(struct
 	rq_modified_clear(this_rq);
 	raw_spin_rq_unlock(this_rq);
 
-	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		u64 domain_cost;
 
@@ -12930,7 +12924,6 @@ static int sched_balance_newidle(struct
 		if (pulled_task || !continue_balancing)
 			break;
 	}
-	rcu_read_unlock();
 
 	raw_spin_rq_lock(this_rq);
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2009,7 +2009,7 @@ queue_balance_callback(struct rq *rq,
 }
 
 #define rcu_dereference_check_sched_domain(p) \
-	rcu_dereference_check((p), lockdep_is_held(&sched_domains_mutex))
+	rcu_dereference_sched_check((p), lockdep_is_held(&sched_domains_mutex))
 
 /*
  * The domain tree (rq->sd) is protected by RCU's quiescent state transition.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/5] sched: Add assertions to QUEUE_CLASS
  2025-11-27 15:39 [PATCH 0/5] sched: Random collection of patches Peter Zijlstra
                   ` (2 preceding siblings ...)
  2025-11-27 15:39 ` [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched Peter Zijlstra
@ 2025-11-27 15:39 ` Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
  2025-12-18 10:09   ` [PATCH 4/5] sched: " Marek Szyprowski
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
  4 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-27 15:39 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

Add some checks to the sched_change pattern to validate assumptions
around changing classes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   13 +++++++++++++
 kernel/sched/sched.h |    1 +
 2 files changed, 14 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10806,6 +10806,7 @@ struct sched_change_ctx *sched_change_be
 
 	*ctx = (struct sched_change_ctx){
 		.p = p,
+		.class = p->sched_class,
 		.flags = flags,
 		.queued = task_on_rq_queued(p),
 		.running = task_current_donor(rq, p),
@@ -10836,6 +10837,11 @@ void sched_change_end(struct sched_chang
 
 	lockdep_assert_rq_held(rq);
 
+	/*
+	 * Changing class without *QUEUE_CLASS is bad.
+	 */
+	WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS));
+
 	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
 		p->sched_class->switching_to(rq, p);
 
@@ -10847,6 +10853,13 @@ void sched_change_end(struct sched_chang
 	if (ctx->flags & ENQUEUE_CLASS) {
 		if (p->sched_class->switched_to)
 			p->sched_class->switched_to(rq, p);
+
+		/*
+		 * If this was a degradation in class someone should have set
+		 * need_resched by now.
+		 */
+		WARN_ON_ONCE(sched_class_above(ctx->class, p->sched_class) &&
+			     !test_tsk_need_resched(p));
 	} else {
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4027,6 +4027,7 @@ extern void balance_callbacks(struct rq
 struct sched_change_ctx {
 	u64			prio;
 	struct task_struct	*p;
+	const struct sched_class *class;
 	int			flags;
 	bool			queued;
 	bool			running;



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-27 15:39 [PATCH 0/5] sched: Random collection of patches Peter Zijlstra
                   ` (3 preceding siblings ...)
  2025-11-27 15:39 ` [PATCH 4/5] sched: Add assertions to QUEUE_CLASS Peter Zijlstra
@ 2025-11-27 15:39 ` Peter Zijlstra
  2025-11-28 13:26   ` Kuba Piecuch
                     ` (7 more replies)
  4 siblings, 8 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-27 15:39 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, peterz, juri.lelli, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task is of a
higher class than the previous highest class.

In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have set_next_task() re-set the value to
the current class.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   32 +++++++++++++++++++++++---------
 kernel/sched/deadline.c  |   14 +++++++++-----
 kernel/sched/ext.c       |    9 ++++-----
 kernel/sched/fair.c      |   17 ++++++++++-------
 kernel/sched/idle.c      |    3 ---
 kernel/sched/rt.c        |    9 ++++++---
 kernel/sched/sched.h     |   26 ++------------------------
 kernel/sched/stop_task.c |    3 ---
 8 files changed, 54 insertions(+), 59 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2090,7 +2090,6 @@ void enqueue_task(struct rq *rq, struct
 	 */
 	uclamp_rq_inc(rq, p, flags);
 
-	rq->queue_mask |= p->sched_class->queue_mask;
 	p->sched_class->enqueue_task(rq, p, flags);
 
 	psi_enqueue(p, flags);
@@ -2123,7 +2122,6 @@ inline bool dequeue_task(struct rq *rq,
 	 * and mark the task ->sched_delayed.
 	 */
 	uclamp_rq_dec(rq, p);
-	rq->queue_mask |= p->sched_class->queue_mask;
 	return p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struc
 {
 	struct task_struct *donor = rq->donor;
 
-	if (p->sched_class == donor->sched_class)
-		donor->sched_class->wakeup_preempt(rq, p, flags);
-	else if (sched_class_above(p->sched_class, donor->sched_class))
+	if (p->sched_class == rq->next_class) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
+
+	} else if (sched_class_above(p->sched_class, rq->next_class)) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
 		resched_curr(rq);
+		rq->next_class = p->sched_class;
+	}
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
@@ -6797,6 +6799,7 @@ static void __sched notrace __schedule(i
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	rq->next_class = next->sched_class;
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
 		if (!next)
@@ -8646,6 +8649,8 @@ void __init sched_init(void)
 		rq->rt.rt_runtime = global_rt_runtime();
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
+		rq->next_class = &idle_sched_class;
+
 		rq->sd = NULL;
 		rq->rd = NULL;
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
@@ -10771,10 +10776,8 @@ struct sched_change_ctx *sched_change_be
 		flags |= DEQUEUE_NOCLOCK;
 	}
 
-	if (flags & DEQUEUE_CLASS) {
-		if (p->sched_class->switching_from)
-			p->sched_class->switching_from(rq, p);
-	}
+	if ((flags & DEQUEUE_CLASS) && p->sched_class->switching_from)
+		p->sched_class->switching_from(rq, p);
 
 	*ctx = (struct sched_change_ctx){
 		.p = p,
@@ -10827,6 +10830,17 @@ void sched_change_end(struct sched_chang
 			p->sched_class->switched_to(rq, p);
 
 		/*
+		 * If this was a class promotion; let the old class know it
+		 * got preempted. Note that none of the switch*_from() methods
+		 * know the new class and none of the switch*_to() methods
+		 * know the old class.
+		 */
+		if (ctx->running && sched_class_above(p->sched_class, ctx->class)) {
+			rq->next_class->wakeup_preempt(rq, p, 0);
+			rq->next_class = p->sched_class;
+		}
+
+		/*
 		 * If this was a degradation in class someone should have set
 		 * need_resched by now.
 		 */
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2499,9 +2499,16 @@ static int balance_dl(struct rq *rq, str
  * Only called when both the current and waking task are -deadline
  * tasks.
  */
-static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
-				  int flags)
+static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	/*
+	 * Can only get preempted by stop-class, and those should be
+	 * few and short lived, doesn't really make sense to push
+	 * anything away for that.
+	 */
+	if (p->sched_class != &dl_sched_class)
+		return;
+
 	if (dl_entity_preempt(&p->dl, &rq->donor->dl)) {
 		resched_curr(rq);
 		return;
@@ -3304,9 +3311,6 @@ static int task_is_throttled_dl(struct t
 #endif
 
 DEFINE_SCHED_CLASS(dl) = {
-
-	.queue_mask		= 8,
-
 	.enqueue_task		= enqueue_task_dl,
 	.dequeue_task		= dequeue_task_dl,
 	.yield_task		= yield_task_dl,
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2338,12 +2338,12 @@ static struct task_struct *pick_task_scx
 	bool keep_prev, kick_idle = false;
 	struct task_struct *p;
 
-	rq_modified_clear(rq);
+	rq->next_class = &ext_sched_class;
 	rq_unpin_lock(rq, rf);
 	balance_one(rq, prev);
 	rq_repin_lock(rq, rf);
 	maybe_queue_balance_callback(rq);
-	if (rq_modified_above(rq, &ext_sched_class))
+	if (sched_class_above(rq->next_class, &ext_sched_class))
 		return RETRY_TASK;
 
 	keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
@@ -2967,7 +2967,8 @@ static void switched_from_scx(struct rq
 	scx_disable_task(p);
 }
 
-static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
+static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {}
+
 static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
 
 int scx_check_setscheduler(struct task_struct *p, int policy)
@@ -3216,8 +3217,6 @@ static void scx_cgroup_unlock(void) {}
  *   their current sched_class. Call them directly from sched core instead.
  */
 DEFINE_SCHED_CLASS(ext) = {
-	.queue_mask		= 1,
-
 	.enqueue_task		= enqueue_task_scx,
 	.dequeue_task		= dequeue_task_scx,
 	.yield_task		= yield_task_scx,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8697,7 +8697,7 @@ preempt_sync(struct rq *rq, int wake_fla
 /*
  * Preempt the current task with a newly woken task if needed:
  */
-static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
+static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
@@ -8705,6 +8705,12 @@ static void check_preempt_wakeup_fair(st
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
 
+	/*
+	 * XXX Getting preempted by higher class, try and find idle CPU?
+	 */
+	if (p->sched_class != &fair_sched_class)
+		return;
+
 	if (unlikely(se == pse))
 		return;
 
@@ -12872,7 +12878,7 @@ static int sched_balance_newidle(struct
 	t0 = sched_clock_cpu(this_cpu);
 	__sched_balance_update_blocked_averages(this_rq);
 
-	rq_modified_clear(this_rq);
+	this_rq->next_class = &fair_sched_class;
 	raw_spin_rq_unlock(this_rq);
 
 	for_each_domain(this_cpu, sd) {
@@ -12939,7 +12945,7 @@ static int sched_balance_newidle(struct
 		pulled_task = 1;
 
 	/* If a higher prio class was modified, restart the pick */
-	if (rq_modified_above(this_rq, &fair_sched_class))
+	if (sched_class_above(this_rq->next_class, &fair_sched_class))
 		pulled_task = -1;
 
 out:
@@ -13837,15 +13843,12 @@ static unsigned int get_rr_interval_fair
  * All the scheduling class methods:
  */
 DEFINE_SCHED_CLASS(fair) = {
-
-	.queue_mask		= 2,
-
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
 	.yield_to_task		= yield_to_task_fair,
 
-	.wakeup_preempt		= check_preempt_wakeup_fair,
+	.wakeup_preempt		= wakeup_preempt_fair,
 
 	.pick_task		= pick_task_fair,
 	.pick_next_task		= pick_next_task_fair,
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -534,9 +534,6 @@ static void update_curr_idle(struct rq *
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
 DEFINE_SCHED_CLASS(idle) = {
-
-	.queue_mask		= 0,
-
 	/* no enqueue/yield_task for idle tasks */
 
 	/* dequeue is not valid, we print a debug message there: */
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1615,6 +1615,12 @@ static void wakeup_preempt_rt(struct rq
 {
 	struct task_struct *donor = rq->donor;
 
+	/*
+	 * XXX If we're preempted by DL, queue a push?
+	 */
+	if (p->sched_class != &rt_sched_class)
+		return;
+
 	if (p->prio < donor->prio) {
 		resched_curr(rq);
 		return;
@@ -2568,9 +2574,6 @@ static int task_is_throttled_rt(struct t
 #endif /* CONFIG_SCHED_CORE */
 
 DEFINE_SCHED_CLASS(rt) = {
-
-	.queue_mask		= 4,
-
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1119,7 +1119,6 @@ struct rq {
 	raw_spinlock_t		__lock;
 
 	/* Per class runqueue modification mask; bits in class order. */
-	unsigned int		queue_mask;
 	unsigned int		nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		nr_numa_running;
@@ -1179,6 +1178,7 @@ struct rq {
 	struct sched_dl_entity	*dl_server;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
+	const struct sched_class *next_class;
 	unsigned long		next_balance;
 	struct mm_struct	*prev_mm;
 
@@ -2426,15 +2426,6 @@ struct sched_class {
 #ifdef CONFIG_UCLAMP_TASK
 	int uclamp_enabled;
 #endif
-	/*
-	 * idle:  0
-	 * ext:   1
-	 * fair:  2
-	 * rt:    4
-	 * dl:    8
-	 * stop: 16
-	 */
-	unsigned int queue_mask;
 
 	/*
 	 * move_queued_task/activate_task/enqueue_task: rq->lock
@@ -2593,20 +2584,6 @@ struct sched_class {
 #endif
 };
 
-/*
- * Does not nest; only used around sched_class::pick_task() rq-lock-breaks.
- */
-static inline void rq_modified_clear(struct rq *rq)
-{
-	rq->queue_mask = 0;
-}
-
-static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class)
-{
-	unsigned int mask = class->queue_mask;
-	return rq->queue_mask & ~((mask << 1) - 1);
-}
-
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->donor != prev);
@@ -3899,6 +3876,7 @@ void move_queued_task_locked(struct rq *
 	deactivate_task(src_rq, task, 0);
 	set_task_cpu(task, dst_rq->cpu);
 	activate_task(dst_rq, task, 0);
+	wakeup_preempt(dst_rq, task, 0);
 }
 
 static inline
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -97,9 +97,6 @@ static void update_curr_stop(struct rq *
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 DEFINE_SCHED_CLASS(stop) = {
-
-	.queue_mask		= 16,
-
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,
 	.yield_task		= yield_task_stop,



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched
  2025-11-27 15:39 ` [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched Peter Zijlstra
@ 2025-11-28 10:57   ` Peter Zijlstra
  2025-11-28 11:04     ` Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] sched/fair: Remove superfluous rcu_read_lock() tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-28 10:57 UTC (permalink / raw)
  To: mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

On Thu, Nov 27, 2025 at 04:39:46PM +0100, Peter Zijlstra wrote:
> By changing rcu_dereference_check_sched_domain() to use
> rcu_dereference_sched_check() it also considers preempt_disable() to
> be equivalent to rcu_read_lock().
> 
> Since rcu fully implies rcu_sched this has absolutely no change in
> behaviour, but it does allow removing a bunch of otherwise redundant
> rcu_read_lock() noise.

This goes sideways with NUMABALANCING=y, it needs a little more. I'll
have a poke.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched
  2025-11-28 10:57   ` Peter Zijlstra
@ 2025-11-28 11:04     ` Peter Zijlstra
  2025-11-28 11:21       ` Paul E. McKenney
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-28 11:04 UTC (permalink / raw)
  To: mingo, vincent.guittot, Paul McKenney
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

On Fri, Nov 28, 2025 at 11:57:23AM +0100, Peter Zijlstra wrote:
> On Thu, Nov 27, 2025 at 04:39:46PM +0100, Peter Zijlstra wrote:
> > By changing rcu_dereference_check_sched_domain() to use
> > rcu_dereference_sched_check() it also considers preempt_disable() to
> > be equivalent to rcu_read_lock().
> > 
> > Since rcu fully implies rcu_sched this has absolutely no change in
> > behaviour, but it does allow removing a bunch of otherwise redundant
> > rcu_read_lock() noise.
> 
> This goes sideways with NUMABALANCING=y, it needs a little more. I'll
> have a poke.

Bah, so I overlooked that rcu_dereference_sched() checks
rcu_sched_lock_map while rcu_dereference() checks rcu_lock_map.

Paul, with RCU being unified, how much sense does it make that the rcu
validation stuff is still fully separated?

Case at hand, I'm trying to remove a bunch of
rcu_read_lock()/rcu_read_unlock() noise from deep inside the scheduler
where I know IRQs are disabled.

But the rcu checking thing is still living in the separated universe and
giving me pain.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched
  2025-11-28 11:04     ` Peter Zijlstra
@ 2025-11-28 11:21       ` Paul E. McKenney
  2025-11-28 11:37         ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Paul E. McKenney @ 2025-11-28 11:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, void,
	arighi, changwoo, sched-ext

On Fri, Nov 28, 2025 at 12:04:16PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 28, 2025 at 11:57:23AM +0100, Peter Zijlstra wrote:
> > On Thu, Nov 27, 2025 at 04:39:46PM +0100, Peter Zijlstra wrote:
> > > By changing rcu_dereference_check_sched_domain() to use
> > > rcu_dereference_sched_check() it also considers preempt_disable() to
> > > be equivalent to rcu_read_lock().
> > > 
> > > Since rcu fully implies rcu_sched this has absolutely no change in
> > > behaviour, but it does allow removing a bunch of otherwise redundant
> > > rcu_read_lock() noise.
> > 
> > This goes sideways with NUMABALANCING=y, it needs a little more. I'll
> > have a poke.
> 
> Bah, so I overlooked that rcu_dereference_sched() checks
> rcu_sched_lock_map while rcu_dereference() checks rcu_lock_map.
> 
> Paul, with RCU being unified, how much sense does it make that the rcu
> validation stuff is still fully separated?
> 
> Case at hand, I'm trying to remove a bunch of
> rcu_read_lock()/rcu_read_unlock() noise from deep inside the scheduler
> where I know IRQs are disabled.
> 
> But the rcu checking thing is still living in the separated universe and
> giving me pain.

Would rcu_dereference_all_check() do what you need?  It is happy with an
online CPU that RCU is watching as long as either preemption is disabled
(which includes IRQs being disabled) or any/all of rcu_read_lock(),
rcu_read_lock_bh(), and rcu_read_lock_sched().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched
  2025-11-28 11:21       ` Paul E. McKenney
@ 2025-11-28 11:37         ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-28 11:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, void,
	arighi, changwoo, sched-ext

On Fri, Nov 28, 2025 at 03:21:41AM -0800, Paul E. McKenney wrote:

> Would rcu_dereference_all_check() do what you need?

Yes, clearly I should have read more of that file.

Let me go try that.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
@ 2025-11-28 13:26   ` Kuba Piecuch
  2025-11-28 13:36     ` Peter Zijlstra
  2025-11-28 22:29   ` Andrea Righi
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Kuba Piecuch @ 2025-11-28 13:26 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext

Hi Peter,

On Thu Nov 27, 2025 at 3:39 PM UTC, Peter Zijlstra wrote:
> Additionally have set_next_task() re-set the value to the current class.

I don't see this part reflected in the patch. Is something missing?

Best,
Kuba

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-28 13:26   ` Kuba Piecuch
@ 2025-11-28 13:36     ` Peter Zijlstra
  2025-11-28 13:44       ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-28 13:36 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, void,
	arighi, changwoo, sched-ext

On Fri, Nov 28, 2025 at 01:26:30PM +0000, Kuba Piecuch wrote:
> Hi Peter,
> 
> On Thu Nov 27, 2025 at 3:39 PM UTC, Peter Zijlstra wrote:
> > Additionally have set_next_task() re-set the value to the current class.
> 
> I don't see this part reflected in the patch. Is something missing?

Hmm, that does appear to have gone walk-about :/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-28 13:36     ` Peter Zijlstra
@ 2025-11-28 13:44       ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-28 13:44 UTC (permalink / raw)
  To: Kuba Piecuch
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, void,
	arighi, changwoo, sched-ext

On Fri, Nov 28, 2025 at 02:36:38PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 28, 2025 at 01:26:30PM +0000, Kuba Piecuch wrote:
> > Hi Peter,
> > 
> > On Thu Nov 27, 2025 at 3:39 PM UTC, Peter Zijlstra wrote:
> > > Additionally have set_next_task() re-set the value to the current class.
> > 
> > I don't see this part reflected in the patch. Is something missing?
> 
> Hmm, that does appear to have gone walk-about :/

Aah, here:

@@ -6797,6 +6799,7 @@ static void __sched notrace __schedule(i
 pick_again:
        next = pick_next_task(rq, rq->donor, &rf);
        rq_set_donor(rq, next);
+       rq->next_class = next->sched_class;
        if (unlikely(task_is_blocked(next))) {
                next = find_proxy_task(rq, next, &rf);
                if (!next)

Will fix changelog. Had to do the above instead of set_next_task()
because if proxy stuff.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
  2025-11-28 13:26   ` Kuba Piecuch
@ 2025-11-28 22:29   ` Andrea Righi
  2025-11-29 18:08   ` Shrikanth Hegde
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Andrea Righi @ 2025-11-28 22:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, void,
	changwoo, sched-ext

Hi Peter,

On Thu, Nov 27, 2025 at 04:39:48PM +0100, Peter Zijlstra wrote:
...
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1119,7 +1119,6 @@ struct rq {
>  	raw_spinlock_t		__lock;
>  
>  	/* Per class runqueue modification mask; bits in class order. */

We should probably remove this comment as well along with queue_mask.

Thanks,
-Andrea

> -	unsigned int		queue_mask;
>  	unsigned int		nr_running;
>  #ifdef CONFIG_NUMA_BALANCING
>  	unsigned int		nr_numa_running;
> @@ -1179,6 +1178,7 @@ struct rq {
>  	struct sched_dl_entity	*dl_server;
>  	struct task_struct	*idle;
>  	struct task_struct	*stop;
> +	const struct sched_class *next_class;
>  	unsigned long		next_balance;
>  	struct mm_struct	*prev_mm;
>  

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
  2025-11-28 13:26   ` Kuba Piecuch
  2025-11-28 22:29   ` Andrea Righi
@ 2025-11-29 18:08   ` Shrikanth Hegde
  2025-11-30 11:32     ` Peter Zijlstra
  2025-12-02 23:27   ` Tejun Heo
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Shrikanth Hegde @ 2025-11-29 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext, mingo,
	vincent.guittot



On 11/27/25 9:09 PM, Peter Zijlstra wrote:
> Change sched_class::wakeup_preempt() to also get called for
> cross-class wakeups, specifically those where the woken task is of a
> higher class than the previous highest class.
> 
> In order to do this, track the current highest class of the runqueue
> in rq::next_class and have wakeup_preempt() track this upwards for
> each new wakeup. Additionally have set_next_task() re-set the value to
> the current class.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/sched/core.c      |   32 +++++++++++++++++++++++---------
>   kernel/sched/deadline.c  |   14 +++++++++-----
>   kernel/sched/ext.c       |    9 ++++-----
>   kernel/sched/fair.c      |   17 ++++++++++-------
>   kernel/sched/idle.c      |    3 ---
>   kernel/sched/rt.c        |    9 ++++++---
>   kernel/sched/sched.h     |   26 ++------------------------
>   kernel/sched/stop_task.c |    3 ---
>   8 files changed, 54 insertions(+), 59 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2090,7 +2090,6 @@ void enqueue_task(struct rq *rq, struct
>   	 */
>   	uclamp_rq_inc(rq, p, flags);
>   
> -	rq->queue_mask |= p->sched_class->queue_mask;
>   	p->sched_class->enqueue_task(rq, p, flags);
>   
>   	psi_enqueue(p, flags);
> @@ -2123,7 +2122,6 @@ inline bool dequeue_task(struct rq *rq,
>   	 * and mark the task ->sched_delayed.
>   	 */
>   	uclamp_rq_dec(rq, p);
> -	rq->queue_mask |= p->sched_class->queue_mask;
>   	return p->sched_class->dequeue_task(rq, p, flags);
>   }
>   
> @@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struc
>   {
>   	struct task_struct *donor = rq->donor;
>   
> -	if (p->sched_class == donor->sched_class)
> -		donor->sched_class->wakeup_preempt(rq, p, flags);
> -	else if (sched_class_above(p->sched_class, donor->sched_class))
> +	if (p->sched_class == rq->next_class) {
> +		rq->next_class->wakeup_preempt(rq, p, flags);
> +
> +	} else if (sched_class_above(p->sched_class, rq->next_class)) {
> +		rq->next_class->wakeup_preempt(rq, p, flags);

Whats the logic of calling wakeup_preempt here?

say rq was running CFS, now RT is waking up. but first thing we do is return if not
fair_sched_class. it is effectively resched_curr right?

>   		resched_curr(rq);
> +		rq->next_class = p->sched_class;

Since resched will happen and __schedule can set the next_class. it is necessary to set it
even earlier?

> +	}
>   
>   	/*
>   	 * A queue event has occurred, and we're going to schedule.  In
> @@ -6797,6 +6799,7 @@ static void __sched notrace __schedule(i
>   pick_again:
>   	next = pick_next_task(rq, rq->donor, &rf);
>   	rq_set_donor(rq, next);
> +	rq->next_class = next->sched_class;
>   	if (unlikely(task_is_blocked(next))) {
>   		next = find_proxy_task(rq, next, &rf);
>   		if (!next)
> @@ -8646,6 +8649,8 @@ void __init sched_init(void)
>   		rq->rt.rt_runtime = global_rt_runtime();
>   		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
>   #endif
> +		rq->next_class = &idle_sched_class;
> +
>   		rq->sd = NULL;
>   		rq->rd = NULL;
>   		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
> @@ -10771,10 +10776,8 @@ struct sched_change_ctx *sched_change_be
>   		flags |= DEQUEUE_NOCLOCK;
>   	}
>   
> -	if (flags & DEQUEUE_CLASS) {
> -		if (p->sched_class->switching_from)
> -			p->sched_class->switching_from(rq, p);
> -	}
> +	if ((flags & DEQUEUE_CLASS) && p->sched_class->switching_from)
> +		p->sched_class->switching_from(rq, p);
>   
>   	*ctx = (struct sched_change_ctx){
>   		.p = p,
> @@ -10827,6 +10830,17 @@ void sched_change_end(struct sched_chang
>   			p->sched_class->switched_to(rq, p);
>   
>   		/*
> +		 * If this was a class promotion; let the old class know it
> +		 * got preempted. Note that none of the switch*_from() methods
> +		 * know the new class and none of the switch*_to() methods
> +		 * know the old class.
> +		 */
> +		if (ctx->running && sched_class_above(p->sched_class, ctx->class)) {
> +			rq->next_class->wakeup_preempt(rq, p, 0);
> +			rq->next_class = p->sched_class;
> +		}
> +
> +		/*
>   		 * If this was a degradation in class someone should have set
>   		 * need_resched by now.
>   		 */
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2499,9 +2499,16 @@ static int balance_dl(struct rq *rq, str
>    * Only called when both the current and waking task are -deadline
>    * tasks.
>    */
> -static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
> -				  int flags)
> +static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags)
>   {
> +	/*
> +	 * Can only get preempted by stop-class, and those should be
> +	 * few and short lived, doesn't really make sense to push
> +	 * anything away for that.
> +	 */
> +	if (p->sched_class != &dl_sched_class)
> +		return;
> +
>   	if (dl_entity_preempt(&p->dl, &rq->donor->dl)) {
>   		resched_curr(rq);
>   		return;
> @@ -3304,9 +3311,6 @@ static int task_is_throttled_dl(struct t
>   #endif
>   
>   DEFINE_SCHED_CLASS(dl) = {
> -
> -	.queue_mask		= 8,
> -
>   	.enqueue_task		= enqueue_task_dl,
>   	.dequeue_task		= dequeue_task_dl,
>   	.yield_task		= yield_task_dl,
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2338,12 +2338,12 @@ static struct task_struct *pick_task_scx
>   	bool keep_prev, kick_idle = false;
>   	struct task_struct *p;
>   
> -	rq_modified_clear(rq);
> +	rq->next_class = &ext_sched_class;
>   	rq_unpin_lock(rq, rf);
>   	balance_one(rq, prev);
>   	rq_repin_lock(rq, rf);
>   	maybe_queue_balance_callback(rq);
> -	if (rq_modified_above(rq, &ext_sched_class))
> +	if (sched_class_above(rq->next_class, &ext_sched_class))
>   		return RETRY_TASK;
>   
>   	keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
> @@ -2967,7 +2967,8 @@ static void switched_from_scx(struct rq
>   	scx_disable_task(p);
>   }
>   
> -static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
> +static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {}
> +
>   static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
>   
>   int scx_check_setscheduler(struct task_struct *p, int policy)
> @@ -3216,8 +3217,6 @@ static void scx_cgroup_unlock(void) {}
>    *   their current sched_class. Call them directly from sched core instead.
>    */
>   DEFINE_SCHED_CLASS(ext) = {
> -	.queue_mask		= 1,
> -
>   	.enqueue_task		= enqueue_task_scx,
>   	.dequeue_task		= dequeue_task_scx,
>   	.yield_task		= yield_task_scx,
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8697,7 +8697,7 @@ preempt_sync(struct rq *rq, int wake_fla
>   /*
>    * Preempt the current task with a newly woken task if needed:
>    */
> -static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> +static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>   {
>   	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>   	struct task_struct *donor = rq->donor;
> @@ -8705,6 +8705,12 @@ static void check_preempt_wakeup_fair(st
>   	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>   	int cse_is_idle, pse_is_idle;
>   
> +	/*
> +	 * XXX Getting preempted by higher class, try and find idle CPU?
> +	 */
> +	if (p->sched_class != &fair_sched_class)
> +		return;
> +
>   	if (unlikely(se == pse))
>   		return;
>   
> @@ -12872,7 +12878,7 @@ static int sched_balance_newidle(struct
>   	t0 = sched_clock_cpu(this_cpu);
>   	__sched_balance_update_blocked_averages(this_rq);
>   
> -	rq_modified_clear(this_rq);
> +	this_rq->next_class = &fair_sched_class;
>   	raw_spin_rq_unlock(this_rq);
>   
>   	for_each_domain(this_cpu, sd) {
> @@ -12939,7 +12945,7 @@ static int sched_balance_newidle(struct
>   		pulled_task = 1;
>   
>   	/* If a higher prio class was modified, restart the pick */
> -	if (rq_modified_above(this_rq, &fair_sched_class))
> +	if (sched_class_above(this_rq->next_class, &fair_sched_class))
>   		pulled_task = -1;
>   
>   out:
> @@ -13837,15 +13843,12 @@ static unsigned int get_rr_interval_fair
>    * All the scheduling class methods:
>    */
>   DEFINE_SCHED_CLASS(fair) = {
> -
> -	.queue_mask		= 2,
> -
>   	.enqueue_task		= enqueue_task_fair,
>   	.dequeue_task		= dequeue_task_fair,
>   	.yield_task		= yield_task_fair,
>   	.yield_to_task		= yield_to_task_fair,
>   
> -	.wakeup_preempt		= check_preempt_wakeup_fair,
> +	.wakeup_preempt		= wakeup_preempt_fair,
>   
>   	.pick_task		= pick_task_fair,
>   	.pick_next_task		= pick_next_task_fair,
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -534,9 +534,6 @@ static void update_curr_idle(struct rq *
>    * Simple, special scheduling class for the per-CPU idle tasks:
>    */
>   DEFINE_SCHED_CLASS(idle) = {
> -
> -	.queue_mask		= 0,
> -
>   	/* no enqueue/yield_task for idle tasks */
>   
>   	/* dequeue is not valid, we print a debug message there: */
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1615,6 +1615,12 @@ static void wakeup_preempt_rt(struct rq
>   {
>   	struct task_struct *donor = rq->donor;
>   
> +	/*
> +	 * XXX If we're preempted by DL, queue a push?
> +	 */
> +	if (p->sched_class != &rt_sched_class)
> +		return;
> +
>   	if (p->prio < donor->prio) {
>   		resched_curr(rq);
>   		return;
> @@ -2568,9 +2574,6 @@ static int task_is_throttled_rt(struct t
>   #endif /* CONFIG_SCHED_CORE */
>   
>   DEFINE_SCHED_CLASS(rt) = {
> -
> -	.queue_mask		= 4,
> -
>   	.enqueue_task		= enqueue_task_rt,
>   	.dequeue_task		= dequeue_task_rt,
>   	.yield_task		= yield_task_rt,
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1119,7 +1119,6 @@ struct rq {
>   	raw_spinlock_t		__lock;
>   
>   	/* Per class runqueue modification mask; bits in class order. */
> -	unsigned int		queue_mask;
>   	unsigned int		nr_running;
>   #ifdef CONFIG_NUMA_BALANCING
>   	unsigned int		nr_numa_running;
> @@ -1179,6 +1178,7 @@ struct rq {
>   	struct sched_dl_entity	*dl_server;
>   	struct task_struct	*idle;
>   	struct task_struct	*stop;
> +	const struct sched_class *next_class;
>   	unsigned long		next_balance;
>   	struct mm_struct	*prev_mm;
>   
> @@ -2426,15 +2426,6 @@ struct sched_class {
>   #ifdef CONFIG_UCLAMP_TASK
>   	int uclamp_enabled;
>   #endif
> -	/*
> -	 * idle:  0
> -	 * ext:   1
> -	 * fair:  2
> -	 * rt:    4
> -	 * dl:    8
> -	 * stop: 16
> -	 */
> -	unsigned int queue_mask;
>   
>   	/*
>   	 * move_queued_task/activate_task/enqueue_task: rq->lock
> @@ -2593,20 +2584,6 @@ struct sched_class {
>   #endif
>   };
>   
> -/*
> - * Does not nest; only used around sched_class::pick_task() rq-lock-breaks.
> - */
> -static inline void rq_modified_clear(struct rq *rq)
> -{
> -	rq->queue_mask = 0;
> -}
> -
> -static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class)
> -{
> -	unsigned int mask = class->queue_mask;
> -	return rq->queue_mask & ~((mask << 1) - 1);
> -}
> -
>   static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
>   {
>   	WARN_ON_ONCE(rq->donor != prev);
> @@ -3899,6 +3876,7 @@ void move_queued_task_locked(struct rq *
>   	deactivate_task(src_rq, task, 0);
>   	set_task_cpu(task, dst_rq->cpu);
>   	activate_task(dst_rq, task, 0);
> +	wakeup_preempt(dst_rq, task, 0);

Whats the need of wakeup_preempt here?

In all places, move_queued_task_locked is followed by resched_curr
except in __migrate_swap_task which does same wakeup_preempt.


>   }
>   
>   static inline
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -97,9 +97,6 @@ static void update_curr_stop(struct rq *
>    * Simple, special scheduling class for the per-CPU stop tasks:
>    */
>   DEFINE_SCHED_CLASS(stop) = {
> -
> -	.queue_mask		= 16,
> -
>   	.enqueue_task		= enqueue_task_stop,
>   	.dequeue_task		= dequeue_task_stop,
>   	.yield_task		= yield_task_stop,
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/5] sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()
  2025-11-27 15:39 ` [PATCH 2/5] sched/fair: Avoid rq->lock bouncing in sched_balance_newidle() Peter Zijlstra
@ 2025-11-29 18:59   ` Shrikanth Hegde
  2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Shrikanth Hegde @ 2025-11-29 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext, mingo,
	vincent.guittot



On 11/27/25 9:09 PM, Peter Zijlstra wrote:
> While poking at this code recently I noted we do a pointless
> unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so
> we can balance) but then instantly grab the same rq->lock again in
> sched_balance_update_blocked_averages().
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   kernel/sched/fair.c |   27 ++++++++++++++++++---------
>   1 file changed, 18 insertions(+), 9 deletions(-)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9902,15 +9902,11 @@ static unsigned long task_h_load(struct
>   }
>   #endif /* !CONFIG_FAIR_GROUP_SCHED */
>   
> -static void sched_balance_update_blocked_averages(int cpu)
> +static void __sched_balance_update_blocked_averages(struct rq *rq)
>   {
>   	bool decayed = false, done = true;
> -	struct rq *rq = cpu_rq(cpu);
> -	struct rq_flags rf;
>   
> -	rq_lock_irqsave(rq, &rf);
>   	update_blocked_load_tick(rq);
> -	update_rq_clock(rq);
>   
>   	decayed |= __update_blocked_others(rq, &done);
>   	decayed |= __update_blocked_fair(rq, &done);
> @@ -9918,7 +9914,15 @@ static void sched_balance_update_blocked
>   	update_blocked_load_status(rq, !done);
>   	if (decayed)
>   		cpufreq_update_util(rq, 0);
> -	rq_unlock_irqrestore(rq, &rf);
> +}
> +
> +static void sched_balance_update_blocked_averages(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +
> +	guard(rq_lock_irqsave)(rq);
> +	update_rq_clock(rq);
> +	__sched_balance_update_blocked_averages(rq);
>   }
>   
>   /********** Helpers for sched_balance_find_src_group ************************/
> @@ -12865,12 +12869,17 @@ static int sched_balance_newidle(struct
>   	}
>   	rcu_read_unlock();
>   
> +	/*
> +	 * Include sched_balance_update_blocked_averages() in the cost
> +	 * calculation because it can be quite costly -- this ensures we skip
> +	 * it when avg_idle gets to be very low.
> +	 */
> +	t0 = sched_clock_cpu(this_cpu);
> +	__sched_balance_update_blocked_averages(this_rq);
> +

I think we do update_rq_clock earlier as early as __schedule.
no warnings seen.

Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-29 18:08   ` Shrikanth Hegde
@ 2025-11-30 11:32     ` Peter Zijlstra
  2025-11-30 13:03       ` Shrikanth Hegde
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-11-30 11:32 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext, mingo,
	vincent.guittot

On Sat, Nov 29, 2025 at 11:38:49PM +0530, Shrikanth Hegde wrote:

> > @@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struc
> >   {
> >   	struct task_struct *donor = rq->donor;
> > -	if (p->sched_class == donor->sched_class)
> > -		donor->sched_class->wakeup_preempt(rq, p, flags);
> > -	else if (sched_class_above(p->sched_class, donor->sched_class))
> > +	if (p->sched_class == rq->next_class) {
> > +		rq->next_class->wakeup_preempt(rq, p, flags);
> > +
> > +	} else if (sched_class_above(p->sched_class, rq->next_class)) {
> > +		rq->next_class->wakeup_preempt(rq, p, flags);
> 
> Whats the logic of calling wakeup_preempt here?
> 
> say rq was running CFS, now RT is waking up. but first thing we do is
> return if not fair_sched_class. it is effectively resched_curr right?

Yes, as-is this patch seems silly, but that is mostly to preserve
current semantics :-)

The idea is that classes *could* do something else. Notably this was a
request from sched_ext. There are cases where when they pull a task from
the global runqueue and stick it on the local runqueue, but then get
preempted by a higher priority class (say RT) they would want to stick
the task back on the global runqueue such that another CPU can select it
again, instead of having that task linger on a CPU that is not
available.

This issue has come up in the past as well but was never addressed.

Anyway, this is just foundational work. It would let a class respond to
loosing the runqueue to a higher priority class.

I suppose I should go write a better changelog.

> 
> >   		resched_curr(rq);
> > +		rq->next_class = p->sched_class;
> 
> Since resched will happen and __schedule can set the next_class. it is necessary to set it
> even earlier?

Yes, because we can have another wakeup before that schedule.

Imagine running a fair class, getting a fifo wakeup and then a dl
wakeup. You want the fair class, then the rt class to get a preemption
notification.

> > @@ -3899,6 +3876,7 @@ void move_queued_task_locked(struct rq *
> >   	deactivate_task(src_rq, task, 0);
> >   	set_task_cpu(task, dst_rq->cpu);
> >   	activate_task(dst_rq, task, 0);
> > +	wakeup_preempt(dst_rq, task, 0);
> 
> Whats the need of wakeup_preempt here?

Everything that places a task on the runqueue should do a 'wakeup'
preemption to make sure the above mentioned class preemption stuff
works.

It doesn't really matter if the task is new due to an actual wakeup or
due to a migration, the task is 'new' to this CPU and stuff might need
to 'move'.

IIRC this was the only such place that missed the check.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-30 11:32     ` Peter Zijlstra
@ 2025-11-30 13:03       ` Shrikanth Hegde
  0 siblings, 0 replies; 36+ messages in thread
From: Shrikanth Hegde @ 2025-11-30 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext, mingo,
	vincent.guittot



On 11/30/25 5:02 PM, Peter Zijlstra wrote:
> On Sat, Nov 29, 2025 at 11:38:49PM +0530, Shrikanth Hegde wrote:
> 
>>> @@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struc
>>>    {
>>>    	struct task_struct *donor = rq->donor;
>>> -	if (p->sched_class == donor->sched_class)
>>> -		donor->sched_class->wakeup_preempt(rq, p, flags);
>>> -	else if (sched_class_above(p->sched_class, donor->sched_class))
>>> +	if (p->sched_class == rq->next_class) {
>>> +		rq->next_class->wakeup_preempt(rq, p, flags);
>>> +
>>> +	} else if (sched_class_above(p->sched_class, rq->next_class)) {
>>> +		rq->next_class->wakeup_preempt(rq, p, flags);
>>
>> Whats the logic of calling wakeup_preempt here?
>>
>> say rq was running CFS, now RT is waking up. but first thing we do is
>> return if not fair_sched_class. it is effectively resched_curr right?
> 
> Yes, as-is this patch seems silly, but that is mostly to preserve
> current semantics :-)
> 
> The idea is that classes *could* do something else. Notably this was a
> request from sched_ext. There are cases where when they pull a task from
> the global runqueue and stick it on the local runqueue, but then get
> preempted by a higher priority class (say RT) they would want to stick
> the task back on the global runqueue such that another CPU can select it
> again, instead of having that task linger on a CPU that is not
> available.
> 

ok. This helps to understand.

> This issue has come up in the past as well but was never addressed.
> 
> Anyway, this is just foundational work. It would let a class respond to
> loosing the runqueue to a higher priority class.
> 
> I suppose I should go write a better changelog.
> 
>>
>>>    		resched_curr(rq);
>>> +		rq->next_class = p->sched_class;
>>
>> Since resched will happen and __schedule can set the next_class. it is necessary to set it
>> even earlier?
> 
> Yes, because we can have another wakeup before that schedule.
> 
> Imagine running a fair class, getting a fifo wakeup and then a dl
> wakeup. You want the fair class, then the rt class to get a preemption
> notification.
> 
>>> @@ -3899,6 +3876,7 @@ void move_queued_task_locked(struct rq *
>>>    	deactivate_task(src_rq, task, 0);
>>>    	set_task_cpu(task, dst_rq->cpu);
>>>    	activate_task(dst_rq, task, 0);
>>> +	wakeup_preempt(dst_rq, task, 0);
>>
>> Whats the need of wakeup_preempt here?
> 
> Everything that places a task on the runqueue should do a 'wakeup'
> preemption to make sure the above mentioned class preemption stuff
> works.
> 
> It doesn't really matter if the task is new due to an actual wakeup or
> due to a migration, the task is 'new' to this CPU and stuff might need
> to 'move'.
> 
> IIRC this was the only such place that missed the check.

Point was, we might do resched_curr twice in this case.
Once in wakeup_preempt and once by explicit call following
move_queued_task_locked. May remove the later one?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
                     ` (2 preceding siblings ...)
  2025-11-29 18:08   ` Shrikanth Hegde
@ 2025-12-02 23:27   ` Tejun Heo
  2025-12-14  7:46   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Tejun Heo @ 2025-12-02 23:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, void,
	arighi, changwoo, sched-ext

Hello,

On Thu, Nov 27, 2025 at 04:39:48PM +0100, Peter Zijlstra wrote:
> @@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struc
>  {
>  	struct task_struct *donor = rq->donor;
>  
> -	if (p->sched_class == donor->sched_class)
> -		donor->sched_class->wakeup_preempt(rq, p, flags);
> -	else if (sched_class_above(p->sched_class, donor->sched_class))
> +	if (p->sched_class == rq->next_class) {
> +		rq->next_class->wakeup_preempt(rq, p, flags);
> +
> +	} else if (sched_class_above(p->sched_class, rq->next_class)) {
> +		rq->next_class->wakeup_preempt(rq, p, flags);
>  		resched_curr(rq);
> +		rq->next_class = p->sched_class;
> +	}

I wonder whether this is a bit subtle. Wouldn't it be clearer to add a
separate method which takes an explicit next_class argument for the second
case?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [tip: sched/core] sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
                     ` (3 preceding siblings ...)
  2025-12-02 23:27   ` Tejun Heo
@ 2025-12-14  7:46   ` tip-bot2 for Peter Zijlstra
  2025-12-15  6:07   ` error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()) Thorsten Leemhuis
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-14  7:46 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     31ab17f00c810076333c26cb485ec4d778829a76
Gitweb:        https://git.kernel.org/tip/31ab17f00c810076333c26cb485ec4d778829a76
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 10 Dec 2025 09:06:50 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 14 Dec 2025 08:25:02 +01:00

sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()

Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task
is of a higher class than the previous highest class.

In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have schedule() re-set the value on
pick.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
---
 kernel/sched/core.c      | 32 +++++++++++++++++++++++---------
 kernel/sched/deadline.c  | 14 +++++++++-----
 kernel/sched/ext.c       |  5 ++---
 kernel/sched/fair.c      | 17 ++++++++++-------
 kernel/sched/idle.c      |  3 ---
 kernel/sched/rt.c        |  9 ++++++---
 kernel/sched/sched.h     | 26 ++------------------------
 kernel/sched/stop_task.c |  3 ---
 8 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4479f7d..7d0a862 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2090,7 +2090,6 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	uclamp_rq_inc(rq, p, flags);
 
-	rq->queue_mask |= p->sched_class->queue_mask;
 	p->sched_class->enqueue_task(rq, p, flags);
 
 	psi_enqueue(p, flags);
@@ -2123,7 +2122,6 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	 * and mark the task ->sched_delayed.
 	 */
 	uclamp_rq_dec(rq, p);
-	rq->queue_mask |= p->sched_class->queue_mask;
 	return p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
 
-	if (p->sched_class == donor->sched_class)
-		donor->sched_class->wakeup_preempt(rq, p, flags);
-	else if (sched_class_above(p->sched_class, donor->sched_class))
+	if (p->sched_class == rq->next_class) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
+
+	} else if (sched_class_above(p->sched_class, rq->next_class)) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
 		resched_curr(rq);
+		rq->next_class = p->sched_class;
+	}
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
@@ -6804,6 +6806,7 @@ static void __sched notrace __schedule(int sched_mode)
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	rq->next_class = next->sched_class;
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
 		if (!next)
@@ -8650,6 +8653,8 @@ void __init sched_init(void)
 		rq->rt.rt_runtime = global_rt_runtime();
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
+		rq->next_class = &idle_sched_class;
+
 		rq->sd = NULL;
 		rq->rd = NULL;
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
@@ -10775,10 +10780,8 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int 
 		flags |= DEQUEUE_NOCLOCK;
 	}
 
-	if (flags & DEQUEUE_CLASS) {
-		if (p->sched_class->switching_from)
-			p->sched_class->switching_from(rq, p);
-	}
+	if ((flags & DEQUEUE_CLASS) && p->sched_class->switching_from)
+		p->sched_class->switching_from(rq, p);
 
 	*ctx = (struct sched_change_ctx){
 		.p = p,
@@ -10831,6 +10834,17 @@ void sched_change_end(struct sched_change_ctx *ctx)
 			p->sched_class->switched_to(rq, p);
 
 		/*
+		 * If this was a class promotion; let the old class know it
+		 * got preempted. Note that none of the switch*_from() methods
+		 * know the new class and none of the switch*_to() methods
+		 * know the old class.
+		 */
+		if (ctx->running && sched_class_above(p->sched_class, ctx->class)) {
+			rq->next_class->wakeup_preempt(rq, p, 0);
+			rq->next_class = p->sched_class;
+		}
+
+		/*
 		 * If this was a degradation in class someone should have set
 		 * need_resched by now.
 		 */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 319439f..80c9559 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2499,9 +2499,16 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
  * Only called when both the current and waking task are -deadline
  * tasks.
  */
-static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
-				  int flags)
+static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	/*
+	 * Can only get preempted by stop-class, and those should be
+	 * few and short lived, doesn't really make sense to push
+	 * anything away for that.
+	 */
+	if (p->sched_class != &dl_sched_class)
+		return;
+
 	if (dl_entity_preempt(&p->dl, &rq->donor->dl)) {
 		resched_curr(rq);
 		return;
@@ -3346,9 +3353,6 @@ static int task_is_throttled_dl(struct task_struct *p, int cpu)
 #endif
 
 DEFINE_SCHED_CLASS(dl) = {
-
-	.queue_mask		= 8,
-
 	.enqueue_task		= enqueue_task_dl,
 	.dequeue_task		= dequeue_task_dl,
 	.yield_task		= yield_task_dl,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 05f5a49..8015ab6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3075,7 +3075,8 @@ static void switched_from_scx(struct rq *rq, struct task_struct *p)
 	scx_disable_task(p);
 }
 
-static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
+static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {}
+
 static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
 
 int scx_check_setscheduler(struct task_struct *p, int policy)
@@ -3336,8 +3337,6 @@ static void scx_cgroup_unlock(void) {}
  *   their current sched_class. Call them directly from sched core instead.
  */
 DEFINE_SCHED_CLASS(ext) = {
-	.queue_mask		= 1,
-
 	.enqueue_task		= enqueue_task_scx,
 	.dequeue_task		= dequeue_task_scx,
 	.yield_task		= yield_task_scx,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f79951f..ea276d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8700,7 +8700,7 @@ preempt_sync(struct rq *rq, int wake_flags,
 /*
  * Preempt the current task with a newly woken task if needed:
  */
-static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
+static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
@@ -8708,6 +8708,12 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
 
+	/*
+	 * XXX Getting preempted by higher class, try and find idle CPU?
+	 */
+	if (p->sched_class != &fair_sched_class)
+		return;
+
 	if (unlikely(se == pse))
 		return;
 
@@ -12875,7 +12881,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	t0 = sched_clock_cpu(this_cpu);
 	__sched_balance_update_blocked_averages(this_rq);
 
-	rq_modified_clear(this_rq);
+	this_rq->next_class = &fair_sched_class;
 	raw_spin_rq_unlock(this_rq);
 
 	for_each_domain(this_cpu, sd) {
@@ -12942,7 +12948,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 		pulled_task = 1;
 
 	/* If a higher prio class was modified, restart the pick */
-	if (rq_modified_above(this_rq, &fair_sched_class))
+	if (sched_class_above(this_rq->next_class, &fair_sched_class))
 		pulled_task = -1;
 
 out:
@@ -13846,15 +13852,12 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
  * All the scheduling class methods:
  */
 DEFINE_SCHED_CLASS(fair) = {
-
-	.queue_mask		= 2,
-
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
 	.yield_to_task		= yield_to_task_fair,
 
-	.wakeup_preempt		= check_preempt_wakeup_fair,
+	.wakeup_preempt		= wakeup_preempt_fair,
 
 	.pick_task		= pick_task_fair,
 	.pick_next_task		= pick_next_task_fair,
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe..65eb8f8 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -536,9 +536,6 @@ static void update_curr_idle(struct rq *rq)
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
 DEFINE_SCHED_CLASS(idle) = {
-
-	.queue_mask		= 0,
-
 	/* no enqueue/yield_task for idle tasks */
 
 	/* dequeue is not valid, we print a debug message there: */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1867fe..0a9b2cd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1615,6 +1615,12 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
 
+	/*
+	 * XXX If we're preempted by DL, queue a push?
+	 */
+	if (p->sched_class != &rt_sched_class)
+		return;
+
 	if (p->prio < donor->prio) {
 		resched_curr(rq);
 		return;
@@ -2568,9 +2574,6 @@ static int task_is_throttled_rt(struct task_struct *p, int cpu)
 #endif /* CONFIG_SCHED_CORE */
 
 DEFINE_SCHED_CLASS(rt) = {
-
-	.queue_mask		= 4,
-
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a40582d..467ea31 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1121,7 +1121,6 @@ struct rq {
 	raw_spinlock_t		__lock;
 
 	/* Per class runqueue modification mask; bits in class order. */
-	unsigned int		queue_mask;
 	unsigned int		nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		nr_numa_running;
@@ -1181,6 +1180,7 @@ struct rq {
 	struct sched_dl_entity	*dl_server;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
+	const struct sched_class *next_class;
 	unsigned long		next_balance;
 	struct mm_struct	*prev_mm;
 
@@ -2428,15 +2428,6 @@ struct sched_class {
 #ifdef CONFIG_UCLAMP_TASK
 	int uclamp_enabled;
 #endif
-	/*
-	 * idle:  0
-	 * ext:   1
-	 * fair:  2
-	 * rt:    4
-	 * dl:    8
-	 * stop: 16
-	 */
-	unsigned int queue_mask;
 
 	/*
 	 * move_queued_task/activate_task/enqueue_task: rq->lock
@@ -2595,20 +2586,6 @@ struct sched_class {
 #endif
 };
 
-/*
- * Does not nest; only used around sched_class::pick_task() rq-lock-breaks.
- */
-static inline void rq_modified_clear(struct rq *rq)
-{
-	rq->queue_mask = 0;
-}
-
-static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class)
-{
-	unsigned int mask = class->queue_mask;
-	return rq->queue_mask & ~((mask << 1) - 1);
-}
-
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->donor != prev);
@@ -3901,6 +3878,7 @@ void move_queued_task_locked(struct rq *src_rq, struct rq *dst_rq, struct task_s
 	deactivate_task(src_rq, task, 0);
 	set_task_cpu(task, dst_rq->cpu);
 	activate_task(dst_rq, task, 0);
+	wakeup_preempt(dst_rq, task, 0);
 }
 
 static inline
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4f9192b..f95798b 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -97,9 +97,6 @@ static void update_curr_stop(struct rq *rq)
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 DEFINE_SCHED_CLASS(stop) = {
-
-	.queue_mask		= 16,
-
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,
 	.yield_task		= yield_task_stop,

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: sched/core] sched/core: Add assertions to QUEUE_CLASS
  2025-11-27 15:39 ` [PATCH 4/5] sched: Add assertions to QUEUE_CLASS Peter Zijlstra
@ 2025-12-14  7:46   ` tip-bot2 for Peter Zijlstra
  2025-12-18 10:09   ` [PATCH 4/5] sched: " Marek Szyprowski
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-14  7:46 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     47efe2ddccb1f285a02bfcf1e079f49bf7a9ccb3
Gitweb:        https://git.kernel.org/tip/47efe2ddccb1f285a02bfcf1e079f49bf7a9ccb3
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 30 Oct 2025 12:56:34 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 14 Dec 2025 08:25:02 +01:00

sched/core: Add assertions to QUEUE_CLASS

Add some checks to the sched_change pattern to validate assumptions
around changing classes.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.771691954@infradead.org
---
 kernel/sched/core.c  | 13 +++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41ba0be..4479f7d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10782,6 +10782,7 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int 
 
 	*ctx = (struct sched_change_ctx){
 		.p = p,
+		.class = p->sched_class,
 		.flags = flags,
 		.queued = task_on_rq_queued(p),
 		.running = task_current_donor(rq, p),
@@ -10812,6 +10813,11 @@ void sched_change_end(struct sched_change_ctx *ctx)
 
 	lockdep_assert_rq_held(rq);
 
+	/*
+	 * Changing class without *QUEUE_CLASS is bad.
+	 */
+	WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS));
+
 	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
 		p->sched_class->switching_to(rq, p);
 
@@ -10823,6 +10829,13 @@ void sched_change_end(struct sched_change_ctx *ctx)
 	if (ctx->flags & ENQUEUE_CLASS) {
 		if (p->sched_class->switched_to)
 			p->sched_class->switched_to(rq, p);
+
+		/*
+		 * If this was a degradation in class someone should have set
+		 * need_resched by now.
+		 */
+		WARN_ON_ONCE(sched_class_above(ctx->class, p->sched_class) &&
+			     !test_tsk_need_resched(p));
 	} else {
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67cff7d..a40582d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3968,6 +3968,7 @@ extern void balance_callbacks(struct rq *rq, struct balance_callback *head);
 struct sched_change_ctx {
 	u64			prio;
 	struct task_struct	*p;
+	const struct sched_class *class;
 	int			flags;
 	bool			queued;
 	bool			running;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: sched/core] sched/fair: Remove superfluous rcu_read_lock()
  2025-11-27 15:39 ` [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched Peter Zijlstra
  2025-11-28 10:57   ` Peter Zijlstra
@ 2025-12-14  7:46   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-14  7:46 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     a03fee333a2f1e065a739bdbe5edbc5512fab9a4
Gitweb:        https://git.kernel.org/tip/a03fee333a2f1e065a739bdbe5edbc5512fab9a4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 14 Nov 2025 11:00:55 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 14 Dec 2025 08:25:02 +01:00

sched/fair: Remove superfluous rcu_read_lock()

With fair switched to rcu_dereference_all() validation, having IRQ or
preemption disabled is sufficient, remove the rcu_read_lock()
clutter.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.647502625@infradead.org
---
 kernel/sched/fair.c |  9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44a359d..496a30a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12856,21 +12856,16 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	 */
 	rq_unpin_lock(this_rq, rf);
 
-	rcu_read_lock();
 	sd = rcu_dereference_sched_domain(this_rq->sd);
-	if (!sd) {
-		rcu_read_unlock();
+	if (!sd)
 		goto out;
-	}
 
 	if (!get_rd_overloaded(this_rq->rd) ||
 	    this_rq->avg_idle < sd->max_newidle_lb_cost) {
 
 		update_next_balance(sd, &next_balance);
-		rcu_read_unlock();
 		goto out;
 	}
-	rcu_read_unlock();
 
 	/*
 	 * Include sched_balance_update_blocked_averages() in the cost
@@ -12883,7 +12878,6 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	rq_modified_clear(this_rq);
 	raw_spin_rq_unlock(this_rq);
 
-	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		u64 domain_cost;
 
@@ -12933,7 +12927,6 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 		if (pulled_task || !continue_balancing)
 			break;
 	}
-	rcu_read_unlock();
 
 	raw_spin_rq_lock(this_rq);
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: sched/core] sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()
  2025-11-27 15:39 ` [PATCH 2/5] sched/fair: Avoid rq->lock bouncing in sched_balance_newidle() Peter Zijlstra
  2025-11-29 18:59   ` Shrikanth Hegde
@ 2025-12-14  7:46   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-14  7:46 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     45e09225085f70b856b7b9f26a18ea767a7e1563
Gitweb:        https://git.kernel.org/tip/45e09225085f70b856b7b9f26a18ea767a7e1563
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 12 Nov 2025 16:08:23 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 14 Dec 2025 08:25:02 +01:00

sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()

While poking at this code recently I noted we do a pointless
unlock+lock cycle in sched_balance_newidle(). We drop the rq->lock (so
we can balance) but then instantly grab the same rq->lock again in
sched_balance_update_blocked_averages().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.532469061@infradead.org
---
 kernel/sched/fair.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aa033e4..708ad01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9905,15 +9905,11 @@ static unsigned long task_h_load(struct task_struct *p)
 }
 #endif /* !CONFIG_FAIR_GROUP_SCHED */
 
-static void sched_balance_update_blocked_averages(int cpu)
+static void __sched_balance_update_blocked_averages(struct rq *rq)
 {
 	bool decayed = false, done = true;
-	struct rq *rq = cpu_rq(cpu);
-	struct rq_flags rf;
 
-	rq_lock_irqsave(rq, &rf);
 	update_blocked_load_tick(rq);
-	update_rq_clock(rq);
 
 	decayed |= __update_blocked_others(rq, &done);
 	decayed |= __update_blocked_fair(rq, &done);
@@ -9921,7 +9917,15 @@ static void sched_balance_update_blocked_averages(int cpu)
 	update_blocked_load_status(rq, !done);
 	if (decayed)
 		cpufreq_update_util(rq, 0);
-	rq_unlock_irqrestore(rq, &rf);
+}
+
+static void sched_balance_update_blocked_averages(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	guard(rq_lock_irqsave)(rq);
+	update_rq_clock(rq);
+	__sched_balance_update_blocked_averages(rq);
 }
 
 /********** Helpers for sched_balance_find_src_group ************************/
@@ -12868,12 +12872,17 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	}
 	rcu_read_unlock();
 
+	/*
+	 * Include sched_balance_update_blocked_averages() in the cost
+	 * calculation because it can be quite costly -- this ensures we skip
+	 * it when avg_idle gets to be very low.
+	 */
+	t0 = sched_clock_cpu(this_cpu);
+	__sched_balance_update_blocked_averages(this_rq);
+
 	rq_modified_clear(this_rq);
 	raw_spin_rq_unlock(this_rq);
 
-	t0 = sched_clock_cpu(this_cpu);
-	sched_balance_update_blocked_averages(this_cpu);
-
 	rcu_read_lock();
 	for_each_domain(this_cpu, sd) {
 		u64 domain_cost;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: sched/core] sched/fair: Fold the sched_avg update
  2025-11-27 15:39 ` [PATCH 1/5] sched/fair: Fold the sched_avg update Peter Zijlstra
@ 2025-12-14  7:46   ` tip-bot2 for Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-14  7:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Ingo Molnar, Linus Torvalds,
	Dietmar Eggemann, Juri Lelli, Mel Gorman, Shrikanth Hegde,
	Valentin Schneider, Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     089d84203ad42bc8fd6dbf41683e162ac6e848cd
Gitweb:        https://git.kernel.org/tip/089d84203ad42bc8fd6dbf41683e162ac6e848cd
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 27 Nov 2025 16:39:44 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 14 Dec 2025 08:25:02 +01:00

sched/fair: Fold the sched_avg update

Nine (and a half) instances of the same pattern is just silly, fold the lot.

Notably, the half instance in enqueue_load_avg() is right after setting
cfs_rq->avg.load_sum to cfs_rq->avg.load_avg * get_pelt_divider(&cfs_rq->avg).
Since get_pelt_divisor() >= PELT_MIN_DIVIDER, this ends up being a no-op
change.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20251127154725.413564507@infradead.org
---
 kernel/sched/fair.c | 108 ++++++++++++-------------------------------
 1 file changed, 32 insertions(+), 76 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c31..aa033e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3693,7 +3693,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
  */
 #define add_positive(_ptr, _val) do {                           \
 	typeof(_ptr) ptr = (_ptr);                              \
-	typeof(_val) val = (_val);                              \
+	__signed_scalar_typeof(*ptr) val = (_val);              \
 	typeof(*ptr) res, var = READ_ONCE(*ptr);                \
 								\
 	res = var + val;                                        \
@@ -3705,23 +3705,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 } while (0)
 
 /*
- * Unsigned subtract and clamp on underflow.
- *
- * Explicitly do a load-store to ensure the intermediate value never hits
- * memory. This allows lockless observations without ever seeing the negative
- * values.
- */
-#define sub_positive(_ptr, _val) do {				\
-	typeof(_ptr) ptr = (_ptr);				\
-	typeof(*ptr) val = (_val);				\
-	typeof(*ptr) res, var = READ_ONCE(*ptr);		\
-	res = var - val;					\
-	if (res > var)						\
-		res = 0;					\
-	WRITE_ONCE(*ptr, res);					\
-} while (0)
-
-/*
  * Remove and clamp on negative, from a local variable.
  *
  * A variant of sub_positive(), which does not use explicit load-store
@@ -3732,21 +3715,37 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	*ptr -= min_t(typeof(*ptr), *ptr, _val);		\
 } while (0)
 
+
+/*
+ * Because of rounding, se->util_sum might ends up being +1 more than
+ * cfs->util_sum. Although this is not a problem by itself, detaching
+ * a lot of tasks with the rounding problem between 2 updates of
+ * util_avg (~1ms) can make cfs->util_sum becoming null whereas
+ * cfs_util_avg is not.
+ *
+ * Check that util_sum is still above its lower bound for the new
+ * util_avg. Given that period_contrib might have moved since the last
+ * sync, we are only sure that util_sum must be above or equal to
+ *    util_avg * minimum possible divider
+ */
+#define __update_sa(sa, name, delta_avg, delta_sum) do {	\
+	add_positive(&(sa)->name##_avg, delta_avg);		\
+	add_positive(&(sa)->name##_sum, delta_sum);		\
+	(sa)->name##_sum = max_t(typeof((sa)->name##_sum),	\
+			       (sa)->name##_sum,		\
+			       (sa)->name##_avg * PELT_MIN_DIVIDER); \
+} while (0)
+
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	cfs_rq->avg.load_avg += se->avg.load_avg;
-	cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
+	__update_sa(&cfs_rq->avg, load, se->avg.load_avg, se->avg.load_sum);
 }
 
 static inline void
 dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	sub_positive(&cfs_rq->avg.load_avg, se->avg.load_avg);
-	sub_positive(&cfs_rq->avg.load_sum, se_weight(se) * se->avg.load_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
-					  cfs_rq->avg.load_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, load, -se->avg.load_avg, -se->avg.load_sum);
 }
 
 static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags);
@@ -4242,7 +4241,6 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	 */
 	divider = get_pelt_divider(&cfs_rq->avg);
 
-
 	/* Set new sched_entity's utilization */
 	se->avg.util_avg = gcfs_rq->avg.util_avg;
 	new_sum = se->avg.util_avg * divider;
@@ -4250,12 +4248,7 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	se->avg.util_sum = new_sum;
 
 	/* Update parent cfs_rq utilization */
-	add_positive(&cfs_rq->avg.util_avg, delta_avg);
-	add_positive(&cfs_rq->avg.util_sum, delta_sum);
-
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
-					  cfs_rq->avg.util_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, util, delta_avg, delta_sum);
 }
 
 static inline void
@@ -4281,11 +4274,7 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
 	se->avg.runnable_sum = new_sum;
 
 	/* Update parent cfs_rq runnable */
-	add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
-	add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
-					      cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, runnable, delta_avg, delta_sum);
 }
 
 static inline void
@@ -4349,11 +4338,7 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 
 	se->avg.load_sum = runnable_sum;
 	se->avg.load_avg = load_avg;
-	add_positive(&cfs_rq->avg.load_avg, delta_avg);
-	add_positive(&cfs_rq->avg.load_sum, delta_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.load_sum = max_t(u32, cfs_rq->avg.load_sum,
-					  cfs_rq->avg.load_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, load, delta_avg, delta_sum);
 }
 
 static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum)
@@ -4552,33 +4537,13 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		raw_spin_unlock(&cfs_rq->removed.lock);
 
 		r = removed_load;
-		sub_positive(&sa->load_avg, r);
-		sub_positive(&sa->load_sum, r * divider);
-		/* See sa->util_sum below */
-		sa->load_sum = max_t(u32, sa->load_sum, sa->load_avg * PELT_MIN_DIVIDER);
+		__update_sa(sa, load, -r, -r*divider);
 
 		r = removed_util;
-		sub_positive(&sa->util_avg, r);
-		sub_positive(&sa->util_sum, r * divider);
-		/*
-		 * Because of rounding, se->util_sum might ends up being +1 more than
-		 * cfs->util_sum. Although this is not a problem by itself, detaching
-		 * a lot of tasks with the rounding problem between 2 updates of
-		 * util_avg (~1ms) can make cfs->util_sum becoming null whereas
-		 * cfs_util_avg is not.
-		 * Check that util_sum is still above its lower bound for the new
-		 * util_avg. Given that period_contrib might have moved since the last
-		 * sync, we are only sure that util_sum must be above or equal to
-		 *    util_avg * minimum possible divider
-		 */
-		sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * PELT_MIN_DIVIDER);
+		__update_sa(sa, util, -r, -r*divider);
 
 		r = removed_runnable;
-		sub_positive(&sa->runnable_avg, r);
-		sub_positive(&sa->runnable_sum, r * divider);
-		/* See sa->util_sum above */
-		sa->runnable_sum = max_t(u32, sa->runnable_sum,
-					      sa->runnable_avg * PELT_MIN_DIVIDER);
+		__update_sa(sa, runnable, -r, -r*divider);
 
 		/*
 		 * removed_runnable is the unweighted version of removed_load so we
@@ -4663,17 +4628,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	dequeue_load_avg(cfs_rq, se);
-	sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg);
-	sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.util_sum = max_t(u32, cfs_rq->avg.util_sum,
-					  cfs_rq->avg.util_avg * PELT_MIN_DIVIDER);
-
-	sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg);
-	sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum);
-	/* See update_cfs_rq_load_avg() */
-	cfs_rq->avg.runnable_sum = max_t(u32, cfs_rq->avg.runnable_sum,
-					      cfs_rq->avg.runnable_avg * PELT_MIN_DIVIDER);
+	__update_sa(&cfs_rq->avg, util, -se->avg.util_avg, -se->avg.util_sum);
+	__update_sa(&cfs_rq->avg, runnable, -se->avg.runnable_avg, -se->avg.runnable_sum);
 
 	add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [tip: sched/core] <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper
  2025-11-27 15:39 ` [PATCH 1/5] sched/fair: Fold the sched_avg update Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
@ 2025-12-14  7:46   ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-14  7:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Ingo Molnar, Dietmar Eggemann, Juri Lelli,
	Linus Torvalds, Mel Gorman, Shrikanth Hegde, Valentin Schneider,
	Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     38a68b982dd0b10e3da943f100e034598326eafe
Gitweb:        https://git.kernel.org/tip/38a68b982dd0b10e3da943f100e034598326eafe
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Thu, 27 Nov 2025 16:39:44 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sun, 14 Dec 2025 08:25:02 +01:00

<linux/compiler_types.h>: Add the __signed_scalar_typeof() helper

Define __signed_scalar_typeof() to declare a signed scalar type, leaving
non-scalar types unchanged.

To be used to clean up the scheduler load-balancing code a bit.

[ mingo: Split off this patch from the scheduler patch. ]

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20251127154725.413564507@infradead.org
---
 include/linux/compiler_types.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 1280693..280b4ac 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -586,6 +586,25 @@ struct ftrace_likely_data {
 			 __scalar_type_to_expr_cases(long long),	\
 			 default: (x)))
 
+/*
+ * __signed_scalar_typeof(x) - Declare a signed scalar type, leaving
+ *			       non-scalar types unchanged.
+ */
+
+#define __scalar_type_to_signed_cases(type)				\
+		unsigned type:	(signed type)0,				\
+		signed type:	(signed type)0
+
+#define __signed_scalar_typeof(x) typeof(				\
+		_Generic((x),						\
+			 char:	(signed char)0,				\
+			 __scalar_type_to_signed_cases(char),		\
+			 __scalar_type_to_signed_cases(short),		\
+			 __scalar_type_to_signed_cases(int),		\
+			 __scalar_type_to_signed_cases(long),		\
+			 __scalar_type_to_signed_cases(long long),	\
+			 default: (x)))
+
 /* Is this type a native word size -- useful for atomic operations */
 #define __native_word(t) \
 	(sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*())
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
                     ` (4 preceding siblings ...)
  2025-12-14  7:46   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
@ 2025-12-15  6:07   ` Thorsten Leemhuis
  2025-12-15  7:12     ` Ingo Molnar
  2025-12-15  7:59   ` [tip: sched/core] sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*() tip-bot2 for Peter Zijlstra
  2025-12-17 10:02   ` tip-bot2 for Peter Zijlstra
  7 siblings, 1 reply; 36+ messages in thread
From: Thorsten Leemhuis @ 2025-12-15  6:07 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext,
	Linux Next Mailing List

On 11/27/25 16:39, Peter Zijlstra wrote:
> Change sched_class::wakeup_preempt() to also get called for
> cross-class wakeups, specifically those where the woken task is of a
> higher class than the previous highest class.

I suspect you might be aware of this already, but this patch afaics
broke compilation of today's -next for me, as reverting fixed things.

"""
In file included from kernel/sched/build_policy.c:62:
kernel/sched/ext.c: In function ‘do_pick_task_scx’:
kernel/sched/ext.c:2455:9: error: implicit declaration of function ‘rq_modified_clear’ [-Wimplicit-function-declaration]
 2455 |         rq_modified_clear(rq);
      |         ^~~~~~~~~~~~~~~~~
kernel/sched/ext.c:2470:27: error: implicit declaration of function ‘rq_modified_above’ [-Wimplicit-function-declaration]
 2470 |         if (!force_scx && rq_modified_above(rq, &ext_sched_class))
      |                           ^~~~~~~~~~~~~~~~~
make[4]: *** [scripts/Makefile.build:287: kernel/sched/build_policy.o] Error 1
make[3]: *** [scripts/Makefile.build:556: kernel/sched] Error 2
make[2]: *** [scripts/Makefile.build:556: kernel] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [/builddir/build/BUILD/kernel-6.19.0-build/kernel-next-20251215/linux-6.19.0-0.0.next.20251215.414.vanilla.fc44.s390x/Makefile:2062: .] Error 2
make: *** [Makefile:256: __sub-make] Error 2
"""

Ciao, Thorsten
 > In order to do this, track the current highest class of the runqueue
> in rq::next_class and have wakeup_preempt() track this upwards for
> each new wakeup. Additionally have set_next_task() re-set the value to
> the current class.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c      |   32 +++++++++++++++++++++++---------
>  kernel/sched/deadline.c  |   14 +++++++++-----
>  kernel/sched/ext.c       |    9 ++++-----
>  kernel/sched/fair.c      |   17 ++++++++++-------
>  kernel/sched/idle.c      |    3 ---
>  kernel/sched/rt.c        |    9 ++++++---
>  kernel/sched/sched.h     |   26 ++------------------------
>  kernel/sched/stop_task.c |    3 ---
>  8 files changed, 54 insertions(+), 59 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2090,7 +2090,6 @@ void enqueue_task(struct rq *rq, struct
>  	 */
>  	uclamp_rq_inc(rq, p, flags);
>  
> -	rq->queue_mask |= p->sched_class->queue_mask;
>  	p->sched_class->enqueue_task(rq, p, flags);
>  
>  	psi_enqueue(p, flags);
> @@ -2123,7 +2122,6 @@ inline bool dequeue_task(struct rq *rq,
>  	 * and mark the task ->sched_delayed.
>  	 */
>  	uclamp_rq_dec(rq, p);
> -	rq->queue_mask |= p->sched_class->queue_mask;
>  	return p->sched_class->dequeue_task(rq, p, flags);
>  }
>  
> @@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struc
>  {
>  	struct task_struct *donor = rq->donor;
>  
> -	if (p->sched_class == donor->sched_class)
> -		donor->sched_class->wakeup_preempt(rq, p, flags);
> -	else if (sched_class_above(p->sched_class, donor->sched_class))
> +	if (p->sched_class == rq->next_class) {
> +		rq->next_class->wakeup_preempt(rq, p, flags);
> +
> +	} else if (sched_class_above(p->sched_class, rq->next_class)) {
> +		rq->next_class->wakeup_preempt(rq, p, flags);
>  		resched_curr(rq);
> +		rq->next_class = p->sched_class;
> +	}
>  
>  	/*
>  	 * A queue event has occurred, and we're going to schedule.  In
> @@ -6797,6 +6799,7 @@ static void __sched notrace __schedule(i
>  pick_again:
>  	next = pick_next_task(rq, rq->donor, &rf);
>  	rq_set_donor(rq, next);
> +	rq->next_class = next->sched_class;
>  	if (unlikely(task_is_blocked(next))) {
>  		next = find_proxy_task(rq, next, &rf);
>  		if (!next)
> @@ -8646,6 +8649,8 @@ void __init sched_init(void)
>  		rq->rt.rt_runtime = global_rt_runtime();
>  		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
>  #endif
> +		rq->next_class = &idle_sched_class;
> +
>  		rq->sd = NULL;
>  		rq->rd = NULL;
>  		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
> @@ -10771,10 +10776,8 @@ struct sched_change_ctx *sched_change_be
>  		flags |= DEQUEUE_NOCLOCK;
>  	}
>  
> -	if (flags & DEQUEUE_CLASS) {
> -		if (p->sched_class->switching_from)
> -			p->sched_class->switching_from(rq, p);
> -	}
> +	if ((flags & DEQUEUE_CLASS) && p->sched_class->switching_from)
> +		p->sched_class->switching_from(rq, p);
>  
>  	*ctx = (struct sched_change_ctx){
>  		.p = p,
> @@ -10827,6 +10830,17 @@ void sched_change_end(struct sched_chang
>  			p->sched_class->switched_to(rq, p);
>  
>  		/*
> +		 * If this was a class promotion; let the old class know it
> +		 * got preempted. Note that none of the switch*_from() methods
> +		 * know the new class and none of the switch*_to() methods
> +		 * know the old class.
> +		 */
> +		if (ctx->running && sched_class_above(p->sched_class, ctx->class)) {
> +			rq->next_class->wakeup_preempt(rq, p, 0);
> +			rq->next_class = p->sched_class;
> +		}
> +
> +		/*
>  		 * If this was a degradation in class someone should have set
>  		 * need_resched by now.
>  		 */
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2499,9 +2499,16 @@ static int balance_dl(struct rq *rq, str
>   * Only called when both the current and waking task are -deadline
>   * tasks.
>   */
> -static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
> -				  int flags)
> +static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	/*
> +	 * Can only get preempted by stop-class, and those should be
> +	 * few and short lived, doesn't really make sense to push
> +	 * anything away for that.
> +	 */
> +	if (p->sched_class != &dl_sched_class)
> +		return;
> +
>  	if (dl_entity_preempt(&p->dl, &rq->donor->dl)) {
>  		resched_curr(rq);
>  		return;
> @@ -3304,9 +3311,6 @@ static int task_is_throttled_dl(struct t
>  #endif
>  
>  DEFINE_SCHED_CLASS(dl) = {
> -
> -	.queue_mask		= 8,
> -
>  	.enqueue_task		= enqueue_task_dl,
>  	.dequeue_task		= dequeue_task_dl,
>  	.yield_task		= yield_task_dl,
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2338,12 +2338,12 @@ static struct task_struct *pick_task_scx
>  	bool keep_prev, kick_idle = false;
>  	struct task_struct *p;
>  
> -	rq_modified_clear(rq);
> +	rq->next_class = &ext_sched_class;
>  	rq_unpin_lock(rq, rf);
>  	balance_one(rq, prev);
>  	rq_repin_lock(rq, rf);
>  	maybe_queue_balance_callback(rq);
> -	if (rq_modified_above(rq, &ext_sched_class))
> +	if (sched_class_above(rq->next_class, &ext_sched_class))
>  		return RETRY_TASK;
>  
>  	keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
> @@ -2967,7 +2967,8 @@ static void switched_from_scx(struct rq
>  	scx_disable_task(p);
>  }
>  
> -static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
> +static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {}
> +
>  static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
>  
>  int scx_check_setscheduler(struct task_struct *p, int policy)
> @@ -3216,8 +3217,6 @@ static void scx_cgroup_unlock(void) {}
>   *   their current sched_class. Call them directly from sched core instead.
>   */
>  DEFINE_SCHED_CLASS(ext) = {
> -	.queue_mask		= 1,
> -
>  	.enqueue_task		= enqueue_task_scx,
>  	.dequeue_task		= dequeue_task_scx,
>  	.yield_task		= yield_task_scx,
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8697,7 +8697,7 @@ preempt_sync(struct rq *rq, int wake_fla
>  /*
>   * Preempt the current task with a newly woken task if needed:
>   */
> -static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
> +static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>  {
>  	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
>  	struct task_struct *donor = rq->donor;
> @@ -8705,6 +8705,12 @@ static void check_preempt_wakeup_fair(st
>  	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
>  	int cse_is_idle, pse_is_idle;
>  
> +	/*
> +	 * XXX Getting preempted by higher class, try and find idle CPU?
> +	 */
> +	if (p->sched_class != &fair_sched_class)
> +		return;
> +
>  	if (unlikely(se == pse))
>  		return;
>  
> @@ -12872,7 +12878,7 @@ static int sched_balance_newidle(struct
>  	t0 = sched_clock_cpu(this_cpu);
>  	__sched_balance_update_blocked_averages(this_rq);
>  
> -	rq_modified_clear(this_rq);
> +	this_rq->next_class = &fair_sched_class;
>  	raw_spin_rq_unlock(this_rq);
>  
>  	for_each_domain(this_cpu, sd) {
> @@ -12939,7 +12945,7 @@ static int sched_balance_newidle(struct
>  		pulled_task = 1;
>  
>  	/* If a higher prio class was modified, restart the pick */
> -	if (rq_modified_above(this_rq, &fair_sched_class))
> +	if (sched_class_above(this_rq->next_class, &fair_sched_class))
>  		pulled_task = -1;
>  
>  out:
> @@ -13837,15 +13843,12 @@ static unsigned int get_rr_interval_fair
>   * All the scheduling class methods:
>   */
>  DEFINE_SCHED_CLASS(fair) = {
> -
> -	.queue_mask		= 2,
> -
>  	.enqueue_task		= enqueue_task_fair,
>  	.dequeue_task		= dequeue_task_fair,
>  	.yield_task		= yield_task_fair,
>  	.yield_to_task		= yield_to_task_fair,
>  
> -	.wakeup_preempt		= check_preempt_wakeup_fair,
> +	.wakeup_preempt		= wakeup_preempt_fair,
>  
>  	.pick_task		= pick_task_fair,
>  	.pick_next_task		= pick_next_task_fair,
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -534,9 +534,6 @@ static void update_curr_idle(struct rq *
>   * Simple, special scheduling class for the per-CPU idle tasks:
>   */
>  DEFINE_SCHED_CLASS(idle) = {
> -
> -	.queue_mask		= 0,
> -
>  	/* no enqueue/yield_task for idle tasks */
>  
>  	/* dequeue is not valid, we print a debug message there: */
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1615,6 +1615,12 @@ static void wakeup_preempt_rt(struct rq
>  {
>  	struct task_struct *donor = rq->donor;
>  
> +	/*
> +	 * XXX If we're preempted by DL, queue a push?
> +	 */
> +	if (p->sched_class != &rt_sched_class)
> +		return;
> +
>  	if (p->prio < donor->prio) {
>  		resched_curr(rq);
>  		return;
> @@ -2568,9 +2574,6 @@ static int task_is_throttled_rt(struct t
>  #endif /* CONFIG_SCHED_CORE */
>  
>  DEFINE_SCHED_CLASS(rt) = {
> -
> -	.queue_mask		= 4,
> -
>  	.enqueue_task		= enqueue_task_rt,
>  	.dequeue_task		= dequeue_task_rt,
>  	.yield_task		= yield_task_rt,
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1119,7 +1119,6 @@ struct rq {
>  	raw_spinlock_t		__lock;
>  
>  	/* Per class runqueue modification mask; bits in class order. */
> -	unsigned int		queue_mask;
>  	unsigned int		nr_running;
>  #ifdef CONFIG_NUMA_BALANCING
>  	unsigned int		nr_numa_running;
> @@ -1179,6 +1178,7 @@ struct rq {
>  	struct sched_dl_entity	*dl_server;
>  	struct task_struct	*idle;
>  	struct task_struct	*stop;
> +	const struct sched_class *next_class;
>  	unsigned long		next_balance;
>  	struct mm_struct	*prev_mm;
>  
> @@ -2426,15 +2426,6 @@ struct sched_class {
>  #ifdef CONFIG_UCLAMP_TASK
>  	int uclamp_enabled;
>  #endif
> -	/*
> -	 * idle:  0
> -	 * ext:   1
> -	 * fair:  2
> -	 * rt:    4
> -	 * dl:    8
> -	 * stop: 16
> -	 */
> -	unsigned int queue_mask;
>  
>  	/*
>  	 * move_queued_task/activate_task/enqueue_task: rq->lock
> @@ -2593,20 +2584,6 @@ struct sched_class {
>  #endif
>  };
>  
> -/*
> - * Does not nest; only used around sched_class::pick_task() rq-lock-breaks.
> - */
> -static inline void rq_modified_clear(struct rq *rq)
> -{
> -	rq->queue_mask = 0;
> -}
> -
> -static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class)
> -{
> -	unsigned int mask = class->queue_mask;
> -	return rq->queue_mask & ~((mask << 1) - 1);
> -}
> -
>  static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
>  {
>  	WARN_ON_ONCE(rq->donor != prev);
> @@ -3899,6 +3876,7 @@ void move_queued_task_locked(struct rq *
>  	deactivate_task(src_rq, task, 0);
>  	set_task_cpu(task, dst_rq->cpu);
>  	activate_task(dst_rq, task, 0);
> +	wakeup_preempt(dst_rq, task, 0);
>  }
>  
>  static inline
> --- a/kernel/sched/stop_task.c
> +++ b/kernel/sched/stop_task.c
> @@ -97,9 +97,6 @@ static void update_curr_stop(struct rq *
>   * Simple, special scheduling class for the per-CPU stop tasks:
>   */
>  DEFINE_SCHED_CLASS(stop) = {
> -
> -	.queue_mask		= 16,
> -
>  	.enqueue_task		= enqueue_task_stop,
>  	.dequeue_task		= dequeue_task_stop,
>  	.yield_task		= yield_task_stop,
> 
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*())
  2025-12-15  6:07   ` error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()) Thorsten Leemhuis
@ 2025-12-15  7:12     ` Ingo Molnar
  2025-12-15 11:51       ` Nathan Chancellor
  0 siblings, 1 reply; 36+ messages in thread
From: Ingo Molnar @ 2025-12-15  7:12 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Peter Zijlstra, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, void,
	arighi, changwoo, sched-ext, Linux Next Mailing List


* Thorsten Leemhuis <linux@leemhuis.info> wrote:

> On 11/27/25 16:39, Peter Zijlstra wrote:
> > Change sched_class::wakeup_preempt() to also get called for
> > cross-class wakeups, specifically those where the woken task is of a
> > higher class than the previous highest class.
>
> I suspect you might be aware of this already, but this patch afaics
> broke compilation of today's -next for me, as reverting fixed things.

Yeah, sorry about that, I fumbled a conflict resolution - should be
fixed for tomorrow's -next.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [tip: sched/core] sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
                     ` (5 preceding siblings ...)
  2025-12-15  6:07   ` error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()) Thorsten Leemhuis
@ 2025-12-15  7:59   ` tip-bot2 for Peter Zijlstra
  2025-12-17 10:02   ` tip-bot2 for Peter Zijlstra
  7 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-15  7:59 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5d1f0b2f278eb55aebe29210fbc8f352c53497d6
Gitweb:        https://git.kernel.org/tip/5d1f0b2f278eb55aebe29210fbc8f352c53497d6
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 10 Dec 2025 09:06:50 +01:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Mon, 15 Dec 2025 07:53:35 +01:00

sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()

Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task
is of a higher class than the previous highest class.

In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have schedule() re-set the value on
pick.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
---
 kernel/sched/core.c      | 32 +++++++++++++++++++++++---------
 kernel/sched/deadline.c  | 14 +++++++++-----
 kernel/sched/ext.c       |  7 +++----
 kernel/sched/fair.c      | 17 ++++++++++-------
 kernel/sched/idle.c      |  3 ---
 kernel/sched/rt.c        |  9 ++++++---
 kernel/sched/sched.h     | 26 ++------------------------
 kernel/sched/stop_task.c |  3 ---
 8 files changed, 53 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4479f7d..7d0a862 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2090,7 +2090,6 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	uclamp_rq_inc(rq, p, flags);
 
-	rq->queue_mask |= p->sched_class->queue_mask;
 	p->sched_class->enqueue_task(rq, p, flags);
 
 	psi_enqueue(p, flags);
@@ -2123,7 +2122,6 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	 * and mark the task ->sched_delayed.
 	 */
 	uclamp_rq_dec(rq, p);
-	rq->queue_mask |= p->sched_class->queue_mask;
 	return p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
 
-	if (p->sched_class == donor->sched_class)
-		donor->sched_class->wakeup_preempt(rq, p, flags);
-	else if (sched_class_above(p->sched_class, donor->sched_class))
+	if (p->sched_class == rq->next_class) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
+
+	} else if (sched_class_above(p->sched_class, rq->next_class)) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
 		resched_curr(rq);
+		rq->next_class = p->sched_class;
+	}
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
@@ -6804,6 +6806,7 @@ static void __sched notrace __schedule(int sched_mode)
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	rq->next_class = next->sched_class;
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
 		if (!next)
@@ -8650,6 +8653,8 @@ void __init sched_init(void)
 		rq->rt.rt_runtime = global_rt_runtime();
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
+		rq->next_class = &idle_sched_class;
+
 		rq->sd = NULL;
 		rq->rd = NULL;
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
@@ -10775,10 +10780,8 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int 
 		flags |= DEQUEUE_NOCLOCK;
 	}
 
-	if (flags & DEQUEUE_CLASS) {
-		if (p->sched_class->switching_from)
-			p->sched_class->switching_from(rq, p);
-	}
+	if ((flags & DEQUEUE_CLASS) && p->sched_class->switching_from)
+		p->sched_class->switching_from(rq, p);
 
 	*ctx = (struct sched_change_ctx){
 		.p = p,
@@ -10831,6 +10834,17 @@ void sched_change_end(struct sched_change_ctx *ctx)
 			p->sched_class->switched_to(rq, p);
 
 		/*
+		 * If this was a class promotion; let the old class know it
+		 * got preempted. Note that none of the switch*_from() methods
+		 * know the new class and none of the switch*_to() methods
+		 * know the old class.
+		 */
+		if (ctx->running && sched_class_above(p->sched_class, ctx->class)) {
+			rq->next_class->wakeup_preempt(rq, p, 0);
+			rq->next_class = p->sched_class;
+		}
+
+		/*
 		 * If this was a degradation in class someone should have set
 		 * need_resched by now.
 		 */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 319439f..80c9559 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2499,9 +2499,16 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
  * Only called when both the current and waking task are -deadline
  * tasks.
  */
-static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
-				  int flags)
+static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	/*
+	 * Can only get preempted by stop-class, and those should be
+	 * few and short lived, doesn't really make sense to push
+	 * anything away for that.
+	 */
+	if (p->sched_class != &dl_sched_class)
+		return;
+
 	if (dl_entity_preempt(&p->dl, &rq->donor->dl)) {
 		resched_curr(rq);
 		return;
@@ -3346,9 +3353,6 @@ static int task_is_throttled_dl(struct task_struct *p, int cpu)
 #endif
 
 DEFINE_SCHED_CLASS(dl) = {
-
-	.queue_mask		= 8,
-
 	.enqueue_task		= enqueue_task_dl,
 	.dequeue_task		= dequeue_task_dl,
 	.yield_task		= yield_task_dl,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 05f5a49..3058777 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2431,7 +2431,7 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
 	/* see kick_cpus_irq_workfn() */
 	smp_store_release(&rq->scx.kick_sync, rq->scx.kick_sync + 1);
 
-	rq_modified_clear(rq);
+	rq->next_class = &fair_sched_class;
 
 	rq_unpin_lock(rq, rf);
 	balance_one(rq, prev);
@@ -3075,7 +3075,8 @@ static void switched_from_scx(struct rq *rq, struct task_struct *p)
 	scx_disable_task(p);
 }
 
-static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
+static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {}
+
 static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
 
 int scx_check_setscheduler(struct task_struct *p, int policy)
@@ -3336,8 +3337,6 @@ static void scx_cgroup_unlock(void) {}
  *   their current sched_class. Call them directly from sched core instead.
  */
 DEFINE_SCHED_CLASS(ext) = {
-	.queue_mask		= 1,
-
 	.enqueue_task		= enqueue_task_scx,
 	.dequeue_task		= dequeue_task_scx,
 	.yield_task		= yield_task_scx,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d588eb8..76f5e4b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8736,7 +8736,7 @@ preempt_sync(struct rq *rq, int wake_flags,
 /*
  * Preempt the current task with a newly woken task if needed:
  */
-static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
+static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
@@ -8744,6 +8744,12 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
 
+	/*
+	 * XXX Getting preempted by higher class, try and find idle CPU?
+	 */
+	if (p->sched_class != &fair_sched_class)
+		return;
+
 	if (unlikely(se == pse))
 		return;
 
@@ -12911,7 +12917,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	t0 = sched_clock_cpu(this_cpu);
 	__sched_balance_update_blocked_averages(this_rq);
 
-	rq_modified_clear(this_rq);
+	this_rq->next_class = &fair_sched_class;
 	raw_spin_rq_unlock(this_rq);
 
 	for_each_domain(this_cpu, sd) {
@@ -12978,7 +12984,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 		pulled_task = 1;
 
 	/* If a higher prio class was modified, restart the pick */
-	if (rq_modified_above(this_rq, &fair_sched_class))
+	if (sched_class_above(this_rq->next_class, &fair_sched_class))
 		pulled_task = -1;
 
 out:
@@ -13882,15 +13888,12 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
  * All the scheduling class methods:
  */
 DEFINE_SCHED_CLASS(fair) = {
-
-	.queue_mask		= 2,
-
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
 	.yield_to_task		= yield_to_task_fair,
 
-	.wakeup_preempt		= check_preempt_wakeup_fair,
+	.wakeup_preempt		= wakeup_preempt_fair,
 
 	.pick_task		= pick_task_fair,
 	.pick_next_task		= pick_next_task_fair,
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe..65eb8f8 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -536,9 +536,6 @@ static void update_curr_idle(struct rq *rq)
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
 DEFINE_SCHED_CLASS(idle) = {
-
-	.queue_mask		= 0,
-
 	/* no enqueue/yield_task for idle tasks */
 
 	/* dequeue is not valid, we print a debug message there: */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1867fe..0a9b2cd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1615,6 +1615,12 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
 
+	/*
+	 * XXX If we're preempted by DL, queue a push?
+	 */
+	if (p->sched_class != &rt_sched_class)
+		return;
+
 	if (p->prio < donor->prio) {
 		resched_curr(rq);
 		return;
@@ -2568,9 +2574,6 @@ static int task_is_throttled_rt(struct task_struct *p, int cpu)
 #endif /* CONFIG_SCHED_CORE */
 
 DEFINE_SCHED_CLASS(rt) = {
-
-	.queue_mask		= 4,
-
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ab1bfa0..bdb1e74 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1119,7 +1119,6 @@ struct rq {
 	raw_spinlock_t		__lock;
 
 	/* Per class runqueue modification mask; bits in class order. */
-	unsigned int		queue_mask;
 	unsigned int		nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		nr_numa_running;
@@ -1179,6 +1178,7 @@ struct rq {
 	struct sched_dl_entity	*dl_server;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
+	const struct sched_class *next_class;
 	unsigned long		next_balance;
 	struct mm_struct	*prev_mm;
 
@@ -2426,15 +2426,6 @@ struct sched_class {
 #ifdef CONFIG_UCLAMP_TASK
 	int uclamp_enabled;
 #endif
-	/*
-	 * idle:  0
-	 * ext:   1
-	 * fair:  2
-	 * rt:    4
-	 * dl:    8
-	 * stop: 16
-	 */
-	unsigned int queue_mask;
 
 	/*
 	 * move_queued_task/activate_task/enqueue_task: rq->lock
@@ -2593,20 +2584,6 @@ struct sched_class {
 #endif
 };
 
-/*
- * Does not nest; only used around sched_class::pick_task() rq-lock-breaks.
- */
-static inline void rq_modified_clear(struct rq *rq)
-{
-	rq->queue_mask = 0;
-}
-
-static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class)
-{
-	unsigned int mask = class->queue_mask;
-	return rq->queue_mask & ~((mask << 1) - 1);
-}
-
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->donor != prev);
@@ -3899,6 +3876,7 @@ void move_queued_task_locked(struct rq *src_rq, struct rq *dst_rq, struct task_s
 	deactivate_task(src_rq, task, 0);
 	set_task_cpu(task, dst_rq->cpu);
 	activate_task(dst_rq, task, 0);
+	wakeup_preempt(dst_rq, task, 0);
 }
 
 static inline
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4f9192b..f95798b 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -97,9 +97,6 @@ static void update_curr_stop(struct rq *rq)
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 DEFINE_SCHED_CLASS(stop) = {
-
-	.queue_mask		= 16,
-
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,
 	.yield_task		= yield_task_stop,

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*())
  2025-12-15  7:12     ` Ingo Molnar
@ 2025-12-15 11:51       ` Nathan Chancellor
  2025-12-16  7:02         ` Thorsten Leemhuis
  0 siblings, 1 reply; 36+ messages in thread
From: Nathan Chancellor @ 2025-12-15 11:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thorsten Leemhuis, Peter Zijlstra, vincent.guittot, linux-kernel,
	juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	tj, void, arighi, changwoo, sched-ext, Linux Next Mailing List

On Mon, Dec 15, 2025 at 08:12:13AM +0100, Ingo Molnar wrote:
> 
> * Thorsten Leemhuis <linux@leemhuis.info> wrote:
> 
> > On 11/27/25 16:39, Peter Zijlstra wrote:
> > > Change sched_class::wakeup_preempt() to also get called for
> > > cross-class wakeups, specifically those where the woken task is of a
> > > higher class than the previous highest class.
> >
> > I suspect you might be aware of this already, but this patch afaics
> > broke compilation of today's -next for me, as reverting fixed things.
> 
> Yeah, sorry about that, I fumbled a conflict resolution - should be
> fixed for tomorrow's -next.

It looks like you cleared up the rq_modified_clear() error but
rq_modified_above() is still present in kernel/sched/ext.c.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*())
  2025-12-15 11:51       ` Nathan Chancellor
@ 2025-12-16  7:02         ` Thorsten Leemhuis
  2025-12-16 18:40           ` Tejun Heo
  0 siblings, 1 reply; 36+ messages in thread
From: Thorsten Leemhuis @ 2025-12-16  7:02 UTC (permalink / raw)
  To: Nathan Chancellor, Ingo Molnar
  Cc: Peter Zijlstra, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, void,
	arighi, changwoo, sched-ext, Linux Next Mailing List



On 12/15/25 12:51, Nathan Chancellor wrote:
> On Mon, Dec 15, 2025 at 08:12:13AM +0100, Ingo Molnar wrote:
>>
>> * Thorsten Leemhuis <linux@leemhuis.info> wrote:
>>
>>> On 11/27/25 16:39, Peter Zijlstra wrote:
>>>> Change sched_class::wakeup_preempt() to also get called for
>>>> cross-class wakeups, specifically those where the woken task is of a
>>>> higher class than the previous highest class.
>>>
>>> I suspect you might be aware of this already, but this patch afaics
>>> broke compilation of today's -next for me, as reverting fixed things.
>>
>> Yeah, sorry about that, I fumbled a conflict resolution - should be
>> fixed for tomorrow's -next.
> 
> It looks like you cleared up the rq_modified_clear() error but
> rq_modified_above() is still present in kernel/sched/ext.c.

...which afaics causes this build error in today's next:

In file included from kernel/sched/build_policy.c:62:
kernel/sched/ext.c: In function ‘do_pick_task_scx’:
kernel/sched/ext.c:2470:27: error: implicit declaration of function ‘rq_modified_above’ [-Wimplicit-function-declaration]
 2470 |         if (!force_scx && rq_modified_above(rq, &ext_sched_class))
      |                           ^~~~~~~~~~~~~~~~~
make[4]: *** [scripts/Makefile.build:287: kernel/sched/build_policy.o] Error 1
make[3]: *** [scripts/Makefile.build:556: kernel/sched] Error 2
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [scripts/Makefile.build:556: kernel] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [/builddir/build/BUILD/kernel-6.19.0-build/kernel-next-20251216/linux-6.19.0-0.0.next.20251216.415.vanilla.fc44.x86_64/Makefile:2062: .] Error 2
make: *** [Makefile:256: __sub-make] Error 2

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*())
  2025-12-16  7:02         ` Thorsten Leemhuis
@ 2025-12-16 18:40           ` Tejun Heo
  2025-12-16 21:42             ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Tejun Heo @ 2025-12-16 18:40 UTC (permalink / raw)
  To: Thorsten Leemhuis
  Cc: Nathan Chancellor, Ingo Molnar, Peter Zijlstra, vincent.guittot,
	linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, void, arighi, changwoo, sched-ext,
	Linux Next Mailing List

On Tue, Dec 16, 2025 at 08:02:50AM +0100, Thorsten Leemhuis wrote:
> 
> 
> On 12/15/25 12:51, Nathan Chancellor wrote:
> > On Mon, Dec 15, 2025 at 08:12:13AM +0100, Ingo Molnar wrote:
> >>
> >> * Thorsten Leemhuis <linux@leemhuis.info> wrote:
> >>
> >>> On 11/27/25 16:39, Peter Zijlstra wrote:
> >>>> Change sched_class::wakeup_preempt() to also get called for
> >>>> cross-class wakeups, specifically those where the woken task is of a
> >>>> higher class than the previous highest class.
> >>>
> >>> I suspect you might be aware of this already, but this patch afaics
> >>> broke compilation of today's -next for me, as reverting fixed things.
> >>
> >> Yeah, sorry about that, I fumbled a conflict resolution - should be
> >> fixed for tomorrow's -next.
> > 
> > It looks like you cleared up the rq_modified_clear() error but
> > rq_modified_above() is still present in kernel/sched/ext.c.
> 
> ...which afaics causes this build error in today's next:
> 
> In file included from kernel/sched/build_policy.c:62:
> kernel/sched/ext.c: In function ‘do_pick_task_scx’:
> kernel/sched/ext.c:2470:27: error: implicit declaration of function ‘rq_modified_above’ [-Wimplicit-function-declaration]
>  2470 |         if (!force_scx && rq_modified_above(rq, &ext_sched_class))
>       |                           ^~~~~~~~~~~~~~~~~
> make[4]: *** [scripts/Makefile.build:287: kernel/sched/build_policy.o] Error 1
> make[3]: *** [scripts/Makefile.build:556: kernel/sched] Error 2
> make[3]: *** Waiting for unfinished jobs....
> make[2]: *** [scripts/Makefile.build:556: kernel] Error 2
> make[2]: *** Waiting for unfinished jobs....
> make[1]: *** [/builddir/build/BUILD/kernel-6.19.0-build/kernel-next-20251216/linux-6.19.0-0.0.next.20251216.415.vanilla.fc44.x86_64/Makefile:2062: .] Error 2
> make: *** [Makefile:256: __sub-make] Error 2

Ingo, Peter, I can pull tip and resolve this from sched_ext side too but it
would probably be cleaner to resolve from tip side?

Tahnks.

-- 
tejun

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*())
  2025-12-16 18:40           ` Tejun Heo
@ 2025-12-16 21:42             ` Peter Zijlstra
  2025-12-17  9:58               ` Peter Zijlstra
  0 siblings, 1 reply; 36+ messages in thread
From: Peter Zijlstra @ 2025-12-16 21:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thorsten Leemhuis, Nathan Chancellor, Ingo Molnar,
	vincent.guittot, linux-kernel, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, void, arighi, changwoo,
	sched-ext, Linux Next Mailing List

On Tue, Dec 16, 2025 at 08:40:36AM -1000, Tejun Heo wrote:

> Ingo, Peter, I can pull tip and resolve this from sched_ext side too but it
> would probably be cleaner to resolve from tip side?

Yeah, I'll fix it up tomorrow morning if Ingo hasn't yet. Sorry for the
mess.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*())
  2025-12-16 21:42             ` Peter Zijlstra
@ 2025-12-17  9:58               ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-12-17  9:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Thorsten Leemhuis, Nathan Chancellor, Ingo Molnar,
	vincent.guittot, linux-kernel, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, void, arighi, changwoo,
	sched-ext, Linux Next Mailing List

On Tue, Dec 16, 2025 at 10:42:29PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 16, 2025 at 08:40:36AM -1000, Tejun Heo wrote:
> 
> > Ingo, Peter, I can pull tip and resolve this from sched_ext side too but it
> > would probably be cleaner to resolve from tip side?
> 
> Yeah, I'll fix it up tomorrow morning if Ingo hasn't yet. Sorry for the
> mess.

Force pushed tip/sched/core -- this issue should now hopefully be laid
to rest. Again, sorry for the mess.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [tip: sched/core] sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()
  2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
                     ` (6 preceding siblings ...)
  2025-12-15  7:59   ` [tip: sched/core] sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*() tip-bot2 for Peter Zijlstra
@ 2025-12-17 10:02   ` tip-bot2 for Peter Zijlstra
  7 siblings, 0 replies; 36+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-12-17 10:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), Ingo Molnar, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     704069649b5bfb7bf1fe32c0281fe9036806a59a
Gitweb:        https://git.kernel.org/tip/704069649b5bfb7bf1fe32c0281fe9036806a59a
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Wed, 10 Dec 2025 09:06:50 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 17 Dec 2025 10:53:25 +01:00

sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*()

Change sched_class::wakeup_preempt() to also get called for
cross-class wakeups, specifically those where the woken task
is of a higher class than the previous highest class.

In order to do this, track the current highest class of the runqueue
in rq::next_class and have wakeup_preempt() track this upwards for
each new wakeup. Additionally have schedule() re-set the value on
pick.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251127154725.901391274@infradead.org
---
 kernel/sched/core.c      | 32 +++++++++++++++++++++++---------
 kernel/sched/deadline.c  | 14 +++++++++-----
 kernel/sched/ext.c       |  9 ++++-----
 kernel/sched/fair.c      | 17 ++++++++++-------
 kernel/sched/idle.c      |  3 ---
 kernel/sched/rt.c        |  9 ++++++---
 kernel/sched/sched.h     | 27 ++-------------------------
 kernel/sched/stop_task.c |  3 ---
 8 files changed, 54 insertions(+), 60 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4479f7d..7d0a862 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2090,7 +2090,6 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 	 */
 	uclamp_rq_inc(rq, p, flags);
 
-	rq->queue_mask |= p->sched_class->queue_mask;
 	p->sched_class->enqueue_task(rq, p, flags);
 
 	psi_enqueue(p, flags);
@@ -2123,7 +2122,6 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 	 * and mark the task ->sched_delayed.
 	 */
 	uclamp_rq_dec(rq, p);
-	rq->queue_mask |= p->sched_class->queue_mask;
 	return p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -2174,10 +2172,14 @@ void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
 
-	if (p->sched_class == donor->sched_class)
-		donor->sched_class->wakeup_preempt(rq, p, flags);
-	else if (sched_class_above(p->sched_class, donor->sched_class))
+	if (p->sched_class == rq->next_class) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
+
+	} else if (sched_class_above(p->sched_class, rq->next_class)) {
+		rq->next_class->wakeup_preempt(rq, p, flags);
 		resched_curr(rq);
+		rq->next_class = p->sched_class;
+	}
 
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
@@ -6804,6 +6806,7 @@ static void __sched notrace __schedule(int sched_mode)
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	rq->next_class = next->sched_class;
 	if (unlikely(task_is_blocked(next))) {
 		next = find_proxy_task(rq, next, &rf);
 		if (!next)
@@ -8650,6 +8653,8 @@ void __init sched_init(void)
 		rq->rt.rt_runtime = global_rt_runtime();
 		init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
 #endif
+		rq->next_class = &idle_sched_class;
+
 		rq->sd = NULL;
 		rq->rd = NULL;
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
@@ -10775,10 +10780,8 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int 
 		flags |= DEQUEUE_NOCLOCK;
 	}
 
-	if (flags & DEQUEUE_CLASS) {
-		if (p->sched_class->switching_from)
-			p->sched_class->switching_from(rq, p);
-	}
+	if ((flags & DEQUEUE_CLASS) && p->sched_class->switching_from)
+		p->sched_class->switching_from(rq, p);
 
 	*ctx = (struct sched_change_ctx){
 		.p = p,
@@ -10831,6 +10834,17 @@ void sched_change_end(struct sched_change_ctx *ctx)
 			p->sched_class->switched_to(rq, p);
 
 		/*
+		 * If this was a class promotion; let the old class know it
+		 * got preempted. Note that none of the switch*_from() methods
+		 * know the new class and none of the switch*_to() methods
+		 * know the old class.
+		 */
+		if (ctx->running && sched_class_above(p->sched_class, ctx->class)) {
+			rq->next_class->wakeup_preempt(rq, p, 0);
+			rq->next_class = p->sched_class;
+		}
+
+		/*
 		 * If this was a degradation in class someone should have set
 		 * need_resched by now.
 		 */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 319439f..80c9559 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2499,9 +2499,16 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
  * Only called when both the current and waking task are -deadline
  * tasks.
  */
-static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
-				  int flags)
+static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, int flags)
 {
+	/*
+	 * Can only get preempted by stop-class, and those should be
+	 * few and short lived, doesn't really make sense to push
+	 * anything away for that.
+	 */
+	if (p->sched_class != &dl_sched_class)
+		return;
+
 	if (dl_entity_preempt(&p->dl, &rq->donor->dl)) {
 		resched_curr(rq);
 		return;
@@ -3346,9 +3353,6 @@ static int task_is_throttled_dl(struct task_struct *p, int cpu)
 #endif
 
 DEFINE_SCHED_CLASS(dl) = {
-
-	.queue_mask		= 8,
-
 	.enqueue_task		= enqueue_task_dl,
 	.dequeue_task		= dequeue_task_dl,
 	.yield_task		= yield_task_dl,
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 05f5a49..3b32e64 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2431,7 +2431,7 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
 	/* see kick_cpus_irq_workfn() */
 	smp_store_release(&rq->scx.kick_sync, rq->scx.kick_sync + 1);
 
-	rq_modified_clear(rq);
+	rq->next_class = &ext_sched_class;
 
 	rq_unpin_lock(rq, rf);
 	balance_one(rq, prev);
@@ -2446,7 +2446,7 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
 	 * If @force_scx is true, always try to pick a SCHED_EXT task,
 	 * regardless of any higher-priority sched classes activity.
 	 */
-	if (!force_scx && rq_modified_above(rq, &ext_sched_class))
+	if (!force_scx && sched_class_above(rq->next_class, &ext_sched_class))
 		return RETRY_TASK;
 
 	keep_prev = rq->scx.flags & SCX_RQ_BAL_KEEP;
@@ -3075,7 +3075,8 @@ static void switched_from_scx(struct rq *rq, struct task_struct *p)
 	scx_disable_task(p);
 }
 
-static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
+static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_flags) {}
+
 static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
 
 int scx_check_setscheduler(struct task_struct *p, int policy)
@@ -3336,8 +3337,6 @@ static void scx_cgroup_unlock(void) {}
  *   their current sched_class. Call them directly from sched core instead.
  */
 DEFINE_SCHED_CLASS(ext) = {
-	.queue_mask		= 1,
-
 	.enqueue_task		= enqueue_task_scx,
 	.dequeue_task		= dequeue_task_scx,
 	.yield_task		= yield_task_scx,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d588eb8..76f5e4b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8736,7 +8736,7 @@ preempt_sync(struct rq *rq, int wake_flags,
 /*
  * Preempt the current task with a newly woken task if needed:
  */
-static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
+static void wakeup_preempt_fair(struct rq *rq, struct task_struct *p, int wake_flags)
 {
 	enum preempt_wakeup_action preempt_action = PREEMPT_WAKEUP_PICK;
 	struct task_struct *donor = rq->donor;
@@ -8744,6 +8744,12 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	struct cfs_rq *cfs_rq = task_cfs_rq(donor);
 	int cse_is_idle, pse_is_idle;
 
+	/*
+	 * XXX Getting preempted by higher class, try and find idle CPU?
+	 */
+	if (p->sched_class != &fair_sched_class)
+		return;
+
 	if (unlikely(se == pse))
 		return;
 
@@ -12911,7 +12917,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	t0 = sched_clock_cpu(this_cpu);
 	__sched_balance_update_blocked_averages(this_rq);
 
-	rq_modified_clear(this_rq);
+	this_rq->next_class = &fair_sched_class;
 	raw_spin_rq_unlock(this_rq);
 
 	for_each_domain(this_cpu, sd) {
@@ -12978,7 +12984,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 		pulled_task = 1;
 
 	/* If a higher prio class was modified, restart the pick */
-	if (rq_modified_above(this_rq, &fair_sched_class))
+	if (sched_class_above(this_rq->next_class, &fair_sched_class))
 		pulled_task = -1;
 
 out:
@@ -13882,15 +13888,12 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
  * All the scheduling class methods:
  */
 DEFINE_SCHED_CLASS(fair) = {
-
-	.queue_mask		= 2,
-
 	.enqueue_task		= enqueue_task_fair,
 	.dequeue_task		= dequeue_task_fair,
 	.yield_task		= yield_task_fair,
 	.yield_to_task		= yield_to_task_fair,
 
-	.wakeup_preempt		= check_preempt_wakeup_fair,
+	.wakeup_preempt		= wakeup_preempt_fair,
 
 	.pick_task		= pick_task_fair,
 	.pick_next_task		= pick_next_task_fair,
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe..65eb8f8 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -536,9 +536,6 @@ static void update_curr_idle(struct rq *rq)
  * Simple, special scheduling class for the per-CPU idle tasks:
  */
 DEFINE_SCHED_CLASS(idle) = {
-
-	.queue_mask		= 0,
-
 	/* no enqueue/yield_task for idle tasks */
 
 	/* dequeue is not valid, we print a debug message there: */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1867fe..0a9b2cd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1615,6 +1615,12 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
 
+	/*
+	 * XXX If we're preempted by DL, queue a push?
+	 */
+	if (p->sched_class != &rt_sched_class)
+		return;
+
 	if (p->prio < donor->prio) {
 		resched_curr(rq);
 		return;
@@ -2568,9 +2574,6 @@ static int task_is_throttled_rt(struct task_struct *p, int cpu)
 #endif /* CONFIG_SCHED_CORE */
 
 DEFINE_SCHED_CLASS(rt) = {
-
-	.queue_mask		= 4,
-
 	.enqueue_task		= enqueue_task_rt,
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ab1bfa0..3ceaa9d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1118,8 +1118,6 @@ struct rq {
 	/* runqueue lock: */
 	raw_spinlock_t		__lock;
 
-	/* Per class runqueue modification mask; bits in class order. */
-	unsigned int		queue_mask;
 	unsigned int		nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		nr_numa_running;
@@ -1179,6 +1177,7 @@ struct rq {
 	struct sched_dl_entity	*dl_server;
 	struct task_struct	*idle;
 	struct task_struct	*stop;
+	const struct sched_class *next_class;
 	unsigned long		next_balance;
 	struct mm_struct	*prev_mm;
 
@@ -2426,15 +2425,6 @@ struct sched_class {
 #ifdef CONFIG_UCLAMP_TASK
 	int uclamp_enabled;
 #endif
-	/*
-	 * idle:  0
-	 * ext:   1
-	 * fair:  2
-	 * rt:    4
-	 * dl:    8
-	 * stop: 16
-	 */
-	unsigned int queue_mask;
 
 	/*
 	 * move_queued_task/activate_task/enqueue_task: rq->lock
@@ -2593,20 +2583,6 @@ struct sched_class {
 #endif
 };
 
-/*
- * Does not nest; only used around sched_class::pick_task() rq-lock-breaks.
- */
-static inline void rq_modified_clear(struct rq *rq)
-{
-	rq->queue_mask = 0;
-}
-
-static inline bool rq_modified_above(struct rq *rq, const struct sched_class * class)
-{
-	unsigned int mask = class->queue_mask;
-	return rq->queue_mask & ~((mask << 1) - 1);
-}
-
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->donor != prev);
@@ -3899,6 +3875,7 @@ void move_queued_task_locked(struct rq *src_rq, struct rq *dst_rq, struct task_s
 	deactivate_task(src_rq, task, 0);
 	set_task_cpu(task, dst_rq->cpu);
 	activate_task(dst_rq, task, 0);
+	wakeup_preempt(dst_rq, task, 0);
 }
 
 static inline
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4f9192b..f95798b 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -97,9 +97,6 @@ static void update_curr_stop(struct rq *rq)
  * Simple, special scheduling class for the per-CPU stop tasks:
  */
 DEFINE_SCHED_CLASS(stop) = {
-
-	.queue_mask		= 16,
-
 	.enqueue_task		= enqueue_task_stop,
 	.dequeue_task		= dequeue_task_stop,
 	.yield_task		= yield_task_stop,

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/5] sched: Add assertions to QUEUE_CLASS
  2025-11-27 15:39 ` [PATCH 4/5] sched: Add assertions to QUEUE_CLASS Peter Zijlstra
  2025-12-14  7:46   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
@ 2025-12-18 10:09   ` Marek Szyprowski
  2025-12-18 10:12     ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Marek Szyprowski @ 2025-12-18 10:09 UTC (permalink / raw)
  To: Peter Zijlstra, mingo, vincent.guittot
  Cc: linux-kernel, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, tj, void, arighi, changwoo, sched-ext,
	Heiko Stuebner, linux-rockchip

On 27.11.2025 16:39, Peter Zijlstra wrote:
> Add some checks to the sched_change pattern to validate assumptions
> around changing classes.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>


This patch landed recently in linux-next as commit 47efe2ddccb1 
("sched/core: Add assertions to QUEUE_CLASS"). In my tests it turned out 
that it triggers the following warning during simple 'rtcwake' test on 
Hardkernel's Odroid-M1 board 
(arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts):

root@target:~# time rtcwake -s5 -mon
rtcwake: wakeup using /dev/rtc0 at Thu Dec 18 10:01:28 2025
------------[ cut here ]------------
WARNING: kernel/sched/core.c:10837 at sched_change_end+0x160/0x168, 
CPU#0: irq/38-rk817/79
Modules linked in: snd_soc_hdmi_codec dw_hdmi_i2s_audio dw_hdmi_cec 
snd_soc_simple_card snd_soc_rk817 snd_soc_simple_card_utils 
snd_soc_rockchip_i2s_tdm snd_soc_core hantro_vpu rockchip_rga v4l2_vp9 
v4l2_h264 snd_compress v4l2_jpeg videobuf2_dma_sg videobuf2_dma_contig 
v4l2_mem2mem videobuf2_memops snd_pcm_dmaengine videobuf2_v4l2 snd_pcm 
gpio_ir_recv dwmac_rk display_connector stmmac_platform rockchip_saradc 
rockchipdrm snd_timer videodev snd stmmac industrialio_triggered_buffer 
kfifo_buf rockchip_thermal phy_rockchip_naneng_combphy videobuf2_common 
spi_rockchip_sfc soundcore rk817_charger rockchip_dfi rtc_rk808 
rk805_pwrkey pcs_xpcs panfrost dw_hdmi_qp analogix_dp dw_dp 
drm_shmem_helper dw_mipi_dsi drm_dp_aux_bus gpu_sched dw_hdmi mc 
drm_display_helper ahci_dwc ipv6 libsha1
CPU: 0 UID: 0 PID: 79 Comm: irq/38-rk817 Not tainted 6.19.0-rc1+ #16288 
PREEMPT
Hardware name: Hardkernel ODROID-M1 (DT)
pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : sched_change_end+0x160/0x168
lr : sched_change_end+0xb0/0x168
...
Call trace:
  sched_change_end+0x160/0x168 (P)
  rt_mutex_setprio+0xc8/0x3a8
  mark_wakeup_next_waiter+0xc0/0x258
  rt_mutex_unlock+0x88/0x148
  i2c_adapter_unlock_bus+0x14/0x20
  i2c_transfer+0xac/0xf0
  regmap_i2c_read+0x5c/0xa0
  _regmap_raw_read+0xec/0x16c
  _regmap_bus_read+0x44/0x7c
  _regmap_read+0x64/0xf4
  regmap_read+0x4c/0x78
  read_irq_data+0x9c/0x460
  regmap_irq_thread+0x64/0x2f0
  irq_thread_fn+0x2c/0xa8
  irq_thread+0x1a4/0x378
  kthread+0x13c/0x214
  ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---

real    0m5.547s
user    0m0.004s
sys     0m0.011s
root@target:~#

I don't see anything suspicious in this stacktrace. Let me know how I 
can help debugging this issue. This board is the only one in my test 
farm which triggers such warning.


> ---
>   kernel/sched/core.c  |   13 +++++++++++++
>   kernel/sched/sched.h |    1 +
>   2 files changed, 14 insertions(+)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10806,6 +10806,7 @@ struct sched_change_ctx *sched_change_be
>   
>   	*ctx = (struct sched_change_ctx){
>   		.p = p,
> +		.class = p->sched_class,
>   		.flags = flags,
>   		.queued = task_on_rq_queued(p),
>   		.running = task_current_donor(rq, p),
> @@ -10836,6 +10837,11 @@ void sched_change_end(struct sched_chang
>   
>   	lockdep_assert_rq_held(rq);
>   
> +	/*
> +	 * Changing class without *QUEUE_CLASS is bad.
> +	 */
> +	WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS));
> +
>   	if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
>   		p->sched_class->switching_to(rq, p);
>   
> @@ -10847,6 +10853,13 @@ void sched_change_end(struct sched_chang
>   	if (ctx->flags & ENQUEUE_CLASS) {
>   		if (p->sched_class->switched_to)
>   			p->sched_class->switched_to(rq, p);
> +
> +		/*
> +		 * If this was a degradation in class someone should have set
> +		 * need_resched by now.
> +		 */
> +		WARN_ON_ONCE(sched_class_above(ctx->class, p->sched_class) &&
> +			     !test_tsk_need_resched(p));
>   	} else {
>   		p->sched_class->prio_changed(rq, p, ctx->prio);
>   	}
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -4027,6 +4027,7 @@ extern void balance_callbacks(struct rq
>   struct sched_change_ctx {
>   	u64			prio;
>   	struct task_struct	*p;
> +	const struct sched_class *class;
>   	int			flags;
>   	bool			queued;
>   	bool			running;
>
>
>
Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/5] sched: Add assertions to QUEUE_CLASS
  2025-12-18 10:09   ` [PATCH 4/5] sched: " Marek Szyprowski
@ 2025-12-18 10:12     ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2025-12-18 10:12 UTC (permalink / raw)
  To: Marek Szyprowski
  Cc: mingo, vincent.guittot, linux-kernel, juri.lelli,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, void,
	arighi, changwoo, sched-ext, Heiko Stuebner, linux-rockchip

On Thu, Dec 18, 2025 at 11:09:13AM +0100, Marek Szyprowski wrote:
> On 27.11.2025 16:39, Peter Zijlstra wrote:
> > Add some checks to the sched_change pattern to validate assumptions
> > around changing classes.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> 
> This patch landed recently in linux-next as commit 47efe2ddccb1 
> ("sched/core: Add assertions to QUEUE_CLASS"). In my tests it turned out 
> that it triggers the following warning during simple 'rtcwake' test on 
> Hardkernel's Odroid-M1 board 
> (arch/arm64/boot/dts/rockchip/rk3568-odroid-m1.dts):
> 
> root@target:~# time rtcwake -s5 -mon
> rtcwake: wakeup using /dev/rtc0 at Thu Dec 18 10:01:28 2025
> ------------[ cut here ]------------
> WARNING: kernel/sched/core.c:10837 at sched_change_end+0x160/0x168, 

https://lkml.kernel.org/r/176596899373.510.17191516261088315233.tip-bot2@tip-bot2

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2025-12-18 10:12 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-27 15:39 [PATCH 0/5] sched: Random collection of patches Peter Zijlstra
2025-11-27 15:39 ` [PATCH 1/5] sched/fair: Fold the sched_avg update Peter Zijlstra
2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-12-14  7:46   ` [tip: sched/core] <linux/compiler_types.h>: Add the __signed_scalar_typeof() helper tip-bot2 for Peter Zijlstra
2025-11-27 15:39 ` [PATCH 2/5] sched/fair: Avoid rq->lock bouncing in sched_balance_newidle() Peter Zijlstra
2025-11-29 18:59   ` Shrikanth Hegde
2025-12-14  7:46   ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
2025-11-27 15:39 ` [PATCH 3/5] sched: Change rcu_dereference_check_sched_domain() to rcu-sched Peter Zijlstra
2025-11-28 10:57   ` Peter Zijlstra
2025-11-28 11:04     ` Peter Zijlstra
2025-11-28 11:21       ` Paul E. McKenney
2025-11-28 11:37         ` Peter Zijlstra
2025-12-14  7:46   ` [tip: sched/core] sched/fair: Remove superfluous rcu_read_lock() tip-bot2 for Peter Zijlstra
2025-11-27 15:39 ` [PATCH 4/5] sched: Add assertions to QUEUE_CLASS Peter Zijlstra
2025-12-14  7:46   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
2025-12-18 10:09   ` [PATCH 4/5] sched: " Marek Szyprowski
2025-12-18 10:12     ` Peter Zijlstra
2025-11-27 15:39 ` [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*() Peter Zijlstra
2025-11-28 13:26   ` Kuba Piecuch
2025-11-28 13:36     ` Peter Zijlstra
2025-11-28 13:44       ` Peter Zijlstra
2025-11-28 22:29   ` Andrea Righi
2025-11-29 18:08   ` Shrikanth Hegde
2025-11-30 11:32     ` Peter Zijlstra
2025-11-30 13:03       ` Shrikanth Hegde
2025-12-02 23:27   ` Tejun Heo
2025-12-14  7:46   ` [tip: sched/core] sched/core: " tip-bot2 for Peter Zijlstra
2025-12-15  6:07   ` error: implicit declaration of function ‘rq_modified_clear’ (was [PATCH 5/5] sched: Rework sched_class::wakeup_preempt() and rq_modified_*()) Thorsten Leemhuis
2025-12-15  7:12     ` Ingo Molnar
2025-12-15 11:51       ` Nathan Chancellor
2025-12-16  7:02         ` Thorsten Leemhuis
2025-12-16 18:40           ` Tejun Heo
2025-12-16 21:42             ` Peter Zijlstra
2025-12-17  9:58               ` Peter Zijlstra
2025-12-15  7:59   ` [tip: sched/core] sched/core: Rework sched_class::wakeup_preempt() and rq_modified_*() tip-bot2 for Peter Zijlstra
2025-12-17 10:02   ` tip-bot2 for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox