[PATCH 0/2] sched: Remove sched

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH 0/2] sched: Remove sched_class::balance()
@ 2026-06-24 12:13 Peter Zijlstra
  2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-24 12:13 UTC (permalink / raw)
  To: mingo
  Cc: linux-kernel, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, ziqianlu, tj, williams, jkacur

Hi!

As mentioned [1], I was going to move sched_balance_newidle() into
balance_fair(). And while the patch was quickly done, something didn't feel
right.

The two things that bugged me were:

 - core-sched only calls prev_balance() on one sibling, not allowing the others
   to pull tasks;

 - both ::balance() and ::pick_task() take @rf and do balance operations.

That latter is because of ext; the DSQ move needs to be undone if it turns out
a higher class will get ran after all -- this is where ::pick_task() got the rf
from and why I moved its balance into pick_task_scx().

After a bit of puzzling and a few wrong turns [2], I think these patches do
away with sched_class::balance() instead, less is more and all that.

[1] https://patch.msgid.link/20260611113219.GG187714%40noisy.programming.kicks-ass.net
[2] I'm pretty sure I messed things up a few times, but rt-migration-test never
    threw a failure on me -- which suggests it isn't very good at finding fails.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] sched/core: Allow newidle for core-sched
  2026-06-24 12:13 [PATCH 0/2] sched: Remove sched_class::balance() Peter Zijlstra
@ 2026-06-24 12:13 ` Peter Zijlstra
  2026-06-24 23:56   ` K Prateek Nayak
  2026-06-24 12:13 ` [PATCH 2/2] sched: Remove sched_class::balance() Peter Zijlstra
  2026-07-02 11:49 ` [PATCH 0/2] " Aaron Lu
  2 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-24 12:13 UTC (permalink / raw)
  To: mingo
  Cc: linux-kernel, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, ziqianlu, tj, williams, jkacur

As reported [1], it is possible for newidle, as called from pick_task_fair(),
to migrate a task that was already selected for the SMT sibling.

That is, in case of core scheduling (pick_next_task()'s restart_multi label),
pick_task() is called for each SMT sibling. Suppose SMT0 picks task A, and SMT1
has no tasks and does newidle, it must not be possible (but currently is) to
migrate A to SMT1 and also select A.

This re-enables newidle balancing for core scheduling.

[1] https://patch.msgid.link/20260603095108.GA1684319@bytedance.com

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c  |    6 ++----
 kernel/sched/sched.h |   15 ++++++++++++++-
 2 files changed, 16 insertions(+), 5 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9942,9 +9942,6 @@ struct task_struct *pick_task_fair(struc
 	return p;
 
 idle:
-	if (sched_core_enabled(rq))
-		return NULL;
-
 	new_tasks = sched_balance_newidle(rq, rf);
 	if (new_tasks < 0)
 		return RETRY_TASK;
@@ -10783,7 +10780,8 @@ int can_migrate_task(struct task_struct
 	env->flags &= ~LBF_ALL_PINNED;
 
 	if (task_on_cpu(env->src_rq, p) ||
-	    task_current_donor(env->src_rq, p)) {
+	    task_current_donor(env->src_rq, p) ||
+	    task_on_core(env->src_rq, p)) {
 		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1570,6 +1570,14 @@ static inline bool task_has_sched_core(s
 	return !!p->core_cookie;
 }
 
+static inline bool task_on_core(struct rq *rq, struct task_struct *p)
+{
+	if (sched_core_disabled())
+		return false;
+
+	return rq->core_pick == p;
+}
+
 #else /* !CONFIG_SCHED_CORE: */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1615,6 +1623,11 @@ static inline bool task_has_sched_core(s
 	return false;
 }
 
+static inline bool task_on_core(struct rq *rq, struct task_struct *p)
+{
+	return false;
+}
+
 #endif /* !CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -4131,7 +4144,7 @@ void move_queued_task_locked(struct rq *
 static inline
 bool task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
 {
-	if (!task_on_cpu(rq, p) &&
+	if (!task_on_cpu(rq, p) && !task_on_core(rq, p) &&
 	    cpumask_test_cpu(cpu, &p->cpus_mask))
 		return true;
 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] sched: Remove sched_class::balance()
  2026-06-24 12:13 [PATCH 0/2] sched: Remove sched_class::balance() Peter Zijlstra
  2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra
@ 2026-06-24 12:13 ` Peter Zijlstra
  2026-07-02 11:49 ` [PATCH 0/2] " Aaron Lu
  2 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-24 12:13 UTC (permalink / raw)
  To: mingo
  Cc: linux-kernel, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, ziqianlu, tj, williams, jkacur

Ever since commit 50653216e4ff ("sched: Add support to pick functions to take
rf"), we have the unfortunate situation that both sched_class::balance() and
sched_class::pick_task() have overlapping functionality in that they drop
rq->lock and balance tasks.

Additionally, prev_balance() is only called for a single RQ in the core-sched
case, resulting in 'missed' balance opportunities in this case.

The only classes with a balance callback are dl and rt, prev_balance() will run
the callbacks from prev->class down, pick_next_task() runs the callbacks from
stop_class down.

Therefore, the only case where there is a difference is if prev->class ==
rt_sched_class and pick_next_task() stops at stop/dl. But in those cases the rt
pull would have been pointless, it would move a high priority task to a
runqueue that will not be able to run it.

A subsequent pick that does reach rt, must have a prev of stop/dl priority
(0,-1 resp.) and this will ensure need_pull_rt_task() is true and do the pull
then.

Therefore, move balance_{rt,dl}() into pick_task_{rt,dl}() and remove
sched_class::balance().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c      |   23 -----------------------
 kernel/sched/deadline.c  |   10 ++++++----
 kernel/sched/fair.c      |   38 ++++++++++----------------------------
 kernel/sched/idle.c      |    7 -------
 kernel/sched/rt.c        |   14 ++++++--------
 kernel/sched/sched.h     |    5 -----
 kernel/sched/stop_task.c |    7 -------
 7 files changed, 22 insertions(+), 82 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6092,25 +6092,6 @@ static inline void schedule_debug(struct
 	schedstat_inc(this_rq()->sched_count);
 }
 
-static void prev_balance(struct rq *rq, struct rq_flags *rf)
-{
-	const struct sched_class *start_class = rq->donor->sched_class;
-	const struct sched_class *class;
-
-	/*
-	 * We must do the balancing pass before put_prev_task(), such
-	 * that when we release the rq->lock the task is in the same
-	 * state as before we took rq->lock.
-	 *
-	 * We can terminate the balance pass as soon as we know there is
-	 * a runnable task of @class priority or higher.
-	 */
-	for_active_class_range(class, start_class, &idle_sched_class) {
-		if (class->balance && class->balance(rq, rf))
-			break;
-	}
-}
-
 /*
  * Pick up the highest-prio task:
  */
@@ -6148,8 +6129,6 @@ __pick_next_task(struct rq *rq, struct r
 	}
 
 restart:
-	prev_balance(rq, rf);
-
 	for_each_active_class(class) {
 		p = class->pick_task(rq, rf);
 		if (unlikely(p == RETRY_TASK))
@@ -6257,8 +6236,6 @@ pick_next_task(struct rq *rq, struct rq_
 		goto out_set_next;
 	}
 
-	prev_balance(rq, rf);
-
 	smt_mask = cpu_smt_mask(cpu);
 	need_sync = !!rq->core->core_cookie;
 
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2704,7 +2704,7 @@ static void check_preempt_equal_dl(struc
 	resched_curr(rq);
 }
 
-static int balance_dl(struct rq *rq, struct rq_flags *rf)
+static void balance_dl(struct rq *rq, struct rq_flags *rf)
 {
 	/*
 	 * Note, rq->donor may change during rq lock drops,
@@ -2723,8 +2723,6 @@ static int balance_dl(struct rq *rq, str
 		pull_dl_task(rq);
 		rq_repin_lock(rq, rf);
 	}
-
-	return sched_stop_runnable(rq) || sched_dl_runnable(rq);
 }
 
 /*
@@ -2817,6 +2815,11 @@ static struct task_struct *__pick_task_d
 	struct dl_rq *dl_rq = &rq->dl;
 	struct task_struct *p;
 
+	rq_modified_begin(rq, &dl_sched_class);
+	balance_dl(rq, rf);
+	if (rq_modified_above(rq, &dl_sched_class))
+		return RETRY_TASK;
+
 again:
 	if (!sched_dl_runnable(rq))
 		return NULL;
@@ -3652,7 +3655,6 @@ DEFINE_SCHED_CLASS(dl) = {
 	.put_prev_task		= put_prev_task_dl,
 	.set_next_task		= set_next_task_dl,
 
-	.balance		= balance_dl,
 	.select_task_rq		= select_task_rq_dl,
 	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5732,7 +5732,7 @@ static inline unsigned long cfs_rq_load_
 	return cfs_rq->avg.load_avg;
 }
 
-static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
+static void sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(this_rq));
 
 static inline unsigned long task_util(struct task_struct *p)
@@ -9916,7 +9916,6 @@ struct task_struct *pick_task_fair(struc
 	struct cfs_rq *cfs_rq;
 	struct task_struct *p;
 	bool throttled;
-	int new_tasks;
 
 again:
 	cfs_rq = &rq->cfs;
@@ -9942,11 +9941,14 @@ struct task_struct *pick_task_fair(struc
 	return p;
 
 idle:
-	new_tasks = sched_balance_newidle(rq, rf);
-	if (new_tasks < 0)
+	rq_modified_begin(rq, &fair_sched_class);
+	sched_balance_newidle(rq, rf);
+	if (rq_modified_above(rq, &fair_sched_class))
 		return RETRY_TASK;
-	if (new_tasks > 0)
+
+	if (cfs_rq->nr_queued)
 		goto again;
+
 	return NULL;
 }
 
@@ -14334,13 +14336,8 @@ static inline void nohz_newidle_balance(
 /*
  * sched_balance_newidle is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
- *
- * Returns:
- *   < 0 - we released the lock and there are !fair tasks present
- *     0 - failed, no new tasks
- *   > 0 - success, new (fair) tasks present
  */
-static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
+static void sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	__must_hold(__rq_lockp(this_rq))
 {
 	unsigned long next_balance = jiffies + HZ;
@@ -14357,7 +14354,7 @@ static int sched_balance_newidle(struct
 	 * Return 0; the task will be enqueued when switching to idle.
 	 */
 	if (this_rq->ttwu_pending)
-		return 0;
+		return;
 
 	/*
 	 * We must set idle_stamp _before_ calling sched_balance_rq()
@@ -14370,7 +14367,7 @@ static int sched_balance_newidle(struct
 	 * Do not pull tasks towards !active CPUs...
 	 */
 	if (!cpu_active(this_cpu))
-		return 0;
+		return;
 
 	/*
 	 * This is OK, because current is on_cpu, which avoids it being picked
@@ -14399,7 +14396,6 @@ static int sched_balance_newidle(struct
 	t0 = sched_clock_cpu(this_cpu);
 	__sched_balance_update_blocked_averages(this_rq);
 
-	rq_modified_begin(this_rq, &fair_sched_class);
 	raw_spin_rq_unlock(this_rq);
 
 	for_each_domain(this_cpu, sd) {
@@ -14457,18 +14453,6 @@ static int sched_balance_newidle(struct
 	if (curr_cost > this_rq->max_idle_balance_cost)
 		this_rq->max_idle_balance_cost = curr_cost;
 
-	/*
-	 * While browsing the domains, we released the rq lock, a task could
-	 * have been enqueued in the meantime. Since we're not going idle,
-	 * pretend we pulled a task.
-	 */
-	if (this_rq->cfs.h_nr_queued && !pulled_task)
-		pulled_task = 1;
-
-	/* If a higher prio class was modified, restart the pick */
-	if (rq_modified_above(this_rq, &fair_sched_class))
-		pulled_task = -1;
-
 out:
 	/* Move the next balance forward */
 	if (time_after(this_rq->next_balance, next_balance))
@@ -14480,8 +14464,6 @@ static int sched_balance_newidle(struct
 		nohz_newidle_balance(this_rq);
 
 	rq_repin_lock(this_rq, rf);
-
-	return pulled_task;
 }
 
 /*
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -464,12 +464,6 @@ select_task_rq_idle(struct task_struct *
 	return task_cpu(p); /* IDLE tasks as never migrated */
 }
 
-static int
-balance_idle(struct rq *rq, struct rq_flags *rf)
-{
-	return WARN_ON_ONCE(1);
-}
-
 /*
  * Idle tasks are unconditionally rescheduled:
  */
@@ -581,7 +575,6 @@ DEFINE_SCHED_CLASS(idle) = {
 	.put_prev_task		= put_prev_task_idle,
 	.set_next_task          = set_next_task_idle,
 
-	.balance		= balance_idle,
 	.select_task_rq		= select_task_rq_idle,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1596,7 +1596,7 @@ static void check_preempt_equal_prio(str
 	resched_curr(rq);
 }
 
-static int balance_rt(struct rq *rq, struct rq_flags *rf)
+static void balance_rt(struct rq *rq, struct rq_flags *rf)
 {
 	/*
 	 * Note, rq->donor may change during rq lock drops,
@@ -1615,8 +1615,6 @@ static int balance_rt(struct rq *rq, str
 		pull_rt_task(rq);
 		rq_repin_lock(rq, rf);
 	}
-
-	return sched_stop_runnable(rq) || sched_dl_runnable(rq) || sched_rt_runnable(rq);
 }
 
 /*
@@ -1714,14 +1712,15 @@ static struct task_struct *_pick_next_ta
 
 static struct task_struct *pick_task_rt(struct rq *rq, struct rq_flags *rf)
 {
-	struct task_struct *p;
+	rq_modified_begin(rq, &rt_sched_class);
+	balance_rt(rq, rf);
+	if (rq_modified_above(rq, &rt_sched_class))
+		return RETRY_TASK;
 
 	if (!sched_rt_runnable(rq))
 		return NULL;
 
-	p = _pick_next_task_rt(rq);
-
-	return p;
+	return _pick_next_task_rt(rq);
 }
 
 static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_struct *next)
@@ -2609,7 +2608,6 @@ DEFINE_SCHED_CLASS(rt) = {
 	.put_prev_task		= put_prev_task_rt,
 	.set_next_task          = set_next_task_rt,
 
-	.balance		= balance_rt,
 	.select_task_rq		= select_task_rq_rt,
 	.set_cpus_allowed       = set_cpus_allowed_common,
 	.rq_online              = rq_online_rt,
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2638,11 +2638,6 @@ struct sched_class {
 	void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
 
 	/*
-	 * schedule/pick_next_task/prev_balance: rq->lock
-	 */
-	int (*balance)(struct rq *rq, struct rq_flags *rf);
-
-	/*
 	 * schedule/pick_next_task: rq->lock
 	 */
 	struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -15,12 +15,6 @@ select_task_rq_stop(struct task_struct *
 	return task_cpu(p); /* stop tasks as never migrate */
 }
 
-static int
-balance_stop(struct rq *rq, struct rq_flags *rf)
-{
-	return sched_stop_runnable(rq);
-}
-
 static void
 wakeup_preempt_stop(struct rq *rq, struct task_struct *p, int flags)
 {
@@ -107,7 +101,6 @@ DEFINE_SCHED_CLASS(stop) = {
 	.put_prev_task		= put_prev_task_stop,
 	.set_next_task          = set_next_task_stop,
 
-	.balance		= balance_stop,
 	.select_task_rq		= select_task_rq_stop,
 	.set_cpus_allowed	= set_cpus_allowed_common,
 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] sched/core: Allow newidle for core-sched
  2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra
@ 2026-06-24 23:56   ` K Prateek Nayak
  2026-06-25 12:41     ` Peter Zijlstra
  2026-06-25 12:42     ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: K Prateek Nayak @ 2026-06-24 23:56 UTC (permalink / raw)
  To: Peter Zijlstra, mingo
  Cc: linux-kernel, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, ziqianlu, tj, williams,
	jkacur

Hello Peter,

On 6/24/2026 5:43 PM, Peter Zijlstra wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9942,9 +9942,6 @@ struct task_struct *pick_task_fair(struc
>  	return p;
>  
>  idle:
> -	if (sched_core_enabled(rq))
> -		return NULL;
> -
>  	new_tasks = sched_balance_newidle(rq, rf);

I have a sneaky feeling this might still race with a
sched_setaffinity() like:

     CPU0 (CPU1 is its sibling)           CPU2 (CPU1 is idle in the meanwhile)
     ==========================           ====================================

  rq0->core_pick = p;

  /* Core pick on SMT - CPU1 */
  pick_next_task(rq1)                  __sched_setaffinity(p)
    pick_next_task_fair()                __set_cpus_allowed_ptr() /* p cannot run on CPU0 anymore */
      sched_balance_newidle()              task_rq_lock(p, &rf)
        raw_spin_rq_unlock()                 ...
        ... /* continues newidle */
                                             /* Gets lock */
                                             __set_cpus_allowed_ptr_locked()
                                               dest_cpu = 1; /* SMT of CPU0 */
                                               affine_move_task(rq0, p, 1 /* Move to CPU1 */)
                                                 /*
                                                  * p is not on_cpu
                                                  * p is TASK_ON_RQ_QUEUED
                                                  * p is not migration disabled
                                                  */
                                                 move_queued_task(rq0, p, 1)
                                                   /* Moves task to CPU1 */
                                                 task_rq_unlock();

        ... /* Sees new task on CPU1 */
        raw_spin_rq_lock(rq1);
        pulled_task = 1;
        
      /* Retry pick; Finds p on cPU1 */
      return p;

    rq1->core_pick = p; !!! Two CPUs have the same core pick. !!!


Do we have a guard against this that I might be missing?

Should we treat core_pick similar to task_on_cpu() and go down the
stopper route like:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f4530eb543f..e7f64f34aa4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3049,7 +3049,9 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
 		return -EINVAL;
 	}
 
-	if (task_on_cpu(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) {
+	if (task_on_cpu(rq, p) ||
+	    task_on_core(rq, p) ||
+	    READ_ONCE(p->__state) == TASK_WAKING) {
 		/*
 		 * MIGRATE_ENABLE gets here because 'p == current', but for
 		 * anything else we cannot do is_migration_disabled(), punt
---

... or we can return a RETRY_TASK when newidle balance succeeds too to
force a core-wide pick when core scheduling is enabled like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d212bf04885..e393aed58bfa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9946,8 +9946,16 @@ struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 	if (rq_modified_above(rq, &fair_sched_class))
 		return RETRY_TASK;
 
-	if (cfs_rq->nr_queued)
+	if (cfs_rq->nr_queued) {
+		/*
+		 * Force a core-wide pick if newidle
+		 * balance managed to pull a task since
+		 * the lock was dropped.
+		 */
+		if (sched_core_enabled(rq))
+			return RETRY_TASK;
 		goto again;
+	}
 
 	return NULL;
 }
---

Thoughts?

>  	if (new_tasks < 0)
>  		return RETRY_TASK;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] sched/core: Allow newidle for core-sched
  2026-06-24 23:56   ` K Prateek Nayak
@ 2026-06-25 12:41     ` Peter Zijlstra
  2026-06-25 12:42     ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-25 12:41 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: mingo, linux-kernel, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, ziqianlu,
	tj, williams, jkacur

On Thu, Jun 25, 2026 at 05:26:03AM +0530, K Prateek Nayak wrote:
> Should we treat core_pick similar to task_on_cpu() and go down the
> stopper route like:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2f4530eb543f..e7f64f34aa4f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3049,7 +3049,9 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag
>  		return -EINVAL;
>  	}
>  
> -	if (task_on_cpu(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) {
> +	if (task_on_cpu(rq, p) ||
> +	    task_on_core(rq, p) ||
> +	    READ_ONCE(p->__state) == TASK_WAKING) {
>  		/*
>  		 * MIGRATE_ENABLE gets here because 'p == current', but for
>  		 * anything else we cannot do is_migration_disabled(), punt

Bah, yes. Let me go stare more at this.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] sched/core: Allow newidle for core-sched
  2026-06-24 23:56   ` K Prateek Nayak
  2026-06-25 12:41     ` Peter Zijlstra
@ 2026-06-25 12:42     ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-25 12:42 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: mingo, linux-kernel, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, ziqianlu,
	tj, williams, jkacur

On Thu, Jun 25, 2026 at 05:26:03AM +0530, K Prateek Nayak wrote:

> ... or we can return a RETRY_TASK when newidle balance succeeds too to
> force a core-wide pick when core scheduling is enabled like:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0d212bf04885..e393aed58bfa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9946,8 +9946,16 @@ struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
>  	if (rq_modified_above(rq, &fair_sched_class))
>  		return RETRY_TASK;
>  
> -	if (cfs_rq->nr_queued)
> +	if (cfs_rq->nr_queued) {
> +		/*
> +		 * Force a core-wide pick if newidle
> +		 * balance managed to pull a task since
> +		 * the lock was dropped.
> +		 */
> +		if (sched_core_enabled(rq))
> +			return RETRY_TASK;
>  		goto again;
> +	}
>  
>  	return NULL;
>  }

This has forward progress issues. You can ping-pong he one task back and
forth and never actually pick anything.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] sched: Remove sched_class::balance()
  2026-06-24 12:13 [PATCH 0/2] sched: Remove sched_class::balance() Peter Zijlstra
  2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra
  2026-06-24 12:13 ` [PATCH 2/2] sched: Remove sched_class::balance() Peter Zijlstra
@ 2026-07-02 11:49 ` Aaron Lu
  2026-07-03  3:31   ` K Prateek Nayak
  2 siblings, 1 reply; 8+ messages in thread
From: Aaron Lu @ 2026-07-02 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
	kprateek.nayak, tj, williams, jkacur

Hi Peter,

On Wed, Jun 24, 2026 at 02:13:27PM +0200, Peter Zijlstra wrote:
> Hi!
> 
> As mentioned [1], I was going to move sched_balance_newidle() into
> balance_fair(). And while the patch was quickly done, something didn't feel
> right.
>

Sorry for replying late, been busy with other stuffs.

The test caused a panic with below msg:

[  110.613929] BUG: kernel NULL pointer dereference, address: 0000000000000014
[  110.615988] #PF: supervisor read access in kernel mode
[  110.617444] #PF: error_code(0x0000) - not-present page
[  110.618929] PGD 1044c2067 P4D 0
[  110.619938] Oops: Oops: 0000 [#1] SMP NOPTI
[  110.621491] CPU: 34 UID: 1000 PID: 1394 Comm: sh Not tainted 7.1.0-rc2-00109-gb17be96b4d6c #4 PREEMPT(lazy)
[  110.624764] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  110.627559] RIP: 0010:pick_next_task+0x42b/0xa20
[  110.628970] Code: 28 60 00 49 89 c7 8b 05 b3 8d 73 05 41 39 c7 0f 83 c3 01 00 00 49 63 cf 48 8b 1c cd 80 7b 96 82 48 01 eb 48 8b b3 50 0f 00 00 <8b> 4e 14 48 8b 0c cd 80 7b 96 82 48 8b 4c 29 20 48 39 ce 74 32 48
[  110.634160] RSP: 0000:ffa000000435bc48 EFLAGS: 00010083
[  110.635261] RAX: 0000000000000040 RBX: ff11000ff85fe140 RCX: 0000000000000022
[  110.636753] RDX: 0000000000000022 RSI: 0000000000000000 RDI: ff110001008b54d8
[  110.638230] RBP: ffffffff87452140 R08: 0000000000000022 R09: 00000000aed806e6
[  110.639719] R10: 0000000000000001 R11: 0000000000000000 R12: ff11000ff85fe140
[  110.641185] R13: 0000000000000000 R14: ff11000102373e88 R15: 0000000000000022
[  110.642688] FS:  00007f698e381780(0000) GS:ff110010711ac000(0000) knlGS:0000000000000000
[  110.644359] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  110.645255] CR2: 0000000000000014 CR3: 00000001044c1001 CR4: 0000000000371ef0
[  110.646366] Call Trace:
[  110.646781]  <TASK>
[  110.647123]  ? sched_clock+0x10/0x30
[  110.647704]  __schedule+0x1a4/0x7c0
[  110.648249]  ? local_clock_noinstr+0xd/0xc0
[  110.648914]  preempt_schedule+0x37/0x50
[  110.649528]  preempt_schedule_thunk+0x16/0x30
[  110.650217]  _raw_spin_unlock+0x37/0x40
[  110.650833]  wp_page_copy+0x2b7/0x5f0
[  110.651408]  __handle_mm_fault+0x45f/0x6e0
[  110.652068]  handle_mm_fault+0x96/0x250
[  110.652685]  do_user_addr_fault+0x236/0x6a0
[  110.653333]  exc_page_fault+0x7a/0x210
[  110.653937]  asm_exc_page_fault+0x26/0x30
[  110.654577] RIP: 0033:0x7f698e3a1f84
[  110.655077] Code: 8d 3d e0 d4 1c 00 e9 0b 15 05 00 0f 1f 00 48 89 7d f8 48 8d 3d cd d4 1c 00 e8 38 14 05 00 48 8b 55 f8 eb 8c 66 90 f3 0f 1e fa <c7> 05 b2 d4 1c 00 00 00 00 00 c3 90 f3 0f 1e fa 31 c9 e9 65 f9 ff
[  110.657442] RSP: 002b:00007fff9a5ff9e8 EFLAGS: 00010246
[  110.658113] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00007f698e443f63
[  110.659037] RDX: 0000000000000000 RSI: 0000000000000018 RDI: 00007f698e381a60
[  110.659973] RBP: 00007fff9a5ffa10 R08: 0000000000000000 R09: 000000000000000f
[  110.660896] R10: 00007f698e381a50 R11: 0000000000000246 R12: 0000000000000000
[  110.661829] R13: 00000000ffffffff R14: 0000000000000000 R15: 00007fff9a5ffc20
[  110.662745]  </TASK>
[  110.663040] Modules linked in:
[  110.663451] CR2: 0000000000000014
[  110.663893] ---[ end trace 0000000000000000 ]---
[  110.664487] RIP: 0010:pick_next_task+0x42b/0xa20
[  110.665090] Code: 28 60 00 49 89 c7 8b 05 b3 8d 73 05 41 39 c7 0f 83 c3 01 00 00 49 63 cf 48 8b 1c cd 80 7b 96 82 48 01 eb 48 8b b3 50 0f 00 00 <8b> 4e 14 48 8b 0c cd 80 7b 96 82 48 8b 4c 29 20 48 39 ce 74 32 48
[  110.667457] RSP: 0000:ffa000000435bc48 EFLAGS: 00010083
[  110.668146] RAX: 0000000000000040 RBX: ff11000ff85fe140 RCX: 0000000000000022
[  110.669074] RDX: 0000000000000022 RSI: 0000000000000000 RDI: ff110001008b54d8
[  110.670009] RBP: ffffffff87452140 R08: 0000000000000022 R09: 00000000aed806e6
[  110.670942] R10: 0000000000000001 R11: 0000000000000000 R12: ff11000ff85fe140
[  110.671861] R13: 0000000000000000 R14: ff11000102373e88 R15: 0000000000000022
[  110.672787] FS:  00007f698e381780(0000) GS:ff110010711ac000(0000) knlGS:0000000000000000
[  110.673826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  110.674567] CR2: 0000000000000014 CR3: 00000001044c1001 CR4: 0000000000371ef0
[  110.675487] Kernel panic - not syncing: Fatal exception
[  111.737349] Shutting down cpus with NMI
[  111.738019] Kernel Offset: disabled
[  111.738373] Rebooting in 30 seconds..

$ ./scripts/faddr2line ../obj/guest/vmlinux pick_next_task+0x42b
pick_next_task+0x42b/0xa20:
task_cpu at include/linux/sched.h:2270
(inlined by) is_task_rq_idle at kernel/sched/core.c:6144
(inlined by) cookie_equals at kernel/sched/core.c:6149
(inlined by) pick_next_task at kernel/sched/core.c:6328
This is the cookie_equals(p, cookie) in pick_next_task().

Assume cpuX and cpuY are siblings, it appears the following happened:

     cpuX                                      cpuY

   pick_next_task()
   goto restart_multi

   rqX->core_pick = pick_task(rqX)

   pick_task(rqY)
     pick_task_fair(rqY)
       sched_balance_newidle(rqY)
         raw_spin_rq_unlock(rqY)  // drops core lock

                                           pick_next_task()
                                           goto restart_multi
                                           rqY->core_pick = pick_task(rqY)
                                           rqX->core_pick = pick_task(rqX)

                                           if (rqX->curr == rqX->core_pick)
                                             rqX->core_pick = NULL

					   UNLOCK rq_lockp(rqY)

         raw_spin_rq_lock(rqY)

   rqY->core_pick = pick_task(rqY)

   p = rqX->core_pick        // NULL
   cookie_equals(p, cookie)  // NULL deref

BTW, I applied this series on top of commit c095741713d1b("sched/fair:
Fix newidle vs core-sched") of the queue/sched/core branch.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] sched: Remove sched_class::balance()
  2026-07-02 11:49 ` [PATCH 0/2] " Aaron Lu
@ 2026-07-03  3:31   ` K Prateek Nayak
  0 siblings, 0 replies; 8+ messages in thread
From: K Prateek Nayak @ 2026-07-03  3:31 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra
  Cc: mingo, linux-kernel, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj,
	williams, jkacur

Hi Aaron,

On 7/2/2026 5:19 PM, Aaron Lu wrote:
> Assume cpuX and cpuY are siblings, it appears the following happened:
> 
>      cpuX                                      cpuY
> 
>    pick_next_task()
>    goto restart_multi
> 
>    rqX->core_pick = pick_task(rqX)
> 
>    pick_task(rqY)
>      pick_task_fair(rqY)
>        sched_balance_newidle(rqY)
>          raw_spin_rq_unlock(rqY)  // drops core lock
> 
>                                            pick_next_task()
>                                            goto restart_multi
>                                            rqY->core_pick = pick_task(rqY)
>                                            rqX->core_pick = pick_task(rqX)
> 
>                                            if (rqX->curr == rqX->core_pick)
>                                              rqX->core_pick = NULL
> 
> 					   UNLOCK rq_lockp(rqY)
> 
>          raw_spin_rq_lock(rqY)
> 
>    rqY->core_pick = pick_task(rqY)
> 
>    p = rqX->core_pick        // NULL
>    cookie_equals(p, cookie)  // NULL deref

I also feel doing a balance before the core_cookie is finalized can
move tasks wrongly to a particular core, only to make them wait later
because the pick converted on a different core cookie. Stealing via
balance callback still seems like a good option.

Then there is the whole issue of core_pick_seq possibly being updated
by another CPU on the core by the time rq_lock is dropped and grabbed
again. Afaict, anything that drops the core-wide lock should just do
a RETRY_TASK if a new task arrives so everything is re-done withing
a single core-wide lock critical-section.

Up until this cleanup, only fair and ext used the RETRY_TASK
mechanism and fair bypassed the newidle with core-sched enabled so
it never used the RETRY_TASK mechanism with core-sched so I'm still
skeptical on balance for core-sched being done as part of pick.

Maybe Peter will have a way to sort it out after he is back from
vacation if he hasn't got it all figured out already ;-)

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-07-03  3:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 12:13 [PATCH 0/2] sched: Remove sched_class::balance() Peter Zijlstra
2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra
2026-06-24 23:56   ` K Prateek Nayak
2026-06-25 12:41     ` Peter Zijlstra
2026-06-25 12:42     ` Peter Zijlstra
2026-06-24 12:13 ` [PATCH 2/2] sched: Remove sched_class::balance() Peter Zijlstra
2026-07-02 11:49 ` [PATCH 0/2] " Aaron Lu
2026-07-03  3:31   ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox