* [PATCH 0/2] sched: Remove sched_class::balance()
@ 2026-06-24 12:13 Peter Zijlstra
2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Peter Zijlstra @ 2026-06-24 12:13 UTC (permalink / raw)
To: mingo
Cc: linux-kernel, peterz, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
kprateek.nayak, ziqianlu, tj, williams, jkacur
Hi!
As mentioned [1], I was going to move sched_balance_newidle() into
balance_fair(). And while the patch was quickly done, something didn't feel
right.
The two things that bugged me were:
- core-sched only calls prev_balance() on one sibling, not allowing the others
to pull tasks;
- both ::balance() and ::pick_task() take @rf and do balance operations.
That latter is because of ext; the DSQ move needs to be undone if it turns out
a higher class will get ran after all -- this is where ::pick_task() got the rf
from and why I moved its balance into pick_task_scx().
After a bit of puzzling and a few wrong turns [2], I think these patches do
away with sched_class::balance() instead, less is more and all that.
[1] https://patch.msgid.link/20260611113219.GG187714%40noisy.programming.kicks-ass.net
[2] I'm pretty sure I messed things up a few times, but rt-migration-test never
threw a failure on me -- which suggests it isn't very good at finding fails.
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH 1/2] sched/core: Allow newidle for core-sched 2026-06-24 12:13 [PATCH 0/2] sched: Remove sched_class::balance() Peter Zijlstra @ 2026-06-24 12:13 ` Peter Zijlstra 2026-06-24 23:56 ` K Prateek Nayak 2026-06-24 12:13 ` [PATCH 2/2] sched: Remove sched_class::balance() Peter Zijlstra 2026-07-02 11:49 ` [PATCH 0/2] " Aaron Lu 2 siblings, 1 reply; 8+ messages in thread From: Peter Zijlstra @ 2026-06-24 12:13 UTC (permalink / raw) To: mingo Cc: linux-kernel, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, kprateek.nayak, ziqianlu, tj, williams, jkacur As reported [1], it is possible for newidle, as called from pick_task_fair(), to migrate a task that was already selected for the SMT sibling. That is, in case of core scheduling (pick_next_task()'s restart_multi label), pick_task() is called for each SMT sibling. Suppose SMT0 picks task A, and SMT1 has no tasks and does newidle, it must not be possible (but currently is) to migrate A to SMT1 and also select A. This re-enables newidle balancing for core scheduling. [1] https://patch.msgid.link/20260603095108.GA1684319@bytedance.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/fair.c | 6 ++---- kernel/sched/sched.h | 15 ++++++++++++++- 2 files changed, 16 insertions(+), 5 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9942,9 +9942,6 @@ struct task_struct *pick_task_fair(struc return p; idle: - if (sched_core_enabled(rq)) - return NULL; - new_tasks = sched_balance_newidle(rq, rf); if (new_tasks < 0) return RETRY_TASK; @@ -10783,7 +10780,8 @@ int can_migrate_task(struct task_struct env->flags &= ~LBF_ALL_PINNED; if (task_on_cpu(env->src_rq, p) || - task_current_donor(env->src_rq, p)) { + task_current_donor(env->src_rq, p) || + task_on_core(env->src_rq, p)) { schedstat_inc(p->stats.nr_failed_migrations_running); return 0; } --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1570,6 +1570,14 @@ static inline bool task_has_sched_core(s return !!p->core_cookie; } +static inline bool task_on_core(struct rq *rq, struct task_struct *p) +{ + if (sched_core_disabled()) + return false; + + return rq->core_pick == p; +} + #else /* !CONFIG_SCHED_CORE: */ static inline bool sched_core_enabled(struct rq *rq) @@ -1615,6 +1623,11 @@ static inline bool task_has_sched_core(s return false; } +static inline bool task_on_core(struct rq *rq, struct task_struct *p) +{ + return false; +} + #endif /* !CONFIG_SCHED_CORE */ #ifdef CONFIG_RT_GROUP_SCHED @@ -4131,7 +4144,7 @@ void move_queued_task_locked(struct rq * static inline bool task_is_pushable(struct rq *rq, struct task_struct *p, int cpu) { - if (!task_on_cpu(rq, p) && + if (!task_on_cpu(rq, p) && !task_on_core(rq, p) && cpumask_test_cpu(cpu, &p->cpus_mask)) return true; ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/core: Allow newidle for core-sched 2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra @ 2026-06-24 23:56 ` K Prateek Nayak 2026-06-25 12:41 ` Peter Zijlstra 2026-06-25 12:42 ` Peter Zijlstra 0 siblings, 2 replies; 8+ messages in thread From: K Prateek Nayak @ 2026-06-24 23:56 UTC (permalink / raw) To: Peter Zijlstra, mingo Cc: linux-kernel, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, ziqianlu, tj, williams, jkacur Hello Peter, On 6/24/2026 5:43 PM, Peter Zijlstra wrote: > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9942,9 +9942,6 @@ struct task_struct *pick_task_fair(struc > return p; > > idle: > - if (sched_core_enabled(rq)) > - return NULL; > - > new_tasks = sched_balance_newidle(rq, rf); I have a sneaky feeling this might still race with a sched_setaffinity() like: CPU0 (CPU1 is its sibling) CPU2 (CPU1 is idle in the meanwhile) ========================== ==================================== rq0->core_pick = p; /* Core pick on SMT - CPU1 */ pick_next_task(rq1) __sched_setaffinity(p) pick_next_task_fair() __set_cpus_allowed_ptr() /* p cannot run on CPU0 anymore */ sched_balance_newidle() task_rq_lock(p, &rf) raw_spin_rq_unlock() ... ... /* continues newidle */ /* Gets lock */ __set_cpus_allowed_ptr_locked() dest_cpu = 1; /* SMT of CPU0 */ affine_move_task(rq0, p, 1 /* Move to CPU1 */) /* * p is not on_cpu * p is TASK_ON_RQ_QUEUED * p is not migration disabled */ move_queued_task(rq0, p, 1) /* Moves task to CPU1 */ task_rq_unlock(); ... /* Sees new task on CPU1 */ raw_spin_rq_lock(rq1); pulled_task = 1; /* Retry pick; Finds p on cPU1 */ return p; rq1->core_pick = p; !!! Two CPUs have the same core pick. !!! Do we have a guard against this that I might be missing? Should we treat core_pick similar to task_on_cpu() and go down the stopper route like: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2f4530eb543f..e7f64f34aa4f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3049,7 +3049,9 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag return -EINVAL; } - if (task_on_cpu(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) { + if (task_on_cpu(rq, p) || + task_on_core(rq, p) || + READ_ONCE(p->__state) == TASK_WAKING) { /* * MIGRATE_ENABLE gets here because 'p == current', but for * anything else we cannot do is_migration_disabled(), punt --- ... or we can return a RETRY_TASK when newidle balance succeeds too to force a core-wide pick when core scheduling is enabled like: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0d212bf04885..e393aed58bfa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9946,8 +9946,16 @@ struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf) if (rq_modified_above(rq, &fair_sched_class)) return RETRY_TASK; - if (cfs_rq->nr_queued) + if (cfs_rq->nr_queued) { + /* + * Force a core-wide pick if newidle + * balance managed to pull a task since + * the lock was dropped. + */ + if (sched_core_enabled(rq)) + return RETRY_TASK; goto again; + } return NULL; } --- Thoughts? > if (new_tasks < 0) > return RETRY_TASK; -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/core: Allow newidle for core-sched 2026-06-24 23:56 ` K Prateek Nayak @ 2026-06-25 12:41 ` Peter Zijlstra 2026-06-25 12:42 ` Peter Zijlstra 1 sibling, 0 replies; 8+ messages in thread From: Peter Zijlstra @ 2026-06-25 12:41 UTC (permalink / raw) To: K Prateek Nayak Cc: mingo, linux-kernel, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, ziqianlu, tj, williams, jkacur On Thu, Jun 25, 2026 at 05:26:03AM +0530, K Prateek Nayak wrote: > Should we treat core_pick similar to task_on_cpu() and go down the > stopper route like: > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2f4530eb543f..e7f64f34aa4f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3049,7 +3049,9 @@ static int affine_move_task(struct rq *rq, struct task_struct *p, struct rq_flag > return -EINVAL; > } > > - if (task_on_cpu(rq, p) || READ_ONCE(p->__state) == TASK_WAKING) { > + if (task_on_cpu(rq, p) || > + task_on_core(rq, p) || > + READ_ONCE(p->__state) == TASK_WAKING) { > /* > * MIGRATE_ENABLE gets here because 'p == current', but for > * anything else we cannot do is_migration_disabled(), punt Bah, yes. Let me go stare more at this. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] sched/core: Allow newidle for core-sched 2026-06-24 23:56 ` K Prateek Nayak 2026-06-25 12:41 ` Peter Zijlstra @ 2026-06-25 12:42 ` Peter Zijlstra 1 sibling, 0 replies; 8+ messages in thread From: Peter Zijlstra @ 2026-06-25 12:42 UTC (permalink / raw) To: K Prateek Nayak Cc: mingo, linux-kernel, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, ziqianlu, tj, williams, jkacur On Thu, Jun 25, 2026 at 05:26:03AM +0530, K Prateek Nayak wrote: > ... or we can return a RETRY_TASK when newidle balance succeeds too to > force a core-wide pick when core scheduling is enabled like: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0d212bf04885..e393aed58bfa 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9946,8 +9946,16 @@ struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf) > if (rq_modified_above(rq, &fair_sched_class)) > return RETRY_TASK; > > - if (cfs_rq->nr_queued) > + if (cfs_rq->nr_queued) { > + /* > + * Force a core-wide pick if newidle > + * balance managed to pull a task since > + * the lock was dropped. > + */ > + if (sched_core_enabled(rq)) > + return RETRY_TASK; > goto again; > + } > > return NULL; > } This has forward progress issues. You can ping-pong he one task back and forth and never actually pick anything. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/2] sched: Remove sched_class::balance() 2026-06-24 12:13 [PATCH 0/2] sched: Remove sched_class::balance() Peter Zijlstra 2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra @ 2026-06-24 12:13 ` Peter Zijlstra 2026-07-02 11:49 ` [PATCH 0/2] " Aaron Lu 2 siblings, 0 replies; 8+ messages in thread From: Peter Zijlstra @ 2026-06-24 12:13 UTC (permalink / raw) To: mingo Cc: linux-kernel, peterz, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, kprateek.nayak, ziqianlu, tj, williams, jkacur Ever since commit 50653216e4ff ("sched: Add support to pick functions to take rf"), we have the unfortunate situation that both sched_class::balance() and sched_class::pick_task() have overlapping functionality in that they drop rq->lock and balance tasks. Additionally, prev_balance() is only called for a single RQ in the core-sched case, resulting in 'missed' balance opportunities in this case. The only classes with a balance callback are dl and rt, prev_balance() will run the callbacks from prev->class down, pick_next_task() runs the callbacks from stop_class down. Therefore, the only case where there is a difference is if prev->class == rt_sched_class and pick_next_task() stops at stop/dl. But in those cases the rt pull would have been pointless, it would move a high priority task to a runqueue that will not be able to run it. A subsequent pick that does reach rt, must have a prev of stop/dl priority (0,-1 resp.) and this will ensure need_pull_rt_task() is true and do the pull then. Therefore, move balance_{rt,dl}() into pick_task_{rt,dl}() and remove sched_class::balance(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/sched/core.c | 23 ----------------------- kernel/sched/deadline.c | 10 ++++++---- kernel/sched/fair.c | 38 ++++++++++---------------------------- kernel/sched/idle.c | 7 ------- kernel/sched/rt.c | 14 ++++++-------- kernel/sched/sched.h | 5 ----- kernel/sched/stop_task.c | 7 ------- 7 files changed, 22 insertions(+), 82 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6092,25 +6092,6 @@ static inline void schedule_debug(struct schedstat_inc(this_rq()->sched_count); } -static void prev_balance(struct rq *rq, struct rq_flags *rf) -{ - const struct sched_class *start_class = rq->donor->sched_class; - const struct sched_class *class; - - /* - * We must do the balancing pass before put_prev_task(), such - * that when we release the rq->lock the task is in the same - * state as before we took rq->lock. - * - * We can terminate the balance pass as soon as we know there is - * a runnable task of @class priority or higher. - */ - for_active_class_range(class, start_class, &idle_sched_class) { - if (class->balance && class->balance(rq, rf)) - break; - } -} - /* * Pick up the highest-prio task: */ @@ -6148,8 +6129,6 @@ __pick_next_task(struct rq *rq, struct r } restart: - prev_balance(rq, rf); - for_each_active_class(class) { p = class->pick_task(rq, rf); if (unlikely(p == RETRY_TASK)) @@ -6257,8 +6236,6 @@ pick_next_task(struct rq *rq, struct rq_ goto out_set_next; } - prev_balance(rq, rf); - smt_mask = cpu_smt_mask(cpu); need_sync = !!rq->core->core_cookie; --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2704,7 +2704,7 @@ static void check_preempt_equal_dl(struc resched_curr(rq); } -static int balance_dl(struct rq *rq, struct rq_flags *rf) +static void balance_dl(struct rq *rq, struct rq_flags *rf) { /* * Note, rq->donor may change during rq lock drops, @@ -2723,8 +2723,6 @@ static int balance_dl(struct rq *rq, str pull_dl_task(rq); rq_repin_lock(rq, rf); } - - return sched_stop_runnable(rq) || sched_dl_runnable(rq); } /* @@ -2817,6 +2815,11 @@ static struct task_struct *__pick_task_d struct dl_rq *dl_rq = &rq->dl; struct task_struct *p; + rq_modified_begin(rq, &dl_sched_class); + balance_dl(rq, rf); + if (rq_modified_above(rq, &dl_sched_class)) + return RETRY_TASK; + again: if (!sched_dl_runnable(rq)) return NULL; @@ -3652,7 +3655,6 @@ DEFINE_SCHED_CLASS(dl) = { .put_prev_task = put_prev_task_dl, .set_next_task = set_next_task_dl, - .balance = balance_dl, .select_task_rq = select_task_rq_dl, .migrate_task_rq = migrate_task_rq_dl, .set_cpus_allowed = set_cpus_allowed_dl, --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5732,7 +5732,7 @@ static inline unsigned long cfs_rq_load_ return cfs_rq->avg.load_avg; } -static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) +static void sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) __must_hold(__rq_lockp(this_rq)); static inline unsigned long task_util(struct task_struct *p) @@ -9916,7 +9916,6 @@ struct task_struct *pick_task_fair(struc struct cfs_rq *cfs_rq; struct task_struct *p; bool throttled; - int new_tasks; again: cfs_rq = &rq->cfs; @@ -9942,11 +9941,14 @@ struct task_struct *pick_task_fair(struc return p; idle: - new_tasks = sched_balance_newidle(rq, rf); - if (new_tasks < 0) + rq_modified_begin(rq, &fair_sched_class); + sched_balance_newidle(rq, rf); + if (rq_modified_above(rq, &fair_sched_class)) return RETRY_TASK; - if (new_tasks > 0) + + if (cfs_rq->nr_queued) goto again; + return NULL; } @@ -14334,13 +14336,8 @@ static inline void nohz_newidle_balance( /* * sched_balance_newidle is called by schedule() if this_cpu is about to become * idle. Attempts to pull tasks from other CPUs. - * - * Returns: - * < 0 - we released the lock and there are !fair tasks present - * 0 - failed, no new tasks - * > 0 - success, new (fair) tasks present */ -static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) +static void sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) __must_hold(__rq_lockp(this_rq)) { unsigned long next_balance = jiffies + HZ; @@ -14357,7 +14354,7 @@ static int sched_balance_newidle(struct * Return 0; the task will be enqueued when switching to idle. */ if (this_rq->ttwu_pending) - return 0; + return; /* * We must set idle_stamp _before_ calling sched_balance_rq() @@ -14370,7 +14367,7 @@ static int sched_balance_newidle(struct * Do not pull tasks towards !active CPUs... */ if (!cpu_active(this_cpu)) - return 0; + return; /* * This is OK, because current is on_cpu, which avoids it being picked @@ -14399,7 +14396,6 @@ static int sched_balance_newidle(struct t0 = sched_clock_cpu(this_cpu); __sched_balance_update_blocked_averages(this_rq); - rq_modified_begin(this_rq, &fair_sched_class); raw_spin_rq_unlock(this_rq); for_each_domain(this_cpu, sd) { @@ -14457,18 +14453,6 @@ static int sched_balance_newidle(struct if (curr_cost > this_rq->max_idle_balance_cost) this_rq->max_idle_balance_cost = curr_cost; - /* - * While browsing the domains, we released the rq lock, a task could - * have been enqueued in the meantime. Since we're not going idle, - * pretend we pulled a task. - */ - if (this_rq->cfs.h_nr_queued && !pulled_task) - pulled_task = 1; - - /* If a higher prio class was modified, restart the pick */ - if (rq_modified_above(this_rq, &fair_sched_class)) - pulled_task = -1; - out: /* Move the next balance forward */ if (time_after(this_rq->next_balance, next_balance)) @@ -14480,8 +14464,6 @@ static int sched_balance_newidle(struct nohz_newidle_balance(this_rq); rq_repin_lock(this_rq, rf); - - return pulled_task; } /* --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -464,12 +464,6 @@ select_task_rq_idle(struct task_struct * return task_cpu(p); /* IDLE tasks as never migrated */ } -static int -balance_idle(struct rq *rq, struct rq_flags *rf) -{ - return WARN_ON_ONCE(1); -} - /* * Idle tasks are unconditionally rescheduled: */ @@ -581,7 +575,6 @@ DEFINE_SCHED_CLASS(idle) = { .put_prev_task = put_prev_task_idle, .set_next_task = set_next_task_idle, - .balance = balance_idle, .select_task_rq = select_task_rq_idle, .set_cpus_allowed = set_cpus_allowed_common, --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1596,7 +1596,7 @@ static void check_preempt_equal_prio(str resched_curr(rq); } -static int balance_rt(struct rq *rq, struct rq_flags *rf) +static void balance_rt(struct rq *rq, struct rq_flags *rf) { /* * Note, rq->donor may change during rq lock drops, @@ -1615,8 +1615,6 @@ static int balance_rt(struct rq *rq, str pull_rt_task(rq); rq_repin_lock(rq, rf); } - - return sched_stop_runnable(rq) || sched_dl_runnable(rq) || sched_rt_runnable(rq); } /* @@ -1714,14 +1712,15 @@ static struct task_struct *_pick_next_ta static struct task_struct *pick_task_rt(struct rq *rq, struct rq_flags *rf) { - struct task_struct *p; + rq_modified_begin(rq, &rt_sched_class); + balance_rt(rq, rf); + if (rq_modified_above(rq, &rt_sched_class)) + return RETRY_TASK; if (!sched_rt_runnable(rq)) return NULL; - p = _pick_next_task_rt(rq); - - return p; + return _pick_next_task_rt(rq); } static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_struct *next) @@ -2609,7 +2608,6 @@ DEFINE_SCHED_CLASS(rt) = { .put_prev_task = put_prev_task_rt, .set_next_task = set_next_task_rt, - .balance = balance_rt, .select_task_rq = select_task_rq_rt, .set_cpus_allowed = set_cpus_allowed_common, .rq_online = rq_online_rt, --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2638,11 +2638,6 @@ struct sched_class { void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags); /* - * schedule/pick_next_task/prev_balance: rq->lock - */ - int (*balance)(struct rq *rq, struct rq_flags *rf); - - /* * schedule/pick_next_task: rq->lock */ struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf); --- a/kernel/sched/stop_task.c +++ b/kernel/sched/stop_task.c @@ -15,12 +15,6 @@ select_task_rq_stop(struct task_struct * return task_cpu(p); /* stop tasks as never migrate */ } -static int -balance_stop(struct rq *rq, struct rq_flags *rf) -{ - return sched_stop_runnable(rq); -} - static void wakeup_preempt_stop(struct rq *rq, struct task_struct *p, int flags) { @@ -107,7 +101,6 @@ DEFINE_SCHED_CLASS(stop) = { .put_prev_task = put_prev_task_stop, .set_next_task = set_next_task_stop, - .balance = balance_stop, .select_task_rq = select_task_rq_stop, .set_cpus_allowed = set_cpus_allowed_common, ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] sched: Remove sched_class::balance() 2026-06-24 12:13 [PATCH 0/2] sched: Remove sched_class::balance() Peter Zijlstra 2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra 2026-06-24 12:13 ` [PATCH 2/2] sched: Remove sched_class::balance() Peter Zijlstra @ 2026-07-02 11:49 ` Aaron Lu 2026-07-03 3:31 ` K Prateek Nayak 2 siblings, 1 reply; 8+ messages in thread From: Aaron Lu @ 2026-07-02 11:49 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, linux-kernel, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, kprateek.nayak, tj, williams, jkacur Hi Peter, On Wed, Jun 24, 2026 at 02:13:27PM +0200, Peter Zijlstra wrote: > Hi! > > As mentioned [1], I was going to move sched_balance_newidle() into > balance_fair(). And while the patch was quickly done, something didn't feel > right. > Sorry for replying late, been busy with other stuffs. The test caused a panic with below msg: [ 110.613929] BUG: kernel NULL pointer dereference, address: 0000000000000014 [ 110.615988] #PF: supervisor read access in kernel mode [ 110.617444] #PF: error_code(0x0000) - not-present page [ 110.618929] PGD 1044c2067 P4D 0 [ 110.619938] Oops: Oops: 0000 [#1] SMP NOPTI [ 110.621491] CPU: 34 UID: 1000 PID: 1394 Comm: sh Not tainted 7.1.0-rc2-00109-gb17be96b4d6c #4 PREEMPT(lazy) [ 110.624764] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 110.627559] RIP: 0010:pick_next_task+0x42b/0xa20 [ 110.628970] Code: 28 60 00 49 89 c7 8b 05 b3 8d 73 05 41 39 c7 0f 83 c3 01 00 00 49 63 cf 48 8b 1c cd 80 7b 96 82 48 01 eb 48 8b b3 50 0f 00 00 <8b> 4e 14 48 8b 0c cd 80 7b 96 82 48 8b 4c 29 20 48 39 ce 74 32 48 [ 110.634160] RSP: 0000:ffa000000435bc48 EFLAGS: 00010083 [ 110.635261] RAX: 0000000000000040 RBX: ff11000ff85fe140 RCX: 0000000000000022 [ 110.636753] RDX: 0000000000000022 RSI: 0000000000000000 RDI: ff110001008b54d8 [ 110.638230] RBP: ffffffff87452140 R08: 0000000000000022 R09: 00000000aed806e6 [ 110.639719] R10: 0000000000000001 R11: 0000000000000000 R12: ff11000ff85fe140 [ 110.641185] R13: 0000000000000000 R14: ff11000102373e88 R15: 0000000000000022 [ 110.642688] FS: 00007f698e381780(0000) GS:ff110010711ac000(0000) knlGS:0000000000000000 [ 110.644359] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 110.645255] CR2: 0000000000000014 CR3: 00000001044c1001 CR4: 0000000000371ef0 [ 110.646366] Call Trace: [ 110.646781] <TASK> [ 110.647123] ? sched_clock+0x10/0x30 [ 110.647704] __schedule+0x1a4/0x7c0 [ 110.648249] ? local_clock_noinstr+0xd/0xc0 [ 110.648914] preempt_schedule+0x37/0x50 [ 110.649528] preempt_schedule_thunk+0x16/0x30 [ 110.650217] _raw_spin_unlock+0x37/0x40 [ 110.650833] wp_page_copy+0x2b7/0x5f0 [ 110.651408] __handle_mm_fault+0x45f/0x6e0 [ 110.652068] handle_mm_fault+0x96/0x250 [ 110.652685] do_user_addr_fault+0x236/0x6a0 [ 110.653333] exc_page_fault+0x7a/0x210 [ 110.653937] asm_exc_page_fault+0x26/0x30 [ 110.654577] RIP: 0033:0x7f698e3a1f84 [ 110.655077] Code: 8d 3d e0 d4 1c 00 e9 0b 15 05 00 0f 1f 00 48 89 7d f8 48 8d 3d cd d4 1c 00 e8 38 14 05 00 48 8b 55 f8 eb 8c 66 90 f3 0f 1e fa <c7> 05 b2 d4 1c 00 00 00 00 00 c3 90 f3 0f 1e fa 31 c9 e9 65 f9 ff [ 110.657442] RSP: 002b:00007fff9a5ff9e8 EFLAGS: 00010246 [ 110.658113] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00007f698e443f63 [ 110.659037] RDX: 0000000000000000 RSI: 0000000000000018 RDI: 00007f698e381a60 [ 110.659973] RBP: 00007fff9a5ffa10 R08: 0000000000000000 R09: 000000000000000f [ 110.660896] R10: 00007f698e381a50 R11: 0000000000000246 R12: 0000000000000000 [ 110.661829] R13: 00000000ffffffff R14: 0000000000000000 R15: 00007fff9a5ffc20 [ 110.662745] </TASK> [ 110.663040] Modules linked in: [ 110.663451] CR2: 0000000000000014 [ 110.663893] ---[ end trace 0000000000000000 ]--- [ 110.664487] RIP: 0010:pick_next_task+0x42b/0xa20 [ 110.665090] Code: 28 60 00 49 89 c7 8b 05 b3 8d 73 05 41 39 c7 0f 83 c3 01 00 00 49 63 cf 48 8b 1c cd 80 7b 96 82 48 01 eb 48 8b b3 50 0f 00 00 <8b> 4e 14 48 8b 0c cd 80 7b 96 82 48 8b 4c 29 20 48 39 ce 74 32 48 [ 110.667457] RSP: 0000:ffa000000435bc48 EFLAGS: 00010083 [ 110.668146] RAX: 0000000000000040 RBX: ff11000ff85fe140 RCX: 0000000000000022 [ 110.669074] RDX: 0000000000000022 RSI: 0000000000000000 RDI: ff110001008b54d8 [ 110.670009] RBP: ffffffff87452140 R08: 0000000000000022 R09: 00000000aed806e6 [ 110.670942] R10: 0000000000000001 R11: 0000000000000000 R12: ff11000ff85fe140 [ 110.671861] R13: 0000000000000000 R14: ff11000102373e88 R15: 0000000000000022 [ 110.672787] FS: 00007f698e381780(0000) GS:ff110010711ac000(0000) knlGS:0000000000000000 [ 110.673826] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 110.674567] CR2: 0000000000000014 CR3: 00000001044c1001 CR4: 0000000000371ef0 [ 110.675487] Kernel panic - not syncing: Fatal exception [ 111.737349] Shutting down cpus with NMI [ 111.738019] Kernel Offset: disabled [ 111.738373] Rebooting in 30 seconds.. $ ./scripts/faddr2line ../obj/guest/vmlinux pick_next_task+0x42b pick_next_task+0x42b/0xa20: task_cpu at include/linux/sched.h:2270 (inlined by) is_task_rq_idle at kernel/sched/core.c:6144 (inlined by) cookie_equals at kernel/sched/core.c:6149 (inlined by) pick_next_task at kernel/sched/core.c:6328 This is the cookie_equals(p, cookie) in pick_next_task(). Assume cpuX and cpuY are siblings, it appears the following happened: cpuX cpuY pick_next_task() goto restart_multi rqX->core_pick = pick_task(rqX) pick_task(rqY) pick_task_fair(rqY) sched_balance_newidle(rqY) raw_spin_rq_unlock(rqY) // drops core lock pick_next_task() goto restart_multi rqY->core_pick = pick_task(rqY) rqX->core_pick = pick_task(rqX) if (rqX->curr == rqX->core_pick) rqX->core_pick = NULL UNLOCK rq_lockp(rqY) raw_spin_rq_lock(rqY) rqY->core_pick = pick_task(rqY) p = rqX->core_pick // NULL cookie_equals(p, cookie) // NULL deref BTW, I applied this series on top of commit c095741713d1b("sched/fair: Fix newidle vs core-sched") of the queue/sched/core branch. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/2] sched: Remove sched_class::balance() 2026-07-02 11:49 ` [PATCH 0/2] " Aaron Lu @ 2026-07-03 3:31 ` K Prateek Nayak 0 siblings, 0 replies; 8+ messages in thread From: K Prateek Nayak @ 2026-07-03 3:31 UTC (permalink / raw) To: Aaron Lu, Peter Zijlstra Cc: mingo, linux-kernel, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid, tj, williams, jkacur Hi Aaron, On 7/2/2026 5:19 PM, Aaron Lu wrote: > Assume cpuX and cpuY are siblings, it appears the following happened: > > cpuX cpuY > > pick_next_task() > goto restart_multi > > rqX->core_pick = pick_task(rqX) > > pick_task(rqY) > pick_task_fair(rqY) > sched_balance_newidle(rqY) > raw_spin_rq_unlock(rqY) // drops core lock > > pick_next_task() > goto restart_multi > rqY->core_pick = pick_task(rqY) > rqX->core_pick = pick_task(rqX) > > if (rqX->curr == rqX->core_pick) > rqX->core_pick = NULL > > UNLOCK rq_lockp(rqY) > > raw_spin_rq_lock(rqY) > > rqY->core_pick = pick_task(rqY) > > p = rqX->core_pick // NULL > cookie_equals(p, cookie) // NULL deref I also feel doing a balance before the core_cookie is finalized can move tasks wrongly to a particular core, only to make them wait later because the pick converted on a different core cookie. Stealing via balance callback still seems like a good option. Then there is the whole issue of core_pick_seq possibly being updated by another CPU on the core by the time rq_lock is dropped and grabbed again. Afaict, anything that drops the core-wide lock should just do a RETRY_TASK if a new task arrives so everything is re-done withing a single core-wide lock critical-section. Up until this cleanup, only fair and ext used the RETRY_TASK mechanism and fair bypassed the newidle with core-sched enabled so it never used the RETRY_TASK mechanism with core-sched so I'm still skeptical on balance for core-sched being done as part of pick. Maybe Peter will have a way to sort it out after he is back from vacation if he hasn't got it all figured out already ;-) -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-07-03 3:31 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-24 12:13 [PATCH 0/2] sched: Remove sched_class::balance() Peter Zijlstra 2026-06-24 12:13 ` [PATCH 1/2] sched/core: Allow newidle for core-sched Peter Zijlstra 2026-06-24 23:56 ` K Prateek Nayak 2026-06-25 12:41 ` Peter Zijlstra 2026-06-25 12:42 ` Peter Zijlstra 2026-06-24 12:13 ` [PATCH 2/2] sched: Remove sched_class::balance() Peter Zijlstra 2026-07-02 11:49 ` [PATCH 0/2] " Aaron Lu 2026-07-03 3:31 ` K Prateek Nayak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox