* [PATCH v5 1/6] sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
2026-06-23 0:05 [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
@ 2026-06-23 0:05 ` Ricardo Neri
2026-06-23 7:13 ` Vincent Guittot
2026-06-23 0:05 ` [PATCH v5 2/6] sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY Ricardo Neri
` (4 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Ricardo Neri @ 2026-06-23 0:05 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
K Prateek Nayak, Barry Song
Cc: Rafael J. Wysocki, Andrea Righi, Len Brown, ricardo.neri,
linux-kernel, Ricardo Neri
When picking a busiest CPU with only one running task, the function
sched_balance_find_src_rq() skips candidate CPUs if the destination CPU has
less than ~5% extra capacity. This condition only holds if all the SMT
siblings of a CPU are idle.
SMT siblings share the computing resources of a physical core and this
results in reduced capacity if more than one sibling is busy.
Skipping a CPU as described would prevent the load balancer from pulling
tasks from a scheduling group previously and correctly identified as
group_smt_balance (i.e., one with more than one task running).
Do not skip a candidate CPU of similar capacity if it has busy SMT
siblings.
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v5:
* Optimized logic to identify CPUs with busy SMT siblings only when
needed. (Prateek, Chen Yu)
* Added Reviewed-by tag from Prateek. Thanks!
* Christian also provided his Reviewed-by tag, but the patch changed
significantly since then. I did not think it was correct to keep it
without him reviewing the updated patch first.
Changes in v4:
* Introduced this patch.
Changes in v3:
* N/A
Changes in v2:
* N/A
---
kernel/sched/fair.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee1..892abd7fcc18 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12976,9 +12976,17 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
* average load.
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
- nr_running == 1)
- continue;
+ nr_running == 1) {
+ bool smt_degraded_cap = sched_smt_active() && !is_core_idle(i);
+
+ /*
+ * Busy SMT siblings reduce the capacity of CPU @i. Do
+ * not skip it in this case.
+ */
+ if (!smt_degraded_cap &&
+ !capacity_greater(capacity_of(env->dst_cpu), capacity))
+ continue;
+ }
/*
* Make sure we only pull tasks from a CPU of lower priority
--
2.43.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v5 1/6] sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
2026-06-23 0:05 ` [PATCH v5 1/6] sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings Ricardo Neri
@ 2026-06-23 7:13 ` Vincent Guittot
2026-06-24 5:25 ` Ricardo Neri
0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2026-06-23 7:13 UTC (permalink / raw)
To: Ricardo Neri
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, Christian Loehle, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> When picking a busiest CPU with only one running task, the function
> sched_balance_find_src_rq() skips candidate CPUs if the destination CPU has
> less than ~5% extra capacity. This condition only holds if all the SMT
> siblings of a CPU are idle.
>
> SMT siblings share the computing resources of a physical core and this
> results in reduced capacity if more than one sibling is busy.
>
> Skipping a CPU as described would prevent the load balancer from pulling
> tasks from a scheduling group previously and correctly identified as
> group_smt_balance (i.e., one with more than one task running).
>
> Do not skip a candidate CPU of similar capacity if it has busy SMT
> siblings.
>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> Changes in v5:
> * Optimized logic to identify CPUs with busy SMT siblings only when
> needed. (Prateek, Chen Yu)
> * Added Reviewed-by tag from Prateek. Thanks!
> * Christian also provided his Reviewed-by tag, but the patch changed
> significantly since then. I did not think it was correct to keep it
> without him reviewing the updated patch first.
>
> Changes in v4:
> * Introduced this patch.
>
> Changes in v3:
> * N/A
>
> Changes in v2:
> * N/A
> ---
> kernel/sched/fair.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d78467ec6ee1..892abd7fcc18 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12976,9 +12976,17 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> * average load.
> */
> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> - !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
> - nr_running == 1)
> - continue;
> + nr_running == 1) {
> + bool smt_degraded_cap = sched_smt_active() && !is_core_idle(i);
> +
> + /*
> + * Busy SMT siblings reduce the capacity of CPU @i. Do
> + * not skip it in this case.
> + */
> + if (!smt_degraded_cap &&
> + !capacity_greater(capacity_of(env->dst_cpu), capacity))
> + continue;
> + }
>
> /*
> * Make sure we only pull tasks from a CPU of lower priority
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 1/6] sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
2026-06-23 7:13 ` Vincent Guittot
@ 2026-06-24 5:25 ` Ricardo Neri
0 siblings, 0 replies; 22+ messages in thread
From: Ricardo Neri @ 2026-06-24 5:25 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, Christian Loehle, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, Jun 23, 2026 at 09:13:41AM +0200, Vincent Guittot wrote:
> On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > When picking a busiest CPU with only one running task, the function
> > sched_balance_find_src_rq() skips candidate CPUs if the destination CPU has
> > less than ~5% extra capacity. This condition only holds if all the SMT
> > siblings of a CPU are idle.
> >
> > SMT siblings share the computing resources of a physical core and this
> > results in reduced capacity if more than one sibling is busy.
> >
> > Skipping a CPU as described would prevent the load balancer from pulling
> > tasks from a scheduling group previously and correctly identified as
> > group_smt_balance (i.e., one with more than one task running).
> >
> > Do not skip a candidate CPU of similar capacity if it has busy SMT
> > siblings.
> >
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Thanks you!
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 2/6] sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
2026-06-23 0:05 [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-06-23 0:05 ` [PATCH v5 1/6] sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings Ricardo Neri
@ 2026-06-23 0:05 ` Ricardo Neri
2026-06-23 7:14 ` Vincent Guittot
2026-06-23 0:05 ` [PATCH v5 3/6] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
` (3 subsequent siblings)
5 siblings, 1 reply; 22+ messages in thread
From: Ricardo Neri @ 2026-06-23 0:05 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
K Prateek Nayak, Barry Song
Cc: Rafael J. Wysocki, Andrea Righi, Len Brown, ricardo.neri,
linux-kernel, Ricardo Neri
The argument sg_overloaded of update_sg_lb_stats() is only consumed when
balancing at the root domain. It only makes sense to update it in such a
case. Commit 3229adbe7875 ("sched/fair: Do not compute overloaded status
unnecessarily during lb") updated the logic accordingly but missed the case
in which the root domain has the SD_ASYM_CPUCAPACITY flag. Fix this.
Fixes: 3229adbe7875 ("sched/fair: Do not compute overloaded status unnecessarily during lb")
Tested-by: Christian Loehle <christian.loehle@arm.com>
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v5:
* Added Tested-by tag from Christian. Thanks!
Changes in v4:
* Introduced this patch.
Changes in v3:
* N/A
Changes in v2:
* N/A
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 892abd7fcc18..31baa0000616 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11863,7 +11863,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Check for a misfit task on the cpu */
if (sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
- *sg_overloaded = 1;
+
+ if (balancing_at_rd)
+ *sg_overloaded = 1;
}
} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
--
2.43.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v5 2/6] sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
2026-06-23 0:05 ` [PATCH v5 2/6] sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY Ricardo Neri
@ 2026-06-23 7:14 ` Vincent Guittot
0 siblings, 0 replies; 22+ messages in thread
From: Vincent Guittot @ 2026-06-23 7:14 UTC (permalink / raw)
To: Ricardo Neri
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, Christian Loehle, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> The argument sg_overloaded of update_sg_lb_stats() is only consumed when
> balancing at the root domain. It only makes sense to update it in such a
> case. Commit 3229adbe7875 ("sched/fair: Do not compute overloaded status
> unnecessarily during lb") updated the logic accordingly but missed the case
> in which the root domain has the SD_ASYM_CPUCAPACITY flag. Fix this.
>
> Fixes: 3229adbe7875 ("sched/fair: Do not compute overloaded status unnecessarily during lb")
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> Reported-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> Changes in v5:
> * Added Tested-by tag from Christian. Thanks!
>
> Changes in v4:
> * Introduced this patch.
>
> Changes in v3:
> * N/A
>
> Changes in v2:
> * N/A
> ---
> kernel/sched/fair.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 892abd7fcc18..31baa0000616 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11863,7 +11863,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> /* Check for a misfit task on the cpu */
> if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> sgs->group_misfit_task_load = rq->misfit_task_load;
> - *sg_overloaded = 1;
> +
> + if (balancing_at_rd)
> + *sg_overloaded = 1;
> }
> } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
> /* Check for a task running on a CPU with reduced capacity */
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 3/6] sched/fair: Check CPU capacity before comparing group types during load balance
2026-06-23 0:05 [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-06-23 0:05 ` [PATCH v5 1/6] sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings Ricardo Neri
2026-06-23 0:05 ` [PATCH v5 2/6] sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY Ricardo Neri
@ 2026-06-23 0:05 ` Ricardo Neri
2026-06-23 0:05 ` [PATCH v5 4/6] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
` (2 subsequent siblings)
5 siblings, 0 replies; 22+ messages in thread
From: Ricardo Neri @ 2026-06-23 0:05 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
K Prateek Nayak, Barry Song
Cc: Rafael J. Wysocki, Andrea Righi, Len Brown, ricardo.neri,
linux-kernel, Vincent Guittot, Ricardo Neri
update_sd_pick_busiest() may incorrectly select a fully_busy group as the
busiest group when its per-CPU capacity exceeds that of the destination
CPU. This happens because the type of busiest group is initialized to
group_has_spare and allows the fully_busy group to win the type comparison.
update_sd_pick_busiest() should not choose a candidate scheduling group
with at most one runnable task if its per-CPU capacity is greater than that
of the destination CPU. Such a check already exists, but it is done too
late: after the type comparison, preventing a subsequent fully_busy group
of equal per-CPU capacity from being correctly selected.
Move this check to occur before comparing group types.
Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guitto@linaro.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v5:
* Added Tested-by tag from Christian. Thanks!
Changes in v4:
* Dropped note on SMT not being affected since SMT + asym capacity is
now supported.
* Added Reviewed-by tags from Vincent, Tim, and Chen Yu. Thanks!
Changes in v3:
* Added a Fixes tag. (Christian)
* Added Reviewed-by tag from Christian. Thanks!
Changes in v2:
* Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
exclusive. (Tim)
* Kept parentheses around bitwise operators for clarity.
* Rewrote patch description for clarity.
---
kernel/sched/fair.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31baa0000616..030675e249b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11944,6 +11944,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
sds->local_stat.group_type != group_has_spare))
return false;
+ /*
+ * Candidate sg has no more than one task per CPU and has higher
+ * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+ * throughput. Maximize throughput, power/energy consequences are not
+ * considered.
+ */
+ if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+ (sgs->group_type <= group_fully_busy) &&
+ (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;
@@ -12050,17 +12061,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
break;
}
- /*
- * Candidate sg has no more than one task per CPU and has higher
- * per-CPU capacity. Migrating tasks to less capable CPUs may harm
- * throughput. Maximize throughput, power/energy consequences are not
- * considered.
- */
- if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
- (sgs->group_type <= group_fully_busy) &&
- (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
- return false;
-
return true;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH v5 4/6] sched/fair: Skip misfit load accounting when the destination CPU cannot help
2026-06-23 0:05 [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
` (2 preceding siblings ...)
2026-06-23 0:05 ` [PATCH v5 3/6] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-06-23 0:05 ` Ricardo Neri
2026-06-23 0:05 ` [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
2026-06-23 0:05 ` [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
5 siblings, 0 replies; 22+ messages in thread
From: Ricardo Neri @ 2026-06-23 0:05 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
K Prateek Nayak, Barry Song
Cc: Rafael J. Wysocki, Andrea Righi, Len Brown, ricardo.neri,
linux-kernel, Ricardo Neri
In domains with asymmetric capacity, identifying misfit load in a
scheduling group is not useful when the destination CPU cannot help (i.e.,
its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
such cases, it also prevents load balance among clusters of equal capacity
when CONFIG_SCHED_CLUSTER is enabled. This happens because
update_sd_pick_busiest() skips candidate groups of type misfit_task if the
destination CPU has similar capacity.
Skipping misfit load accounting in this situation allows the group to be
classified as has_spare or fully_busy and lets load balancing proceed. Keep
marking scheduling groups as overloaded when misfit tasks are present. The
sg_overloaded flag propagates to the root domain and allows bigger CPUs in
it to help via newly idle balance.
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v5:
* Added Tested-by tag from Christian. Thanks!
Changes in v4:
* Added Reviewed-by tags from Vincent and Chen Yu. Thanks!
Changes in v3:
* Added Reviewed-by tag from Christian. Thanks!
Changes in v2:
* Moved the check of the destination CPU capacity inside the code block
used for SD_ASYM_CPUCAPACITY. v1 inadvertently broke the mutual
exclusion of the sched_reduced_capacity() path.
* Keep marking the root domain as overloaded to allow bigger CPUs to
help. (sashiko)
* Fixed patch description to clarify that the capacity_greater() looks
for differences of 5% or more. (Christian)
* Reworded the patch description for clarity.
* I did not include the Reviewed-by tag from Christian since the patch
changed functionally.
---
kernel/sched/fair.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 030675e249b5..e55eb019d2c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11860,12 +11860,25 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
if (sd_flags & SD_ASYM_CPUCAPACITY) {
- /* Check for a misfit task on the cpu */
- if (sgs->group_misfit_task_load < rq->misfit_task_load) {
- sgs->group_misfit_task_load = rq->misfit_task_load;
-
+ if (rq->misfit_task_load) {
+ /*
+ * Always mark the root domain overloaded so big
+ * CPUs can pick up misfit tasks via newly idle
+ * balance.
+ */
if (balancing_at_rd)
*sg_overloaded = 1;
+
+ /*
+ * Only account misfit load if @dst_cpu can
+ * help; otherwise, the group may be classified
+ * as misfit_task and update_sd_pick_busiest()
+ * will skip it.
+ */
+ if (capacity_greater(capacity_of(env->dst_cpu),
+ group->sgc->max_capacity) &&
+ (sgs->group_misfit_task_load < rq->misfit_task_load))
+ sgs->group_misfit_task_load = rq->misfit_task_load;
}
} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
--
2.43.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-23 0:05 [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
` (3 preceding siblings ...)
2026-06-23 0:05 ` [PATCH v5 4/6] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-06-23 0:05 ` Ricardo Neri
2026-06-23 7:20 ` Vincent Guittot
2026-06-27 19:07 ` Andrea Righi
2026-06-23 0:05 ` [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
5 siblings, 2 replies; 22+ messages in thread
From: Ricardo Neri @ 2026-06-23 0:05 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
K Prateek Nayak, Barry Song
Cc: Rafael J. Wysocki, Andrea Righi, Len Brown, ricardo.neri,
linux-kernel, Ricardo Neri
sched_balance_find_src_rq() avoids selecting a runqueue with a single
running task as busiest if doing so results in migrating the task to a
CPU with less than ~5% of extra capacity. It also unintentionally
prevents migrations between CPUs of identical capacity.
When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
clusters of CPUs with the same capacity. Allowing migration between CPUs
of identical capacity is necessary to meet this goal.
Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
runtime reductions due to side activity or thermal pressure. Guard this
check with the sched_cluster_active static key so that systems without
cluster topology are unaffected.
Tested-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v5:
* Optimized logic to identify same-arch clusters only when needed.
* Added Tested-by tag from Christian. Thanks!
Changes in v4:
* Implemented the check for cluster with a local variable for improved
readability.
Changes in v3:
* Reverted the inverted capacity check; the inverted form incorrectly
allows migrations to CPUs of slightly less capacity.
* Guarded the check for architectural capacity with the
sched_cluster_active static key.
Changes in v2:
* Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
runtime variability.
* Inverted the check for runtime capacity. (Christian)
* Reworded patch description for clarity.
---
kernel/sched/fair.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e55eb019d2c9..f4eb55cad54d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12992,13 +12992,20 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
nr_running == 1) {
+ bool same_arch_cluster = static_branch_unlikely(&sched_cluster_active) &&
+ (arch_scale_cpu_capacity(env->dst_cpu) ==
+ arch_scale_cpu_capacity(i));
bool smt_degraded_cap = sched_smt_active() && !is_core_idle(i);
/*
* Busy SMT siblings reduce the capacity of CPU @i. Do
* not skip it in this case.
+ *
+ * CONFIG_SCHED_CLUSTER requires balancing load across clusters
+ * of identical capacity. Use architectural capacity to ignore
+ * runtime variability.
*/
- if (!smt_degraded_cap &&
+ if (!smt_degraded_cap && !same_arch_cluster &&
!capacity_greater(capacity_of(env->dst_cpu), capacity))
continue;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-23 0:05 ` [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-06-23 7:20 ` Vincent Guittot
2026-06-23 7:45 ` Christian Loehle
2026-06-27 19:07 ` Andrea Righi
1 sibling, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2026-06-23 7:20 UTC (permalink / raw)
To: Ricardo Neri
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, Christian Loehle, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> running task as busiest if doing so results in migrating the task to a
> CPU with less than ~5% of extra capacity. It also unintentionally
> prevents migrations between CPUs of identical capacity.
>
> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> clusters of CPUs with the same capacity. Allowing migration between CPUs
> of identical capacity is necessary to meet this goal.
>
> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
capacity_of() reflects not only RT and irq pressure but also thermal
pressure or system frequency capping.
If dst cluster is under thermal mitigation but the source cluster is
not, we probably shouldn't spread tasks across both clusters.
Have you considered using get_actual_cpu_capacity() instead of
arch_scale_cpu_capacity() ?
> runtime reductions due to side activity or thermal pressure. Guard this
> check with the sched_cluster_active static key so that systems without
> cluster topology are unaffected.
>
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v5:
> * Optimized logic to identify same-arch clusters only when needed.
> * Added Tested-by tag from Christian. Thanks!
>
> Changes in v4:
> * Implemented the check for cluster with a local variable for improved
> readability.
>
> Changes in v3:
> * Reverted the inverted capacity check; the inverted form incorrectly
> allows migrations to CPUs of slightly less capacity.
> * Guarded the check for architectural capacity with the
> sched_cluster_active static key.
>
> Changes in v2:
> * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
> runtime variability.
> * Inverted the check for runtime capacity. (Christian)
> * Reworded patch description for clarity.
> ---
> kernel/sched/fair.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e55eb019d2c9..f4eb55cad54d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12992,13 +12992,20 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> */
> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> nr_running == 1) {
> + bool same_arch_cluster = static_branch_unlikely(&sched_cluster_active) &&
> + (arch_scale_cpu_capacity(env->dst_cpu) ==
> + arch_scale_cpu_capacity(i));
> bool smt_degraded_cap = sched_smt_active() && !is_core_idle(i);
>
> /*
> * Busy SMT siblings reduce the capacity of CPU @i. Do
> * not skip it in this case.
> + *
> + * CONFIG_SCHED_CLUSTER requires balancing load across clusters
> + * of identical capacity. Use architectural capacity to ignore
> + * runtime variability.
> */
> - if (!smt_degraded_cap &&
> + if (!smt_degraded_cap && !same_arch_cluster &&
> !capacity_greater(capacity_of(env->dst_cpu), capacity))
> continue;
> }
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-23 7:20 ` Vincent Guittot
@ 2026-06-23 7:45 ` Christian Loehle
2026-06-24 5:25 ` Ricardo Neri
2026-06-26 15:20 ` Vincent Guittot
0 siblings, 2 replies; 22+ messages in thread
From: Christian Loehle @ 2026-06-23 7:45 UTC (permalink / raw)
To: Vincent Guittot, Ricardo Neri
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, K Prateek Nayak, Barry Song,
Rafael J. Wysocki, Andrea Righi, Len Brown, ricardo.neri,
linux-kernel
On 6/23/26 08:20, Vincent Guittot wrote:
> On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
>>
>> sched_balance_find_src_rq() avoids selecting a runqueue with a single
>> running task as busiest if doing so results in migrating the task to a
>> CPU with less than ~5% of extra capacity. It also unintentionally
>> prevents migrations between CPUs of identical capacity.
>>
>> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
>> clusters of CPUs with the same capacity. Allowing migration between CPUs
>> of identical capacity is necessary to meet this goal.
>>
>> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
>
> capacity_of() reflects not only RT and irq pressure but also thermal
> pressure or system frequency capping.
> If dst cluster is under thermal mitigation but the source cluster is
> not, we probably shouldn't spread tasks across both clusters.
> Have you considered using get_actual_cpu_capacity() instead of
> arch_scale_cpu_capacity() ?
Replacing arch_scale_cpu_capacity() with get_actual_cpu_capacity()
would make the == comparison below very unlikely to be true FWIW.
I think it's fine like that, I will prepare a follow-up anyway to make
it work for our "almost equal capacity" cluster systems and then also
consider switching to get_actual_cpu_capacity() since we include a margin
anyway.
>
>> runtime reductions due to side activity or thermal pressure. Guard this
>> check with the sched_cluster_active static key so that systems without
>> cluster topology are unaffected.
>>
>> Tested-by: Christian Loehle <christian.loehle@arm.com>
>> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
>> ---
>> Changes in v5:
>> * Optimized logic to identify same-arch clusters only when needed.
>> * Added Tested-by tag from Christian. Thanks!
>>
>> Changes in v4:
>> * Implemented the check for cluster with a local variable for improved
>> readability.
>>
>> Changes in v3:
>> * Reverted the inverted capacity check; the inverted form incorrectly
>> allows migrations to CPUs of slightly less capacity.
>> * Guarded the check for architectural capacity with the
>> sched_cluster_active static key.
>>
>> Changes in v2:
>> * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
>> runtime variability.
>> * Inverted the check for runtime capacity. (Christian)
>> * Reworded patch description for clarity.
>> ---
>> kernel/sched/fair.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e55eb019d2c9..f4eb55cad54d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12992,13 +12992,20 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>> */
>> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
>> nr_running == 1) {
>> + bool same_arch_cluster = static_branch_unlikely(&sched_cluster_active) &&
>> + (arch_scale_cpu_capacity(env->dst_cpu) ==
>> + arch_scale_cpu_capacity(i));
>> bool smt_degraded_cap = sched_smt_active() && !is_core_idle(i);
>>
>> /*
>> * Busy SMT siblings reduce the capacity of CPU @i. Do
>> * not skip it in this case.
>> + *
>> + * CONFIG_SCHED_CLUSTER requires balancing load across clusters
>> + * of identical capacity. Use architectural capacity to ignore
>> + * runtime variability.
>> */
>> - if (!smt_degraded_cap &&
>> + if (!smt_degraded_cap && !same_arch_cluster &&
>> !capacity_greater(capacity_of(env->dst_cpu), capacity))
>> continue;
>> }
>>
>> --
>> 2.43.0
>>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-23 7:45 ` Christian Loehle
@ 2026-06-24 5:25 ` Ricardo Neri
2026-06-26 0:11 ` Ricardo Neri
2026-06-26 15:20 ` Vincent Guittot
1 sibling, 1 reply; 22+ messages in thread
From: Ricardo Neri @ 2026-06-24 5:25 UTC (permalink / raw)
To: Christian Loehle, g
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, Jun 23, 2026 at 08:45:23AM +0100, Christian Loehle wrote:
> On 6/23/26 08:20, Vincent Guittot wrote:
> > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > <ricardo.neri-calderon@linux.intel.com> wrote:
> >>
> >> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> >> running task as busiest if doing so results in migrating the task to a
> >> CPU with less than ~5% of extra capacity. It also unintentionally
> >> prevents migrations between CPUs of identical capacity.
> >>
> >> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> >> clusters of CPUs with the same capacity. Allowing migration between CPUs
> >> of identical capacity is necessary to meet this goal.
> >>
> >> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
> >
> > capacity_of() reflects not only RT and irq pressure but also thermal
> > pressure or system frequency capping.
> > If dst cluster is under thermal mitigation but the source cluster is
> > not, we probably shouldn't spread tasks across both clusters.
> > Have you considered using get_actual_cpu_capacity() instead of
> > arch_scale_cpu_capacity() ?
>
> Replacing arch_scale_cpu_capacity() with get_actual_cpu_capacity()
> would make the == comparison below very unlikely to be true FWIW.
Yes, this is what I thought too. I did not try with get_actual_cpu_capacity(),
though. Perhaps on Intel processors it would work since rq->avg_hw.load_avg
is not used, IIUC. I am not sure about cpufreq_pressure. I need to check.
Still, it may work for Intel processors but not for ARM ones.
> I think it's fine like that, I will prepare a follow-up anyway to make
> it work for our "almost equal capacity" cluster systems and then also
> consider switching to get_actual_cpu_capacity() since we include a margin
> anyway.
Great!
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-24 5:25 ` Ricardo Neri
@ 2026-06-26 0:11 ` Ricardo Neri
2026-06-26 14:50 ` Vincent Guittot
0 siblings, 1 reply; 22+ messages in thread
From: Ricardo Neri @ 2026-06-26 0:11 UTC (permalink / raw)
To: Christian Loehle, g
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, Jun 23, 2026 at 10:25:14PM -0700, Ricardo Neri wrote:
> On Tue, Jun 23, 2026 at 08:45:23AM +0100, Christian Loehle wrote:
> > On 6/23/26 08:20, Vincent Guittot wrote:
> > > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > > <ricardo.neri-calderon@linux.intel.com> wrote:
> > >>
> > >> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> > >> running task as busiest if doing so results in migrating the task to a
> > >> CPU with less than ~5% of extra capacity. It also unintentionally
> > >> prevents migrations between CPUs of identical capacity.
> > >>
> > >> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> > >> clusters of CPUs with the same capacity. Allowing migration between CPUs
> > >> of identical capacity is necessary to meet this goal.
> > >>
> > >> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
> > >
> > > capacity_of() reflects not only RT and irq pressure but also thermal
> > > pressure or system frequency capping.
> > > If dst cluster is under thermal mitigation but the source cluster is
> > > not, we probably shouldn't spread tasks across both clusters.
> > > Have you considered using get_actual_cpu_capacity() instead of
> > > arch_scale_cpu_capacity() ?
> >
> > Replacing arch_scale_cpu_capacity() with get_actual_cpu_capacity()
> > would make the == comparison below very unlikely to be true FWIW.
>
> Yes, this is what I thought too. I did not try with get_actual_cpu_capacity(),
> though. Perhaps on Intel processors it would work since rq->avg_hw.load_avg
> is not used, IIUC. I am not sure about cpufreq_pressure. I need to check.
>
> Still, it may work for Intel processors but not for ARM ones.
>
> > I think it's fine like that, I will prepare a follow-up anyway to make
> > it work for our "almost equal capacity" cluster systems and then also
> > consider switching to get_actual_cpu_capacity() since we include a margin
> > anyway.
>
> Great!
I confirmed that does not use rq->avg_hw.load_avg nor cpufreq_pressure.
Hence, get_actual_cpu_capacity() worked for me on Intel hybrid processors,
but it would not on other architectures.
So perhaps we can stick with arch_scale_cpu_capacity() for now? The series
from Christian will add a margin, making the use of get_actual_cpu_capacity()
feasible.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-26 0:11 ` Ricardo Neri
@ 2026-06-26 14:50 ` Vincent Guittot
2026-06-27 2:02 ` Ricardo Neri
0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2026-06-26 14:50 UTC (permalink / raw)
To: Ricardo Neri
Cc: Christian Loehle, g, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Fri, 26 Jun 2026 at 02:02, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> On Tue, Jun 23, 2026 at 10:25:14PM -0700, Ricardo Neri wrote:
> > On Tue, Jun 23, 2026 at 08:45:23AM +0100, Christian Loehle wrote:
> > > On 6/23/26 08:20, Vincent Guittot wrote:
> > > > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > > > <ricardo.neri-calderon@linux.intel.com> wrote:
> > > >>
> > > >> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> > > >> running task as busiest if doing so results in migrating the task to a
> > > >> CPU with less than ~5% of extra capacity. It also unintentionally
> > > >> prevents migrations between CPUs of identical capacity.
> > > >>
> > > >> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> > > >> clusters of CPUs with the same capacity. Allowing migration between CPUs
> > > >> of identical capacity is necessary to meet this goal.
> > > >>
> > > >> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
> > > >
> > > > capacity_of() reflects not only RT and irq pressure but also thermal
> > > > pressure or system frequency capping.
> > > > If dst cluster is under thermal mitigation but the source cluster is
> > > > not, we probably shouldn't spread tasks across both clusters.
> > > > Have you considered using get_actual_cpu_capacity() instead of
> > > > arch_scale_cpu_capacity() ?
> > >
> > > Replacing arch_scale_cpu_capacity() with get_actual_cpu_capacity()
> > > would make the == comparison below very unlikely to be true FWIW.
> >
> > Yes, this is what I thought too. I did not try with get_actual_cpu_capacity(),
> > though. Perhaps on Intel processors it would work since rq->avg_hw.load_avg
> > is not used, IIUC. I am not sure about cpufreq_pressure. I need to check.
> >
> > Still, it may work for Intel processors but not for ARM ones.
> >
> > > I think it's fine like that, I will prepare a follow-up anyway to make
> > > it work for our "almost equal capacity" cluster systems and then also
> > > consider switching to get_actual_cpu_capacity() since we include a margin
> > > anyway.
> >
> > Great!
>
> I confirmed that does not use rq->avg_hw.load_avg nor cpufreq_pressure.
I'm not surprised that intel don't use rq->avg_hw.load_avg but I'm
pretty sure that you use cpufreq_pressure, because any call to
freq_qos_add_request(..., FREQ_QOS_MAX), like scaling_max_freq, will
update cpufreq_pressure.
If one cluster has its max freq capped, you will spread tasks between
the uncapped and the capped clusters which no longer have the same
compute capacity.
> Hence, get_actual_cpu_capacity() worked for me on Intel hybrid processors,
> but it would not on other architectures.
>
> So perhaps we can stick with arch_scale_cpu_capacity() for now? The series
> from Christian will add a margin, making the use of get_actual_cpu_capacity()
> feasible.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-26 14:50 ` Vincent Guittot
@ 2026-06-27 2:02 ` Ricardo Neri
0 siblings, 0 replies; 22+ messages in thread
From: Ricardo Neri @ 2026-06-27 2:02 UTC (permalink / raw)
To: Vincent Guittot
Cc: Christian Loehle, g, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Fri, Jun 26, 2026 at 04:50:12PM +0200, Vincent Guittot wrote:
> On Fri, 26 Jun 2026 at 02:02, Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > On Tue, Jun 23, 2026 at 10:25:14PM -0700, Ricardo Neri wrote:
> > > On Tue, Jun 23, 2026 at 08:45:23AM +0100, Christian Loehle wrote:
> > > > On 6/23/26 08:20, Vincent Guittot wrote:
> > > > > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > > > > <ricardo.neri-calderon@linux.intel.com> wrote:
> > > > >>
> > > > >> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> > > > >> running task as busiest if doing so results in migrating the task to a
> > > > >> CPU with less than ~5% of extra capacity. It also unintentionally
> > > > >> prevents migrations between CPUs of identical capacity.
> > > > >>
> > > > >> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> > > > >> clusters of CPUs with the same capacity. Allowing migration between CPUs
> > > > >> of identical capacity is necessary to meet this goal.
> > > > >>
> > > > >> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
> > > > >
> > > > > capacity_of() reflects not only RT and irq pressure but also thermal
> > > > > pressure or system frequency capping.
> > > > > If dst cluster is under thermal mitigation but the source cluster is
> > > > > not, we probably shouldn't spread tasks across both clusters.
> > > > > Have you considered using get_actual_cpu_capacity() instead of
> > > > > arch_scale_cpu_capacity() ?
> > > >
> > > > Replacing arch_scale_cpu_capacity() with get_actual_cpu_capacity()
> > > > would make the == comparison below very unlikely to be true FWIW.
> > >
> > > Yes, this is what I thought too. I did not try with get_actual_cpu_capacity(),
> > > though. Perhaps on Intel processors it would work since rq->avg_hw.load_avg
> > > is not used, IIUC. I am not sure about cpufreq_pressure. I need to check.
> > >
> > > Still, it may work for Intel processors but not for ARM ones.
> > >
> > > > I think it's fine like that, I will prepare a follow-up anyway to make
> > > > it work for our "almost equal capacity" cluster systems and then also
> > > > consider switching to get_actual_cpu_capacity() since we include a margin
> > > > anyway.
> > >
> > > Great!
> >
> > I confirmed that does not use rq->avg_hw.load_avg nor cpufreq_pressure.
>
> I'm not surprised that intel don't use rq->avg_hw.load_avg but I'm
> pretty sure that you use cpufreq_pressure, because any call to
> freq_qos_add_request(..., FREQ_QOS_MAX), like scaling_max_freq, will
> update cpufreq_pressure.
But in cpufreq_update_pressure() a non-zero pressure can be only computed
if arch_scale_freq_ref() returns non-zero. x86 does not implement this
function.
The check max_freq <= capped_freq is always true because the default
arch_scale_freq_ref() returns 0. The computed pressure is always 0. Am I
missing something?
>
> If one cluster has its max freq capped, you will spread tasks between
> the uncapped and the capped clusters which no longer have the same
> compute capacity.
I this I agree.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-23 7:45 ` Christian Loehle
2026-06-24 5:25 ` Ricardo Neri
@ 2026-06-26 15:20 ` Vincent Guittot
1 sibling, 0 replies; 22+ messages in thread
From: Vincent Guittot @ 2026-06-26 15:20 UTC (permalink / raw)
To: Christian Loehle
Cc: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, 23 Jun 2026 at 09:45, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 6/23/26 08:20, Vincent Guittot wrote:
> > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > <ricardo.neri-calderon@linux.intel.com> wrote:
> >>
> >> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> >> running task as busiest if doing so results in migrating the task to a
> >> CPU with less than ~5% of extra capacity. It also unintentionally
> >> prevents migrations between CPUs of identical capacity.
> >>
> >> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> >> clusters of CPUs with the same capacity. Allowing migration between CPUs
> >> of identical capacity is necessary to meet this goal.
> >>
> >> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
> >
> > capacity_of() reflects not only RT and irq pressure but also thermal
> > pressure or system frequency capping.
> > If dst cluster is under thermal mitigation but the source cluster is
> > not, we probably shouldn't spread tasks across both clusters.
> > Have you considered using get_actual_cpu_capacity() instead of
> > arch_scale_cpu_capacity() ?
>
> Replacing arch_scale_cpu_capacity() with get_actual_cpu_capacity()
> would make the == comparison below very unlikely to be true FWIW.
Do you have in mind cpufreq_pressure or hw load_avg ?
> I think it's fine like that, I will prepare a follow-up anyway to make
> it work for our "almost equal capacity" cluster systems and then also
> consider switching to get_actual_cpu_capacity() since we include a margin
> anyway.
I would prefer the other way: Keep the current behavior correct (keep
accounting system pressure) before adding a new feature
>
> >
> >> runtime reductions due to side activity or thermal pressure. Guard this
> >> check with the sched_cluster_active static key so that systems without
> >> cluster topology are unaffected.
> >>
> >> Tested-by: Christian Loehle <christian.loehle@arm.com>
> >> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> >> ---
> >> Changes in v5:
> >> * Optimized logic to identify same-arch clusters only when needed.
> >> * Added Tested-by tag from Christian. Thanks!
> >>
> >> Changes in v4:
> >> * Implemented the check for cluster with a local variable for improved
> >> readability.
> >>
> >> Changes in v3:
> >> * Reverted the inverted capacity check; the inverted form incorrectly
> >> allows migrations to CPUs of slightly less capacity.
> >> * Guarded the check for architectural capacity with the
> >> sched_cluster_active static key.
> >>
> >> Changes in v2:
> >> * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
> >> runtime variability.
> >> * Inverted the check for runtime capacity. (Christian)
> >> * Reworded patch description for clarity.
> >> ---
> >> kernel/sched/fair.c | 9 ++++++++-
> >> 1 file changed, 8 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index e55eb019d2c9..f4eb55cad54d 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -12992,13 +12992,20 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> >> */
> >> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> >> nr_running == 1) {
> >> + bool same_arch_cluster = static_branch_unlikely(&sched_cluster_active) &&
> >> + (arch_scale_cpu_capacity(env->dst_cpu) ==
> >> + arch_scale_cpu_capacity(i));
> >> bool smt_degraded_cap = sched_smt_active() && !is_core_idle(i);
> >>
> >> /*
> >> * Busy SMT siblings reduce the capacity of CPU @i. Do
> >> * not skip it in this case.
> >> + *
> >> + * CONFIG_SCHED_CLUSTER requires balancing load across clusters
> >> + * of identical capacity. Use architectural capacity to ignore
> >> + * runtime variability.
> >> */
> >> - if (!smt_degraded_cap &&
> >> + if (!smt_degraded_cap && !same_arch_cluster &&
> >> !capacity_greater(capacity_of(env->dst_cpu), capacity))
> >> continue;
> >> }
> >>
> >> --
> >> 2.43.0
> >>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity
2026-06-23 0:05 ` [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
2026-06-23 7:20 ` Vincent Guittot
@ 2026-06-27 19:07 ` Andrea Righi
1 sibling, 0 replies; 22+ messages in thread
From: Andrea Righi @ 2026-06-27 19:07 UTC (permalink / raw)
To: Ricardo Neri
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
K Prateek Nayak, Barry Song, Rafael J. Wysocki, Len Brown,
ricardo.neri, linux-kernel
Hi Ricardo,
On Mon, Jun 22, 2026 at 05:05:55PM -0700, Ricardo Neri wrote:
> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> running task as busiest if doing so results in migrating the task to a
> CPU with less than ~5% of extra capacity. It also unintentionally
> prevents migrations between CPUs of identical capacity.
>
> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> clusters of CPUs with the same capacity. Allowing migration between CPUs
> of identical capacity is necessary to meet this goal.
>
> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
> runtime reductions due to side activity or thermal pressure. Guard this
> check with the sched_cluster_active static key so that systems without
> cluster topology are unaffected.
>
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v5:
> * Optimized logic to identify same-arch clusters only when needed.
> * Added Tested-by tag from Christian. Thanks!
>
> Changes in v4:
> * Implemented the check for cluster with a local variable for improved
> readability.
>
> Changes in v3:
> * Reverted the inverted capacity check; the inverted form incorrectly
> allows migrations to CPUs of slightly less capacity.
> * Guarded the check for architectural capacity with the
> sched_cluster_active static key.
>
> Changes in v2:
> * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
> runtime variability.
> * Inverted the check for runtime capacity. (Christian)
> * Reworded patch description for clarity.
> ---
> kernel/sched/fair.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e55eb019d2c9..f4eb55cad54d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12992,13 +12992,20 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> */
> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> nr_running == 1) {
> + bool same_arch_cluster = static_branch_unlikely(&sched_cluster_active) &&
> + (arch_scale_cpu_capacity(env->dst_cpu) ==
> + arch_scale_cpu_capacity(i));
I find same_arch_cluster a bit misleading. It sounds like "these two CPUs belong
to the same cluster", while what it actually checks is whether a cluster
topology exists somewhere in the root domain and the two CPUs have exactly the
same architectural capacity. Am I understanding it correctly?
If so, would something like same_arch_capacity or cluster_equal_capacity be a
better name? I think either would make the intent of the code a bit clearer.
Thanks,
-Andrea
> bool smt_degraded_cap = sched_smt_active() && !is_core_idle(i);
>
> /*
> * Busy SMT siblings reduce the capacity of CPU @i. Do
> * not skip it in this case.
> + *
> + * CONFIG_SCHED_CLUSTER requires balancing load across clusters
> + * of identical capacity. Use architectural capacity to ignore
> + * runtime variability.
> */
> - if (!smt_degraded_cap &&
> + if (!smt_degraded_cap && !same_arch_cluster &&
> !capacity_greater(capacity_of(env->dst_cpu), capacity))
> continue;
> }
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
2026-06-23 0:05 [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
` (4 preceding siblings ...)
2026-06-23 0:05 ` [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-06-23 0:05 ` Ricardo Neri
2026-06-23 7:26 ` Vincent Guittot
5 siblings, 1 reply; 22+ messages in thread
From: Ricardo Neri @ 2026-06-23 0:05 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
K Prateek Nayak, Barry Song
Cc: Rafael J. Wysocki, Andrea Righi, Len Brown, ricardo.neri,
linux-kernel, Ricardo Neri
Some topologies have scheduling domains that contain CPUs of asymmetric
capacity, grouped into two or more clusters of equal-capacity CPUs
sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
balanced across these clusters.
Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
load balancer that it should spread load among cluster siblings.
Checks for capacity in update_sd_pick_busiest(),
sched_balance_find_src_group(), and sched_balance_find_src_rq() prevent
migrations from high- to low-capacity CPUs if the busiest group is not
overloaded.
CPUs with spare capacity, big or small, have always helped overloaded
groups. Once the overloading condition disappears, misfit load will still
be used to move high-utilization tasks to bigger CPUs if they have spare
capacity.
Adding the SD_PREFER_SIBLING flag shifts load balancing in shared-LLC
domains from equalizing the number of idle CPUs to equalizing the number
of running tasks. This also enables migrations among clusters from newly-
idle load balance, where the outgoing task is already dequeued but the CPU
has not yet transitioned to idle.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v5:
* Improved inline comments for accuracy.
* Added Tested-by tag from Christian. Thanks!
Changes in v4:
* Added Reviewed-by tag from Tim. Thanks!
Changes in v3:
* Updated documentation of SD_PREFER_SIBLING.
* Expanded the patch description to explain the behavior when overloaded
groups are involved.
Changes in v2:
* Reworded the patch description for clarity.
* Kept parentheses around bitwise operators for clarity.
---
include/linux/sched/sd_flags.h | 3 ++-
kernel/sched/topology.c | 14 ++++++++++++--
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 42839cfa2778..f9a46fb8cacf 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
* Prefer to place tasks in a sibling domain
*
* Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
- * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
+ * flag, but cleared below domains with SD_ASYM_CPUCAPACITY unless those child
+ * domains have clusters of CPUs sharing cache.
*
* NEEDS_GROUPS: Load balancing flag.
*/
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 622e2e01974c..261b407d0936 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1995,8 +1995,18 @@ sd_init(struct sched_domain_topology_level *tl,
/*
* Convert topological properties into behaviour.
*/
- /* Don't attempt to spread across CPUs of different capacities. */
- if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
+ /*
+ * Don't attempt to spread across CPUs of different capacities.
+ *
+ * If the child domain has clusters of CPUs sharing L2 cache, keep the
+ * flag to spread tasks across clusters of identical capacity. Checks in
+ * the load balancer prevent task migrations from high- to low-capacity
+ * CPUs unless the source group is overloaded. Migrations to a lower-
+ * capacity CPU can happen if a higher-capacity group is overloaded and
+ * a lower-capacity CPU has spare capacity.
+ */
+ if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
+ !(sd->child->flags & SD_CLUSTER))
sd->child->flags &= ~SD_PREFER_SIBLING;
if (sd->flags & SD_SHARE_CPUCAPACITY) {
--
2.43.0
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
2026-06-23 0:05 ` [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
@ 2026-06-23 7:26 ` Vincent Guittot
2026-06-24 5:14 ` Ricardo Neri
0 siblings, 1 reply; 22+ messages in thread
From: Vincent Guittot @ 2026-06-23 7:26 UTC (permalink / raw)
To: Ricardo Neri
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, Christian Loehle, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> Some topologies have scheduling domains that contain CPUs of asymmetric
> capacity, grouped into two or more clusters of equal-capacity CPUs
> sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
> balanced across these clusters.
>
> Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
> load balancer that it should spread load among cluster siblings.
>
> Checks for capacity in update_sd_pick_busiest(),
> sched_balance_find_src_group(), and sched_balance_find_src_rq() prevent
> migrations from high- to low-capacity CPUs if the busiest group is not
> overloaded.
>
> CPUs with spare capacity, big or small, have always helped overloaded
> groups. Once the overloading condition disappears, misfit load will still
> be used to move high-utilization tasks to bigger CPUs if they have spare
> capacity.
>
> Adding the SD_PREFER_SIBLING flag shifts load balancing in shared-LLC
> domains from equalizing the number of idle CPUs to equalizing the number
> of running tasks. This also enables migrations among clusters from newly-
> idle load balance, where the outgoing task is already dequeued but the CPU
> has not yet transitioned to idle.
>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Tested-by: Christian Loehle <christian.loehle@arm.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v5:
> * Improved inline comments for accuracy.
> * Added Tested-by tag from Christian. Thanks!
>
> Changes in v4:
> * Added Reviewed-by tag from Tim. Thanks!
>
> Changes in v3:
> * Updated documentation of SD_PREFER_SIBLING.
> * Expanded the patch description to explain the behavior when overloaded
> groups are involved.
>
> Changes in v2:
> * Reworded the patch description for clarity.
> * Kept parentheses around bitwise operators for clarity.
> ---
> include/linux/sched/sd_flags.h | 3 ++-
> kernel/sched/topology.c | 14 ++++++++++++--
> 2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> index 42839cfa2778..f9a46fb8cacf 100644
> --- a/include/linux/sched/sd_flags.h
> +++ b/include/linux/sched/sd_flags.h
> @@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> * Prefer to place tasks in a sibling domain
> *
> * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> - * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> + * flag, but cleared below domains with SD_ASYM_CPUCAPACITY unless those child
> + * domains have clusters of CPUs sharing cache.
> *
> * NEEDS_GROUPS: Load balancing flag.
> */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 622e2e01974c..261b407d0936 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1995,8 +1995,18 @@ sd_init(struct sched_domain_topology_level *tl,
> /*
> * Convert topological properties into behaviour.
> */
> - /* Don't attempt to spread across CPUs of different capacities. */
> - if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> + /*
> + * Don't attempt to spread across CPUs of different capacities.
> + *
> + * If the child domain has clusters of CPUs sharing L2 cache, keep the
> + * flag to spread tasks across clusters of identical capacity. Checks in
> + * the load balancer prevent task migrations from high- to low-capacity
> + * CPUs unless the source group is overloaded. Migrations to a lower-
> + * capacity CPU can happen if a higher-capacity group is overloaded and
> + * a lower-capacity CPU has spare capacity.
> + */
> + if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
> + !(sd->child->flags & SD_CLUSTER))
> sd->child->flags &= ~SD_PREFER_SIBLING;
Last time I looked at this patch I was balanced between your proposal
above and simply keeping SD_PREFER_SIBLING for all HMP topologies. As
added in the comment:
" Checks in
* the load balancer prevent task migrations from high- to low-capacity
* CPUs unless the source group is overloaded.
"
So, why should we bother for (SD_ASYM_CPUCAPACITY && !SD_CLUSTER) topology ?
>
> if (sd->flags & SD_SHARE_CPUCAPACITY) {
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
2026-06-23 7:26 ` Vincent Guittot
@ 2026-06-24 5:14 ` Ricardo Neri
2026-06-26 0:19 ` Ricardo Neri
0 siblings, 1 reply; 22+ messages in thread
From: Ricardo Neri @ 2026-06-24 5:14 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, Christian Loehle, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, Jun 23, 2026 at 09:26:57AM +0200, Vincent Guittot wrote:
> On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > Some topologies have scheduling domains that contain CPUs of asymmetric
> > capacity, grouped into two or more clusters of equal-capacity CPUs
> > sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
> > balanced across these clusters.
> >
> > Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
> > load balancer that it should spread load among cluster siblings.
> >
> > Checks for capacity in update_sd_pick_busiest(),
> > sched_balance_find_src_group(), and sched_balance_find_src_rq() prevent
> > migrations from high- to low-capacity CPUs if the busiest group is not
> > overloaded.
> >
> > CPUs with spare capacity, big or small, have always helped overloaded
> > groups. Once the overloading condition disappears, misfit load will still
> > be used to move high-utilization tasks to bigger CPUs if they have spare
> > capacity.
> >
> > Adding the SD_PREFER_SIBLING flag shifts load balancing in shared-LLC
> > domains from equalizing the number of idle CPUs to equalizing the number
> > of running tasks. This also enables migrations among clusters from newly-
> > idle load balance, where the outgoing task is already dequeued but the CPU
> > has not yet transitioned to idle.
> >
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes in v5:
> > * Improved inline comments for accuracy.
> > * Added Tested-by tag from Christian. Thanks!
> >
> > Changes in v4:
> > * Added Reviewed-by tag from Tim. Thanks!
> >
> > Changes in v3:
> > * Updated documentation of SD_PREFER_SIBLING.
> > * Expanded the patch description to explain the behavior when overloaded
> > groups are involved.
> >
> > Changes in v2:
> > * Reworded the patch description for clarity.
> > * Kept parentheses around bitwise operators for clarity.
> > ---
> > include/linux/sched/sd_flags.h | 3 ++-
> > kernel/sched/topology.c | 14 ++++++++++++--
> > 2 files changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> > index 42839cfa2778..f9a46fb8cacf 100644
> > --- a/include/linux/sched/sd_flags.h
> > +++ b/include/linux/sched/sd_flags.h
> > @@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> > * Prefer to place tasks in a sibling domain
> > *
> > * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> > - * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> > + * flag, but cleared below domains with SD_ASYM_CPUCAPACITY unless those child
> > + * domains have clusters of CPUs sharing cache.
> > *
> > * NEEDS_GROUPS: Load balancing flag.
> > */
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 622e2e01974c..261b407d0936 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1995,8 +1995,18 @@ sd_init(struct sched_domain_topology_level *tl,
> > /*
> > * Convert topological properties into behaviour.
> > */
> > - /* Don't attempt to spread across CPUs of different capacities. */
> > - if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > + /*
> > + * Don't attempt to spread across CPUs of different capacities.
> > + *
> > + * If the child domain has clusters of CPUs sharing L2 cache, keep the
> > + * flag to spread tasks across clusters of identical capacity. Checks in
> > + * the load balancer prevent task migrations from high- to low-capacity
> > + * CPUs unless the source group is overloaded. Migrations to a lower-
> > + * capacity CPU can happen if a higher-capacity group is overloaded and
> > + * a lower-capacity CPU has spare capacity.
> > + */
> > + if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
> > + !(sd->child->flags & SD_CLUSTER))
> > sd->child->flags &= ~SD_PREFER_SIBLING;
>
> Last time I looked at this patch I was balanced between your proposal
> above and simply keeping SD_PREFER_SIBLING for all HMP topologies. As
> added in the comment:
> " Checks in
> * the load balancer prevent task migrations from high- to low-capacity
> * CPUs unless the source group is overloaded.
> "
> So, why should we bother for (SD_ASYM_CPUCAPACITY && !SD_CLUSTER) topology ?
No reason, AFAICS. I just wanted to restrict the change to the target
topology of this patchset.
But you raise a good point: given the checks in place in the load balancer,
it should be OK to keep SD_PREFER_SIBLING in all asymmetric topologies. I
will run a few experiments to confirm.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
2026-06-24 5:14 ` Ricardo Neri
@ 2026-06-26 0:19 ` Ricardo Neri
2026-06-26 14:54 ` Vincent Guittot
0 siblings, 1 reply; 22+ messages in thread
From: Ricardo Neri @ 2026-06-26 0:19 UTC (permalink / raw)
To: Vincent Guittot
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, Christian Loehle, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Tue, Jun 23, 2026 at 10:14:57PM -0700, Ricardo Neri wrote:
> On Tue, Jun 23, 2026 at 09:26:57AM +0200, Vincent Guittot wrote:
> > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > <ricardo.neri-calderon@linux.intel.com> wrote:
> > >
> > > Some topologies have scheduling domains that contain CPUs of asymmetric
> > > capacity, grouped into two or more clusters of equal-capacity CPUs
> > > sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
> > > balanced across these clusters.
> > >
> > > Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
> > > load balancer that it should spread load among cluster siblings.
> > >
> > > Checks for capacity in update_sd_pick_busiest(),
> > > sched_balance_find_src_group(), and sched_balance_find_src_rq() prevent
> > > migrations from high- to low-capacity CPUs if the busiest group is not
> > > overloaded.
> > >
> > > CPUs with spare capacity, big or small, have always helped overloaded
> > > groups. Once the overloading condition disappears, misfit load will still
> > > be used to move high-utilization tasks to bigger CPUs if they have spare
> > > capacity.
> > >
> > > Adding the SD_PREFER_SIBLING flag shifts load balancing in shared-LLC
> > > domains from equalizing the number of idle CPUs to equalizing the number
> > > of running tasks. This also enables migrations among clusters from newly-
> > > idle load balance, where the outgoing task is already dequeued but the CPU
> > > has not yet transitioned to idle.
> > >
> > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > > ---
> > > Changes in v5:
> > > * Improved inline comments for accuracy.
> > > * Added Tested-by tag from Christian. Thanks!
> > >
> > > Changes in v4:
> > > * Added Reviewed-by tag from Tim. Thanks!
> > >
> > > Changes in v3:
> > > * Updated documentation of SD_PREFER_SIBLING.
> > > * Expanded the patch description to explain the behavior when overloaded
> > > groups are involved.
> > >
> > > Changes in v2:
> > > * Reworded the patch description for clarity.
> > > * Kept parentheses around bitwise operators for clarity.
> > > ---
> > > include/linux/sched/sd_flags.h | 3 ++-
> > > kernel/sched/topology.c | 14 ++++++++++++--
> > > 2 files changed, 14 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> > > index 42839cfa2778..f9a46fb8cacf 100644
> > > --- a/include/linux/sched/sd_flags.h
> > > +++ b/include/linux/sched/sd_flags.h
> > > @@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> > > * Prefer to place tasks in a sibling domain
> > > *
> > > * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> > > - * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> > > + * flag, but cleared below domains with SD_ASYM_CPUCAPACITY unless those child
> > > + * domains have clusters of CPUs sharing cache.
> > > *
> > > * NEEDS_GROUPS: Load balancing flag.
> > > */
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index 622e2e01974c..261b407d0936 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -1995,8 +1995,18 @@ sd_init(struct sched_domain_topology_level *tl,
> > > /*
> > > * Convert topological properties into behaviour.
> > > */
> > > - /* Don't attempt to spread across CPUs of different capacities. */
> > > - if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > > + /*
> > > + * Don't attempt to spread across CPUs of different capacities.
> > > + *
> > > + * If the child domain has clusters of CPUs sharing L2 cache, keep the
> > > + * flag to spread tasks across clusters of identical capacity. Checks in
> > > + * the load balancer prevent task migrations from high- to low-capacity
> > > + * CPUs unless the source group is overloaded. Migrations to a lower-
> > > + * capacity CPU can happen if a higher-capacity group is overloaded and
> > > + * a lower-capacity CPU has spare capacity.
> > > + */
> > > + if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
> > > + !(sd->child->flags & SD_CLUSTER))
> > > sd->child->flags &= ~SD_PREFER_SIBLING;
> >
> > Last time I looked at this patch I was balanced between your proposal
> > above and simply keeping SD_PREFER_SIBLING for all HMP topologies. As
> > added in the comment:
> > " Checks in
> > * the load balancer prevent task migrations from high- to low-capacity
> > * CPUs unless the source group is overloaded.
> > "
> > So, why should we bother for (SD_ASYM_CPUCAPACITY && !SD_CLUSTER) topology ?
>
> No reason, AFAICS. I just wanted to restrict the change to the target
> topology of this patchset.
>
> But you raise a good point: given the checks in place in the load balancer,
> it should be OK to keep SD_PREFER_SIBLING in all asymmetric topologies. I
> will run a few experiments to confirm.
I ran a few experiments with and without CONFIG_CLUSTER_SCHED. I ran N
threads where N < nproc to ensure that sched groups were classified as
has_spare or fully_busy. The threads saturated the CPUs to minimize task
placement decisions at wake up.
I observed these threads to remain on the CPUs with highest capacity; no
spreading.
I repeated the experiment with EAS enabled and threads ramping up
utilization. EAS kept them on small CPUs and later duly moved to CPUs of
higher capacity as they became misfits.
I will update my patch to keep SD_PREFER_SIBLING regardless of asymmetric
capacity.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
2026-06-26 0:19 ` Ricardo Neri
@ 2026-06-26 14:54 ` Vincent Guittot
0 siblings, 0 replies; 22+ messages in thread
From: Vincent Guittot @ 2026-06-26 14:54 UTC (permalink / raw)
To: Ricardo Neri
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Tim C Chen, Chen Yu, Christian Loehle, K Prateek Nayak,
Barry Song, Rafael J. Wysocki, Andrea Righi, Len Brown,
ricardo.neri, linux-kernel
On Fri, 26 Jun 2026 at 02:10, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> On Tue, Jun 23, 2026 at 10:14:57PM -0700, Ricardo Neri wrote:
> > On Tue, Jun 23, 2026 at 09:26:57AM +0200, Vincent Guittot wrote:
> > > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > > <ricardo.neri-calderon@linux.intel.com> wrote:
> > > >
> > > > Some topologies have scheduling domains that contain CPUs of asymmetric
> > > > capacity, grouped into two or more clusters of equal-capacity CPUs
> > > > sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
> > > > balanced across these clusters.
> > > >
> > > > Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
> > > > load balancer that it should spread load among cluster siblings.
> > > >
> > > > Checks for capacity in update_sd_pick_busiest(),
> > > > sched_balance_find_src_group(), and sched_balance_find_src_rq() prevent
> > > > migrations from high- to low-capacity CPUs if the busiest group is not
> > > > overloaded.
> > > >
> > > > CPUs with spare capacity, big or small, have always helped overloaded
> > > > groups. Once the overloading condition disappears, misfit load will still
> > > > be used to move high-utilization tasks to bigger CPUs if they have spare
> > > > capacity.
> > > >
> > > > Adding the SD_PREFER_SIBLING flag shifts load balancing in shared-LLC
> > > > domains from equalizing the number of idle CPUs to equalizing the number
> > > > of running tasks. This also enables migrations among clusters from newly-
> > > > idle load balance, where the outgoing task is already dequeued but the CPU
> > > > has not yet transitioned to idle.
> > > >
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Tested-by: Christian Loehle <christian.loehle@arm.com>
> > > > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > > > ---
> > > > Changes in v5:
> > > > * Improved inline comments for accuracy.
> > > > * Added Tested-by tag from Christian. Thanks!
> > > >
> > > > Changes in v4:
> > > > * Added Reviewed-by tag from Tim. Thanks!
> > > >
> > > > Changes in v3:
> > > > * Updated documentation of SD_PREFER_SIBLING.
> > > > * Expanded the patch description to explain the behavior when overloaded
> > > > groups are involved.
> > > >
> > > > Changes in v2:
> > > > * Reworded the patch description for clarity.
> > > > * Kept parentheses around bitwise operators for clarity.
> > > > ---
> > > > include/linux/sched/sd_flags.h | 3 ++-
> > > > kernel/sched/topology.c | 14 ++++++++++++--
> > > > 2 files changed, 14 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> > > > index 42839cfa2778..f9a46fb8cacf 100644
> > > > --- a/include/linux/sched/sd_flags.h
> > > > +++ b/include/linux/sched/sd_flags.h
> > > > @@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> > > > * Prefer to place tasks in a sibling domain
> > > > *
> > > > * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> > > > - * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> > > > + * flag, but cleared below domains with SD_ASYM_CPUCAPACITY unless those child
> > > > + * domains have clusters of CPUs sharing cache.
> > > > *
> > > > * NEEDS_GROUPS: Load balancing flag.
> > > > */
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index 622e2e01974c..261b407d0936 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -1995,8 +1995,18 @@ sd_init(struct sched_domain_topology_level *tl,
> > > > /*
> > > > * Convert topological properties into behaviour.
> > > > */
> > > > - /* Don't attempt to spread across CPUs of different capacities. */
> > > > - if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > > > + /*
> > > > + * Don't attempt to spread across CPUs of different capacities.
> > > > + *
> > > > + * If the child domain has clusters of CPUs sharing L2 cache, keep the
> > > > + * flag to spread tasks across clusters of identical capacity. Checks in
> > > > + * the load balancer prevent task migrations from high- to low-capacity
> > > > + * CPUs unless the source group is overloaded. Migrations to a lower-
> > > > + * capacity CPU can happen if a higher-capacity group is overloaded and
> > > > + * a lower-capacity CPU has spare capacity.
> > > > + */
> > > > + if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
> > > > + !(sd->child->flags & SD_CLUSTER))
> > > > sd->child->flags &= ~SD_PREFER_SIBLING;
> > >
> > > Last time I looked at this patch I was balanced between your proposal
> > > above and simply keeping SD_PREFER_SIBLING for all HMP topologies. As
> > > added in the comment:
> > > " Checks in
> > > * the load balancer prevent task migrations from high- to low-capacity
> > > * CPUs unless the source group is overloaded.
> > > "
> > > So, why should we bother for (SD_ASYM_CPUCAPACITY && !SD_CLUSTER) topology ?
> >
> > No reason, AFAICS. I just wanted to restrict the change to the target
> > topology of this patchset.
> >
> > But you raise a good point: given the checks in place in the load balancer,
> > it should be OK to keep SD_PREFER_SIBLING in all asymmetric topologies. I
> > will run a few experiments to confirm.
>
> I ran a few experiments with and without CONFIG_CLUSTER_SCHED. I ran N
> threads where N < nproc to ensure that sched groups were classified as
> has_spare or fully_busy. The threads saturated the CPUs to minimize task
> placement decisions at wake up.
>
> I observed these threads to remain on the CPUs with highest capacity; no
> spreading.
>
> I repeated the experiment with EAS enabled and threads ramping up
> utilization. EAS kept them on small CPUs and later duly moved to CPUs of
> higher capacity as they became misfits.
>
> I will update my patch to keep SD_PREFER_SIBLING regardless of asymmetric
> capacity.
Thanks
^ permalink raw reply [flat|nested] 22+ messages in thread