* [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
@ 2026-05-14 18:34 ` Ricardo Neri
2026-05-15 12:29 ` Chen, Yu C
2026-05-15 19:26 ` Tim Chen
2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
` (2 subsequent siblings)
3 siblings, 2 replies; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Ricardo Neri
update_sd_pick_busiest() may incorrectly select a fully_busy group as the
busiest group when its per-CPU capacity exceeds that of the destination
CPU. This happens because the type of busiest group is initialized to
group_has_spare and allows the fully_busy group to win the type comparison.
update_sd_pick_busiest() should not choose a candidate scheduling group
with at most one runnable task if its per-CPU capacity is greater than that
of the destination CPU. Such a check already exists, but it is done too
late: after the type comparison, preventing a subsequent fully_busy group
of equal per-CPU capacity from being correctly selected.
Move this check to occur before comparing group types.
Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v3:
* Added a Fixes tag. (Christian)
* Added Reviewed-by tag from Christian. Thanks!
Changes in v2:
* Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
exclusive. (Tim)
* Kept parentheses around bitwise operators for clarity.
* Rewrote patch description for clarity.
---
kernel/sched/fair.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f982..e06e74d9ce0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10818,6 +10818,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
sds->local_stat.group_type != group_has_spare))
return false;
+ /*
+ * Candidate sg has no more than one task per CPU and has higher
+ * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+ * throughput. Maximize throughput, power/energy consequences are not
+ * considered.
+ *
+ * Systems with SMT are unaffected, as asymmetric capacity is not set
+ * in such cases.
+ */
+ if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+ (sgs->group_type <= group_fully_busy) &&
+ (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;
@@ -10920,17 +10934,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
break;
}
- /*
- * Candidate sg has no more than one task per CPU and has higher
- * per-CPU capacity. Migrating tasks to less capable CPUs may harm
- * throughput. Maximize throughput, power/energy consequences are not
- * considered.
- */
- if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
- (sgs->group_type <= group_fully_busy) &&
- (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
- return false;
-
return true;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-05-15 12:29 ` Chen, Yu C
2026-05-15 19:26 ` Tim Chen
1 sibling, 0 replies; 11+ messages in thread
From: Chen, Yu C @ 2026-05-15 12:29 UTC (permalink / raw)
To: Ricardo Neri
Cc: Rafael J. Wysocki, Len Brown, Dietmar Eggemann, Juri Lelli,
Vincent Guittot, ricardo.neri, linux-kernel, Steven Rostedt,
Ben Segall, Valentin Schneider, Mel Gorman, Tim C Chen,
Christian Loehle, Peter Zijlstra, Ingo Molnar, Barry Song
On 5/15/2026 2:34 AM, Ricardo Neri wrote:
[ ... ]
> @@ -10818,6 +10818,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> sds->local_stat.group_type != group_has_spare))
> return false;
>
> + /*
> + * Candidate sg has no more than one task per CPU and has higher
> + * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> + * throughput. Maximize throughput, power/energy consequences are not
> + * considered.
> + *
> + * Systems with SMT are unaffected, as asymmetric capacity is not set
> + * in such cases.
> + */
Does "SMT" here imply that group_smt_balance is unaffected?
Regardless of whether we move the check earlier, this seems to
already be guaranteed by the fact that the check only applies
to sgs->group_type <= group_fully_busy, which does not include
group_smt_balance. In other words, SD_ASYM_CPUCAPACITY is not
the only gatekeeper.
Other than that, the change looks good to me,Reviewed-by: Chen Yu
<yu.c.chen@intel.com>
thanks,
Chenyu
> + if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> + (sgs->group_type <= group_fully_busy) &&
> + (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> + return false;
> +
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
2026-05-15 12:29 ` Chen, Yu C
@ 2026-05-15 19:26 ` Tim Chen
1 sibling, 0 replies; 11+ messages in thread
From: Tim Chen @ 2026-05-15 19:26 UTC (permalink / raw)
To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel
On Thu, 2026-05-14 at 11:34 -0700, Ricardo Neri wrote:
> update_sd_pick_busiest() may incorrectly select a fully_busy group as the
> busiest group when its per-CPU capacity exceeds that of the destination
> CPU. This happens because the type of busiest group is initialized to
> group_has_spare and allows the fully_busy group to win the type comparison.
>
> update_sd_pick_busiest() should not choose a candidate scheduling group
> with at most one runnable task if its per-CPU capacity is greater than that
> of the destination CPU. Such a check already exists, but it is done too
> late: after the type comparison, preventing a subsequent fully_busy group
> of equal per-CPU capacity from being correctly selected.
>
> Move this check to occur before comparing group types.
Looks good to me.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>
> Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v3:
> * Added a Fixes tag. (Christian)
> * Added Reviewed-by tag from Christian. Thanks!
>
> Changes in v2:
> * Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
> exclusive. (Tim)
> * Kept parentheses around bitwise operators for clarity.
> * Rewrote patch description for clarity.
> ---
> kernel/sched/fair.c | 25 ++++++++++++++-----------
> 1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f982..e06e74d9ce0e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10818,6 +10818,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> sds->local_stat.group_type != group_has_spare))
> return false;
>
> + /*
> + * Candidate sg has no more than one task per CPU and has higher
> + * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> + * throughput. Maximize throughput, power/energy consequences are not
> + * considered.
> + *
> + * Systems with SMT are unaffected, as asymmetric capacity is not set
> + * in such cases.
> + */
> + if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> + (sgs->group_type <= group_fully_busy) &&
> + (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> + return false;
> +
> if (sgs->group_type > busiest->group_type)
> return true;
>
> @@ -10920,17 +10934,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> break;
> }
>
> - /*
> - * Candidate sg has no more than one task per CPU and has higher
> - * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> - * throughput. Maximize throughput, power/energy consequences are not
> - * considered.
> - */
> - if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> - (sgs->group_type <= group_fully_busy) &&
> - (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> - return false;
> -
> return true;
> }
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-05-14 18:34 ` Ricardo Neri
2026-05-15 12:49 ` Chen, Yu C
2026-05-15 20:12 ` Tim Chen
2026-05-14 18:34 ` [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
2026-05-14 18:34 ` [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
3 siblings, 2 replies; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Ricardo Neri
In domains with asymmetric capacity, identifying misfit load in a
scheduling group is not useful when the destination CPU cannot help (i.e.,
its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
such cases, it also prevents load balance among clusters of equal capacity
when CONFIG_SCHED_CLUSTER is enabled. This happens because
update_sd_pick_busiest() skips candidate groups of type misfit_task if the
destination CPU has similar capacity.
Skipping misfit load accounting in this situation allows the group to be
classified as has_spare or fully_busy and lets load balancing proceed. Keep
marking scheduling groups as overloaded when misfit tasks are present. The
sg_overloaded flag propagates to the root domain and allows bigger CPUs in
it to help via newly idle balance.
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v3:
* Added Reviewed-by tag from Christian. Thanks!
Changes in v2:
* Moved the check of the destination CPU capacity inside the code block
used for SD_ASYM_CPUCAPACITY. v1 inadvertently broke the mutual
exclusion of the sched_reduced_capacity() path.
* Keep marking the root domain as overloaded to allow bigger CPUs to
help. (sashiko)
* Fixed patch description to clarify that the capacity_greater() looks
for differences of 5% or more. (Christian)
* Reworded the patch description for clarity.
* I did not include the Reviewed-by tag from Christian since the patch
changed functionally.
---
kernel/sched/fair.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e06e74d9ce0e..dcc02ceb44b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10749,10 +10749,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
if (sd_flags & SD_ASYM_CPUCAPACITY) {
- /* Check for a misfit task on the cpu */
- if (sgs->group_misfit_task_load < rq->misfit_task_load) {
- sgs->group_misfit_task_load = rq->misfit_task_load;
+ if (rq->misfit_task_load) {
+ /*
+ * Always mark the domain overloaded so big CPUs
+ * can pick up misfit tasks via newly idle
+ * balance.
+ */
*sg_overloaded = 1;
+
+ /*
+ * Only account misfit load if @dst_cpu can
+ * help; otherwise, the group may be classified
+ * as misfit_task and update_sd_pick_busiest()
+ * will skip it.
+ */
+ if (capacity_greater(capacity_of(env->dst_cpu),
+ group->sgc->max_capacity) &&
+ (sgs->group_misfit_task_load < rq->misfit_task_load))
+ sgs->group_misfit_task_load = rq->misfit_task_load;
}
} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-05-15 12:49 ` Chen, Yu C
2026-05-15 20:12 ` Tim Chen
1 sibling, 0 replies; 11+ messages in thread
From: Chen, Yu C @ 2026-05-15 12:49 UTC (permalink / raw)
To: Ricardo Neri
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, Mel Gorman,
Valentin Schneider, linux-kernel, Christian Loehle, Ben Segall,
Steven Rostedt, Juri Lelli, Dietmar Eggemann, Tim C Chen,
Vincent Guittot, Barry Song, Peter Zijlstra, Ingo Molnar
On 5/15/2026 2:34 AM, Ricardo Neri wrote:
> + if (rq->misfit_task_load) {
> + /*
> + * Always mark the domain overloaded so big CPUs
> + * can pick up misfit tasks via newly idle
> + * balance.
> + */
> *sg_overloaded = 1;
if (balancing_at_rd)
*sg_overloaded = 1
to avoid confusing non-root domain(although in current code only
root domain checks this)
But the original logic does not have this check, it should be OK.
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
thanks,
Chenyu
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
2026-05-15 12:49 ` Chen, Yu C
@ 2026-05-15 20:12 ` Tim Chen
1 sibling, 0 replies; 11+ messages in thread
From: Tim Chen @ 2026-05-15 20:12 UTC (permalink / raw)
To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel
On Thu, 2026-05-14 at 11:34 -0700, Ricardo Neri wrote:
> In domains with asymmetric capacity, identifying misfit load in a
> scheduling group is not useful when the destination CPU cannot help (i.e.,
> its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
> such cases, it also prevents load balance among clusters of equal capacity
> when CONFIG_SCHED_CLUSTER is enabled. This happens because
> update_sd_pick_busiest() skips candidate groups of type misfit_task if the
> destination CPU has similar capacity.
>
> Skipping misfit load accounting in this situation allows the group to be
> classified as has_spare or fully_busy and lets load balancing proceed. Keep
> marking scheduling groups as overloaded when misfit tasks are present. The
> sg_overloaded flag propagates to the root domain and allows bigger CPUs in
> it to help via newly idle balance.
>
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v3:
> * Added Reviewed-by tag from Christian. Thanks!
>
> Changes in v2:
> * Moved the check of the destination CPU capacity inside the code block
> used for SD_ASYM_CPUCAPACITY. v1 inadvertently broke the mutual
> exclusion of the sched_reduced_capacity() path.
> * Keep marking the root domain as overloaded to allow bigger CPUs to
> help. (sashiko)
> * Fixed patch description to clarify that the capacity_greater() looks
> for differences of 5% or more. (Christian)
> * Reworded the patch description for clarity.
> * I did not include the Reviewed-by tag from Christian since the patch
> changed functionally.
> ---
> kernel/sched/fair.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e06e74d9ce0e..dcc02ceb44b5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10749,10 +10749,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> continue;
>
> if (sd_flags & SD_ASYM_CPUCAPACITY) {
> - /* Check for a misfit task on the cpu */
> - if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> - sgs->group_misfit_task_load = rq->misfit_task_load;
> + if (rq->misfit_task_load) {
> + /*
> + * Always mark the domain overloaded so big CPUs
> + * can pick up misfit tasks via newly idle
> + * balance.
> + */
> *sg_overloaded = 1;
> +
> + /*
> + * Only account misfit load if @dst_cpu can
> + * help; otherwise, the group may be classified
> + * as misfit_task and update_sd_pick_busiest()
> + * will skip it.
You mean "sd_pick_busiest() will pick it" instead of "skip it" for misfit task
load balancing in the above comment?
Tim
> + */
> + if (capacity_greater(capacity_of(env->dst_cpu),
> + group->sgc->max_capacity) &&
> + (sgs->group_misfit_task_load < rq->misfit_task_load))
> + sgs->group_misfit_task_load = rq->misfit_task_load;
> }
> } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
> /* Check for a task running on a CPU with reduced capacity */
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity
2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-05-14 18:34 ` Ricardo Neri
2026-05-15 15:16 ` Chen, Yu C
2026-05-14 18:34 ` [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
3 siblings, 1 reply; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Ricardo Neri
sched_balance_find_src_rq() avoids selecting a runqueue with a single
running task as busiest if doing so results in migrating the task to a
CPU with less than ~5% of extra capacity. It also unintentionally
prevents migrations between CPUs of identical capacity.
When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
clusters of CPUs with the same capacity. Allowing migration between CPUs
of identical capacity is necessary to meet this goal.
Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
runtime reductions due to side activity or thermal pressure. Guard this
check with the sched_cluster_active static key so that systems without
cluster topology are unaffected.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v3:
* Reverted the inverted capacity check; the inverted form incorrectly
allows migrations to CPUs of slightly less capacity.
* Guarded the check for architectural capacity with the
sched_cluster_active static key.
Changes in v2:
* Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
runtime variability.
* Inverted the check for runtime capacity. (Christian)
* Reworded patch description for clarity.
---
kernel/sched/fair.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcc02ceb44b5..d2a4c529f67f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11846,8 +11846,14 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
* eventually lead to active_balancing high->low capacity.
* Higher per-CPU capacity is considered better than balancing
* average load.
+ *
+ * CONFIG_SCHED_CLUSTER requires balancing load across clusters
+ * of identical capacity. Use architectural capacity to ignore
+ * runtime variability.
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ (!static_branch_unlikely(&sched_cluster_active) ||
+ arch_scale_cpu_capacity(env->dst_cpu) != arch_scale_cpu_capacity(i)) &&
!capacity_greater(capacity_of(env->dst_cpu), capacity) &&
nr_running == 1)
continue;
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity
2026-05-14 18:34 ` [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-05-15 15:16 ` Chen, Yu C
0 siblings, 0 replies; 11+ messages in thread
From: Chen, Yu C @ 2026-05-15 15:16 UTC (permalink / raw)
To: Ricardo Neri
Cc: Rafael J. Wysocki, Len Brown, Tim C Chen, ricardo.neri,
linux-kernel, Mel Gorman, Christian Loehle, Barry Song,
Dietmar Eggemann, Vincent Guittot, Valentin Schneider, Ben Segall,
Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt
On 5/15/2026 2:34 AM, Ricardo Neri wrote:
> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> running task as busiest if doing so results in migrating the task to a
> CPU with less than ~5% of extra capacity. It also unintentionally
> prevents migrations between CPUs of identical capacity.
>
> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> clusters of CPUs with the same capacity. Allowing migration between CPUs
> of identical capacity is necessary to meet this goal.
>
> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
> runtime reductions due to side activity or thermal pressure. Guard this
> check with the sched_cluster_active static key so that systems without
> cluster topology are unaffected.
>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v3:
> * Reverted the inverted capacity check; the inverted form incorrectly
> allows migrations to CPUs of slightly less capacity.
> * Guarded the check for architectural capacity with the
> sched_cluster_active static key.
>
> Changes in v2:
> * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
> runtime variability.
> * Inverted the check for runtime capacity. (Christian)
> * Reworded patch description for clarity.
> ---
> kernel/sched/fair.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dcc02ceb44b5..d2a4c529f67f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11846,8 +11846,14 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> * eventually lead to active_balancing high->low capacity.
> * Higher per-CPU capacity is considered better than balancing
> * average load.
> + *
> + * CONFIG_SCHED_CLUSTER requires balancing load across clusters
> + * of identical capacity. Use architectural capacity to ignore
> + * runtime variability.
> */
> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> + (!static_branch_unlikely(&sched_cluster_active) ||
> + arch_scale_cpu_capacity(env->dst_cpu) != arch_scale_cpu_capacity(i)) &&
> !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
As stated in the commit log, the existing logic blocks task migrations
between CPUs
with identical capacity, which is based on capacity_of() comparison
rather than
arch_scale_cpu_capacity. Could I kindly ask why replacing
!capacity_greater(capacity_of(env->dst_cpu), capacity)
with
capacity_greater(capacity, capacity_of(env->dst_cpu))
does not achieve the expected effect?
This would theoretically enable migration among equal-capacity CPUs, and
in most cases e-cores in different clusters should return 0 thus
load balance is allowed.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
` (2 preceding siblings ...)
2026-05-14 18:34 ` [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-05-14 18:34 ` Ricardo Neri
2026-05-15 20:21 ` Tim Chen
3 siblings, 1 reply; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Ricardo Neri
Some topologies have scheduling domains that contain CPUs of asymmetric
capacity, grouped into two or more clusters of equal-capacity CPUs
sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
balanced across these resource-sharing clusters.
Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
load balancer that it should spread load among cluster siblings.
Checks for capacity in update_sd_pick_busiest() prevent migrations from
high- to low-capacity CPUs if a candidate group is not overloaded.
An effect of keeping the SD_PREFER_SIBLING in domains with asymmetric
capacity is that low-capacity clusters with spare capacity can now help
overloaded higher-capacity groups. This was already the case for single-CPU
groups (see calculate_imbalance() for domains with SD_SHARE_LLC).
Once the overloading condition disappears, misfit load will still be used
to move high-utilization tasks to bigger CPUs if they have spare capacity.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v3:
* Updated documentation of SD_PREFER_SIBLING.
* Expanded the patch description to explain the behavior when overloaded
groups are involved.
Changes in v2:
* Reworded the patch description for clarity.
* Kept parentheses around bitwise operators for clarity.
---
include/linux/sched/sd_flags.h | 3 ++-
kernel/sched/topology.c | 14 ++++++++++++--
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 42839cfa2778..42f74af83b8c 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
* Prefer to place tasks in a sibling domain
*
* Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
- * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
+ * flag, but cleared below domains with SD_ASYM_CPUCAPACITY if the domain does
+ * not have clusters of CPUs sharing cache.
*
* NEEDS_GROUPS: Load balancing flag.
*/
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..a1d048344ea1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1723,8 +1723,18 @@ sd_init(struct sched_domain_topology_level *tl,
/*
* Convert topological properties into behaviour.
*/
- /* Don't attempt to spread across CPUs of different capacities. */
- if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
+ /*
+ * Don't attempt to spread across CPUs of different capacities.
+ *
+ * If the domain has clusters of CPUs sharing L2 cache, keep the flag to
+ * spread tasks across clusters of identical capacity. Checks in
+ * update_sd_pick_busiest() prevent task migrations from high- to low-
+ * capacity CPUs for non-overloaded groups. Migrations to a lower-
+ * capacity CPU can happen if a higher-capacity group is overloaded and
+ * a low-capacity cluster has spare capacity.
+ */
+ if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
+ !(sd->child->flags & SD_CLUSTER))
sd->child->flags &= ~SD_PREFER_SIBLING;
if (sd->flags & SD_SHARE_CPUCAPACITY) {
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
2026-05-14 18:34 ` [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
@ 2026-05-15 20:21 ` Tim Chen
0 siblings, 0 replies; 11+ messages in thread
From: Tim Chen @ 2026-05-15 20:21 UTC (permalink / raw)
To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel
On Thu, 2026-05-14 at 11:34 -0700, Ricardo Neri wrote:
> Some topologies have scheduling domains that contain CPUs of asymmetric
> capacity, grouped into two or more clusters of equal-capacity CPUs
> sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
> balanced across these resource-sharing clusters.
>
> Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
> load balancer that it should spread load among cluster siblings.
>
> Checks for capacity in update_sd_pick_busiest() prevent migrations from
> high- to low-capacity CPUs if a candidate group is not overloaded.
>
> An effect of keeping the SD_PREFER_SIBLING in domains with asymmetric
> capacity is that low-capacity clusters with spare capacity can now help
> overloaded higher-capacity groups. This was already the case for single-CPU
> groups (see calculate_imbalance() for domains with SD_SHARE_LLC).
>
> Once the overloading condition disappears, misfit load will still be used
> to move high-utilization tasks to bigger CPUs if they have spare capacity.
Looks good to me.
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v3:
> * Updated documentation of SD_PREFER_SIBLING.
> * Expanded the patch description to explain the behavior when overloaded
> groups are involved.
>
> Changes in v2:
> * Reworded the patch description for clarity.
> * Kept parentheses around bitwise operators for clarity.
> ---
> include/linux/sched/sd_flags.h | 3 ++-
> kernel/sched/topology.c | 14 ++++++++++++--
> 2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> index 42839cfa2778..42f74af83b8c 100644
> --- a/include/linux/sched/sd_flags.h
> +++ b/include/linux/sched/sd_flags.h
> @@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> * Prefer to place tasks in a sibling domain
> *
> * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> - * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> + * flag, but cleared below domains with SD_ASYM_CPUCAPACITY if the domain does
> + * not have clusters of CPUs sharing cache.
> *
> * NEEDS_GROUPS: Load balancing flag.
> */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9d55..a1d048344ea1 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1723,8 +1723,18 @@ sd_init(struct sched_domain_topology_level *tl,
> /*
> * Convert topological properties into behaviour.
> */
> - /* Don't attempt to spread across CPUs of different capacities. */
> - if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> + /*
> + * Don't attempt to spread across CPUs of different capacities.
> + *
> + * If the domain has clusters of CPUs sharing L2 cache, keep the flag to
> + * spread tasks across clusters of identical capacity. Checks in
> + * update_sd_pick_busiest() prevent task migrations from high- to low-
> + * capacity CPUs for non-overloaded groups. Migrations to a lower-
> + * capacity CPU can happen if a higher-capacity group is overloaded and
> + * a low-capacity cluster has spare capacity.
> + */
> + if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
> + !(sd->child->flags & SD_CLUSTER))
> sd->child->flags &= ~SD_PREFER_SIBLING;
>
> if (sd->flags & SD_SHARE_CPUCAPACITY) {
^ permalink raw reply [flat|nested] 11+ messages in thread