* [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
@ 2026-04-29 21:19 ` Ricardo Neri
2026-05-06 10:38 ` Christian Loehle
2026-04-29 21:19 ` [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Ricardo Neri
update_sd_pick_busiest() may incorrectly select a fully_busy group as the
busiest group when its per-CPU capacity exceeds that of the destination
CPU. This happens because the type of busiest group is initialized to
group_has_spare and allows the fully_busy group to win the type comparison.
update_sd_pick_busiest() should not choose a candidate scheduling group
with at most one runnable task if its per-CPU capacity is greater than that
of the destination CPU. Such a check already exists, but it is done too
late: after the type comparison, preventing a subsequent fully_busy group
of equal per-CPU capacity from being correctly selected.
Move this check to occur before comparing group types.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
* Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
exclusive. (Tim)
* Kept parentheses around bitwise operators for clarity.
* Rewrote patch description for clarity.
---
kernel/sched/fair.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..0dbed82aa63f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10788,6 +10788,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
sds->local_stat.group_type != group_has_spare))
return false;
+ /*
+ * Candidate sg has no more than one task per CPU and has higher
+ * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+ * throughput. Maximize throughput, power/energy consequences are not
+ * considered.
+ *
+ * Systems with SMT are unaffected, as asymmetric capacity is not set
+ * in such case.
+ */
+ if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+ (sgs->group_type <= group_fully_busy) &&
+ (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;
@@ -10890,17 +10904,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
break;
}
- /*
- * Candidate sg has no more than one task per CPU and has higher
- * per-CPU capacity. Migrating tasks to less capable CPUs may harm
- * throughput. Maximize throughput, power/energy consequences are not
- * considered.
- */
- if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
- (sgs->group_type <= group_fully_busy) &&
- (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
- return false;
-
return true;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-05-06 10:38 ` Christian Loehle
2026-05-06 23:45 ` Ricardo Neri
0 siblings, 1 reply; 10+ messages in thread
From: Christian Loehle @ 2026-05-06 10:38 UTC (permalink / raw)
To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Tim C Chen, Chen Yu, Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Andrea Righi
On 4/29/26 22:19, Ricardo Neri wrote:
> update_sd_pick_busiest() may incorrectly select a fully_busy group as the
> busiest group when its per-CPU capacity exceeds that of the destination
> CPU. This happens because the type of busiest group is initialized to
> group_has_spare and allows the fully_busy group to win the type comparison.
>
> update_sd_pick_busiest() should not choose a candidate scheduling group
> with at most one runnable task if its per-CPU capacity is greater than that
> of the destination CPU. Such a check already exists, but it is done too
> late: after the type comparison, preventing a subsequent fully_busy group
> of equal per-CPU capacity from being correctly selected.
>
> Move this check to occur before comparing group types.
>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v1:
> * Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
> exclusive. (Tim)
> * Kept parentheses around bitwise operators for clarity.
> * Rewrote patch description for clarity.
> ---
> kernel/sched/fair.c | 25 ++++++++++++++-----------
> 1 file changed, 14 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..0dbed82aa63f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10788,6 +10788,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> sds->local_stat.group_type != group_has_spare))
> return false;
>
> + /*
> + * Candidate sg has no more than one task per CPU and has higher
> + * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> + * throughput. Maximize throughput, power/energy consequences are not
> + * considered.
> + *
> + * Systems with SMT are unaffected, as asymmetric capacity is not set
> + * in such case.
> + */
> + if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> + (sgs->group_type <= group_fully_busy) &&
> + (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> + return false;
> +
> if (sgs->group_type > busiest->group_type)
> return true;
>
> @@ -10890,17 +10904,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> break;
> }
>
> - /*
> - * Candidate sg has no more than one task per CPU and has higher
> - * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> - * throughput. Maximize throughput, power/energy consequences are not
> - * considered.
> - */
> - if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> - (sgs->group_type <= group_fully_busy) &&
> - (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> - return false;
> -
> return true;
> }
>
>
I think it deserves a Fixes, but nonetheless:
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
I've CCed Andrea, just because of this SMT -> !SD_ASYM_CPUCAPACITY currently
being up for debate...
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
2026-05-06 10:38 ` Christian Loehle
@ 2026-05-06 23:45 ` Ricardo Neri
0 siblings, 0 replies; 10+ messages in thread
From: Ricardo Neri @ 2026-05-06 23:45 UTC (permalink / raw)
To: Christian Loehle
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Barry Song,
Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Andrea Righi
On Wed, May 06, 2026 at 11:38:31AM +0100, Christian Loehle wrote:
> On 4/29/26 22:19, Ricardo Neri wrote:
> > update_sd_pick_busiest() may incorrectly select a fully_busy group as the
> > busiest group when its per-CPU capacity exceeds that of the destination
> > CPU. This happens because the type of busiest group is initialized to
> > group_has_spare and allows the fully_busy group to win the type comparison.
> >
> > update_sd_pick_busiest() should not choose a candidate scheduling group
> > with at most one runnable task if its per-CPU capacity is greater than that
> > of the destination CPU. Such a check already exists, but it is done too
> > late: after the type comparison, preventing a subsequent fully_busy group
> > of equal per-CPU capacity from being correctly selected.
> >
> > Move this check to occur before comparing group types.
> >
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v1:
> > * Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
> > exclusive. (Tim)
> > * Kept parentheses around bitwise operators for clarity.
> > * Rewrote patch description for clarity.
> > ---
> > kernel/sched/fair.c | 25 ++++++++++++++-----------
> > 1 file changed, 14 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 728965851842..0dbed82aa63f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10788,6 +10788,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > sds->local_stat.group_type != group_has_spare))
> > return false;
> >
> > + /*
> > + * Candidate sg has no more than one task per CPU and has higher
> > + * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> > + * throughput. Maximize throughput, power/energy consequences are not
> > + * considered.
> > + *
> > + * Systems with SMT are unaffected, as asymmetric capacity is not set
> > + * in such case.
> > + */
> > + if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> > + (sgs->group_type <= group_fully_busy) &&
> > + (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> > + return false;
> > +
> > if (sgs->group_type > busiest->group_type)
> > return true;
> >
> > @@ -10890,17 +10904,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> > break;
> > }
> >
> > - /*
> > - * Candidate sg has no more than one task per CPU and has higher
> > - * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> > - * throughput. Maximize throughput, power/energy consequences are not
> > - * considered.
> > - */
> > - if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> > - (sgs->group_type <= group_fully_busy) &&
> > - (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> > - return false;
> > -
> > return true;
> > }
> >
> >
>
> I think it deserves a Fixes, but nonetheless:
I will add this tag.
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Thank you!
>
> I've CCed Andrea, just because of this SMT -> !SD_ASYM_CPUCAPACITY currently
> being up for debate...
Ah, I missed that patchset. I will take a look. For now I will leave the
comment. It can always be updated later.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-04-29 21:19 ` Ricardo Neri
2026-05-06 11:39 ` Christian Loehle
2026-04-29 21:19 ` [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
3 siblings, 1 reply; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Ricardo Neri
In domains with asymmetric capacity, identifying misfit load in a
scheduling group is not useful when the destination CPU cannot help (i.e.,
its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
such cases, it also prevents load balance among clusters of equal capacity
when CONFIG_SCHED_CLUSTER is enabled. This happens because
update_sd_pick_busiest() skips candidate groups of type misfit_task if the
destination CPU has similar capacity.
Skipping misfit load accounting in this situation allows the group to be
classified as has_spare or fully_busy and lets load balancing proceed. Keep
marking scheduling groups as overloaded when misfit tasks are present. This
flag propagates to the root domain and allows bigger CPUs in it to help
via newly idle balance.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
* Moved the check of the destination CPU capacity inside the code block
used for SD_ASYM_CPUCAPACITY. v1 inadvertedly broke the mutual
exclusion of the sched_reduced_capacity() path.
* Keep marking the root domain as overloaded to allow bigger CPUs to
help. (sashiko)
* Fixed patch description to clarify that the capacity_greater() looks
differences of 5% or more. (Christian)
* Reworded the patch description for clarity.
* I did not include the Reviewed-by tag from Christian since the patch
changed functionally.
---
kernel/sched/fair.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dbed82aa63f..166a5b109e0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10719,10 +10719,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
continue;
if (sd_flags & SD_ASYM_CPUCAPACITY) {
- /* Check for a misfit task on the cpu */
- if (sgs->group_misfit_task_load < rq->misfit_task_load) {
- sgs->group_misfit_task_load = rq->misfit_task_load;
+ if (rq->misfit_task_load) {
+ /*
+ * Always mark the domain overloaded so big CPUs
+ * can pick up misfit tasks via newly idle
+ * balance.
+ */
*sg_overloaded = 1;
+
+ /*
+ * Only account misfit load if @dst_cpu can
+ * help, otherwise the group may be classified
+ * as misfit_task and update_sd_pick_busiest()
+ * will skip it.
+ */
+ if (capacity_greater(capacity_of(env->dst_cpu),
+ group->sgc->max_capacity) &&
+ (sgs->group_misfit_task_load < rq->misfit_task_load))
+ sgs->group_misfit_task_load = rq->misfit_task_load;
}
} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
2026-04-29 21:19 ` [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-05-06 11:39 ` Christian Loehle
2026-05-06 23:47 ` Ricardo Neri
0 siblings, 1 reply; 10+ messages in thread
From: Christian Loehle @ 2026-05-06 11:39 UTC (permalink / raw)
To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Tim C Chen, Chen Yu, Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel
On 4/29/26 22:19, Ricardo Neri wrote:
> In domains with asymmetric capacity, identifying misfit load in a
> scheduling group is not useful when the destination CPU cannot help (i.e.,
> its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
> such cases, it also prevents load balance among clusters of equal capacity
> when CONFIG_SCHED_CLUSTER is enabled. This happens because
> update_sd_pick_busiest() skips candidate groups of type misfit_task if the
> destination CPU has similar capacity.
>
> Skipping misfit load accounting in this situation allows the group to be
> classified as has_spare or fully_busy and lets load balancing proceed. Keep
> marking scheduling groups as overloaded when misfit tasks are present. This
> flag propagates to the root domain and allows bigger CPUs in it to help
> via newly idle balance.
>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v1:
> * Moved the check of the destination CPU capacity inside the code block
> used for SD_ASYM_CPUCAPACITY. v1 inadvertedly broke the mutual
> exclusion of the sched_reduced_capacity() path.
> * Keep marking the root domain as overloaded to allow bigger CPUs to
> help. (sashiko)
> * Fixed patch description to clarify that the capacity_greater() looks
> differences of 5% or more. (Christian)
> * Reworded the patch description for clarity.
> * I did not include the Reviewed-by tag from Christian since the patch
> changed functionally.
> ---
> kernel/sched/fair.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0dbed82aa63f..166a5b109e0e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10719,10 +10719,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> continue;
>
> if (sd_flags & SD_ASYM_CPUCAPACITY) {
> - /* Check for a misfit task on the cpu */
> - if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> - sgs->group_misfit_task_load = rq->misfit_task_load;
> + if (rq->misfit_task_load) {
> + /*
> + * Always mark the domain overloaded so big CPUs
> + * can pick up misfit tasks via newly idle
> + * balance.
> + */
> *sg_overloaded = 1;
> +
> + /*
> + * Only account misfit load if @dst_cpu can
> + * help, otherwise the group may be classified
> + * as misfit_task and update_sd_pick_busiest()
> + * will skip it.
> + */
> + if (capacity_greater(capacity_of(env->dst_cpu),
> + group->sgc->max_capacity) &&
> + (sgs->group_misfit_task_load < rq->misfit_task_load))
> + sgs->group_misfit_task_load = rq->misfit_task_load;
> }
> } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
> /* Check for a task running on a CPU with reduced capacity */
>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
2026-05-06 11:39 ` Christian Loehle
@ 2026-05-06 23:47 ` Ricardo Neri
0 siblings, 0 replies; 10+ messages in thread
From: Ricardo Neri @ 2026-05-06 23:47 UTC (permalink / raw)
To: Christian Loehle
Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Barry Song,
Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel
On Wed, May 06, 2026 at 12:39:23PM +0100, Christian Loehle wrote:
> On 4/29/26 22:19, Ricardo Neri wrote:
> > In domains with asymmetric capacity, identifying misfit load in a
> > scheduling group is not useful when the destination CPU cannot help (i.e.,
> > its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
> > such cases, it also prevents load balance among clusters of equal capacity
> > when CONFIG_SCHED_CLUSTER is enabled. This happens because
> > update_sd_pick_busiest() skips candidate groups of type misfit_task if the
> > destination CPU has similar capacity.
> >
> > Skipping misfit load accounting in this situation allows the group to be
> > classified as has_spare or fully_busy and lets load balancing proceed. Keep
> > marking scheduling groups as overloaded when misfit tasks are present. This
> > flag propagates to the root domain and allows bigger CPUs in it to help
> > via newly idle balance.
> >
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v1:
> > * Moved the check of the destination CPU capacity inside the code block
> > used for SD_ASYM_CPUCAPACITY. v1 inadvertedly broke the mutual
> > exclusion of the sched_reduced_capacity() path.
> > * Keep marking the root domain as overloaded to allow bigger CPUs to
> > help. (sashiko)
> > * Fixed patch description to clarify that the capacity_greater() looks
> > differences of 5% or more. (Christian)
> > * Reworded the patch description for clarity.
> > * I did not include the Reviewed-by tag from Christian since the patch
> > changed functionally.
> > ---
> > kernel/sched/fair.c | 20 +++++++++++++++++---
> > 1 file changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0dbed82aa63f..166a5b109e0e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10719,10 +10719,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > continue;
> >
> > if (sd_flags & SD_ASYM_CPUCAPACITY) {
> > - /* Check for a misfit task on the cpu */
> > - if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> > - sgs->group_misfit_task_load = rq->misfit_task_load;
> > + if (rq->misfit_task_load) {
> > + /*
> > + * Always mark the domain overloaded so big CPUs
> > + * can pick up misfit tasks via newly idle
> > + * balance.
> > + */
> > *sg_overloaded = 1;
> > +
> > + /*
> > + * Only account misfit load if @dst_cpu can
> > + * help, otherwise the group may be classified
> > + * as misfit_task and update_sd_pick_busiest()
> > + * will skip it.
> > + */
> > + if (capacity_greater(capacity_of(env->dst_cpu),
> > + group->sgc->max_capacity) &&
> > + (sgs->group_misfit_task_load < rq->misfit_task_load))
> > + sgs->group_misfit_task_load = rq->misfit_task_load;
> > }
> > } else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
> > /* Check for a task running on a CPU with reduced capacity */
> >
>
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Thank you for your review!
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity
2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-04-29 21:19 ` Ricardo Neri
2026-05-06 13:10 ` Christian Loehle
2026-04-29 21:19 ` [PATCH v2 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
3 siblings, 1 reply; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Ricardo Neri
sched_balance_find_src_rq() avoids selecting a runqueue with a single
running task as busiest if doing so results in migrating the task to a
CPU with less than ~5% of extra capacity. It also unintentionally
prevents migrations between CPUs of identical capacity.
When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
clusters of CPUs with the same capacity. Allowing migration between CPUs
of identical capacity is necessary to meet this goal.
We are interested in the architectural capacity of the involved CPUs,
excluding any reductions due to side activity or thermal pressure. Use
arch_scale_cpu_capacity().
While here, invert the check for runtime capacity for clarity.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
* Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
runtime variability.
* Inverted the check for runtime capacity. (Christian)
* Reworded patch description for clarity.
---
kernel/sched/fair.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 166a5b109e0e..4105717e64fe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11816,9 +11816,14 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
* eventually lead to active_balancing high->low capacity.
* Higher per-CPU capacity is considered better than balancing
* average load.
+ *
+ * Cluster scheduling requires balancing load across clusters
+ * of identical capacity. Use architectural capacity to ignore
+ * runtime variability.
*/
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
- !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
+ arch_scale_cpu_capacity(env->dst_cpu) != arch_scale_cpu_capacity(i) &&
+ capacity_greater(capacity, capacity_of(env->dst_cpu)) &&
nr_running == 1)
continue;
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity
2026-04-29 21:19 ` [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-05-06 13:10 ` Christian Loehle
0 siblings, 0 replies; 10+ messages in thread
From: Christian Loehle @ 2026-05-06 13:10 UTC (permalink / raw)
To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Tim C Chen, Chen Yu, Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel
On 4/29/26 22:19, Ricardo Neri wrote:
> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> running task as busiest if doing so results in migrating the task to a
> CPU with less than ~5% of extra capacity. It also unintentionally
> prevents migrations between CPUs of identical capacity.
>
> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> clusters of CPUs with the same capacity. Allowing migration between CPUs
> of identical capacity is necessary to meet this goal.
>
> We are interested in the architectural capacity of the involved CPUs,
> excluding any reductions due to side activity or thermal pressure. Use
> arch_scale_cpu_capacity().
>
> While here, invert the check for runtime capacity for clarity.
>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v1:
> * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
> runtime variability.
> * Inverted the check for runtime capacity. (Christian)
> * Reworded patch description for clarity.
> ---
> kernel/sched/fair.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 166a5b109e0e..4105717e64fe 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11816,9 +11816,14 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> * eventually lead to active_balancing high->low capacity.
> * Higher per-CPU capacity is considered better than balancing
> * average load.
> + *
> + * Cluster scheduling requires balancing load across clusters
> + * of identical capacity. Use architectural capacity to ignore
> + * runtime variability.
> */
> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> - !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
> + arch_scale_cpu_capacity(env->dst_cpu) != arch_scale_cpu_capacity(i) &&
> + capacity_greater(capacity, capacity_of(env->dst_cpu)) &&
> nr_running == 1)
> continue;
>
>
I wonder if we shouldn't use capacity_greater() margin for both, i.e.
capacity_greater(arch_scale_cpu_capacity(i), arch_scale_cpu_capacity(env->dst_cpu)) &&
For example the orion o6 has a cluster with 1024 and one with 984, If we allow balancing
984->984 I think it's only consistent to also allow 984->1024.
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
` (2 preceding siblings ...)
2026-04-29 21:19 ` [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-04-29 21:19 ` Ricardo Neri
3 siblings, 0 replies; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
Barry Song
Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
Ricardo Neri
Some topologies have scheduling domains that contain CPUs of asymmetric
capacity, grouped into two or more clusters of equal-capacity CPUs
sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
balanced across these resource-sharing clusters.
Do not clear the SD_PREFER_SIBLING in the child domains to indicate to
the load balancer that it should spread load among cluster siblings.
Checks for capacity in the load balancer will prevent migrations from
high- to low-capacity CPUs. Likewise, misfit load will still be used to
move high-utilization tasks to bigger CPUs.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
* Reworded the patch description for clarity.
* Kept parentheses around bitwise operators for clarity.
---
kernel/sched/topology.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..78ffc1b8eaff 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1723,8 +1723,15 @@ sd_init(struct sched_domain_topology_level *tl,
/*
* Convert topological properties into behaviour.
*/
- /* Don't attempt to spread across CPUs of different capacities. */
- if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
+ /*
+ * Don't attempt to spread across CPUs of different capacities. An
+ * exception to this rule are domains in which there are clusters of
+ * CPUs sharing a resource. Keep the flag in such case to balance load
+ * among them. The load balancer will prevent task migrations from
+ * high- to low-capacity CPUs.
+ */
+ if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
+ !(sd->child->flags & SD_CLUSTER))
sd->child->flags &= ~SD_PREFER_SIBLING;
if (sd->flags & SD_SHARE_CPUCAPACITY) {
--
2.43.0
^ permalink raw reply related [flat|nested] 10+ messages in thread