[PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity
@ 2026-04-29 21:19 Ricardo Neri
  2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

Cluster scheduling aims to maximize performance by spreading load across
clusters of CPUs that share mid-level resources [1]. It works well on
uniform systems, but it breaks down on topologies with big and small
cores arranged in clusters. As a result, it fails on several generations
of Intel processors already shipped and upcoming.

Consider the topology below of big (B) cores and clusters of small (s)
cores.
         ------   ------
         | B  |   | B  |   -----------------   -----------------
         |    |   |    |   | s | s | s | s |   | s | s | s | s |
         ------   ------   -----------------   -----------------
         | L2 |   | L2 |   |      L2       |   |       L2      |
         -------------------------------------------------------
         |                          L3                         |
         -------------------------------------------------------

On a partially busy system (one with idle CPUs; busy CPUs have one task
each), scheduling for asymmetric capacity ensures that misfit tasks land on
the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
be evenly spread among the small-CPU clusters. Today, this does not
happen.

Several issues in the load balancer prevent a small CPU in one cluster
from pulling tasks from another:

 a) update_sd_pick_busiest() may select a fully_busy group with higher
    per-CPU capacity as the busiest, preventing a subsequent fully_busy
    group of equal capacity from being correctly selected.
 b) Misfit-load statistics are used to identify tasks that would benefit
    from migrating to bigger CPUs. Accounting misfit load is pointless if
    the destination CPU is equally small, and it also blocks balancing
    between clusters.
 c) Due to b), groups that are truly has_spare or fully_busy get
    misclassified as misfit_task. update_sd_pick_busiest() then skips
    them, since a small destination CPU cannot help with misfit tasks.
 d) Once a busiest group has been identified, sched_balance_find_src_rq()
    will refuse to migrate tasks to CPUs of equal capacity, even when
    doing so is precisely what is required to balance small-CPU clusters.
 e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
    asymmetric capacity, preventing the balancer from equalizing load
    across sibling small-core clusters.

Together, these issues prevent cluster-level balancing on systems with
asymmetric CPU capacity.

This series addresses each problem and restores the intended behavior.
Details, rationale, and code changes are explained in each patch.

I tested these patches on Alder Lake (with Hyper-Threading disabled),
Lunar Lake and Panther Lake. I also tested configurations with only one
CPU online per cluster to ensure that systems without cluster topology
continue to behave as expected.

Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [1]

Changes in v2:
 - Patch 1: Rewrote patch description for clarity. Added a note
   clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
   exclusive. (Tim)
 - Patch 2: Fixed a bug where the capacity check inadvertently broke
   the mutual exclusion of the sched_reduced_capacity() path. Keep
   marking the root domain as overloaded when misfit tasks are present
   to allow bigger CPUs to help via newly idle balance. (sashiko)
   Fixed the description to state that capacity_greater() looks for
   differences of ~5% or more, not 20%. (Christian)
 - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
   ignore runtime capacity variability. Inverted the capacity check.
   (Christian)
 - Patch 4: Reworded the patch description for clarity.
 - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/

---
Ricardo Neri (4):
      sched/fair: Check CPU capacity before comparing group types during load balance
      sched/fair: Skip misfit load accounting when the destination CPU cannot help
      sched/fair: Allow load balancing between CPUs of identical capacity
      sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters

 kernel/sched/fair.c     | 52 +++++++++++++++++++++++++++++++++++--------------
 kernel/sched/topology.c | 11 +++++++++--
 2 files changed, 46 insertions(+), 17 deletions(-)
---
base-commit: 8f1aacb683ef4a49b83dcc40bfce022aaa4aa597
change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152

Best regards,
-- 
Ricardo Neri <ricardo.neri-calderon@linux.intel.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
  2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
@ 2026-04-29 21:19 ` Ricardo Neri
  2026-05-06 10:38   ` Christian Loehle
  2026-04-29 21:19 ` [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

update_sd_pick_busiest() may incorrectly select a fully_busy group as the
busiest group when its per-CPU capacity exceeds that of the destination
CPU. This happens because the type of busiest group is initialized to
group_has_spare and allows the fully_busy group to win the type comparison.

update_sd_pick_busiest() should not choose a candidate scheduling group
with at most one runnable task if its per-CPU capacity is greater than that
of the destination CPU. Such a check already exists, but it is done too
late: after the type comparison, preventing a subsequent fully_busy group
of equal per-CPU capacity from being correctly selected.

Move this check to occur before comparing group types.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
 * Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
   exclusive. (Tim)
 * Kept parentheses around bitwise operators for clarity.
 * Rewrote patch description for clarity.
---
 kernel/sched/fair.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 728965851842..0dbed82aa63f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10788,6 +10788,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
+	/*
+	 * Candidate sg has no more than one task per CPU and has higher
+	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+	 * throughput. Maximize throughput, power/energy consequences are not
+	 * considered.
+	 *
+	 * Systems with SMT are unaffected, as asymmetric capacity is not set
+	 * in such case.
+	 */
+	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	    (sgs->group_type <= group_fully_busy) &&
+	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
+		return false;
+
 	if (sgs->group_type > busiest->group_type)
 		return true;
 
@@ -10890,17 +10904,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 	}
 
-	/*
-	 * Candidate sg has no more than one task per CPU and has higher
-	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
-	 * throughput. Maximize throughput, power/energy consequences are not
-	 * considered.
-	 */
-	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
-	    (sgs->group_type <= group_fully_busy) &&
-	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
-		return false;
-
 	return true;
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
  2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-05-06 10:38   ` Christian Loehle
  2026-05-06 23:45     ` Ricardo Neri
  0 siblings, 1 reply; 10+ messages in thread
From: Christian Loehle @ 2026-05-06 10:38 UTC (permalink / raw)
  To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim C Chen, Chen Yu, Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Andrea Righi

On 4/29/26 22:19, Ricardo Neri wrote:
> update_sd_pick_busiest() may incorrectly select a fully_busy group as the
> busiest group when its per-CPU capacity exceeds that of the destination
> CPU. This happens because the type of busiest group is initialized to
> group_has_spare and allows the fully_busy group to win the type comparison.
> 
> update_sd_pick_busiest() should not choose a candidate scheduling group
> with at most one runnable task if its per-CPU capacity is greater than that
> of the destination CPU. Such a check already exists, but it is done too
> late: after the type comparison, preventing a subsequent fully_busy group
> of equal per-CPU capacity from being correctly selected.
> 
> Move this check to occur before comparing group types.
> 
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v1:
>  * Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
>    exclusive. (Tim)
>  * Kept parentheses around bitwise operators for clarity.
>  * Rewrote patch description for clarity.
> ---
>  kernel/sched/fair.c | 25 ++++++++++++++-----------
>  1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 728965851842..0dbed82aa63f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10788,6 +10788,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	     sds->local_stat.group_type != group_has_spare))
>  		return false;
>  
> +	/*
> +	 * Candidate sg has no more than one task per CPU and has higher
> +	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> +	 * throughput. Maximize throughput, power/energy consequences are not
> +	 * considered.
> +	 *
> +	 * Systems with SMT are unaffected, as asymmetric capacity is not set
> +	 * in such case.
> +	 */
> +	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> +	    (sgs->group_type <= group_fully_busy) &&
> +	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> +		return false;
> +
>  	if (sgs->group_type > busiest->group_type)
>  		return true;
>  
> @@ -10890,17 +10904,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		break;
>  	}
>  
> -	/*
> -	 * Candidate sg has no more than one task per CPU and has higher
> -	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> -	 * throughput. Maximize throughput, power/energy consequences are not
> -	 * considered.
> -	 */
> -	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> -	    (sgs->group_type <= group_fully_busy) &&
> -	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> -		return false;
> -
>  	return true;
>  }
>  
> 

I think it deserves a Fixes, but nonetheless:
Reviewed-by: Christian Loehle <christian.loehle@arm.com>

I've CCed Andrea, just because of this SMT -> !SD_ASYM_CPUCAPACITY currently
being up for debate...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
  2026-05-06 10:38   ` Christian Loehle
@ 2026-05-06 23:45     ` Ricardo Neri
  0 siblings, 0 replies; 10+ messages in thread
From: Ricardo Neri @ 2026-05-06 23:45 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Barry Song,
	Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Andrea Righi

On Wed, May 06, 2026 at 11:38:31AM +0100, Christian Loehle wrote:
> On 4/29/26 22:19, Ricardo Neri wrote:
> > update_sd_pick_busiest() may incorrectly select a fully_busy group as the
> > busiest group when its per-CPU capacity exceeds that of the destination
> > CPU. This happens because the type of busiest group is initialized to
> > group_has_spare and allows the fully_busy group to win the type comparison.
> > 
> > update_sd_pick_busiest() should not choose a candidate scheduling group
> > with at most one runnable task if its per-CPU capacity is greater than that
> > of the destination CPU. Such a check already exists, but it is done too
> > late: after the type comparison, preventing a subsequent fully_busy group
> > of equal per-CPU capacity from being correctly selected.
> > 
> > Move this check to occur before comparing group types.
> > 
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v1:
> >  * Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
> >    exclusive. (Tim)
> >  * Kept parentheses around bitwise operators for clarity.
> >  * Rewrote patch description for clarity.
> > ---
> >  kernel/sched/fair.c | 25 ++++++++++++++-----------
> >  1 file changed, 14 insertions(+), 11 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 728965851842..0dbed82aa63f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10788,6 +10788,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  	     sds->local_stat.group_type != group_has_spare))
> >  		return false;
> >  
> > +	/*
> > +	 * Candidate sg has no more than one task per CPU and has higher
> > +	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> > +	 * throughput. Maximize throughput, power/energy consequences are not
> > +	 * considered.
> > +	 *
> > +	 * Systems with SMT are unaffected, as asymmetric capacity is not set
> > +	 * in such case.
> > +	 */
> > +	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> > +	    (sgs->group_type <= group_fully_busy) &&
> > +	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> > +		return false;
> > +
> >  	if (sgs->group_type > busiest->group_type)
> >  		return true;
> >  
> > @@ -10890,17 +10904,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  		break;
> >  	}
> >  
> > -	/*
> > -	 * Candidate sg has no more than one task per CPU and has higher
> > -	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> > -	 * throughput. Maximize throughput, power/energy consequences are not
> > -	 * considered.
> > -	 */
> > -	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> > -	    (sgs->group_type <= group_fully_busy) &&
> > -	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> > -		return false;
> > -
> >  	return true;
> >  }
> >  
> > 
> 
> I think it deserves a Fixes, but nonetheless:

I will add this tag.

> Reviewed-by: Christian Loehle <christian.loehle@arm.com>

Thank you!

> 
> I've CCed Andrea, just because of this SMT -> !SD_ASYM_CPUCAPACITY currently
> being up for debate...

Ah, I missed that patchset. I will take a look. For now I will leave the
comment. It can always be updated later.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
  2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
  2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-04-29 21:19 ` Ricardo Neri
  2026-05-06 11:39   ` Christian Loehle
  2026-04-29 21:19 ` [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
  2026-04-29 21:19 ` [PATCH v2 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
  3 siblings, 1 reply; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

In domains with asymmetric capacity, identifying misfit load in a
scheduling group is not useful when the destination CPU cannot help (i.e.,
its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
such cases, it also prevents load balance among clusters of equal capacity
when CONFIG_SCHED_CLUSTER is enabled. This happens because
update_sd_pick_busiest() skips candidate groups of type misfit_task if the
destination CPU has similar capacity.

Skipping misfit load accounting in this situation allows the group to be
classified as has_spare or fully_busy and lets load balancing proceed. Keep
marking scheduling groups as overloaded when misfit tasks are present. This
flag propagates to the root domain and allows bigger CPUs in it to help
via newly idle balance.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
 * Moved the check of the destination CPU capacity inside the code block
   used for SD_ASYM_CPUCAPACITY. v1 inadvertedly broke the mutual
   exclusion of the sched_reduced_capacity() path.
 * Keep marking the root domain as overloaded to allow bigger CPUs to
   help. (sashiko)
 * Fixed patch description to clarify that the capacity_greater() looks
   differences of 5% or more. (Christian)
 * Reworded the patch description for clarity.
 * I did not include the Reviewed-by tag from Christian since the patch
   changed functionally.
---
 kernel/sched/fair.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dbed82aa63f..166a5b109e0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10719,10 +10719,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			continue;
 
 		if (sd_flags & SD_ASYM_CPUCAPACITY) {
-			/* Check for a misfit task on the cpu */
-			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
-				sgs->group_misfit_task_load = rq->misfit_task_load;
+			if (rq->misfit_task_load) {
+				/*
+				 * Always mark the domain overloaded so big CPUs
+				 * can pick up misfit tasks via newly idle
+				 * balance.
+				 */
 				*sg_overloaded = 1;
+
+				/*
+				 * Only account misfit load if @dst_cpu can
+				 * help, otherwise the group may be classified
+				 * as misfit_task and update_sd_pick_busiest()
+				 * will skip it.
+				 */
+				if (capacity_greater(capacity_of(env->dst_cpu),
+						     group->sgc->max_capacity) &&
+				    (sgs->group_misfit_task_load < rq->misfit_task_load))
+					sgs->group_misfit_task_load = rq->misfit_task_load;
 			}
 		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
 			/* Check for a task running on a CPU with reduced capacity */

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
  2026-04-29 21:19 ` [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-05-06 11:39   ` Christian Loehle
  2026-05-06 23:47     ` Ricardo Neri
  0 siblings, 1 reply; 10+ messages in thread
From: Christian Loehle @ 2026-05-06 11:39 UTC (permalink / raw)
  To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim C Chen, Chen Yu, Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel

On 4/29/26 22:19, Ricardo Neri wrote:
> In domains with asymmetric capacity, identifying misfit load in a
> scheduling group is not useful when the destination CPU cannot help (i.e.,
> its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
> such cases, it also prevents load balance among clusters of equal capacity
> when CONFIG_SCHED_CLUSTER is enabled. This happens because
> update_sd_pick_busiest() skips candidate groups of type misfit_task if the
> destination CPU has similar capacity.
> 
> Skipping misfit load accounting in this situation allows the group to be
> classified as has_spare or fully_busy and lets load balancing proceed. Keep
> marking scheduling groups as overloaded when misfit tasks are present. This
> flag propagates to the root domain and allows bigger CPUs in it to help
> via newly idle balance.
> 
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v1:
>  * Moved the check of the destination CPU capacity inside the code block
>    used for SD_ASYM_CPUCAPACITY. v1 inadvertedly broke the mutual
>    exclusion of the sched_reduced_capacity() path.
>  * Keep marking the root domain as overloaded to allow bigger CPUs to
>    help. (sashiko)
>  * Fixed patch description to clarify that the capacity_greater() looks
>    differences of 5% or more. (Christian)
>  * Reworded the patch description for clarity.
>  * I did not include the Reviewed-by tag from Christian since the patch
>    changed functionally.
> ---
>  kernel/sched/fair.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0dbed82aa63f..166a5b109e0e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10719,10 +10719,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			continue;
>  
>  		if (sd_flags & SD_ASYM_CPUCAPACITY) {
> -			/* Check for a misfit task on the cpu */
> -			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> -				sgs->group_misfit_task_load = rq->misfit_task_load;
> +			if (rq->misfit_task_load) {
> +				/*
> +				 * Always mark the domain overloaded so big CPUs
> +				 * can pick up misfit tasks via newly idle
> +				 * balance.
> +				 */
>  				*sg_overloaded = 1;
> +
> +				/*
> +				 * Only account misfit load if @dst_cpu can
> +				 * help, otherwise the group may be classified
> +				 * as misfit_task and update_sd_pick_busiest()
> +				 * will skip it.
> +				 */
> +				if (capacity_greater(capacity_of(env->dst_cpu),
> +						     group->sgc->max_capacity) &&
> +				    (sgs->group_misfit_task_load < rq->misfit_task_load))
> +					sgs->group_misfit_task_load = rq->misfit_task_load;
>  			}
>  		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
>  			/* Check for a task running on a CPU with reduced capacity */
> 

Reviewed-by: Christian Loehle <christian.loehle@arm.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
  2026-05-06 11:39   ` Christian Loehle
@ 2026-05-06 23:47     ` Ricardo Neri
  0 siblings, 0 replies; 10+ messages in thread
From: Ricardo Neri @ 2026-05-06 23:47 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Barry Song,
	Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel

On Wed, May 06, 2026 at 12:39:23PM +0100, Christian Loehle wrote:
> On 4/29/26 22:19, Ricardo Neri wrote:
> > In domains with asymmetric capacity, identifying misfit load in a
> > scheduling group is not useful when the destination CPU cannot help (i.e.,
> > its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
> > such cases, it also prevents load balance among clusters of equal capacity
> > when CONFIG_SCHED_CLUSTER is enabled. This happens because
> > update_sd_pick_busiest() skips candidate groups of type misfit_task if the
> > destination CPU has similar capacity.
> > 
> > Skipping misfit load accounting in this situation allows the group to be
> > classified as has_spare or fully_busy and lets load balancing proceed. Keep
> > marking scheduling groups as overloaded when misfit tasks are present. This
> > flag propagates to the root domain and allows bigger CPUs in it to help
> > via newly idle balance.
> > 
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> > Changes since v1:
> >  * Moved the check of the destination CPU capacity inside the code block
> >    used for SD_ASYM_CPUCAPACITY. v1 inadvertedly broke the mutual
> >    exclusion of the sched_reduced_capacity() path.
> >  * Keep marking the root domain as overloaded to allow bigger CPUs to
> >    help. (sashiko)
> >  * Fixed patch description to clarify that the capacity_greater() looks
> >    differences of 5% or more. (Christian)
> >  * Reworded the patch description for clarity.
> >  * I did not include the Reviewed-by tag from Christian since the patch
> >    changed functionally.
> > ---
> >  kernel/sched/fair.c | 20 +++++++++++++++++---
> >  1 file changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0dbed82aa63f..166a5b109e0e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10719,10 +10719,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >  			continue;
> >  
> >  		if (sd_flags & SD_ASYM_CPUCAPACITY) {
> > -			/* Check for a misfit task on the cpu */
> > -			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> > -				sgs->group_misfit_task_load = rq->misfit_task_load;
> > +			if (rq->misfit_task_load) {
> > +				/*
> > +				 * Always mark the domain overloaded so big CPUs
> > +				 * can pick up misfit tasks via newly idle
> > +				 * balance.
> > +				 */
> >  				*sg_overloaded = 1;
> > +
> > +				/*
> > +				 * Only account misfit load if @dst_cpu can
> > +				 * help, otherwise the group may be classified
> > +				 * as misfit_task and update_sd_pick_busiest()
> > +				 * will skip it.
> > +				 */
> > +				if (capacity_greater(capacity_of(env->dst_cpu),
> > +						     group->sgc->max_capacity) &&
> > +				    (sgs->group_misfit_task_load < rq->misfit_task_load))
> > +					sgs->group_misfit_task_load = rq->misfit_task_load;
> >  			}
> >  		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
> >  			/* Check for a task running on a CPU with reduced capacity */
> > 
> 
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>

Thank you for your review!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity
  2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
  2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
  2026-04-29 21:19 ` [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-04-29 21:19 ` Ricardo Neri
  2026-05-06 13:10   ` Christian Loehle
  2026-04-29 21:19 ` [PATCH v2 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
  3 siblings, 1 reply; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

sched_balance_find_src_rq() avoids selecting a runqueue with a single
running task as busiest if doing so results in migrating the task to a
CPU with less than ~5% of extra capacity. It also unintentionally
prevents migrations between CPUs of identical capacity.

When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
clusters of CPUs with the same capacity. Allowing migration between CPUs
of identical capacity is necessary to meet this goal.

We are interested in the architectural capacity of the involved CPUs,
excluding any reductions due to side activity or thermal pressure. Use
arch_scale_cpu_capacity().

While here, invert the check for runtime capacity for clarity.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
 * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
   runtime variability.
 * Inverted the check for runtime capacity. (Christian)
 * Reworded patch description for clarity.
---
 kernel/sched/fair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 166a5b109e0e..4105717e64fe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11816,9 +11816,14 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 		 * eventually lead to active_balancing high->low capacity.
 		 * Higher per-CPU capacity is considered better than balancing
 		 * average load.
+		 *
+		 * Cluster scheduling requires balancing load across clusters
+		 * of identical capacity. Use architectural capacity to ignore
+		 * runtime variability.
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
+		    arch_scale_cpu_capacity(env->dst_cpu) != arch_scale_cpu_capacity(i) &&
+		    capacity_greater(capacity, capacity_of(env->dst_cpu)) &&
 		    nr_running == 1)
 			continue;
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity
  2026-04-29 21:19 ` [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-05-06 13:10   ` Christian Loehle
  0 siblings, 0 replies; 10+ messages in thread
From: Christian Loehle @ 2026-05-06 13:10 UTC (permalink / raw)
  To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim C Chen, Chen Yu, Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel

On 4/29/26 22:19, Ricardo Neri wrote:
> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> running task as busiest if doing so results in migrating the task to a
> CPU with less than ~5% of extra capacity. It also unintentionally
> prevents migrations between CPUs of identical capacity.
> 
> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> clusters of CPUs with the same capacity. Allowing migration between CPUs
> of identical capacity is necessary to meet this goal.
> 
> We are interested in the architectural capacity of the involved CPUs,
> excluding any reductions due to side activity or thermal pressure. Use
> arch_scale_cpu_capacity().
> 
> While here, invert the check for runtime capacity for clarity.
> 
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v1:
>  * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
>    runtime variability.
>  * Inverted the check for runtime capacity. (Christian)
>  * Reworded patch description for clarity.
> ---
>  kernel/sched/fair.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 166a5b109e0e..4105717e64fe 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11816,9 +11816,14 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>  		 * eventually lead to active_balancing high->low capacity.
>  		 * Higher per-CPU capacity is considered better than balancing
>  		 * average load.
> +		 *
> +		 * Cluster scheduling requires balancing load across clusters
> +		 * of identical capacity. Use architectural capacity to ignore
> +		 * runtime variability.
>  		 */
>  		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> -		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
> +		    arch_scale_cpu_capacity(env->dst_cpu) != arch_scale_cpu_capacity(i) &&
> +		    capacity_greater(capacity, capacity_of(env->dst_cpu)) &&
>  		    nr_running == 1)
>  			continue;
>  
> 

I wonder if we shouldn't use capacity_greater() margin for both, i.e.
capacity_greater(arch_scale_cpu_capacity(i), arch_scale_cpu_capacity(env->dst_cpu)) &&

For example the orion o6 has a cluster with 1024 and one with 984, If we allow balancing
984->984 I think it's only consistent to also allow 984->1024.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
  2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
                   ` (2 preceding siblings ...)
  2026-04-29 21:19 ` [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-04-29 21:19 ` Ricardo Neri
  3 siblings, 0 replies; 10+ messages in thread
From: Ricardo Neri @ 2026-04-29 21:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

Some topologies have scheduling domains that contain CPUs of asymmetric
capacity, grouped into two or more clusters of equal-capacity CPUs
sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
balanced across these resource-sharing clusters.

Do not clear the SD_PREFER_SIBLING in the child domains to indicate to
the load balancer that it should spread load among cluster siblings.

Checks for capacity in the load balancer will prevent migrations from
high- to low-capacity CPUs. Likewise, misfit load will still be used to
move high-utilization tasks to bigger CPUs.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v1:
 * Reworded the patch description for clarity.
 * Kept parentheses around bitwise operators for clarity.
---
 kernel/sched/topology.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..78ffc1b8eaff 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1723,8 +1723,15 @@ sd_init(struct sched_domain_topology_level *tl,
 	/*
 	 * Convert topological properties into behaviour.
 	 */
-	/* Don't attempt to spread across CPUs of different capacities. */
-	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
+	/*
+	 * Don't attempt to spread across CPUs of different capacities. An
+	 * exception to this rule are domains in which there are clusters of
+	 * CPUs sharing a resource. Keep the flag in such case to balance load
+	 * among them. The load balancer will prevent task migrations from
+	 * high- to low-capacity CPUs.
+	 */
+	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
+	    !(sd->child->flags & SD_CLUSTER))
 		sd->child->flags &= ~SD_PREFER_SIBLING;
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-06 23:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 21:19 [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
2026-05-06 10:38   ` Christian Loehle
2026-05-06 23:45     ` Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
2026-05-06 11:39   ` Christian Loehle
2026-05-06 23:47     ` Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
2026-05-06 13:10   ` Christian Loehle
2026-04-29 21:19 ` [PATCH v2 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox