[PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity
@ 2026-05-14 18:34 Ricardo Neri
  2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

Hi,

This is v3 of the series. It has a few but important changes. Please
refer to the change log.

Cluster scheduling aims to maximize performance by spreading load across
clusters of CPUs that share mid-level resources [1]. It works well on
uniform systems, but it breaks down on topologies with big and small
cores arranged in clusters. As a result, it fails on several generations
of Intel processors already shipped and upcoming.

Consider the topology below of big (B) cores and clusters of small (s)
cores.
         ------   ------
         | B  |   | B  |   -----------------   -----------------
         |    |   |    |   | s | s | s | s |   | s | s | s | s |
         ------   ------   -----------------   -----------------
         | L2 |   | L2 |   |      L2       |   |       L2      |
         -------------------------------------------------------
         |                          L3                         |
         -------------------------------------------------------

On a partially busy system (one with idle CPUs; busy CPUs have one task
each), scheduling for asymmetric capacity ensures that misfit tasks land on
the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
be evenly spread among the small-CPU clusters. Today, this does not
happen.

Several issues in the load balancer prevent a small CPU in one cluster
from pulling tasks from another:

 a) update_sd_pick_busiest() may select a fully_busy group with higher
    per-CPU capacity as the busiest, preventing a subsequent fully_busy
    group of equal capacity from being correctly selected.
 b) Misfit-load statistics are used to identify tasks that would benefit
    from migrating to bigger CPUs. Accounting misfit load is pointless if
    the destination CPU is equally small, and it also blocks balancing
    between clusters.
 c) Due to b), groups that are truly has_spare or fully_busy get
    misclassified as misfit_task. update_sd_pick_busiest() then skips
    them, since a small destination CPU cannot help with misfit tasks.
 d) Once a busiest group has been identified, sched_balance_find_src_rq()
    will refuse to migrate tasks to CPUs of equal capacity, even when
    doing so is precisely what is required to balance small-CPU clusters.
 e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
    asymmetric capacity, preventing the balancer from equalizing load
    across sibling small-core clusters.

Together, these issues prevent cluster-level balancing on systems with
asymmetric CPU capacity.

This series addresses each problem and restores the intended behavior.
Details, rationale, and code changes are explained in each patch.

I tested these patches on Alder Lake (with Hyper-Threading disabled),
Lunar Lake and Panther Lake. I also tested configurations with only one
CPU online per cluster to ensure that systems without cluster topology
continue to behave as expected.

Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [1]

Changes in v3:
 - Patch 3: Reverted the inverted runtime capacity check. The inverted
   form resulted in migrations to CPUs of slightly lower capacity. Guarded
   the check for architectural capacity with the sched_cluster_active
   static key.
 - Patch 4: Expanded the patch description to explain the behavior of
   overloaded groups and low-capacity clusters with spare capacity.
 - Added Reviewed-by tags from Christian. Thanks!
 - Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com

Changes in v2:
 - Patch 1: Rewrote patch description for clarity. Added a note
   clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
   exclusive. (Tim)
 - Patch 2: Fixed a bug where the capacity check inadvertently broke
   the mutual exclusion of the sched_reduced_capacity() path. Keep
   marking the root domain as overloaded when misfit tasks are present
   to allow bigger CPUs to help via newly idle balance. (sashiko)
   Fixed the description to state that capacity_greater() looks for
   differences of ~5% or more, not 20%. (Christian)
 - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
   ignore runtime capacity variability. Inverted the capacity check.
   (Christian)
 - Patch 4: Reworded the patch description for clarity.
 - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/

---
Ricardo Neri (4):
      sched/fair: Check CPU capacity before comparing group types during load balance
      sched/fair: Skip misfit load accounting when the destination CPU cannot help
      sched/fair: Allow load balancing between CPUs of identical capacity
      sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters

 include/linux/sched/sd_flags.h |  3 ++-
 kernel/sched/fair.c            | 51 ++++++++++++++++++++++++++++++------------
 kernel/sched/topology.c        | 14 ++++++++++--
 3 files changed, 51 insertions(+), 17 deletions(-)
---
base-commit: 4450349dc665603eb4fab0fb31b7df5b55d6af9b
change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152

Best regards,
-- 
Ricardo Neri <ricardo.neri-calderon@linux.intel.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
  2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
@ 2026-05-14 18:34 ` Ricardo Neri
  2026-05-15 12:29   ` Chen, Yu C
  2026-05-15 19:26   ` Tim Chen
  2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

update_sd_pick_busiest() may incorrectly select a fully_busy group as the
busiest group when its per-CPU capacity exceeds that of the destination
CPU. This happens because the type of busiest group is initialized to
group_has_spare and allows the fully_busy group to win the type comparison.

update_sd_pick_busiest() should not choose a candidate scheduling group
with at most one runnable task if its per-CPU capacity is greater than that
of the destination CPU. Such a check already exists, but it is done too
late: after the type comparison, preventing a subsequent fully_busy group
of equal per-CPU capacity from being correctly selected.

Move this check to occur before comparing group types.

Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v3:
 * Added a Fixes tag. (Christian)
 * Added Reviewed-by tag from Christian. Thanks!

Changes in v2:
 * Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
   exclusive. (Tim)
 * Kept parentheses around bitwise operators for clarity.
 * Rewrote patch description for clarity.
---
 kernel/sched/fair.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ebec186f982..e06e74d9ce0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10818,6 +10818,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
+	/*
+	 * Candidate sg has no more than one task per CPU and has higher
+	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+	 * throughput. Maximize throughput, power/energy consequences are not
+	 * considered.
+	 *
+	 * Systems with SMT are unaffected, as asymmetric capacity is not set
+	 * in such cases.
+	 */
+	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	    (sgs->group_type <= group_fully_busy) &&
+	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
+		return false;
+
 	if (sgs->group_type > busiest->group_type)
 		return true;
 
@@ -10920,17 +10934,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		break;
 	}
 
-	/*
-	 * Candidate sg has no more than one task per CPU and has higher
-	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
-	 * throughput. Maximize throughput, power/energy consequences are not
-	 * considered.
-	 */
-	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
-	    (sgs->group_type <= group_fully_busy) &&
-	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
-		return false;
-
 	return true;
 }
 

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
  2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-05-15 12:29   ` Chen, Yu C
  2026-05-15 19:26   ` Tim Chen
  1 sibling, 0 replies; 11+ messages in thread
From: Chen, Yu C @ 2026-05-15 12:29 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Rafael J. Wysocki, Len Brown, Dietmar Eggemann, Juri Lelli,
	Vincent Guittot, ricardo.neri, linux-kernel, Steven Rostedt,
	Ben Segall, Valentin Schneider, Mel Gorman, Tim C Chen,
	Christian Loehle, Peter Zijlstra, Ingo Molnar, Barry Song

On 5/15/2026 2:34 AM, Ricardo Neri wrote:

[ ... ]

> @@ -10818,6 +10818,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   	     sds->local_stat.group_type != group_has_spare))
>   		return false;
>   
> +	/*
> +	 * Candidate sg has no more than one task per CPU and has higher
> +	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> +	 * throughput. Maximize throughput, power/energy consequences are not
> +	 * considered.
> +	 *
> +	 * Systems with SMT are unaffected, as asymmetric capacity is not set
> +	 * in such cases.
> +	 */

Does "SMT" here imply that group_smt_balance is unaffected?
Regardless of whether we move the check earlier, this seems to
already be guaranteed by the fact that the check only applies
to sgs->group_type <= group_fully_busy, which does not include
group_smt_balance. In other words, SD_ASYM_CPUCAPACITY is not
the only gatekeeper.

Other than that, the change looks good to me,Reviewed-by: Chen Yu 
<yu.c.chen@intel.com>

thanks,
Chenyu
> +	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> +	    (sgs->group_type <= group_fully_busy) &&
> +	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> +		return false;
> +


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance
  2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
  2026-05-15 12:29   ` Chen, Yu C
@ 2026-05-15 19:26   ` Tim Chen
  1 sibling, 0 replies; 11+ messages in thread
From: Tim Chen @ 2026-05-15 19:26 UTC (permalink / raw)
  To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel

On Thu, 2026-05-14 at 11:34 -0700, Ricardo Neri wrote:
> update_sd_pick_busiest() may incorrectly select a fully_busy group as the
> busiest group when its per-CPU capacity exceeds that of the destination
> CPU. This happens because the type of busiest group is initialized to
> group_has_spare and allows the fully_busy group to win the type comparison.
> 
> update_sd_pick_busiest() should not choose a candidate scheduling group
> with at most one runnable task if its per-CPU capacity is greater than that
> of the destination CPU. Such a check already exists, but it is done too
> late: after the type comparison, preventing a subsequent fully_busy group
> of equal per-CPU capacity from being correctly selected.
> 
> Move this check to occur before comparing group types.

Looks good to me.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

> 
> Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v3:
>  * Added a Fixes tag. (Christian)
>  * Added Reviewed-by tag from Christian. Thanks!
> 
> Changes in v2:
>  * Added a note clarifying that SMT and SD_ASYM_CPUCAPACITY are mutually
>    exclusive. (Tim)
>  * Kept parentheses around bitwise operators for clarity.
>  * Rewrote patch description for clarity.
> ---
>  kernel/sched/fair.c | 25 ++++++++++++++-----------
>  1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f982..e06e74d9ce0e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10818,6 +10818,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	     sds->local_stat.group_type != group_has_spare))
>  		return false;
>  
> +	/*
> +	 * Candidate sg has no more than one task per CPU and has higher
> +	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> +	 * throughput. Maximize throughput, power/energy consequences are not
> +	 * considered.
> +	 *
> +	 * Systems with SMT are unaffected, as asymmetric capacity is not set
> +	 * in such cases.
> +	 */
> +	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> +	    (sgs->group_type <= group_fully_busy) &&
> +	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> +		return false;
> +
>  	if (sgs->group_type > busiest->group_type)
>  		return true;
>  
> @@ -10920,17 +10934,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  		break;
>  	}
>  
> -	/*
> -	 * Candidate sg has no more than one task per CPU and has higher
> -	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
> -	 * throughput. Maximize throughput, power/energy consequences are not
> -	 * considered.
> -	 */
> -	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
> -	    (sgs->group_type <= group_fully_busy) &&
> -	    (capacity_greater(sg->sgc->min_capacity, capacity_of(env->dst_cpu))))
> -		return false;
> -
>  	return true;
>  }
>  

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
  2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
  2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
@ 2026-05-14 18:34 ` Ricardo Neri
  2026-05-15 12:49   ` Chen, Yu C
  2026-05-15 20:12   ` Tim Chen
  2026-05-14 18:34 ` [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
  2026-05-14 18:34 ` [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
  3 siblings, 2 replies; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

In domains with asymmetric capacity, identifying misfit load in a
scheduling group is not useful when the destination CPU cannot help (i.e.,
its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
such cases, it also prevents load balance among clusters of equal capacity
when CONFIG_SCHED_CLUSTER is enabled. This happens because
update_sd_pick_busiest() skips candidate groups of type misfit_task if the
destination CPU has similar capacity.

Skipping misfit load accounting in this situation allows the group to be
classified as has_spare or fully_busy and lets load balancing proceed. Keep
marking scheduling groups as overloaded when misfit tasks are present. The
sg_overloaded flag propagates to the root domain and allows bigger CPUs in
it to help via newly idle balance.

Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v3:
 * Added Reviewed-by tag from Christian. Thanks!

Changes in v2:
 * Moved the check of the destination CPU capacity inside the code block
   used for SD_ASYM_CPUCAPACITY. v1 inadvertently broke the mutual
   exclusion of the sched_reduced_capacity() path.
 * Keep marking the root domain as overloaded to allow bigger CPUs to
   help. (sashiko)
 * Fixed patch description to clarify that the capacity_greater() looks
   for differences of 5% or more. (Christian)
 * Reworded the patch description for clarity.
 * I did not include the Reviewed-by tag from Christian since the patch
   changed functionally.
---
 kernel/sched/fair.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e06e74d9ce0e..dcc02ceb44b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10749,10 +10749,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			continue;
 
 		if (sd_flags & SD_ASYM_CPUCAPACITY) {
-			/* Check for a misfit task on the cpu */
-			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
-				sgs->group_misfit_task_load = rq->misfit_task_load;
+			if (rq->misfit_task_load) {
+				/*
+				 * Always mark the domain overloaded so big CPUs
+				 * can pick up misfit tasks via newly idle
+				 * balance.
+				 */
 				*sg_overloaded = 1;
+
+				/*
+				 * Only account misfit load if @dst_cpu can
+				 * help; otherwise, the group may be classified
+				 * as misfit_task and update_sd_pick_busiest()
+				 * will skip it.
+				 */
+				if (capacity_greater(capacity_of(env->dst_cpu),
+						     group->sgc->max_capacity) &&
+				    (sgs->group_misfit_task_load < rq->misfit_task_load))
+					sgs->group_misfit_task_load = rq->misfit_task_load;
 			}
 		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
 			/* Check for a task running on a CPU with reduced capacity */

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
  2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-05-15 12:49   ` Chen, Yu C
  2026-05-15 20:12   ` Tim Chen
  1 sibling, 0 replies; 11+ messages in thread
From: Chen, Yu C @ 2026-05-15 12:49 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, Mel Gorman,
	Valentin Schneider, linux-kernel, Christian Loehle, Ben Segall,
	Steven Rostedt, Juri Lelli, Dietmar Eggemann, Tim C Chen,
	Vincent Guittot, Barry Song, Peter Zijlstra, Ingo Molnar

On 5/15/2026 2:34 AM, Ricardo Neri wrote:

> +			if (rq->misfit_task_load) {
> +				/*
> +				 * Always mark the domain overloaded so big CPUs
> +				 * can pick up misfit tasks via newly idle
> +				 * balance.
> +				 */
>   				*sg_overloaded = 1;

if (balancing_at_rd)
	*sg_overloaded = 1
to avoid confusing non-root domain(although in current code only
root domain checks this)

But the original logic does not have this check, it should be OK.

Reviewed-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help
  2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
  2026-05-15 12:49   ` Chen, Yu C
@ 2026-05-15 20:12   ` Tim Chen
  1 sibling, 0 replies; 11+ messages in thread
From: Tim Chen @ 2026-05-15 20:12 UTC (permalink / raw)
  To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel

On Thu, 2026-05-14 at 11:34 -0700, Ricardo Neri wrote:
> In domains with asymmetric capacity, identifying misfit load in a
> scheduling group is not useful when the destination CPU cannot help (i.e.,
> its capacity exceeds the group's maximum CPU capacity by less than ~5%). In
> such cases, it also prevents load balance among clusters of equal capacity
> when CONFIG_SCHED_CLUSTER is enabled. This happens because
> update_sd_pick_busiest() skips candidate groups of type misfit_task if the
> destination CPU has similar capacity.
> 
> Skipping misfit load accounting in this situation allows the group to be
> classified as has_spare or fully_busy and lets load balancing proceed. Keep
> marking scheduling groups as overloaded when misfit tasks are present. The
> sg_overloaded flag propagates to the root domain and allows bigger CPUs in
> it to help via newly idle balance.
> 
> Reviewed-by: Christian Loehle <christian.loehle@arm.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v3:
>  * Added Reviewed-by tag from Christian. Thanks!
> 
> Changes in v2:
>  * Moved the check of the destination CPU capacity inside the code block
>    used for SD_ASYM_CPUCAPACITY. v1 inadvertently broke the mutual
>    exclusion of the sched_reduced_capacity() path.
>  * Keep marking the root domain as overloaded to allow bigger CPUs to
>    help. (sashiko)
>  * Fixed patch description to clarify that the capacity_greater() looks
>    for differences of 5% or more. (Christian)
>  * Reworded the patch description for clarity.
>  * I did not include the Reviewed-by tag from Christian since the patch
>    changed functionally.
> ---
>  kernel/sched/fair.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e06e74d9ce0e..dcc02ceb44b5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10749,10 +10749,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			continue;
>  
>  		if (sd_flags & SD_ASYM_CPUCAPACITY) {
> -			/* Check for a misfit task on the cpu */
> -			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
> -				sgs->group_misfit_task_load = rq->misfit_task_load;
> +			if (rq->misfit_task_load) {
> +				/*
> +				 * Always mark the domain overloaded so big CPUs
> +				 * can pick up misfit tasks via newly idle
> +				 * balance.
> +				 */
>  				*sg_overloaded = 1;
> +
> +				/*
> +				 * Only account misfit load if @dst_cpu can
> +				 * help; otherwise, the group may be classified
> +				 * as misfit_task and update_sd_pick_busiest()
> +				 * will skip it.

You mean "sd_pick_busiest() will pick it" instead of "skip it" for misfit task
load balancing in the above comment?

Tim

> +				 */
> +				if (capacity_greater(capacity_of(env->dst_cpu),
> +						     group->sgc->max_capacity) &&
> +				    (sgs->group_misfit_task_load < rq->misfit_task_load))
> +					sgs->group_misfit_task_load = rq->misfit_task_load;
>  			}
>  		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
>  			/* Check for a task running on a CPU with reduced capacity */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity
  2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
  2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
  2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
@ 2026-05-14 18:34 ` Ricardo Neri
  2026-05-15 15:16   ` Chen, Yu C
  2026-05-14 18:34 ` [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
  3 siblings, 1 reply; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

sched_balance_find_src_rq() avoids selecting a runqueue with a single
running task as busiest if doing so results in migrating the task to a
CPU with less than ~5% of extra capacity. It also unintentionally
prevents migrations between CPUs of identical capacity.

When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
clusters of CPUs with the same capacity. Allowing migration between CPUs
of identical capacity is necessary to meet this goal.

Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
runtime reductions due to side activity or thermal pressure. Guard this
check with the sched_cluster_active static key so that systems without
cluster topology are unaffected.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v3:
 * Reverted the inverted capacity check; the inverted form incorrectly
   allows migrations to CPUs of slightly less capacity.
 * Guarded the check for architectural capacity with the
   sched_cluster_active static key.

Changes in v2:
 * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
   runtime variability.
 * Inverted the check for runtime capacity. (Christian)
 * Reworded patch description for clarity.
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcc02ceb44b5..d2a4c529f67f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11846,8 +11846,14 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 		 * eventually lead to active_balancing high->low capacity.
 		 * Higher per-CPU capacity is considered better than balancing
 		 * average load.
+		 *
+		 * CONFIG_SCHED_CLUSTER requires balancing load across clusters
+		 * of identical capacity. Use architectural capacity to ignore
+		 * runtime variability.
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		    (!static_branch_unlikely(&sched_cluster_active) ||
+		     arch_scale_cpu_capacity(env->dst_cpu) != arch_scale_cpu_capacity(i)) &&
 		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
 		    nr_running == 1)
 			continue;

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity
  2026-05-14 18:34 ` [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-05-15 15:16   ` Chen, Yu C
  0 siblings, 0 replies; 11+ messages in thread
From: Chen, Yu C @ 2026-05-15 15:16 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Rafael J. Wysocki, Len Brown, Tim C Chen, ricardo.neri,
	linux-kernel, Mel Gorman, Christian Loehle, Barry Song,
	Dietmar Eggemann, Vincent Guittot, Valentin Schneider, Ben Segall,
	Ingo Molnar, Juri Lelli, Peter Zijlstra, Steven Rostedt

On 5/15/2026 2:34 AM, Ricardo Neri wrote:
> sched_balance_find_src_rq() avoids selecting a runqueue with a single
> running task as busiest if doing so results in migrating the task to a
> CPU with less than ~5% of extra capacity. It also unintentionally
> prevents migrations between CPUs of identical capacity.
> 
> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
> clusters of CPUs with the same capacity. Allowing migration between CPUs
> of identical capacity is necessary to meet this goal.
> 
> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
> runtime reductions due to side activity or thermal pressure. Guard this
> check with the sched_cluster_active static key so that systems without
> cluster topology are unaffected.
> 
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v3:
>   * Reverted the inverted capacity check; the inverted form incorrectly
>     allows migrations to CPUs of slightly less capacity.
>   * Guarded the check for architectural capacity with the
>     sched_cluster_active static key.
> 
> Changes in v2:
>   * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
>     runtime variability.
>   * Inverted the check for runtime capacity. (Christian)
>   * Reworded patch description for clarity.
> ---
>   kernel/sched/fair.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dcc02ceb44b5..d2a4c529f67f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11846,8 +11846,14 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>   		 * eventually lead to active_balancing high->low capacity.
>   		 * Higher per-CPU capacity is considered better than balancing
>   		 * average load.
> +		 *
> +		 * CONFIG_SCHED_CLUSTER requires balancing load across clusters
> +		 * of identical capacity. Use architectural capacity to ignore
> +		 * runtime variability.
>   		 */
>   		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
> +		    (!static_branch_unlikely(&sched_cluster_active) ||
> +		     arch_scale_cpu_capacity(env->dst_cpu) != arch_scale_cpu_capacity(i)) &&
>   		    !capacity_greater(capacity_of(env->dst_cpu), capacity) &&

As stated in the commit log, the existing logic blocks task migrations 
between CPUs
with identical capacity, which is based on capacity_of() comparison 
rather than
arch_scale_cpu_capacity. Could I kindly ask why replacing
!capacity_greater(capacity_of(env->dst_cpu), capacity)
with
capacity_greater(capacity, capacity_of(env->dst_cpu))
does not achieve the expected effect?
This would theoretically enable migration among equal-capacity CPUs, and
in most cases e-cores in different clusters should return 0 thus
load balance is allowed.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
  2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
                   ` (2 preceding siblings ...)
  2026-05-14 18:34 ` [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
@ 2026-05-14 18:34 ` Ricardo Neri
  2026-05-15 20:21   ` Tim Chen
  3 siblings, 1 reply; 11+ messages in thread
From: Ricardo Neri @ 2026-05-14 18:34 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel,
	Ricardo Neri

Some topologies have scheduling domains that contain CPUs of asymmetric
capacity, grouped into two or more clusters of equal-capacity CPUs
sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
balanced across these resource-sharing clusters.

Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
load balancer that it should spread load among cluster siblings.

Checks for capacity in update_sd_pick_busiest() prevent migrations from
high- to low-capacity CPUs if a candidate group is not overloaded.

An effect of keeping the SD_PREFER_SIBLING in domains with asymmetric
capacity is that low-capacity clusters with spare capacity can now help
overloaded higher-capacity groups. This was already the case for single-CPU
groups (see calculate_imbalance() for domains with SD_SHARE_LLC).

Once the overloading condition disappears, misfit load will still be used
to move high-utilization tasks to bigger CPUs if they have spare capacity.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes in v3:
 * Updated documentation of SD_PREFER_SIBLING.
 * Expanded the patch description to explain the behavior when overloaded
   groups are involved.

Changes in v2:
 * Reworded the patch description for clarity.
 * Kept parentheses around bitwise operators for clarity.
---
 include/linux/sched/sd_flags.h |  3 ++-
 kernel/sched/topology.c        | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 42839cfa2778..42f74af83b8c 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
  * Prefer to place tasks in a sibling domain
  *
  * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
- * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
+ * flag, but cleared below domains with SD_ASYM_CPUCAPACITY if the domain does
+ * not have clusters of CPUs sharing cache.
  *
  * NEEDS_GROUPS: Load balancing flag.
  */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..a1d048344ea1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1723,8 +1723,18 @@ sd_init(struct sched_domain_topology_level *tl,
 	/*
 	 * Convert topological properties into behaviour.
 	 */
-	/* Don't attempt to spread across CPUs of different capacities. */
-	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
+	/*
+	 * Don't attempt to spread across CPUs of different capacities.
+	 *
+	 * If the domain has clusters of CPUs sharing L2 cache, keep the flag to
+	 * spread tasks across clusters of identical capacity. Checks in
+	 * update_sd_pick_busiest() prevent task migrations from high- to low-
+	 * capacity CPUs for non-overloaded groups. Migrations to a lower-
+	 * capacity CPU can happen if a higher-capacity group is overloaded and
+	 * a low-capacity cluster has spare capacity.
+	 */
+	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
+	    !(sd->child->flags & SD_CLUSTER))
 		sd->child->flags &= ~SD_PREFER_SIBLING;
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {

-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
  2026-05-14 18:34 ` [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
@ 2026-05-15 20:21   ` Tim Chen
  0 siblings, 0 replies; 11+ messages in thread
From: Tim Chen @ 2026-05-15 20:21 UTC (permalink / raw)
  To: Ricardo Neri, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Chen Yu, Christian Loehle,
	Barry Song
  Cc: Rafael J. Wysocki, Len Brown, ricardo.neri, linux-kernel

On Thu, 2026-05-14 at 11:34 -0700, Ricardo Neri wrote:
> Some topologies have scheduling domains that contain CPUs of asymmetric
> capacity, grouped into two or more clusters of equal-capacity CPUs
> sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
> balanced across these resource-sharing clusters.
> 
> Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
> load balancer that it should spread load among cluster siblings.
> 
> Checks for capacity in update_sd_pick_busiest() prevent migrations from
> high- to low-capacity CPUs if a candidate group is not overloaded.
> 
> An effect of keeping the SD_PREFER_SIBLING in domains with asymmetric
> capacity is that low-capacity clusters with spare capacity can now help
> overloaded higher-capacity groups. This was already the case for single-CPU
> groups (see calculate_imbalance() for domains with SD_SHARE_LLC).
> 
> Once the overloading condition disappears, misfit load will still be used
> to move high-utilization tasks to bigger CPUs if they have spare capacity.

Looks good to me.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>

> 
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes in v3:
>  * Updated documentation of SD_PREFER_SIBLING.
>  * Expanded the patch description to explain the behavior when overloaded
>    groups are involved.
> 
> Changes in v2:
>  * Reworded the patch description for clarity.
>  * Kept parentheses around bitwise operators for clarity.
> ---
>  include/linux/sched/sd_flags.h |  3 ++-
>  kernel/sched/topology.c        | 14 ++++++++++++--
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> index 42839cfa2778..42f74af83b8c 100644
> --- a/include/linux/sched/sd_flags.h
> +++ b/include/linux/sched/sd_flags.h
> @@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
>   * Prefer to place tasks in a sibling domain
>   *
>   * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> - * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> + * flag, but cleared below domains with SD_ASYM_CPUCAPACITY if the domain does
> + * not have clusters of CPUs sharing cache.
>   *
>   * NEEDS_GROUPS: Load balancing flag.
>   */
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9d55..a1d048344ea1 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1723,8 +1723,18 @@ sd_init(struct sched_domain_topology_level *tl,
>  	/*
>  	 * Convert topological properties into behaviour.
>  	 */
> -	/* Don't attempt to spread across CPUs of different capacities. */
> -	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> +	/*
> +	 * Don't attempt to spread across CPUs of different capacities.
> +	 *
> +	 * If the domain has clusters of CPUs sharing L2 cache, keep the flag to
> +	 * spread tasks across clusters of identical capacity. Checks in
> +	 * update_sd_pick_busiest() prevent task migrations from high- to low-
> +	 * capacity CPUs for non-overloaded groups. Migrations to a lower-
> +	 * capacity CPU can happen if a higher-capacity group is overloaded and
> +	 * a low-capacity cluster has spare capacity.
> +	 */
> +	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
> +	    !(sd->child->flags & SD_CLUSTER))
>  		sd->child->flags &= ~SD_PREFER_SIBLING;
>  
>  	if (sd->flags & SD_SHARE_CPUCAPACITY) {

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-15 20:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-14 18:34 [PATCH v3 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-05-14 18:34 ` [PATCH v3 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
2026-05-15 12:29   ` Chen, Yu C
2026-05-15 19:26   ` Tim Chen
2026-05-14 18:34 ` [PATCH v3 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
2026-05-15 12:49   ` Chen, Yu C
2026-05-15 20:12   ` Tim Chen
2026-05-14 18:34 ` [PATCH v3 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
2026-05-15 15:16   ` Chen, Yu C
2026-05-14 18:34 ` [PATCH v3 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
2026-05-15 20:21   ` Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox