[PATCH v2 0/2] sched/fair: SMT-aware asymmetric CPU capacity

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] sched/fair: SMT-aware asymmetric CPU capacity
@ 2026-04-03  5:31 Andrea Righi
  2026-04-03  5:31 ` [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
  2026-04-03  5:31 ` [PATCH 2/2] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
  0 siblings, 2 replies; 14+ messages in thread
From: Andrea Righi @ 2026-04-03  5:31 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Shrikanth Hegde, linux-kernel

This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by introducing
SMT awareness.

= Problem =

Nominal per-logical-CPU capacity can overstate usable compute when an SMT
sibling is busy, because the physical core doesn't deliver its full nominal
capacity. So, several asym-cpu-capacity paths may pick high capacity idle CPUs
that are not actually good destinations.

= Solution =

This patch set aligns those paths with a simple rule already used elsewhere:
when SMT is active, prefer fully idle cores and avoid treating partially idle
SMT siblings as full-capacity targets where that would mislead load balance.

Patch set summary:
 - Prefer fully-idle SMT cores in asym-capacity idle selection: in the wakeup
   fast path, extend select_idle_capacity() / asym_fits_cpu() so idle
   selection can prefer CPUs on fully idle cores.
 - Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.

This patch set has been tested on the new Vera Rubin platform, where SMT is
enabled and the firmware exposes small frequency variations (+/-~5%) as
differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.

Without these patches, performance can drop by up to ~2x with CPU-intensive
workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
account for busy SMT siblings.

Alternative approaches have been evaluated, such as equalizing CPU capacities,
either by exposing uniform values via firmware or normalizing them in the kernel
by grouping CPUs within a small capacity window (+-5%).

However, the SMT-aware SD_ASYM_CPUCAPACITY approach has shown better results so
far. Improving this policy also seems worthwhile in general, as future platforms
may enable SMT with asymmetric CPU topologies.

Performance results on Vera Rubin with SD_ASYM_CPUCAPACITY (mainline) vs
SD_ASYM_CPUCAPACITY + SMT:

- NVBLAS benchblas (one task / SMT core):

 +---------------------------------+--------+
 | Configuration                   | gflops |
 +---------------------------------+--------+
 | ASYM (mainline) + SIS_UTIL      |  5478  |
 | ASYM (mainline) + NO_SIS_UTIL   |  5491  |
 |                                 |        |
 | NO ASYM + SIS_UTIL              |  8912  |
 | NO ASYM + NO_SIS_UTIL           |  8978  |
 |                                 |        |
 | ASYM + SMT + SIS_UTIL           |  9259  |
 | ASYM + SMT + NO_SIS_UTIL        |  9291  |
 +---------------------------------+--------+

 - DCPerf MediaWiki (all CPUs):

 +---------------------------------+--------+--------+--------+--------+
 | Configuration                   |   rps  |  p50   |  p95   |  p99   |
 +---------------------------------+--------+--------+--------+--------+
 | ASYM (mainline) + SIS_UTIL      |  7994  |  0.052 |  0.223 |  0.246 |
 | ASYM (mainline) + NO_SIS_UTIL   |  7993  |  0.052 |  0.221 |  0.245 |
 |                                 |        |        |        |        |
 | NO ASYM + SIS_UTIL              |  8113  |  0.067 |  0.184 |  0.225 |
 | NO ASYM + NO_SIS_UTIL           |  8093  |  0.068 |  0.184 |  0.223 |
 |                                 |        |        |        |        |
 | ASYM + SMT + SIS_UTIL           |  8129  |  0.076 |  0.149 |  0.188 |
 | ASYM + SMT + NO_SIS_UTIL        |  8138  |  0.076 |  0.148 |  0.186 |
 +---------------------------------+--------+--------+--------+--------+

In the MediaWiki case SMT awareness is less impactful (compared to equalizing
CPU capacities), because for the majority of the run all CPUs are used, but it
still seems to provide some benefits at reducing tail latency.

See also:
 - https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
 - https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com

Changes in v2:
 - Rework SMT awareness logic in select_idle_capacity() (K Prateek Nayak)
 - Drop EAS and find_new_ilb() changes for now
 - Link to v1: https://lore.kernel.org/all/20260326151211.1862600-1-arighi@nvidia.com

Andrea Righi (2):
      sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
      sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity

 kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 39 insertions(+), 5 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-03  5:31 [PATCH v2 0/2] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
@ 2026-04-03  5:31 ` Andrea Righi
  2026-04-07 11:21   ` Dietmar Eggemann
  2026-04-17  9:39   ` Vincent Guittot
  2026-04-03  5:31 ` [PATCH 2/2] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
  1 sibling, 2 replies; 14+ messages in thread
From: Andrea Righi @ 2026-04-03  5:31 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Shrikanth Hegde, linux-kernel

On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement.

However, when those CPUs belong to SMT cores, their effective capacity
can be much lower than the nominal capacity when the sibling thread is
busy: SMT siblings compete for shared resources, so a "high capacity"
CPU that is idle but whose sibling is busy does not deliver its full
capacity. This effective capacity reduction cannot be modeled by the
static capacity value alone.

When SMT is active, teach asym-capacity idle selection to treat a
logical CPU as a weaker target if its physical core is only partially
idle: select_idle_capacity() no longer returns on the first idle CPU
whose static capacity fits the task when that CPU still has a busy
sibling, it keeps scanning for an idle CPU on a fully-idle core and only
if none qualify does it fall back to partially-idle cores, using shifted
fit scores so fully-idle cores win ties; asym_fits_cpu() applies the
same fully-idle core requirement when asym capacity and SMT are both
active.

This improves task placement, since partially-idle SMT siblings deliver
less than their nominal capacity. Favoring fully idle cores, when
available, can significantly enhance both throughput and wakeup latency
on systems with both SMT and CPU asymmetry.

No functional changes on systems with only asymmetric CPUs or only SMT.

Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed1..7f09191014d18 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7774,6 +7774,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
 	unsigned long task_util, util_min, util_max, best_cap = 0;
 	int fits, best_fits = 0;
 	int cpu, best_cpu = -1;
@@ -7787,6 +7788,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
+		bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
@@ -7795,7 +7797,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
 		/* This CPU fits with all requirements */
-		if (fits > 0)
+		if (fits > 0 && preferred_core)
 			return cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -7803,9 +7805,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 */
 		else if (fits < 0)
 			cpu_cap = get_actual_cpu_capacity(cpu);
+		/*
+		 * fits > 0 implies we are not on a preferred core
+		 * but the util fits CPU capacity. Set fits to -2 so
+		 * the effective range becomes [-2, 0] where:
+		 *    0 - does not fit
+		 *   -1 - fits with the exception of UCLAMP_MIN
+		 *   -2 - fits with the exception of preferred_core
+		 */
+		else if (fits > 0)
+			fits = -2;
+
+		/*
+		 * If we are on a preferred core, translate the range of fits
+		 * of [-1, 0] to [-4, -3]. This ensures that an idle core
+		 * is always given priority over (partially) busy core.
+		 *
+		 * A fully fitting idle core would have returned early and hence
+		 * fits > 0 for preferred_core need not be dealt with.
+		 */
+		if (preferred_core)
+			fits -= 3;
 
 		/*
-		 * First, select CPU which fits better (-1 being better than 0).
+		 * First, select CPU which fits better (lower is more preferred).
 		 * Then, select the one with best capacity at same level.
 		 */
 		if ((fits < best_fits) ||
@@ -7824,12 +7847,17 @@ static inline bool asym_fits_cpu(unsigned long util,
 				 unsigned long util_max,
 				 int cpu)
 {
-	if (sched_asym_cpucap_active())
+	if (sched_asym_cpucap_active()) {
 		/*
 		 * Return true only if the cpu fully fits the task requirements
 		 * which include the utilization and the performance hints.
+		 *
+		 * When SMT is active, also require that the core has no busy
+		 * siblings.
 		 */
-		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+		return (!sched_smt_active() || is_core_idle(cpu)) &&
+		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+	}
 
 	return true;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
  2026-04-03  5:31 [PATCH v2 0/2] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
  2026-04-03  5:31 ` [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-04-03  5:31 ` Andrea Righi
  1 sibling, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-04-03  5:31 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Christian Loehle, Koba Ko,
	Felix Abecassis, Balbir Singh, Shrikanth Hegde, linux-kernel

When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling
is busy: the core does not deliver its full nominal capacity.

If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade
we cannot actually provide.

No functional changes on systems with only asymmetric CPUs or only SMT.

Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f09191014d18..7bebceb5ed9df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10607,10 +10607,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * We can use max_capacity here as reduction in capacity on some
 	 * CPUs in the group should either be possible to resolve
 	 * internally or be covered by avg_load imbalance (eventually).
+	 *
+	 * When SMT is active, only pull a misfit to dst_cpu if it is on a
+	 * fully idle core; otherwise the effective capacity of the core is
+	 * reduced and we may not actually provide more capacity than the
+	 * source.
 	 */
 	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
 	    (sgs->group_type == group_misfit_task) &&
-	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+	    ((sched_smt_active() && !is_core_idle(env->dst_cpu)) ||
+	     !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-03  5:31 ` [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-04-07 11:21   ` Dietmar Eggemann
  2026-04-18  8:24     ` Andrea Righi
  2026-04-17  9:39   ` Vincent Guittot
  1 sibling, 1 reply; 14+ messages in thread
From: Dietmar Eggemann @ 2026-04-07 11:21 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel



On 03.04.26 07:31, Andrea Righi wrote:
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement.
> 
> However, when those CPUs belong to SMT cores, their effective capacity
> can be much lower than the nominal capacity when the sibling thread is
> busy: SMT siblings compete for shared resources, so a "high capacity"
> CPU that is idle but whose sibling is busy does not deliver its full
> capacity. This effective capacity reduction cannot be modeled by the
> static capacity value alone.
> 
> When SMT is active, teach asym-capacity idle selection to treat a
> logical CPU as a weaker target if its physical core is only partially
> idle: select_idle_capacity() no longer returns on the first idle CPU
> whose static capacity fits the task when that CPU still has a busy
> sibling, it keeps scanning for an idle CPU on a fully-idle core and only
> if none qualify does it fall back to partially-idle cores, using shifted
> fit scores so fully-idle cores win ties; asym_fits_cpu() applies the
> same fully-idle core requirement when asym capacity and SMT are both
> active.
> 
> This improves task placement, since partially-idle SMT siblings deliver
> less than their nominal capacity. Favoring fully idle cores, when
> available, can significantly enhance both throughput and wakeup latency
> on systems with both SMT and CPU asymmetry.
> 
> No functional changes on systems with only asymmetric CPUs or only SMT.
> 
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf948db905ed1..7f09191014d18 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7774,6 +7774,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);

Somehow I miss a:

    if (prefers_idle_core)
        set_idle_cores(target, false)

The one in select_idle_sibling() -> select_idle_cpu() isn't executed
anymore in with ASYM_CPUCAPACITY.


Another thing is that sic() iterates over CPUs sd_asym_cpucapacity
whereas the idle core thing lives in sd_llc/sd_llc_shared. Both sd's are
probably th same on your system.


>  	unsigned long task_util, util_min, util_max, best_cap = 0;
>  	int fits, best_fits = 0;
>  	int cpu, best_cpu = -1;
[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-03  5:31 ` [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
  2026-04-07 11:21   ` Dietmar Eggemann
@ 2026-04-17  9:39   ` Vincent Guittot
  2026-04-18  6:02     ` Andrea Righi
  1 sibling, 1 reply; 14+ messages in thread
From: Vincent Guittot @ 2026-04-17  9:39 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

On Fri, 3 Apr 2026 at 07:37, Andrea Righi <arighi@nvidia.com> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement.
>
> However, when those CPUs belong to SMT cores, their effective capacity
> can be much lower than the nominal capacity when the sibling thread is
> busy: SMT siblings compete for shared resources, so a "high capacity"
> CPU that is idle but whose sibling is busy does not deliver its full
> capacity. This effective capacity reduction cannot be modeled by the
> static capacity value alone.
>
> When SMT is active, teach asym-capacity idle selection to treat a
> logical CPU as a weaker target if its physical core is only partially
> idle: select_idle_capacity() no longer returns on the first idle CPU
> whose static capacity fits the task when that CPU still has a busy
> sibling, it keeps scanning for an idle CPU on a fully-idle core and only
> if none qualify does it fall back to partially-idle cores, using shifted
> fit scores so fully-idle cores win ties; asym_fits_cpu() applies the
> same fully-idle core requirement when asym capacity and SMT are both
> active.
>
> This improves task placement, since partially-idle SMT siblings deliver
> less than their nominal capacity. Favoring fully idle cores, when
> available, can significantly enhance both throughput and wakeup latency
> on systems with both SMT and CPU asymmetry.
>
> No functional changes on systems with only asymmetric CPUs or only SMT.
>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf948db905ed1..7f09191014d18 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7774,6 +7774,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +       bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
>         unsigned long task_util, util_min, util_max, best_cap = 0;
>         int fits, best_fits = 0;
>         int cpu, best_cpu = -1;
> @@ -7787,6 +7788,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
>         for_each_cpu_wrap(cpu, cpus, target) {
> +               bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>                 unsigned long cpu_cap = capacity_of(cpu);
>
>                 if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> @@ -7795,7 +7797,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
>                 /* This CPU fits with all requirements */
> -               if (fits > 0)
> +               if (fits > 0 && preferred_core)
>                         return cpu;
>                 /*
>                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -7803,9 +7805,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                  */
>                 else if (fits < 0)
>                         cpu_cap = get_actual_cpu_capacity(cpu);
> +               /*
> +                * fits > 0 implies we are not on a preferred core
> +                * but the util fits CPU capacity. Set fits to -2 so
> +                * the effective range becomes [-2, 0] where:
> +                *    0 - does not fit
> +                *   -1 - fits with the exception of UCLAMP_MIN
> +                *   -2 - fits with the exception of preferred_core
> +                */
> +               else if (fits > 0)
> +                       fits = -2;
> +
> +               /*
> +                * If we are on a preferred core, translate the range of fits
> +                * of [-1, 0] to [-4, -3]. This ensures that an idle core
> +                * is always given priority over (partially) busy core.
> +                *
> +                * A fully fitting idle core would have returned early and hence
> +                * fits > 0 for preferred_core need not be dealt with.
> +                */
> +               if (preferred_core)
> +                       fits -= 3;
>
>                 /*
> -                * First, select CPU which fits better (-1 being better than 0).
> +                * First, select CPU which fits better (lower is more preferred).
>                  * Then, select the one with best capacity at same level.
>                  */
>                 if ((fits < best_fits) ||

You have to clear idle_core if you were looking of an idle core but
didn't find one while looping on CPUs.

You need the following to clear idle core:

@@ -7739,6 +7739,11 @@ select_idle_capacity(struct task_struct *p,
struct sched_domain *sd, int target)
                }
        }

+       /* The range [-4, -3] implies at least one idle core, the values above
+        * imply that we didn't find anyone while looping CPUs */
+       if (prefers_idle_core && fits > -3)
+                       set_idle_cores(target, false);
+
        return best_cpu;
 }


> @@ -7824,12 +7847,17 @@ static inline bool asym_fits_cpu(unsigned long util,
>                                  unsigned long util_max,
>                                  int cpu)
>  {
> -       if (sched_asym_cpucap_active())
> +       if (sched_asym_cpucap_active()) {
>                 /*
>                  * Return true only if the cpu fully fits the task requirements
>                  * which include the utilization and the performance hints.
> +                *
> +                * When SMT is active, also require that the core has no busy
> +                * siblings.
>                  */
> -               return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +               return (!sched_smt_active() || is_core_idle(cpu)) &&
> +                      (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +       }
>
>         return true;
>  }
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-17  9:39   ` Vincent Guittot
@ 2026-04-18  6:02     ` Andrea Righi
  2026-04-19 10:20       ` Vincent Guittot
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Righi @ 2026-04-18  6:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

Hi Vincent,

On Fri, Apr 17, 2026 at 11:39:21AM +0200, Vincent Guittot wrote:
> On Fri, 3 Apr 2026 at 07:37, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement.
> >
> > However, when those CPUs belong to SMT cores, their effective capacity
> > can be much lower than the nominal capacity when the sibling thread is
> > busy: SMT siblings compete for shared resources, so a "high capacity"
> > CPU that is idle but whose sibling is busy does not deliver its full
> > capacity. This effective capacity reduction cannot be modeled by the
> > static capacity value alone.
> >
> > When SMT is active, teach asym-capacity idle selection to treat a
> > logical CPU as a weaker target if its physical core is only partially
> > idle: select_idle_capacity() no longer returns on the first idle CPU
> > whose static capacity fits the task when that CPU still has a busy
> > sibling, it keeps scanning for an idle CPU on a fully-idle core and only
> > if none qualify does it fall back to partially-idle cores, using shifted
> > fit scores so fully-idle cores win ties; asym_fits_cpu() applies the
> > same fully-idle core requirement when asym capacity and SMT are both
> > active.
> >
> > This improves task placement, since partially-idle SMT siblings deliver
> > less than their nominal capacity. Favoring fully idle cores, when
> > available, can significantly enhance both throughput and wakeup latency
> > on systems with both SMT and CPU asymmetry.
> >
> > No functional changes on systems with only asymmetric CPUs or only SMT.
> >
> > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 32 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bf948db905ed1..7f09191014d18 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7774,6 +7774,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> > +       bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> >         unsigned long task_util, util_min, util_max, best_cap = 0;
> >         int fits, best_fits = 0;
> >         int cpu, best_cpu = -1;
> > @@ -7787,6 +7788,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> >         for_each_cpu_wrap(cpu, cpus, target) {
> > +               bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> >                 unsigned long cpu_cap = capacity_of(cpu);
> >
> >                 if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> > @@ -7795,7 +7797,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> >                 /* This CPU fits with all requirements */
> > -               if (fits > 0)
> > +               if (fits > 0 && preferred_core)
> >                         return cpu;
> >                 /*
> >                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> > @@ -7803,9 +7805,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                  */
> >                 else if (fits < 0)
> >                         cpu_cap = get_actual_cpu_capacity(cpu);
> > +               /*
> > +                * fits > 0 implies we are not on a preferred core
> > +                * but the util fits CPU capacity. Set fits to -2 so
> > +                * the effective range becomes [-2, 0] where:
> > +                *    0 - does not fit
> > +                *   -1 - fits with the exception of UCLAMP_MIN
> > +                *   -2 - fits with the exception of preferred_core
> > +                */
> > +               else if (fits > 0)
> > +                       fits = -2;
> > +
> > +               /*
> > +                * If we are on a preferred core, translate the range of fits
> > +                * of [-1, 0] to [-4, -3]. This ensures that an idle core
> > +                * is always given priority over (partially) busy core.
> > +                *
> > +                * A fully fitting idle core would have returned early and hence
> > +                * fits > 0 for preferred_core need not be dealt with.
> > +                */
> > +               if (preferred_core)
> > +                       fits -= 3;
> >
> >                 /*
> > -                * First, select CPU which fits better (-1 being better than 0).
> > +                * First, select CPU which fits better (lower is more preferred).
> >                  * Then, select the one with best capacity at same level.
> >                  */
> >                 if ((fits < best_fits) ||
> 
> You have to clear idle_core if you were looking of an idle core but
> didn't find one while looping on CPUs.
> 
> You need the following to clear idle core:
> 
> @@ -7739,6 +7739,11 @@ select_idle_capacity(struct task_struct *p,
> struct sched_domain *sd, int target)
>                 }
>         }
> 
> +       /* The range [-4, -3] implies at least one idle core, the values above
> +        * imply that we didn't find anyone while looping CPUs */
> +       if (prefers_idle_core && fits > -3)
> +                       set_idle_cores(target, false);
> +
>         return best_cpu;
>  }

That makes sense! But it should be best_fits instead of fits, right?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-07 11:21   ` Dietmar Eggemann
@ 2026-04-18  8:24     ` Andrea Righi
  2026-04-20  5:49       ` K Prateek Nayak
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Righi @ 2026-04-18  8:24 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

Hi Dietmar,

On Tue, Apr 07, 2026 at 01:21:16PM +0200, Dietmar Eggemann wrote:
> 
> 
> On 03.04.26 07:31, Andrea Righi wrote:
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement.
> > 
> > However, when those CPUs belong to SMT cores, their effective capacity
> > can be much lower than the nominal capacity when the sibling thread is
> > busy: SMT siblings compete for shared resources, so a "high capacity"
> > CPU that is idle but whose sibling is busy does not deliver its full
> > capacity. This effective capacity reduction cannot be modeled by the
> > static capacity value alone.
> > 
> > When SMT is active, teach asym-capacity idle selection to treat a
> > logical CPU as a weaker target if its physical core is only partially
> > idle: select_idle_capacity() no longer returns on the first idle CPU
> > whose static capacity fits the task when that CPU still has a busy
> > sibling, it keeps scanning for an idle CPU on a fully-idle core and only
> > if none qualify does it fall back to partially-idle cores, using shifted
> > fit scores so fully-idle cores win ties; asym_fits_cpu() applies the
> > same fully-idle core requirement when asym capacity and SMT are both
> > active.
> > 
> > This improves task placement, since partially-idle SMT siblings deliver
> > less than their nominal capacity. Favoring fully idle cores, when
> > available, can significantly enhance both throughput and wakeup latency
> > on systems with both SMT and CPU asymmetry.
> > 
> > No functional changes on systems with only asymmetric CPUs or only SMT.
> > 
> > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++----
> >  1 file changed, 32 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bf948db905ed1..7f09191014d18 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7774,6 +7774,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >  static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> > +	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> 
> Somehow I miss a:
> 
>     if (prefers_idle_core)
>         set_idle_cores(target, false)
> 
> The one in select_idle_sibling() -> select_idle_cpu() isn't executed
> anymore in with ASYM_CPUCAPACITY.
> 

Right, we need to add this as also pointed by Vincent.

> 
> Another thing is that sic() iterates over CPUs sd_asym_cpucapacity
> whereas the idle core thing lives in sd_llc/sd_llc_shared. Both sd's are
> probably th same on your system.

Hm... they're the same on my machine, but if they're different, clearing
has_idle_cores here is not right and it might lead to false positives. We should
only clear it only when both domains span the same CPUs (or just check if
sd_asym_cpucapacity and sd_llc are the same).

However, if they're not the same, I'm not sure exactly what we should do...
maybe ignore has_idle_cores and always do the scan for now?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-18  6:02     ` Andrea Righi
@ 2026-04-19 10:20       ` Vincent Guittot
  0 siblings, 0 replies; 14+ messages in thread
From: Vincent Guittot @ 2026-04-19 10:20 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

On Sat, 18 Apr 2026 at 08:02, Andrea Righi <arighi@nvidia.com> wrote:
>
> Hi Vincent,
>
> On Fri, Apr 17, 2026 at 11:39:21AM +0200, Vincent Guittot wrote:
> > On Fri, 3 Apr 2026 at 07:37, Andrea Righi <arighi@nvidia.com> wrote:
> > >
> > > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > > different per-core frequencies), the wakeup path uses
> > > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > > for better task placement.
> > >
> > > However, when those CPUs belong to SMT cores, their effective capacity
> > > can be much lower than the nominal capacity when the sibling thread is
> > > busy: SMT siblings compete for shared resources, so a "high capacity"
> > > CPU that is idle but whose sibling is busy does not deliver its full
> > > capacity. This effective capacity reduction cannot be modeled by the
> > > static capacity value alone.
> > >
> > > When SMT is active, teach asym-capacity idle selection to treat a
> > > logical CPU as a weaker target if its physical core is only partially
> > > idle: select_idle_capacity() no longer returns on the first idle CPU
> > > whose static capacity fits the task when that CPU still has a busy
> > > sibling, it keeps scanning for an idle CPU on a fully-idle core and only
> > > if none qualify does it fall back to partially-idle cores, using shifted
> > > fit scores so fully-idle cores win ties; asym_fits_cpu() applies the
> > > same fully-idle core requirement when asym capacity and SMT are both
> > > active.
> > >
> > > This improves task placement, since partially-idle SMT siblings deliver
> > > less than their nominal capacity. Favoring fully idle cores, when
> > > available, can significantly enhance both throughput and wakeup latency
> > > on systems with both SMT and CPU asymmetry.
> > >
> > > No functional changes on systems with only asymmetric CPUs or only SMT.
> > >
> > > Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> > > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > Cc: Christian Loehle <christian.loehle@arm.com>
> > > Cc: Koba Ko <kobak@nvidia.com>
> > > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > > ---
> > >  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++----
> > >  1 file changed, 32 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index bf948db905ed1..7f09191014d18 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7774,6 +7774,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > >  static int
> > >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >  {
> > > +       bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> > >         unsigned long task_util, util_min, util_max, best_cap = 0;
> > >         int fits, best_fits = 0;
> > >         int cpu, best_cpu = -1;
> > > @@ -7787,6 +7788,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >         util_max = uclamp_eff_value(p, UCLAMP_MAX);
> > >
> > >         for_each_cpu_wrap(cpu, cpus, target) {
> > > +               bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> > >                 unsigned long cpu_cap = capacity_of(cpu);
> > >
> > >                 if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> > > @@ -7795,7 +7797,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > >
> > >                 /* This CPU fits with all requirements */
> > > -               if (fits > 0)
> > > +               if (fits > 0 && preferred_core)
> > >                         return cpu;
> > >                 /*
> > >                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> > > @@ -7803,9 +7805,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >                  */
> > >                 else if (fits < 0)
> > >                         cpu_cap = get_actual_cpu_capacity(cpu);
> > > +               /*
> > > +                * fits > 0 implies we are not on a preferred core
> > > +                * but the util fits CPU capacity. Set fits to -2 so
> > > +                * the effective range becomes [-2, 0] where:
> > > +                *    0 - does not fit
> > > +                *   -1 - fits with the exception of UCLAMP_MIN
> > > +                *   -2 - fits with the exception of preferred_core
> > > +                */
> > > +               else if (fits > 0)
> > > +                       fits = -2;
> > > +
> > > +               /*
> > > +                * If we are on a preferred core, translate the range of fits
> > > +                * of [-1, 0] to [-4, -3]. This ensures that an idle core
> > > +                * is always given priority over (partially) busy core.
> > > +                *
> > > +                * A fully fitting idle core would have returned early and hence
> > > +                * fits > 0 for preferred_core need not be dealt with.
> > > +                */
> > > +               if (preferred_core)
> > > +                       fits -= 3;
> > >
> > >                 /*
> > > -                * First, select CPU which fits better (-1 being better than 0).
> > > +                * First, select CPU which fits better (lower is more preferred).
> > >                  * Then, select the one with best capacity at same level.
> > >                  */
> > >                 if ((fits < best_fits) ||
> >
> > You have to clear idle_core if you were looking of an idle core but
> > didn't find one while looping on CPUs.
> >
> > You need the following to clear idle core:
> >
> > @@ -7739,6 +7739,11 @@ select_idle_capacity(struct task_struct *p,
> > struct sched_domain *sd, int target)
> >                 }
> >         }
> >
> > +       /* The range [-4, -3] implies at least one idle core, the values above
> > +        * imply that we didn't find anyone while looping CPUs */
> > +       if (prefers_idle_core && fits > -3)
> > +                       set_idle_cores(target, false);
> > +
> >         return best_cpu;
> >  }
>
> That makes sense! But it should be best_fits instead of fits, right?

yes, of course

>
> Thanks,
> -Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-18  8:24     ` Andrea Righi
@ 2026-04-20  5:49       ` K Prateek Nayak
  2026-04-20  8:36         ` Andrea Righi
  0 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2026-04-20  5:49 UTC (permalink / raw)
  To: Andrea Righi, Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	Shrikanth Hegde, linux-kernel

Hello Andrea,

On 4/18/2026 1:54 PM, Andrea Righi wrote:
>>> @@ -7774,6 +7774,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>  static int
>>>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>>>  {
>>> +	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
>>
>> Somehow I miss a:
>>
>>     if (prefers_idle_core)
>>         set_idle_cores(target, false)
>>
>> The one in select_idle_sibling() -> select_idle_cpu() isn't executed
>> anymore in with ASYM_CPUCAPACITY.
>>
> 
> Right, we need to add this as also pointed by Vincent.
> 
>>
>> Another thing is that sic() iterates over CPUs sd_asym_cpucapacity
>> whereas the idle core thing lives in sd_llc/sd_llc_shared. Both sd's are
>> probably th same on your system.
> 
> Hm... they're the same on my machine, but if they're different, clearing
> has_idle_cores here is not right and it might lead to false positives. We should
> only clear it only when both domains span the same CPUs (or just check if
> sd_asym_cpucapacity and sd_llc are the same).
> 
> However, if they're not the same, I'm not sure exactly what we should do...
> maybe ignore has_idle_cores and always do the scan for now?

With your changes, only two places actually care about test_idle_cores():

- select_idle_capacity()
- select_idle_cpu()

If we go into select_idle_capacity(), we don't do select_idle_cpu() so
the two paths are mutually exclusive.

In nohz_balancer_kick(), if we find, sd_asym_cpucapacity, we simply
don't care about the sd_llc_shared->nr_busy_cpus during balancing so
that begs the question if we can simply track idle_cores at
sd_asym_cpucapacity for these systems?

Following is only build tested for now but I'll try to spoof asym
cpucapacity on my system and check if it holds up or not:

  (On top of tip:sched/core at sched-core-2026-04-13 + this series)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 78f2d2c4e24f..509146c486ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7913,7 +7913,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 
-	if (sched_feat(SIS_UTIL)) {
+	if (sched_feat(SIS_UTIL) && sd->shared) {
 		/*
 		 * Increment because !--nr is the condition to stop scan.
 		 *
@@ -12856,7 +12856,8 @@ static void set_cpu_sd_state_busy(int cpu)
 		goto unlock;
 	sd->nohz_idle = 0;
 
-	atomic_inc(&sd->shared->nr_busy_cpus);
+	if (sd->shared)
+		atomic_inc(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
@@ -12885,7 +12886,8 @@ static void set_cpu_sd_state_idle(int cpu)
 		goto unlock;
 	sd->nohz_idle = 1;
 
-	atomic_dec(&sd->shared->nr_busy_cpus);
+	if (sd->shared)
+		atomic_dec(&sd->shared->nr_busy_cpus);
 unlock:
 	rcu_read_unlock();
 }
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5847b83d9d55..45b919b39c7d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -680,19 +680,38 @@ static void update_top_cache_domain(int cpu)
 	int id = cpu;
 	int size = 1;
 
+	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
+	if (sd) {
+		/*
+		 * If sd_asym_cpucapacity exists,
+		 * the shared object should exist too.
+		 */
+		WARN_ON_ONCE(!sd->shared);
+		sds = sd->shared;
+	}
+
+	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
+
 	sd = highest_flag_domain(cpu, SD_SHARE_LLC);
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
 
-		/* If sd_llc exists, sd_llc_shared should exist too. */
-		WARN_ON_ONCE(!sd->shared);
-		sds = sd->shared;
+		/*
+		 * If sd_asym_cpucapacity doesn't exist,
+		 * sd_llc_shared must have a sd->shared linked.
+		 */
+		if (!sds) {
+			WARN_ON_ONCE(!sd->shared);
+			sds = sd->shared;
+		}
 	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+
+	/* TODO: Rename sd_llc_shared to fit the new role. */
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
@@ -711,9 +730,6 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
-
-	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
-	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
 }
 
 /*
@@ -2650,6 +2666,15 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
 	}
 }
 
+static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
+{
+	int sd_id = cpumask_first(sched_domain_span(sd));
+
+	sd->shared = *per_cpu_ptr(d->sds, sd_id);
+	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
+	atomic_inc(&sd->shared->ref);
+}
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -2712,16 +2737,33 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		if (!sd)
 			continue;
 
+		/*
+		 * In case of ASYM_CPUCAPACITY, attach sd->shared to
+		 * sd_asym_cpucapacity for wakeup stat tracking.
+		 *
+		 * XXX: This assumes SD_ASYM_CPUCAPACITY_FULL domain
+		 * always has more than one group else it is prone to
+		 * degeneration.
+		 */
+		if (has_asym) {
+			while (sd && !(sd->flags & SD_ASYM_CPUCAPACITY_FULL))
+				sd = sd->parent;
+
+			init_sched_domain_shared(&d, sd);
+		}
+
 		/* First, find the topmost SD_SHARE_LLC domain */
+		sd = *per_cpu_ptr(d.sd, i);
 		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
 			sd = sd->parent;
 
 		if (sd->flags & SD_SHARE_LLC) {
-			int sd_id = cpumask_first(sched_domain_span(sd));
-
-			sd->shared = *per_cpu_ptr(d.sds, sd_id);
-			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
-			atomic_inc(&sd->shared->ref);
+			/*
+			 * Initialize the sd->shared for SD_SHARE_LLC if
+			 * SD_ASYM_CPUCAPACITY_FULL hasn't claimed it already.
+			 */
+			if (!has_asym)
+				init_sched_domain_shared(&d, sd);
 
 			/*
 			 * In presence of higher domains, adjust the
---

I still have one question: Can first SD_ASYM_CPUCAPACITY_FULL be set at
a SD_NUMA?

We'll need to deal with overlapping domains then but seems like it could
be possible with weird cpusets :-(

But in that case, do we even want to search CPUs outside the NUMA in
select_idle_capacity()? I don't think anything stops this currently but
I might be wrong.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-20  5:49       ` K Prateek Nayak
@ 2026-04-20  8:36         ` Andrea Righi
  2026-04-20  9:39           ` K Prateek Nayak
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Righi @ 2026-04-20  8:36 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

Hi Prateek,

On Mon, Apr 20, 2026 at 11:19:27AM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 4/18/2026 1:54 PM, Andrea Righi wrote:
> >>> @@ -7774,6 +7774,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>>  static int
> >>>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >>>  {
> >>> +	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> >>
> >> Somehow I miss a:
> >>
> >>     if (prefers_idle_core)
> >>         set_idle_cores(target, false)
> >>
> >> The one in select_idle_sibling() -> select_idle_cpu() isn't executed
> >> anymore in with ASYM_CPUCAPACITY.
> >>
> > 
> > Right, we need to add this as also pointed by Vincent.
> > 
> >>
> >> Another thing is that sic() iterates over CPUs sd_asym_cpucapacity
> >> whereas the idle core thing lives in sd_llc/sd_llc_shared. Both sd's are
> >> probably th same on your system.
> > 
> > Hm... they're the same on my machine, but if they're different, clearing
> > has_idle_cores here is not right and it might lead to false positives. We should
> > only clear it only when both domains span the same CPUs (or just check if
> > sd_asym_cpucapacity and sd_llc are the same).
> > 
> > However, if they're not the same, I'm not sure exactly what we should do...
> > maybe ignore has_idle_cores and always do the scan for now?
> 
> With your changes, only two places actually care about test_idle_cores():
> 
> - select_idle_capacity()
> - select_idle_cpu()
> 
> If we go into select_idle_capacity(), we don't do select_idle_cpu() so
> the two paths are mutually exclusive.
> 
> In nohz_balancer_kick(), if we find, sd_asym_cpucapacity, we simply
> don't care about the sd_llc_shared->nr_busy_cpus during balancing so
> that begs the question if we can simply track idle_cores at
> sd_asym_cpucapacity for these systems?

Yeah, makes sense to me. I was planning to test something similar, so thanks for
sharing this patch. :) I'll give it a try and report back.

> 
> Following is only build tested for now but I'll try to spoof asym
> cpucapacity on my system and check if it holds up or not:
> 
>   (On top of tip:sched/core at sched-core-2026-04-13 + this series)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 78f2d2c4e24f..509146c486ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7913,7 +7913,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
>  
> -	if (sched_feat(SIS_UTIL)) {
> +	if (sched_feat(SIS_UTIL) && sd->shared) {
>  		/*
>  		 * Increment because !--nr is the condition to stop scan.
>  		 *
> @@ -12856,7 +12856,8 @@ static void set_cpu_sd_state_busy(int cpu)
>  		goto unlock;
>  	sd->nohz_idle = 0;
>  
> -	atomic_inc(&sd->shared->nr_busy_cpus);
> +	if (sd->shared)
> +		atomic_inc(&sd->shared->nr_busy_cpus);
>  unlock:
>  	rcu_read_unlock();
>  }
> @@ -12885,7 +12886,8 @@ static void set_cpu_sd_state_idle(int cpu)
>  		goto unlock;
>  	sd->nohz_idle = 1;
>  
> -	atomic_dec(&sd->shared->nr_busy_cpus);
> +	if (sd->shared)
> +		atomic_dec(&sd->shared->nr_busy_cpus);
>  unlock:
>  	rcu_read_unlock();
>  }
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5847b83d9d55..45b919b39c7d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -680,19 +680,38 @@ static void update_top_cache_domain(int cpu)
>  	int id = cpu;
>  	int size = 1;
>  
> +	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> +	if (sd) {
> +		/*
> +		 * If sd_asym_cpucapacity exists,
> +		 * the shared object should exist too.
> +		 */
> +		WARN_ON_ONCE(!sd->shared);
> +		sds = sd->shared;
> +	}
> +
> +	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
> +
>  	sd = highest_flag_domain(cpu, SD_SHARE_LLC);
>  	if (sd) {
>  		id = cpumask_first(sched_domain_span(sd));
>  		size = cpumask_weight(sched_domain_span(sd));
>  
> -		/* If sd_llc exists, sd_llc_shared should exist too. */
> -		WARN_ON_ONCE(!sd->shared);
> -		sds = sd->shared;
> +		/*
> +		 * If sd_asym_cpucapacity doesn't exist,
> +		 * sd_llc_shared must have a sd->shared linked.
> +		 */
> +		if (!sds) {
> +			WARN_ON_ONCE(!sd->shared);
> +			sds = sd->shared;
> +		}
>  	}
>  
>  	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>  	per_cpu(sd_llc_size, cpu) = size;
>  	per_cpu(sd_llc_id, cpu) = id;
> +
> +	/* TODO: Rename sd_llc_shared to fit the new role. */
>  	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>  
>  	sd = lowest_flag_domain(cpu, SD_CLUSTER);
> @@ -711,9 +730,6 @@ static void update_top_cache_domain(int cpu)
>  
>  	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>  	rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
> -
> -	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY_FULL);
> -	rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
>  }
>  
>  /*
> @@ -2650,6 +2666,15 @@ static void adjust_numa_imbalance(struct sched_domain *sd_llc)
>  	}
>  }
>  
> +static void init_sched_domain_shared(struct s_data *d, struct sched_domain *sd)
> +{
> +	int sd_id = cpumask_first(sched_domain_span(sd));
> +
> +	sd->shared = *per_cpu_ptr(d->sds, sd_id);
> +	atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> +	atomic_inc(&sd->shared->ref);
> +}
> +
>  /*
>   * Build sched domains for a given set of CPUs and attach the sched domains
>   * to the individual CPUs
> @@ -2712,16 +2737,33 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		if (!sd)
>  			continue;
>  
> +		/*
> +		 * In case of ASYM_CPUCAPACITY, attach sd->shared to
> +		 * sd_asym_cpucapacity for wakeup stat tracking.
> +		 *
> +		 * XXX: This assumes SD_ASYM_CPUCAPACITY_FULL domain
> +		 * always has more than one group else it is prone to
> +		 * degeneration.
> +		 */
> +		if (has_asym) {
> +			while (sd && !(sd->flags & SD_ASYM_CPUCAPACITY_FULL))
> +				sd = sd->parent;
> +
> +			init_sched_domain_shared(&d, sd);
> +		}
> +
>  		/* First, find the topmost SD_SHARE_LLC domain */
> +		sd = *per_cpu_ptr(d.sd, i);
>  		while (sd->parent && (sd->parent->flags & SD_SHARE_LLC))
>  			sd = sd->parent;
>  
>  		if (sd->flags & SD_SHARE_LLC) {
> -			int sd_id = cpumask_first(sched_domain_span(sd));
> -
> -			sd->shared = *per_cpu_ptr(d.sds, sd_id);
> -			atomic_set(&sd->shared->nr_busy_cpus, sd->span_weight);
> -			atomic_inc(&sd->shared->ref);
> +			/*
> +			 * Initialize the sd->shared for SD_SHARE_LLC if
> +			 * SD_ASYM_CPUCAPACITY_FULL hasn't claimed it already.
> +			 */
> +			if (!has_asym)
> +				init_sched_domain_shared(&d, sd);
>  
>  			/*
>  			 * In presence of higher domains, adjust the
> ---
> 
> I still have one question: Can first SD_ASYM_CPUCAPACITY_FULL be set at
> a SD_NUMA?
> 
> We'll need to deal with overlapping domains then but seems like it could
> be possible with weird cpusets :-(
> 
> But in that case, do we even want to search CPUs outside the NUMA in
> select_idle_capacity()? I don't think anything stops this currently but
> I might be wrong.

My $0.02 on this.

In theory it could happen with unusual topologies or constrained cpusets,
although it should be quite rare. That said, select_idle_capacity() already
operates on the span of sd_asym_cpucapacity, so if that domain crosses NUMA
boundaries, we're already scanning across NUMA today. This patch doesn't
fundamentally alter this behavior.

If we think cross-NUMA scanning is undesirable, that's probably a more general
issue in select_idle_capacity(), rather than something specific to this change
and we can address this later.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-20  8:36         ` Andrea Righi
@ 2026-04-20  9:39           ` K Prateek Nayak
  2026-04-20 21:42             ` Andrea Righi
  0 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2026-04-20  9:39 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

Hello Andrea,

On 4/20/2026 2:06 PM, Andrea Righi wrote:
>> With your changes, only two places actually care about test_idle_cores():
>>
>> - select_idle_capacity()
>> - select_idle_cpu()
>>
>> If we go into select_idle_capacity(), we don't do select_idle_cpu() so
>> the two paths are mutually exclusive.
>>
>> In nohz_balancer_kick(), if we find, sd_asym_cpucapacity, we simply
>> don't care about the sd_llc_shared->nr_busy_cpus during balancing so
>> that begs the question if we can simply track idle_cores at
>> sd_asym_cpucapacity for these systems?
> 
> Yeah, makes sense to me. I was planning to test something similar, so thanks for
> sharing this patch. :) I'll give it a try and report back.

Thank you for taking it for a spin!

[..snip..]

>> I still have one question: Can first SD_ASYM_CPUCAPACITY_FULL be set at
>> a SD_NUMA?
>>
>> We'll need to deal with overlapping domains then but seems like it could
>> be possible with weird cpusets :-(
>>
>> But in that case, do we even want to search CPUs outside the NUMA in
>> select_idle_capacity()? I don't think anything stops this currently but
>> I might be wrong.
> 
> My $0.02 on this.
> 
> In theory it could happen with unusual topologies or constrained cpusets,
> although it should be quite rare. That said, select_idle_capacity() already
> operates on the span of sd_asym_cpucapacity, so if that domain crosses NUMA
> boundaries, we're already scanning across NUMA today. This patch doesn't
> fundamentally alter this behavior.

Ack! I was just thinking loud from the topology standpoint since
sd->shared is not designed to handle the overlapping domains like
sg->sgc does but we can probably figure some way to make it work.

Using the ring topology example from topology.c:

  0 ----- 1
  |       |
  |       |
  |       |
  3 ----- 2

Consider NUMA-1 below gets the SD_ASYM_CPUCAPACITY_FULL flag:

  NUMA-2       0-3             0-3             0-3             0-3
   groups:     {0-1,3},{1-3}   {0-2},{0,2-3}   {1-3},{0-1,3}   {0,2-3},{0-2}

  NUMA-1       0-1,3           0-2             1-3             0,2-3
   groups:     {0},{1},{3}     {0},{1},{2}     {1},{2},{3}     {0},{2},{3}

  NUMA-0       0               1               2               3


The "sd->shared" assignments at NUMA-1 will put first, second, and the
last domain in the same "shared" range by today's logic since the first
CPU in their span is the same although their spans are slightly
different.

The third will be standalone since the first CPU of the domain span
will be different.

> If we think cross-NUMA scanning is undesirable, that's probably a more general
> issue in select_idle_capacity(), rather than something specific to this change
> and we can address this later.

Ack! That is a tangential problem but may require some looking at if
we decide to extend the sd->shared object to SD_NUMA domains. I guess
if anyone is running such setup, this bit will be the least of their
worries.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-20  9:39           ` K Prateek Nayak
@ 2026-04-20 21:42             ` Andrea Righi
  2026-04-21  9:01               ` Andrea Righi
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Righi @ 2026-04-20 21:42 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

Hi Prateek,

On Mon, Apr 20, 2026 at 03:09:17PM +0530, K Prateek Nayak wrote:
> On 4/20/2026 2:06 PM, Andrea Righi wrote:
> >> With your changes, only two places actually care about test_idle_cores():
> >>
> >> - select_idle_capacity()
> >> - select_idle_cpu()
> >>
> >> If we go into select_idle_capacity(), we don't do select_idle_cpu() so
> >> the two paths are mutually exclusive.
> >>
> >> In nohz_balancer_kick(), if we find, sd_asym_cpucapacity, we simply
> >> don't care about the sd_llc_shared->nr_busy_cpus during balancing so
> >> that begs the question if we can simply track idle_cores at
> >> sd_asym_cpucapacity for these systems?
> > 
> > Yeah, makes sense to me. I was planning to test something similar, so thanks for
> > sharing this patch. :) I'll give it a try and report back.
> 
> Thank you for taking it for a spin!

I've tested this extensively on Vera and haven't encountered any issues.
Performance wise I get similar results (with vs without), which was expected, as
sd_llc matches sd_asym_cpucapacity in my case.

> 
> >> I still have one question: Can first SD_ASYM_CPUCAPACITY_FULL be set at
> >> a SD_NUMA?
> >>
> >> We'll need to deal with overlapping domains then but seems like it could
> >> be possible with weird cpusets :-(
> >>
> >> But in that case, do we even want to search CPUs outside the NUMA in
> >> select_idle_capacity()? I don't think anything stops this currently but
> >> I might be wrong.
> > 
> > My $0.02 on this.
> > 
> > In theory it could happen with unusual topologies or constrained cpusets,
> > although it should be quite rare. That said, select_idle_capacity() already
> > operates on the span of sd_asym_cpucapacity, so if that domain crosses NUMA
> > boundaries, we're already scanning across NUMA today. This patch doesn't
> > fundamentally alter this behavior.
> 
> Ack! I was just thinking loud from the topology standpoint since
> sd->shared is not designed to handle the overlapping domains like
> sg->sgc does but we can probably figure some way to make it work.
> 
> Using the ring topology example from topology.c:
> 
>   0 ----- 1
>   |       |
>   |       |
>   |       |
>   3 ----- 2
> 
> Consider NUMA-1 below gets the SD_ASYM_CPUCAPACITY_FULL flag:
> 
>   NUMA-2       0-3             0-3             0-3             0-3
>    groups:     {0-1,3},{1-3}   {0-2},{0,2-3}   {1-3},{0-1,3}   {0,2-3},{0-2}
> 
>   NUMA-1       0-1,3           0-2             1-3             0,2-3
>    groups:     {0},{1},{3}     {0},{1},{2}     {1},{2},{3}     {0},{2},{3}
> 
>   NUMA-0       0               1               2               3
> 
> 
> The "sd->shared" assignments at NUMA-1 will put first, second, and the
> last domain in the same "shared" range by today's logic since the first
> CPU in their span is the same although their spans are slightly
> different.
> 
> The third will be standalone since the first CPU of the domain span
> will be different.

Yeah, makes sense. I'm wondering if we should attach the shared blob to
sd_asym_cpucapacity only when asym is a non-overlapping domain, otherwise
fallback to sd_llc and, in this case, ignore has_idle_cores in
select_idle_capacity(). This might be not the best in terms of efficiency on
those exotic topologies, but it'd eliminate the overlap/aliasing risk, while
still being correct. What do you think?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-20 21:42             ` Andrea Righi
@ 2026-04-21  9:01               ` Andrea Righi
  2026-04-21  9:35                 ` Andrea Righi
  0 siblings, 1 reply; 14+ messages in thread
From: Andrea Righi @ 2026-04-21  9:01 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

Hi Prateek,

On Mon, Apr 20, 2026 at 11:42:23PM +0200, Andrea Righi wrote:
...
> > >> I still have one question: Can first SD_ASYM_CPUCAPACITY_FULL be set at
> > >> a SD_NUMA?
> > >>
> > >> We'll need to deal with overlapping domains then but seems like it could
> > >> be possible with weird cpusets :-(
> > >>
> > >> But in that case, do we even want to search CPUs outside the NUMA in
> > >> select_idle_capacity()? I don't think anything stops this currently but
> > >> I might be wrong.
> > > 
> > > My $0.02 on this.
> > > 
> > > In theory it could happen with unusual topologies or constrained cpusets,
> > > although it should be quite rare. That said, select_idle_capacity() already
> > > operates on the span of sd_asym_cpucapacity, so if that domain crosses NUMA
> > > boundaries, we're already scanning across NUMA today. This patch doesn't
> > > fundamentally alter this behavior.
> > 
> > Ack! I was just thinking loud from the topology standpoint since
> > sd->shared is not designed to handle the overlapping domains like
> > sg->sgc does but we can probably figure some way to make it work.
> > 
> > Using the ring topology example from topology.c:
> > 
> >   0 ----- 1
> >   |       |
> >   |       |
> >   |       |
> >   3 ----- 2
> > 
> > Consider NUMA-1 below gets the SD_ASYM_CPUCAPACITY_FULL flag:
> > 
> >   NUMA-2       0-3             0-3             0-3             0-3
> >    groups:     {0-1,3},{1-3}   {0-2},{0,2-3}   {1-3},{0-1,3}   {0,2-3},{0-2}
> > 
> >   NUMA-1       0-1,3           0-2             1-3             0,2-3
> >    groups:     {0},{1},{3}     {0},{1},{2}     {1},{2},{3}     {0},{2},{3}
> > 
> >   NUMA-0       0               1               2               3
> > 
> > 
> > The "sd->shared" assignments at NUMA-1 will put first, second, and the
> > last domain in the same "shared" range by today's logic since the first
> > CPU in their span is the same although their spans are slightly
> > different.
> > 
> > The third will be standalone since the first CPU of the domain span
> > will be different.
> 
> Yeah, makes sense. I'm wondering if we should attach the shared blob to
> sd_asym_cpucapacity only when asym is a non-overlapping domain, otherwise
> fallback to sd_llc and, in this case, ignore has_idle_cores in
> select_idle_capacity(). This might be not the best in terms of efficiency on
> those exotic topologies, but it'd eliminate the overlap/aliasing risk, while
> still being correct. What do you think?

I slightly changed your patch adding this logic on top, I'll send an updated
patch series, so it's easier to review/comment.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-04-21  9:01               ` Andrea Righi
@ 2026-04-21  9:35                 ` Andrea Righi
  0 siblings, 0 replies; 14+ messages in thread
From: Andrea Righi @ 2026-04-21  9:35 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, Shrikanth Hegde, linux-kernel

On Tue, Apr 21, 2026 at 11:01:41AM +0200, Andrea Righi wrote:
> Hi Prateek,
> 
> On Mon, Apr 20, 2026 at 11:42:23PM +0200, Andrea Righi wrote:
> ...
> > > >> I still have one question: Can first SD_ASYM_CPUCAPACITY_FULL be set at
> > > >> a SD_NUMA?
> > > >>
> > > >> We'll need to deal with overlapping domains then but seems like it could
> > > >> be possible with weird cpusets :-(
> > > >>
> > > >> But in that case, do we even want to search CPUs outside the NUMA in
> > > >> select_idle_capacity()? I don't think anything stops this currently but
> > > >> I might be wrong.
> > > > 
> > > > My $0.02 on this.
> > > > 
> > > > In theory it could happen with unusual topologies or constrained cpusets,
> > > > although it should be quite rare. That said, select_idle_capacity() already
> > > > operates on the span of sd_asym_cpucapacity, so if that domain crosses NUMA
> > > > boundaries, we're already scanning across NUMA today. This patch doesn't
> > > > fundamentally alter this behavior.
> > > 
> > > Ack! I was just thinking loud from the topology standpoint since
> > > sd->shared is not designed to handle the overlapping domains like
> > > sg->sgc does but we can probably figure some way to make it work.
> > > 
> > > Using the ring topology example from topology.c:
> > > 
> > >   0 ----- 1
> > >   |       |
> > >   |       |
> > >   |       |
> > >   3 ----- 2
> > > 
> > > Consider NUMA-1 below gets the SD_ASYM_CPUCAPACITY_FULL flag:
> > > 
> > >   NUMA-2       0-3             0-3             0-3             0-3
> > >    groups:     {0-1,3},{1-3}   {0-2},{0,2-3}   {1-3},{0-1,3}   {0,2-3},{0-2}
> > > 
> > >   NUMA-1       0-1,3           0-2             1-3             0,2-3
> > >    groups:     {0},{1},{3}     {0},{1},{2}     {1},{2},{3}     {0},{2},{3}
> > > 
> > >   NUMA-0       0               1               2               3
> > > 
> > > 
> > > The "sd->shared" assignments at NUMA-1 will put first, second, and the
> > > last domain in the same "shared" range by today's logic since the first
> > > CPU in their span is the same although their spans are slightly
> > > different.
> > > 
> > > The third will be standalone since the first CPU of the domain span
> > > will be different.
> > 
> > Yeah, makes sense. I'm wondering if we should attach the shared blob to
> > sd_asym_cpucapacity only when asym is a non-overlapping domain, otherwise
> > fallback to sd_llc and, in this case, ignore has_idle_cores in
> > select_idle_capacity(). This might be not the best in terms of efficiency on
> > those exotic topologies, but it'd eliminate the overlap/aliasing risk, while
> > still being correct. What do you think?
> 
> I slightly changed your patch adding this logic on top, I'll send an updated
> patch series, so it's easier to review/comment.

Actually... while preparing the series I realized that in select_idle_capacity()
we may end up clearing the has_idle_cores hint even when the failure is due to
affinity constraints (no fit CPU in the allowed cpumask), not only when no fully
idle core is found in the system and this can lead to false has_idle_cores
hints.

At this point I'm wondering if it's better to just ignore the has_idle_cores
hint completely in the smt+asym-cpu-capacity scenario (which would also simplify
the exotic topology cases).

I did some quick tests with this on Vera and I'm getting pretty much the same
performance results. Opinions? Am I missing something?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-04-21  9:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03  5:31 [PATCH v2 0/2] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-04-03  5:31 ` [PATCH 1/2] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-04-07 11:21   ` Dietmar Eggemann
2026-04-18  8:24     ` Andrea Righi
2026-04-20  5:49       ` K Prateek Nayak
2026-04-20  8:36         ` Andrea Righi
2026-04-20  9:39           ` K Prateek Nayak
2026-04-20 21:42             ` Andrea Righi
2026-04-21  9:01               ` Andrea Righi
2026-04-21  9:35                 ` Andrea Righi
2026-04-17  9:39   ` Vincent Guittot
2026-04-18  6:02     ` Andrea Righi
2026-04-19 10:20       ` Vincent Guittot
2026-04-03  5:31 ` [PATCH 2/2] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox