[PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
@ 2026-03-26 15:02 Andrea Righi
  2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
                   ` (7 more replies)
  0 siblings, 8 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
introducing SMT awareness.

= Problem =

Nominal per-logical-CPU capacity can overstate usable compute when an SMT
sibling is busy, because the physical core doesn't deliver its full nominal
capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
that are not actually good destinations.

= Proposed Solution =

This patch set aligns those paths with a simple rule already used
elsewhere: when SMT is active, prefer fully idle cores and avoid treating
partially idle SMT siblings as full-capacity targets where that would
mislead load balance.

Patch set summary:

 - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

   Prefer fully-idle SMT cores in asym-capacity idle selection. In the
   wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
   idle selection can prefer CPUs on fully idle cores, with a safe fallback.

 - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity

   Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
   Provided for consistency with PATCH 1/4.

 - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems

   Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
   consistency with PATCH 1/4. I've also tested with/without
   /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
   noticed any regression.

 - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

   When choosing the housekeeping CPU that runs the idle load balancer,
   prefer an idle CPU on a fully idle core so migrated work lands where
   effective capacity is available.

   The change is still consistent with the same "avoid CPUs with busy
   sibling" logic and it shows some benefits on Vera, but could have
   negative impact on other systems, I'm including it for completeness
   (feedback is appreciated).

This patch set has been tested on the new NVIDIA Vera Rubin platform, where
SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.

Without these patches, performance can drop up to ~2x with CPU-intensive
workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
account for busy SMT siblings.

Alternative approaches have been evaluated, such as equalizing CPU
capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
normalizing them in the kernel by grouping CPUs within a small capacity
window (+-5%) [1][2], or enabling asympacking [3].

However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
results so far. Improving this policy also seems worthwhile in general, as
other platforms in the future may enable SMT with asymmetric CPU
topologies.

[1] https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
[2] https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
[3] https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com/

Andrea Righi (4):
      sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
      sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
      sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
      sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

 kernel/sched/fair.c     | 163 +++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/topology.c |   9 ---
 2 files changed, 147 insertions(+), 25 deletions(-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
@ 2026-03-26 15:02 ` Andrea Righi
  2026-03-27  8:09   ` Vincent Guittot
  2026-03-27 10:44   ` K Prateek Nayak
  2026-03-26 15:02 ` [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
different per-core frequencies), the wakeup path uses
select_idle_capacity() and prioritizes idle CPUs with higher capacity
for better task placement. However, when those CPUs belong to SMT cores,
their effective capacity can be much lower than the nominal capacity
when the sibling thread is busy: SMT siblings compete for shared
resources, so a "high capacity" CPU that is idle but whose sibling is
busy does not deliver its full capacity. This effective capacity
reduction cannot be modeled by the static capacity value alone.

Introduce SMT awareness in the asym-capacity idle selection policy: when
SMT is active prefer fully-idle SMT cores over partially-idle ones. A
two-phase selection first tries only CPUs on fully idle cores, then
falls back to any idle CPU if none fit.

Prioritizing fully-idle SMT cores yields better task placement because
the effective capacity of partially-idle SMT cores is reduced; always
preferring them when available leads to more accurate capacity usage on
task wakeup.

On an SMT system with asymmetric CPU capacities, SMT-aware idle
selection has been shown to improve throughput by around 15-18% for
CPU-bound workloads, running an amount of tasks equal to the amount of
SMT cores.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 75 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d57c02e82f3a1..9a95628669851 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7940,14 +7940,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
  * the task fits. If no CPU is big enough, but there are idle ones, try to
  * maximize capacity.
+ *
+ * When @prefer_idle_cores is true (asym + SMT and idle cores exist), prefer
+ * CPUs on fully-idle cores over partially-idle ones in a single pass: track
+ * the best candidate among idle-core CPUs and the best among any idle CPU,
+ * then return the idle-core candidate if found, else the best any-idle.
  */
 static int
-select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
+select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
+		     bool prefer_idle_cores)
 {
-	unsigned long task_util, util_min, util_max, best_cap = 0;
-	int fits, best_fits = 0;
-	int cpu, best_cpu = -1;
+	unsigned long task_util, util_min, util_max, best_cap = 0, best_cap_core = 0;
+	int fits, best_fits = 0, best_fits_core = 0;
+	int cpu, best_cpu = -1, best_cpu_core = -1;
 	struct cpumask *cpus;
+	bool on_idle_core;
 
 	cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
@@ -7962,16 +7969,58 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		if (!choose_idle_cpu(cpu, p))
 			continue;
 
+		on_idle_core = is_core_idle(cpu);
+		if (prefer_idle_cores && !on_idle_core) {
+			/* Track best among any idle CPU for fallback */
+			fits = util_fits_cpu(task_util, util_min, util_max, cpu);
+			if (fits > 0) {
+				/*
+				 * Full fit: strictly better than fits 0 / -1;
+				 * among several, prefer higher capacity.
+				 */
+				if (best_cpu < 0 || best_fits <= 0 ||
+				    (best_fits > 0 && cpu_cap > best_cap)) {
+					best_cap = cpu_cap;
+					best_cpu = cpu;
+					best_fits = fits;
+				}
+				continue;
+			}
+			if (best_fits > 0)
+				continue;
+			if (fits < 0)
+				cpu_cap = get_actual_cpu_capacity(cpu);
+			if ((fits < best_fits) ||
+			    ((fits == best_fits) && (cpu_cap > best_cap))) {
+				best_cap = cpu_cap;
+				best_cpu = cpu;
+				best_fits = fits;
+			}
+			continue;
+		}
+
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
 		/* This CPU fits with all requirements */
-		if (fits > 0)
-			return cpu;
+		if (fits > 0) {
+			if (prefer_idle_cores && on_idle_core)
+				return cpu;
+			if (!prefer_idle_cores)
+				return cpu;
+			/*
+			 * Prefer idle cores: record and keep looking for
+			 * idle-core fit.
+			 */
+			best_cap = cpu_cap;
+			best_cpu = cpu;
+			best_fits = fits;
+			continue;
+		}
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
 		 * Look for the CPU with best capacity.
 		 */
-		else if (fits < 0)
+		if (fits < 0)
 			cpu_cap = get_actual_cpu_capacity(cpu);
 
 		/*
@@ -7984,8 +8033,17 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 			best_cpu = cpu;
 			best_fits = fits;
 		}
+		if (prefer_idle_cores && on_idle_core &&
+		    ((fits < best_fits_core) ||
+		     ((fits == best_fits_core) && (cpu_cap > best_cap_core)))) {
+			best_cap_core = cpu_cap;
+			best_cpu_core = cpu;
+			best_fits_core = fits;
+		}
 	}
 
+	if (prefer_idle_cores && best_cpu_core >= 0)
+		return best_cpu_core;
 	return best_cpu;
 }
 
@@ -7994,12 +8052,17 @@ static inline bool asym_fits_cpu(unsigned long util,
 				 unsigned long util_max,
 				 int cpu)
 {
-	if (sched_asym_cpucap_active())
+	if (sched_asym_cpucap_active()) {
 		/*
 		 * Return true only if the cpu fully fits the task requirements
 		 * which include the utilization and the performance hints.
+		 *
+		 * When SMT is active, also require that the core has no busy
+		 * siblings.
 		 */
-		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+		return (!sched_smt_active() || is_core_idle(cpu)) &&
+		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+	}
 
 	return true;
 }
@@ -8097,8 +8160,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		 * capacity path.
 		 */
 		if (sd) {
-			i = select_idle_capacity(p, sd, target);
-			return ((unsigned)i < nr_cpumask_bits) ? i : target;
+			i = select_idle_capacity(p, sd, target,
+				sched_smt_active() && test_idle_cores(target));
+			return ((unsigned int)i < nr_cpumask_bits) ? i : target;
 		}
 	}
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
  2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
  2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-03-26 15:02 ` Andrea Righi
  2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task,
capacity_of(dst_cpu) can overstate available compute if the SMT sibling is
busy: the core does not deliver its full nominal capacity.

If SMT is active and dst_cpu is not on a fully idle core, skip this
destination so we do not migrate a misfit expecting a capacity upgrade we
cannot actually provide.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a95628669851..f8deaaa5bfc85 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10819,10 +10819,16 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * We can use max_capacity here as reduction in capacity on some
 	 * CPUs in the group should either be possible to resolve
 	 * internally or be covered by avg_load imbalance (eventually).
+	 *
+	 * When SMT is active, only pull a misfit to dst_cpu if it is on a
+	 * fully idle core; otherwise the effective capacity of the core is
+	 * reduced and we may not actually provide more capacity than the
+	 * source.
 	 */
 	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
 	    (sgs->group_type == group_misfit_task) &&
-	    (!capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
+	    ((sched_smt_active() && !is_core_idle(env->dst_cpu)) ||
+	     !capacity_greater(capacity_of(env->dst_cpu), sg->sgc->max_capacity) ||
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
  2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
  2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
  2026-03-26 15:02 ` [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
@ 2026-03-26 15:02 ` Andrea Righi
  2026-03-27  8:09   ` Vincent Guittot
  2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

Drop the sched_is_eas_possible() guard that rejects EAS whenever SMT is
active. This allows to enable EAS and perf-domain setup to succeed on
SD_ASYM_CPUCAPACITY topologies with SMT enabled.

Moreover, apply to find_energy_efficient_cpu() the same SMT-aware
preference as the non-EAS wakeup path: when SMT is active and there is a
fully-idle core in the relevant domain, prefer max-spare-capacity
candidates on fully-idle cores. Otherwise, fall back to the prior
behavior, to include also partially-idle SMT siblings.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c     | 50 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/topology.c |  9 --------
 2 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8deaaa5bfc85..593a89f688679 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8658,13 +8658,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	eenv_task_busy_time(&eenv, p, prev_cpu);
 
 	for (; pd; pd = pd->next) {
-		unsigned long util_min = p_util_min, util_max = p_util_max;
 		unsigned long cpu_cap, cpu_actual_cap, util;
 		long prev_spare_cap = -1, max_spare_cap = -1;
+		long max_spare_cap_fallback = -1;
 		unsigned long rq_util_min, rq_util_max;
 		unsigned long cur_delta, base_energy;
-		int max_spare_cap_cpu = -1;
+		int max_spare_cap_cpu = -1, max_spare_cap_cpu_fallback = -1;
 		int fits, max_fits = -1;
+		int max_fits_fallback = -1;
+		bool prefer_idle_cores;
 
 		if (!cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask))
 			continue;
@@ -8676,6 +8678,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		eenv.cpu_cap = cpu_actual_cap;
 		eenv.pd_cap = 0;
 
+		prefer_idle_cores = sched_smt_active() && test_idle_cores(prev_cpu);
+
 		for_each_cpu(cpu, cpus) {
 			struct rq *rq = cpu_rq(cpu);
 
@@ -8687,6 +8691,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
+			if (prefer_idle_cores && cpu != prev_cpu && !is_core_idle(cpu))
+				goto fallback;
+
+			unsigned long util_min = p_util_min, util_max = p_util_max;
+
 			util = cpu_util(cpu, p, cpu, 0);
 			cpu_cap = capacity_of(cpu);
 
@@ -8733,6 +8742,43 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 				max_spare_cap_cpu = cpu;
 				max_fits = fits;
 			}
+
+fallback:
+			if (!prefer_idle_cores || cpu == prev_cpu || is_core_idle(cpu))
+				continue;
+
+			util_min = p_util_min;
+			util_max = p_util_max;
+			util = cpu_util(cpu, p, cpu, 0);
+			cpu_cap = capacity_of(cpu);
+
+			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
+				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
+				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
+
+				util_min = max(rq_util_min, p_util_min);
+				util_max = max(rq_util_max, p_util_max);
+			}
+
+			fits = util_fits_cpu(util, util_min, util_max, cpu);
+			if (!fits)
+				continue;
+
+			lsub_positive(&cpu_cap, util);
+
+			if ((fits > max_fits_fallback) ||
+			    ((fits == max_fits_fallback) &&
+			     ((long)cpu_cap > max_spare_cap_fallback))) {
+				max_spare_cap_fallback = cpu_cap;
+				max_spare_cap_cpu_fallback = cpu;
+				max_fits_fallback = fits;
+			}
+		}
+
+		if (max_spare_cap_cpu < 0 && max_spare_cap_cpu_fallback >= 0) {
+			max_spare_cap = max_spare_cap_fallback;
+			max_spare_cap_cpu = max_spare_cap_cpu_fallback;
+			max_fits = max_fits_fallback;
 		}
 
 		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 061f8c85f5552..cb060fe56aec1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -232,15 +232,6 @@ static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
 		return false;
 	}
 
-	/* EAS definitely does *not* handle SMT */
-	if (sched_smt_active()) {
-		if (sched_debug()) {
-			pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
-				cpumask_pr_args(cpu_mask));
-		}
-		return false;
-	}
-
 	if (!arch_scale_freq_invariant()) {
 		if (sched_debug()) {
 			pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
  2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (2 preceding siblings ...)
  2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
@ 2026-03-26 15:02 ` Andrea Righi
  2026-03-27  8:45   ` Vincent Guittot
  2026-03-27 13:44   ` Shrikanth Hegde
  2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-26 15:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

When choosing which idle housekeeping CPU runs the idle load balancer,
prefer one on a fully idle core if SMT is active, so balance can migrate
work onto a CPU that still offers full effective capacity. Fall back to
any idle candidate if none qualify.

Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Koba Ko <kobak@nvidia.com>
Reported-by: Felix Abecassis <fabecassis@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/fair.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 593a89f688679..a1ee21f7b32f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
  * - When one of the busy CPUs notices that there may be an idle rebalancing
  *   needed, they will kick the idle load balancer, which then does idle
  *   load balancing for all the idle CPUs.
+ *
+ * - When SMT is active, prefer a CPU on a fully idle core as the ILB
+ *   target, so that when it runs balance it becomes the destination CPU
+ *   and can accept migrated tasks with full effective capacity.
  */
 static inline int find_new_ilb(void)
 {
 	const struct cpumask *hk_mask;
-	int ilb_cpu;
+	int ilb_cpu, fallback = -1;
 
 	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
 
@@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
 		if (ilb_cpu == smp_processor_id())
 			continue;
 
+#ifdef CONFIG_SCHED_SMT
+		if (!idle_cpu(ilb_cpu))
+			continue;
+
+		if (fallback < 0)
+			fallback = ilb_cpu;
+
+		if (!sched_smt_active() || is_core_idle(ilb_cpu))
+			return ilb_cpu;
+#else
 		if (idle_cpu(ilb_cpu))
 			return ilb_cpu;
+#endif
 	}
 
-	return -1;
+	return fallback;
 }
 
 /*
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (3 preceding siblings ...)
  2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
@ 2026-03-26 16:33 ` Christian Loehle
  2026-03-27  6:52   ` Andrea Righi
  2026-03-27 16:31 ` Shrikanth Hegde
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 42+ messages in thread
From: Christian Loehle @ 2026-03-26 16:33 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On 3/26/26 15:02, Andrea Righi wrote:
> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> introducing SMT awareness.
> 
> = Problem =
> 
> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> sibling is busy, because the physical core doesn't deliver its full nominal
> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> that are not actually good destinations.
> 
> = Proposed Solution =
> 
> This patch set aligns those paths with a simple rule already used
> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> partially idle SMT siblings as full-capacity targets where that would
> mislead load balance.
> 
> Patch set summary:
> 
>  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> 
>    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> 
>  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> 
>    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>    Provided for consistency with PATCH 1/4.
> 
>  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> 
>    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>    consistency with PATCH 1/4. I've also tested with/without
>    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>    noticed any regression.


There's a lot more to unpack, but just to confirm, Vera doesn't have an EM, right?
There's no EAS with it?
(To be more precise, CPPC should bail out of building an artifical EM if there's no
or only one efficiency class:
drivers/cpufreq/cppc_cpufreq.c:

if (bitmap_weight(used_classes, 256) <= 1) {
		pr_debug("Efficiency classes are all equal (=%d). "
			"No EM registered", class);
		return;
	}

This is the case, right?

> [snip]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
@ 2026-03-27  6:52   ` Andrea Righi
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-27  6:52 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Thu, Mar 26, 2026 at 04:33:08PM +0000, Christian Loehle wrote:
> On 3/26/26 15:02, Andrea Righi wrote:
> > This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> > introducing SMT awareness.
> > 
> > = Problem =
> > 
> > Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> > sibling is busy, because the physical core doesn't deliver its full nominal
> > capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> > that are not actually good destinations.
> > 
> > = Proposed Solution =
> > 
> > This patch set aligns those paths with a simple rule already used
> > elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> > partially idle SMT siblings as full-capacity targets where that would
> > mislead load balance.
> > 
> > Patch set summary:
> > 
> >  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> > 
> >    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> >    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> >    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> > 
> >  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> > 
> >    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> >    Provided for consistency with PATCH 1/4.
> > 
> >  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> > 
> >    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> >    consistency with PATCH 1/4. I've also tested with/without
> >    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> >    noticed any regression.
> 
> 
> There's a lot more to unpack, but just to confirm, Vera doesn't have an EM, right?
> There's no EAS with it?
> (To be more precise, CPPC should bail out of building an artifical EM if there's no
> or only one efficiency class:
> drivers/cpufreq/cppc_cpufreq.c:
> 
> if (bitmap_weight(used_classes, 256) <= 1) {
> 		pr_debug("Efficiency classes are all equal (=%d). "
> 			"No EM registered", class);
> 		return;
> 	}
> 
> This is the case, right?

Yes, that's correct, so my testing on Vera with EAS isn't that meaningful.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
@ 2026-03-27  8:09   ` Vincent Guittot
  2026-03-27  9:46     ` Andrea Righi
  2026-03-27 10:44   ` K Prateek Nayak
  1 sibling, 1 reply; 42+ messages in thread
From: Vincent Guittot @ 2026-03-27  8:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,
> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active prefer fully-idle SMT cores over partially-idle ones. A
> two-phase selection first tries only CPUs on fully idle cores, then
> falls back to any idle CPU if none fit.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities, SMT-aware idle
> selection has been shown to improve throughput by around 15-18% for
> CPU-bound workloads, running an amount of tasks equal to the amount of
> SMT cores.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 75 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d57c02e82f3a1..9a95628669851 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7940,14 +7940,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>   * the task fits. If no CPU is big enough, but there are idle ones, try to
>   * maximize capacity.
> + *
> + * When @prefer_idle_cores is true (asym + SMT and idle cores exist), prefer
> + * CPUs on fully-idle cores over partially-idle ones in a single pass: track
> + * the best candidate among idle-core CPUs and the best among any idle CPU,
> + * then return the idle-core candidate if found, else the best any-idle.
>   */
>  static int
> -select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
> +                    bool prefer_idle_cores)
>  {
> -       unsigned long task_util, util_min, util_max, best_cap = 0;
> -       int fits, best_fits = 0;
> -       int cpu, best_cpu = -1;
> +       unsigned long task_util, util_min, util_max, best_cap = 0, best_cap_core = 0;
> +       int fits, best_fits = 0, best_fits_core = 0;
> +       int cpu, best_cpu = -1, best_cpu_core = -1;
>         struct cpumask *cpus;
> +       bool on_idle_core;
>
>         cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> @@ -7962,16 +7969,58 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                 if (!choose_idle_cpu(cpu, p))
>                         continue;
>
> +               on_idle_core = is_core_idle(cpu);
> +               if (prefer_idle_cores && !on_idle_core) {
> +                       /* Track best among any idle CPU for fallback */
> +                       fits = util_fits_cpu(task_util, util_min, util_max, cpu);

fits = util_fits_cpu(task_util, util_min, util_max, cpu); is always
called so call it once above this if condition

this will help factorize the selection of best_cpu and best_cpu_core

> +                       if (fits > 0) {
> +                               /*
> +                                * Full fit: strictly better than fits 0 / -1;
> +                                * among several, prefer higher capacity.
> +                                */
> +                               if (best_cpu < 0 || best_fits <= 0 ||
> +                                   (best_fits > 0 && cpu_cap > best_cap)) {
> +                                       best_cap = cpu_cap;
> +                                       best_cpu = cpu;
> +                                       best_fits = fits;
> +                               }
> +                               continue;
> +                       }
> +                       if (best_fits > 0)
> +                               continue;
> +                       if (fits < 0)
> +                               cpu_cap = get_actual_cpu_capacity(cpu);
> +                       if ((fits < best_fits) ||
> +                           ((fits == best_fits) && (cpu_cap > best_cap))) {
> +                               best_cap = cpu_cap;
> +                               best_cpu = cpu;
> +                               best_fits = fits;
> +                       }
> +                       continue;
> +               }
> +
>                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
>                 /* This CPU fits with all requirements */
> -               if (fits > 0)
> -                       return cpu;
> +               if (fits > 0) {
> +                       if (prefer_idle_cores && on_idle_core)
> +                               return cpu;
> +                       if (!prefer_idle_cores)
> +                               return cpu;
> +                       /*
> +                        * Prefer idle cores: record and keep looking for
> +                        * idle-core fit.
> +                        */
> +                       best_cap = cpu_cap;
> +                       best_cpu = cpu;
> +                       best_fits = fits;
> +                       continue;
> +               }
>                 /*
>                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
>                  * Look for the CPU with best capacity.
>                  */
> -               else if (fits < 0)
> +               if (fits < 0)
>                         cpu_cap = get_actual_cpu_capacity(cpu);
>
>                 /*
> @@ -7984,8 +8033,17 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                         best_cpu = cpu;
>                         best_fits = fits;
>                 }
> +               if (prefer_idle_cores && on_idle_core &&
> +                   ((fits < best_fits_core) ||
> +                    ((fits == best_fits_core) && (cpu_cap > best_cap_core)))) {
> +                       best_cap_core = cpu_cap;
> +                       best_cpu_core = cpu;
> +                       best_fits_core = fits;
> +               }
>         }
>
> +       if (prefer_idle_cores && best_cpu_core >= 0)
> +               return best_cpu_core;
>         return best_cpu;
>  }
>
> @@ -7994,12 +8052,17 @@ static inline bool asym_fits_cpu(unsigned long util,
>                                  unsigned long util_max,
>                                  int cpu)
>  {
> -       if (sched_asym_cpucap_active())
> +       if (sched_asym_cpucap_active()) {
>                 /*
>                  * Return true only if the cpu fully fits the task requirements
>                  * which include the utilization and the performance hints.
> +                *
> +                * When SMT is active, also require that the core has no busy
> +                * siblings.
>                  */
> -               return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +               return (!sched_smt_active() || is_core_idle(cpu)) &&
> +                      (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> +       }
>
>         return true;
>  }
> @@ -8097,8 +8160,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>                  * capacity path.
>                  */
>                 if (sd) {
> -                       i = select_idle_capacity(p, sd, target);
> -                       return ((unsigned)i < nr_cpumask_bits) ? i : target;
> +                       i = select_idle_capacity(p, sd, target,
> +                               sched_smt_active() && test_idle_cores(target));

Move "sched_smt_active() && test_idle_cores(target)" inside
select_idle_capacity(). I don't see the benefit of making it a
parameter
or use has_idle_core for the parameter like other smt related function


> +                       return ((unsigned int)i < nr_cpumask_bits) ? i : target;
>                 }
>         }
>
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
  2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
@ 2026-03-27  8:09   ` Vincent Guittot
  2026-03-27  9:45     ` Andrea Righi
  0 siblings, 1 reply; 42+ messages in thread
From: Vincent Guittot @ 2026-03-27  8:09 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>
> Drop the sched_is_eas_possible() guard that rejects EAS whenever SMT is
> active. This allows to enable EAS and perf-domain setup to succeed on
> SD_ASYM_CPUCAPACITY topologies with SMT enabled.

I don't think that we want to enable EAS with SMT. So keep EAS and SMT
exclusive, at least for now


>
> Moreover, apply to find_energy_efficient_cpu() the same SMT-aware
> preference as the non-EAS wakeup path: when SMT is active and there is a
> fully-idle core in the relevant domain, prefer max-spare-capacity
> candidates on fully-idle cores. Otherwise, fall back to the prior
> behavior, to include also partially-idle SMT siblings.
>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c     | 50 +++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/topology.c |  9 --------
>  2 files changed, 48 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f8deaaa5bfc85..593a89f688679 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8658,13 +8658,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>         eenv_task_busy_time(&eenv, p, prev_cpu);
>
>         for (; pd; pd = pd->next) {
> -               unsigned long util_min = p_util_min, util_max = p_util_max;
>                 unsigned long cpu_cap, cpu_actual_cap, util;
>                 long prev_spare_cap = -1, max_spare_cap = -1;
> +               long max_spare_cap_fallback = -1;
>                 unsigned long rq_util_min, rq_util_max;
>                 unsigned long cur_delta, base_energy;
> -               int max_spare_cap_cpu = -1;
> +               int max_spare_cap_cpu = -1, max_spare_cap_cpu_fallback = -1;
>                 int fits, max_fits = -1;
> +               int max_fits_fallback = -1;
> +               bool prefer_idle_cores;
>
>                 if (!cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask))
>                         continue;
> @@ -8676,6 +8678,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                 eenv.cpu_cap = cpu_actual_cap;
>                 eenv.pd_cap = 0;
>
> +               prefer_idle_cores = sched_smt_active() && test_idle_cores(prev_cpu);
> +
>                 for_each_cpu(cpu, cpus) {
>                         struct rq *rq = cpu_rq(cpu);
>
> @@ -8687,6 +8691,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                         if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>                                 continue;
>
> +                       if (prefer_idle_cores && cpu != prev_cpu && !is_core_idle(cpu))
> +                               goto fallback;
> +
> +                       unsigned long util_min = p_util_min, util_max = p_util_max;
> +
>                         util = cpu_util(cpu, p, cpu, 0);
>                         cpu_cap = capacity_of(cpu);
>
> @@ -8733,6 +8742,43 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>                                 max_spare_cap_cpu = cpu;
>                                 max_fits = fits;
>                         }
> +
> +fallback:
> +                       if (!prefer_idle_cores || cpu == prev_cpu || is_core_idle(cpu))
> +                               continue;
> +
> +                       util_min = p_util_min;
> +                       util_max = p_util_max;
> +                       util = cpu_util(cpu, p, cpu, 0);
> +                       cpu_cap = capacity_of(cpu);
> +
> +                       if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> +                               rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> +                               rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> +
> +                               util_min = max(rq_util_min, p_util_min);
> +                               util_max = max(rq_util_max, p_util_max);
> +                       }
> +
> +                       fits = util_fits_cpu(util, util_min, util_max, cpu);
> +                       if (!fits)
> +                               continue;
> +
> +                       lsub_positive(&cpu_cap, util);
> +
> +                       if ((fits > max_fits_fallback) ||
> +                           ((fits == max_fits_fallback) &&
> +                            ((long)cpu_cap > max_spare_cap_fallback))) {
> +                               max_spare_cap_fallback = cpu_cap;
> +                               max_spare_cap_cpu_fallback = cpu;
> +                               max_fits_fallback = fits;
> +                       }
> +               }
> +
> +               if (max_spare_cap_cpu < 0 && max_spare_cap_cpu_fallback >= 0) {
> +                       max_spare_cap = max_spare_cap_fallback;
> +                       max_spare_cap_cpu = max_spare_cap_cpu_fallback;
> +                       max_fits = max_fits_fallback;
>                 }
>
>                 if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 061f8c85f5552..cb060fe56aec1 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -232,15 +232,6 @@ static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
>                 return false;
>         }
>
> -       /* EAS definitely does *not* handle SMT */
> -       if (sched_smt_active()) {
> -               if (sched_debug()) {
> -                       pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
> -                               cpumask_pr_args(cpu_mask));
> -               }
> -               return false;
> -       }
> -
>         if (!arch_scale_freq_invariant()) {
>                 if (sched_debug()) {
>                         pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
  2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
@ 2026-03-27  8:45   ` Vincent Guittot
  2026-03-27  9:44     ` Andrea Righi
  2026-03-27 13:44   ` Shrikanth Hegde
  1 sibling, 1 reply; 42+ messages in thread
From: Vincent Guittot @ 2026-03-27  8:45 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>
> When choosing which idle housekeeping CPU runs the idle load balancer,
> prefer one on a fully idle core if SMT is active, so balance can migrate
> work onto a CPU that still offers full effective capacity. Fall back to
> any idle candidate if none qualify.

This one isn't straightforward for me. The ilb cpu will check all
other idle CPUs 1st and finish with itself so unless the next CPU in
the idle_cpus_mask is a sibling, this should not make a difference

Did you see any perf diff ?


>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>  kernel/sched/fair.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 593a89f688679..a1ee21f7b32f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
>   * - When one of the busy CPUs notices that there may be an idle rebalancing
>   *   needed, they will kick the idle load balancer, which then does idle
>   *   load balancing for all the idle CPUs.
> + *
> + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> + *   target, so that when it runs balance it becomes the destination CPU
> + *   and can accept migrated tasks with full effective capacity.
>   */
>  static inline int find_new_ilb(void)
>  {
>         const struct cpumask *hk_mask;
> -       int ilb_cpu;
> +       int ilb_cpu, fallback = -1;
>
>         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>
> @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
>                 if (ilb_cpu == smp_processor_id())
>                         continue;
>
> +#ifdef CONFIG_SCHED_SMT

you can probably get rid of the CONFIG and put this special case below
sched_smt_active()


> +               if (!idle_cpu(ilb_cpu))
> +                       continue;
> +
> +               if (fallback < 0)
> +                       fallback = ilb_cpu;
> +
> +               if (!sched_smt_active() || is_core_idle(ilb_cpu))
> +                       return ilb_cpu;
> +#else
>                 if (idle_cpu(ilb_cpu))
>                         return ilb_cpu;
> +#endif
>         }
>
> -       return -1;
> +       return fallback;
>  }
>
>  /*
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
  2026-03-27  8:45   ` Vincent Guittot
@ 2026-03-27  9:44     ` Andrea Righi
  2026-03-27 11:34       ` K Prateek Nayak
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-27  9:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

Hi Vincent,

On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > When choosing which idle housekeeping CPU runs the idle load balancer,
> > prefer one on a fully idle core if SMT is active, so balance can migrate
> > work onto a CPU that still offers full effective capacity. Fall back to
> > any idle candidate if none qualify.
> 
> This one isn't straightforward for me. The ilb cpu will check all
> other idle CPUs 1st and finish with itself so unless the next CPU in
> the idle_cpus_mask is a sibling, this should not make a difference
> 
> Did you see any perf diff ?

I actually see a benefit, in particular, with the first patch applied I see
a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
which seems pretty consistent across runs (definitely not in error range).

The intention with this change was to minimize SMT noise running the ILB
code on a fully-idle core when possible, but I also didn't expect to see
such big difference.

I'll investigate more to better understand what's happening.

> 
> 
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 593a89f688679..a1ee21f7b32f6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
> >   * - When one of the busy CPUs notices that there may be an idle rebalancing
> >   *   needed, they will kick the idle load balancer, which then does idle
> >   *   load balancing for all the idle CPUs.
> > + *
> > + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> > + *   target, so that when it runs balance it becomes the destination CPU
> > + *   and can accept migrated tasks with full effective capacity.
> >   */
> >  static inline int find_new_ilb(void)
> >  {
> >         const struct cpumask *hk_mask;
> > -       int ilb_cpu;
> > +       int ilb_cpu, fallback = -1;
> >
> >         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> >
> > @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
> >                 if (ilb_cpu == smp_processor_id())
> >                         continue;
> >
> > +#ifdef CONFIG_SCHED_SMT
> 
> you can probably get rid of the CONFIG and put this special case below
> sched_smt_active()

Ah good point, will change this.

> 
> 
> > +               if (!idle_cpu(ilb_cpu))
> > +                       continue;
> > +
> > +               if (fallback < 0)
> > +                       fallback = ilb_cpu;
> > +
> > +               if (!sched_smt_active() || is_core_idle(ilb_cpu))
> > +                       return ilb_cpu;
> > +#else
> >                 if (idle_cpu(ilb_cpu))
> >                         return ilb_cpu;
> > +#endif
> >         }
> >
> > -       return -1;
> > +       return fallback;
> >  }
> >
> >  /*
> > --
> > 2.53.0
> >

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
  2026-03-27  8:09   ` Vincent Guittot
@ 2026-03-27  9:45     ` Andrea Righi
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-27  9:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Fri, Mar 27, 2026 at 09:09:35AM +0100, Vincent Guittot wrote:
> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > Drop the sched_is_eas_possible() guard that rejects EAS whenever SMT is
> > active. This allows to enable EAS and perf-domain setup to succeed on
> > SD_ASYM_CPUCAPACITY topologies with SMT enabled.
> 
> I don't think that we want to enable EAS with SMT. So keep EAS and SMT
> exclusive, at least for now

Ack.

Thanks,
-Andrea

> 
> 
> >
> > Moreover, apply to find_energy_efficient_cpu() the same SMT-aware
> > preference as the non-EAS wakeup path: when SMT is active and there is a
> > fully-idle core in the relevant domain, prefer max-spare-capacity
> > candidates on fully-idle cores. Otherwise, fall back to the prior
> > behavior, to include also partially-idle SMT siblings.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c     | 50 +++++++++++++++++++++++++++++++++++++++--
> >  kernel/sched/topology.c |  9 --------
> >  2 files changed, 48 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f8deaaa5bfc85..593a89f688679 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8658,13 +8658,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >         eenv_task_busy_time(&eenv, p, prev_cpu);
> >
> >         for (; pd; pd = pd->next) {
> > -               unsigned long util_min = p_util_min, util_max = p_util_max;
> >                 unsigned long cpu_cap, cpu_actual_cap, util;
> >                 long prev_spare_cap = -1, max_spare_cap = -1;
> > +               long max_spare_cap_fallback = -1;
> >                 unsigned long rq_util_min, rq_util_max;
> >                 unsigned long cur_delta, base_energy;
> > -               int max_spare_cap_cpu = -1;
> > +               int max_spare_cap_cpu = -1, max_spare_cap_cpu_fallback = -1;
> >                 int fits, max_fits = -1;
> > +               int max_fits_fallback = -1;
> > +               bool prefer_idle_cores;
> >
> >                 if (!cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask))
> >                         continue;
> > @@ -8676,6 +8678,8 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                 eenv.cpu_cap = cpu_actual_cap;
> >                 eenv.pd_cap = 0;
> >
> > +               prefer_idle_cores = sched_smt_active() && test_idle_cores(prev_cpu);
> > +
> >                 for_each_cpu(cpu, cpus) {
> >                         struct rq *rq = cpu_rq(cpu);
> >
> > @@ -8687,6 +8691,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                         if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> >                                 continue;
> >
> > +                       if (prefer_idle_cores && cpu != prev_cpu && !is_core_idle(cpu))
> > +                               goto fallback;
> > +
> > +                       unsigned long util_min = p_util_min, util_max = p_util_max;
> > +
> >                         util = cpu_util(cpu, p, cpu, 0);
> >                         cpu_cap = capacity_of(cpu);
> >
> > @@ -8733,6 +8742,43 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                                 max_spare_cap_cpu = cpu;
> >                                 max_fits = fits;
> >                         }
> > +
> > +fallback:
> > +                       if (!prefer_idle_cores || cpu == prev_cpu || is_core_idle(cpu))
> > +                               continue;
> > +
> > +                       util_min = p_util_min;
> > +                       util_max = p_util_max;
> > +                       util = cpu_util(cpu, p, cpu, 0);
> > +                       cpu_cap = capacity_of(cpu);
> > +
> > +                       if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
> > +                               rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
> > +                               rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
> > +
> > +                               util_min = max(rq_util_min, p_util_min);
> > +                               util_max = max(rq_util_max, p_util_max);
> > +                       }
> > +
> > +                       fits = util_fits_cpu(util, util_min, util_max, cpu);
> > +                       if (!fits)
> > +                               continue;
> > +
> > +                       lsub_positive(&cpu_cap, util);
> > +
> > +                       if ((fits > max_fits_fallback) ||
> > +                           ((fits == max_fits_fallback) &&
> > +                            ((long)cpu_cap > max_spare_cap_fallback))) {
> > +                               max_spare_cap_fallback = cpu_cap;
> > +                               max_spare_cap_cpu_fallback = cpu;
> > +                               max_fits_fallback = fits;
> > +                       }
> > +               }
> > +
> > +               if (max_spare_cap_cpu < 0 && max_spare_cap_cpu_fallback >= 0) {
> > +                       max_spare_cap = max_spare_cap_fallback;
> > +                       max_spare_cap_cpu = max_spare_cap_cpu_fallback;
> > +                       max_fits = max_fits_fallback;
> >                 }
> >
> >                 if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 061f8c85f5552..cb060fe56aec1 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -232,15 +232,6 @@ static bool sched_is_eas_possible(const struct cpumask *cpu_mask)
> >                 return false;
> >         }
> >
> > -       /* EAS definitely does *not* handle SMT */
> > -       if (sched_smt_active()) {
> > -               if (sched_debug()) {
> > -                       pr_info("rd %*pbl: Checking EAS, SMT is not supported\n",
> > -                               cpumask_pr_args(cpu_mask));
> > -               }
> > -               return false;
> > -       }
> > -
> >         if (!arch_scale_freq_invariant()) {
> >                 if (sched_debug()) {
> >                         pr_info("rd %*pbl: Checking EAS: frequency-invariant load tracking not yet supported",
> > --
> > 2.53.0
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-27  8:09   ` Vincent Guittot
@ 2026-03-27  9:46     ` Andrea Righi
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-27  9:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

Hi Vincent,

On Fri, Mar 27, 2026 at 09:09:24AM +0100, Vincent Guittot wrote:
> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement. However, when those CPUs belong to SMT cores,
> > their effective capacity can be much lower than the nominal capacity
> > when the sibling thread is busy: SMT siblings compete for shared
> > resources, so a "high capacity" CPU that is idle but whose sibling is
> > busy does not deliver its full capacity. This effective capacity
> > reduction cannot be modeled by the static capacity value alone.
> >
> > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > SMT is active prefer fully-idle SMT cores over partially-idle ones. A
> > two-phase selection first tries only CPUs on fully idle cores, then
> > falls back to any idle CPU if none fit.
> >
> > Prioritizing fully-idle SMT cores yields better task placement because
> > the effective capacity of partially-idle SMT cores is reduced; always
> > preferring them when available leads to more accurate capacity usage on
> > task wakeup.
> >
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> >
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Christian Loehle <christian.loehle@arm.com>
> > Cc: Koba Ko <kobak@nvidia.com>
> > Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> > Signed-off-by: Andrea Righi <arighi@nvidia.com>
> > ---
> >  kernel/sched/fair.c | 86 +++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 75 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d57c02e82f3a1..9a95628669851 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7940,14 +7940,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >   * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> >   * the task fits. If no CPU is big enough, but there are idle ones, try to
> >   * maximize capacity.
> > + *
> > + * When @prefer_idle_cores is true (asym + SMT and idle cores exist), prefer
> > + * CPUs on fully-idle cores over partially-idle ones in a single pass: track
> > + * the best candidate among idle-core CPUs and the best among any idle CPU,
> > + * then return the idle-core candidate if found, else the best any-idle.
> >   */
> >  static int
> > -select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target,
> > +                    bool prefer_idle_cores)
> >  {
> > -       unsigned long task_util, util_min, util_max, best_cap = 0;
> > -       int fits, best_fits = 0;
> > -       int cpu, best_cpu = -1;
> > +       unsigned long task_util, util_min, util_max, best_cap = 0, best_cap_core = 0;
> > +       int fits, best_fits = 0, best_fits_core = 0;
> > +       int cpu, best_cpu = -1, best_cpu_core = -1;
> >         struct cpumask *cpus;
> > +       bool on_idle_core;
> >
> >         cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> >         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > @@ -7962,16 +7969,58 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                 if (!choose_idle_cpu(cpu, p))
> >                         continue;
> >
> > +               on_idle_core = is_core_idle(cpu);
> > +               if (prefer_idle_cores && !on_idle_core) {
> > +                       /* Track best among any idle CPU for fallback */
> > +                       fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> 
> fits = util_fits_cpu(task_util, util_min, util_max, cpu); is always
> called so call it once above this if condition
> 
> this will help factorize the selection of best_cpu and best_cpu_core

Makes sense.

> 
> > +                       if (fits > 0) {
> > +                               /*
> > +                                * Full fit: strictly better than fits 0 / -1;
> > +                                * among several, prefer higher capacity.
> > +                                */
> > +                               if (best_cpu < 0 || best_fits <= 0 ||
> > +                                   (best_fits > 0 && cpu_cap > best_cap)) {
> > +                                       best_cap = cpu_cap;
> > +                                       best_cpu = cpu;
> > +                                       best_fits = fits;
> > +                               }
> > +                               continue;
> > +                       }
> > +                       if (best_fits > 0)
> > +                               continue;
> > +                       if (fits < 0)
> > +                               cpu_cap = get_actual_cpu_capacity(cpu);
> > +                       if ((fits < best_fits) ||
> > +                           ((fits == best_fits) && (cpu_cap > best_cap))) {
> > +                               best_cap = cpu_cap;
> > +                               best_cpu = cpu;
> > +                               best_fits = fits;
> > +                       }
> > +                       continue;
> > +               }
> > +
> >                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> >                 /* This CPU fits with all requirements */
> > -               if (fits > 0)
> > -                       return cpu;
> > +               if (fits > 0) {
> > +                       if (prefer_idle_cores && on_idle_core)
> > +                               return cpu;
> > +                       if (!prefer_idle_cores)
> > +                               return cpu;
> > +                       /*
> > +                        * Prefer idle cores: record and keep looking for
> > +                        * idle-core fit.
> > +                        */
> > +                       best_cap = cpu_cap;
> > +                       best_cpu = cpu;
> > +                       best_fits = fits;
> > +                       continue;
> > +               }
> >                 /*
> >                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> >                  * Look for the CPU with best capacity.
> >                  */
> > -               else if (fits < 0)
> > +               if (fits < 0)
> >                         cpu_cap = get_actual_cpu_capacity(cpu);
> >
> >                 /*
> > @@ -7984,8 +8033,17 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >                         best_cpu = cpu;
> >                         best_fits = fits;
> >                 }
> > +               if (prefer_idle_cores && on_idle_core &&
> > +                   ((fits < best_fits_core) ||
> > +                    ((fits == best_fits_core) && (cpu_cap > best_cap_core)))) {
> > +                       best_cap_core = cpu_cap;
> > +                       best_cpu_core = cpu;
> > +                       best_fits_core = fits;
> > +               }
> >         }
> >
> > +       if (prefer_idle_cores && best_cpu_core >= 0)
> > +               return best_cpu_core;
> >         return best_cpu;
> >  }
> >
> > @@ -7994,12 +8052,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> >                                  unsigned long util_max,
> >                                  int cpu)
> >  {
> > -       if (sched_asym_cpucap_active())
> > +       if (sched_asym_cpucap_active()) {
> >                 /*
> >                  * Return true only if the cpu fully fits the task requirements
> >                  * which include the utilization and the performance hints.
> > +                *
> > +                * When SMT is active, also require that the core has no busy
> > +                * siblings.
> >                  */
> > -               return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +               return (!sched_smt_active() || is_core_idle(cpu)) &&
> > +                      (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > +       }
> >
> >         return true;
> >  }
> > @@ -8097,8 +8160,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >                  * capacity path.
> >                  */
> >                 if (sd) {
> > -                       i = select_idle_capacity(p, sd, target);
> > -                       return ((unsigned)i < nr_cpumask_bits) ? i : target;
> > +                       i = select_idle_capacity(p, sd, target,
> > +                               sched_smt_active() && test_idle_cores(target));
> 
> Move "sched_smt_active() && test_idle_cores(target)" inside
> select_idle_capacity(). I don't see the benefit of making it a
> parameter
> or use has_idle_core for the parameter like other smt related function

And also makes sense.

> 
> 
> > +                       return ((unsigned int)i < nr_cpumask_bits) ? i : target;
> >                 }
> >         }
> >
> > --
> > 2.53.0
> >

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
  2026-03-27  8:09   ` Vincent Guittot
@ 2026-03-27 10:44   ` K Prateek Nayak
  2026-03-27 10:58     ` Andrea Righi
  1 sibling, 1 reply; 42+ messages in thread
From: K Prateek Nayak @ 2026-03-27 10:44 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

Hello Andrea,

On 3/26/2026 8:32 PM, Andrea Righi wrote:
>  		/* This CPU fits with all requirements */
> -		if (fits > 0)
> -			return cpu;
> +		if (fits > 0) {
> +			if (prefer_idle_cores && on_idle_core)
> +				return cpu;
> +			if (!prefer_idle_cores)
> +				return cpu;

nit.

Can the above two be re-wittern as:

    if (!prefer_idle_cores || on_idle_core)
        return cpu; 

since they are equivalent.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-27 10:44   ` K Prateek Nayak
@ 2026-03-27 10:58     ` Andrea Righi
  2026-03-27 11:14       ` K Prateek Nayak
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-27 10:58 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

Hi Prateek,

On Fri, Mar 27, 2026 at 04:14:57PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 3/26/2026 8:32 PM, Andrea Righi wrote:
> >  		/* This CPU fits with all requirements */
> > -		if (fits > 0)
> > -			return cpu;
> > +		if (fits > 0) {
> > +			if (prefer_idle_cores && on_idle_core)
> > +				return cpu;
> > +			if (!prefer_idle_cores)
> > +				return cpu;
> 
> nit.
> 
> Can the above two be re-wittern as:
> 
>     if (!prefer_idle_cores || on_idle_core)
>         return cpu; 
> 
> since they are equivalent.

Oh yes, indeed.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-27 10:58     ` Andrea Righi
@ 2026-03-27 11:14       ` K Prateek Nayak
  2026-03-27 16:39         ` Andrea Righi
  0 siblings, 1 reply; 42+ messages in thread
From: K Prateek Nayak @ 2026-03-27 11:14 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

Hello Andrea,

On 3/27/2026 4:28 PM, Andrea Righi wrote:
> On Fri, Mar 27, 2026 at 04:14:57PM +0530, K Prateek Nayak wrote:
>> Hello Andrea,
>>
>> On 3/26/2026 8:32 PM, Andrea Righi wrote:
>>>             /* This CPU fits with all requirements */
>>> -           if (fits > 0)
>>> -                   return cpu;
>>> +           if (fits > 0) {
>>> +                   if (prefer_idle_cores && on_idle_core)
>>> +                           return cpu;
>>> +                   if (!prefer_idle_cores)
>>> +                           return cpu;
>>
>> nit.
>>
>> Can the above two be re-wittern as:
>>
>>     if (!prefer_idle_cores || on_idle_core)
>>         return cpu;
>>
>> since they are equivalent.
> 
> Oh yes, indeed.

Also, can we just rewrite this Patch as:

  (Includes feedback from Vincent; Only build tested)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 700d0f145ca6..cffd5649b54e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7946,6 +7946,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
 	unsigned long task_util, util_min, util_max, best_cap = 0;
 	int fits, best_fits = 0;
 	int cpu, best_cpu = -1;
@@ -7959,6 +7960,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
+		bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		if (!choose_idle_cpu(cpu, p))
@@ -7967,7 +7969,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
 		/* This CPU fits with all requirements */
-		if (fits > 0)
+		if (fits > 0 && preferred_core)
 			return cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -7976,6 +7978,14 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		else if (fits < 0)
 			cpu_cap = get_actual_cpu_capacity(cpu);
 
+		/*
+		 * If we are on an preferred core, translate the range of fits
+		 * from [-1, 1] to [-4, -2]. This ensures that an idle core
+		 * is always given priority over (paritally) busy core.
+		 */
+		if (preferred_core)
+			fits -= 3;
+
 		/*
 		 * First, select CPU which fits better (-1 being better than 0).
 		 * Then, select the one with best capacity at same level.
---

My naive eyes say it should be equivalent of what you have but maybe
I'm wrong?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
  2026-03-27  9:44     ` Andrea Righi
@ 2026-03-27 11:34       ` K Prateek Nayak
  2026-03-27 20:36         ` Andrea Righi
  2026-03-30 17:29         ` Andrea Righi
  0 siblings, 2 replies; 42+ messages in thread
From: K Prateek Nayak @ 2026-03-27 11:34 UTC (permalink / raw)
  To: Andrea Righi, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

Hello Andrea,

On 3/27/2026 3:14 PM, Andrea Righi wrote:
> Hi Vincent,
> 
> On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
>> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
>>>
>>> When choosing which idle housekeeping CPU runs the idle load balancer,
>>> prefer one on a fully idle core if SMT is active, so balance can migrate
>>> work onto a CPU that still offers full effective capacity. Fall back to
>>> any idle candidate if none qualify.
>>
>> This one isn't straightforward for me. The ilb cpu will check all
>> other idle CPUs 1st and finish with itself so unless the next CPU in
>> the idle_cpus_mask is a sibling, this should not make a difference
>>
>> Did you see any perf diff ?
> 
> I actually see a benefit, in particular, with the first patch applied I see
> a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> which seems pretty consistent across runs (definitely not in error range).
> 
> The intention with this change was to minimize SMT noise running the ILB
> code on a fully-idle core when possible, but I also didn't expect to see
> such big difference.
> 
> I'll investigate more to better understand what's happening.

Interesting! Either this "CPU-intensive workload" hates SMT turning
busy (but to an extent where performance drops visibly?) or ILB
keeps getting interrupted on an SMT sibling that is burdened by
interrupts leading to slower balance (or IRQs driving the workload
being delayed by rq_lock disabling them)

Would it be possible to share the total SCHED_SOFTIRQ time, load
balancing attempts, and utlization with and without the patch? I too
will go queue up some runs to see if this makes a difference.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
  2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
  2026-03-27  8:45   ` Vincent Guittot
@ 2026-03-27 13:44   ` Shrikanth Hegde
  1 sibling, 0 replies; 42+ messages in thread
From: Shrikanth Hegde @ 2026-03-27 13:44 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel



On 3/26/26 8:32 PM, Andrea Righi wrote:
> When choosing which idle housekeeping CPU runs the idle load balancer,
> prefer one on a fully idle core if SMT is active, so balance can migrate
> work onto a CPU that still offers full effective capacity. Fall back to
> any idle candidate if none qualify.
> 
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Christian Loehle <christian.loehle@arm.com>
> Cc: Koba Ko <kobak@nvidia.com>
> Reported-by: Felix Abecassis <fabecassis@nvidia.com>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
>   kernel/sched/fair.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 593a89f688679..a1ee21f7b32f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12733,11 +12733,15 @@ static inline int on_null_domain(struct rq *rq)
>    * - When one of the busy CPUs notices that there may be an idle rebalancing
>    *   needed, they will kick the idle load balancer, which then does idle
>    *   load balancing for all the idle CPUs.
> + *
> + * - When SMT is active, prefer a CPU on a fully idle core as the ILB
> + *   target, so that when it runs balance it becomes the destination CPU
> + *   and can accept migrated tasks with full effective capacity.
>    */
>   static inline int find_new_ilb(void)
>   {
>   	const struct cpumask *hk_mask;
> -	int ilb_cpu;
> +	int ilb_cpu, fallback = -1;
>   
>   	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>   
> @@ -12746,11 +12750,22 @@ static inline int find_new_ilb(void)
>   		if (ilb_cpu == smp_processor_id())
>   			continue;
>   
> +#ifdef CONFIG_SCHED_SMT
> +		if (!idle_cpu(ilb_cpu))
> +			continue;
> +
> +		if (fallback < 0)
> +			fallback = ilb_cpu;
> +
> +		if (!sched_smt_active() || is_core_idle(ilb_cpu))

is_core_idle does loop for all sublings and nohz.idle_cpus_mask
will have all siblings likely.

So that might turn out be a bit expensive on large SMT system such as SMT=4
Also, this is with interrupt disabled.

Will try to run this on powerpc system and see if simple benchmarks show anything.

> +			return ilb_cpu;
> +#else
>   		if (idle_cpu(ilb_cpu))
>   			return ilb_cpu;
> +#endif
>   	}
>   
> -	return -1;
> +	return fallback;
>   }
>   
>   /*


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (4 preceding siblings ...)
  2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
@ 2026-03-27 16:31 ` Shrikanth Hegde
  2026-03-27 17:08   ` Andrea Righi
  2026-03-28 13:03 ` Balbir Singh
  2026-03-30 22:30 ` Dietmar Eggemann
  7 siblings, 1 reply; 42+ messages in thread
From: Shrikanth Hegde @ 2026-03-27 16:31 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

Hi Andrea.

On 3/26/26 8:32 PM, Andrea Righi wrote:
> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> introducing SMT awareness.
> 
> = Problem =
> 
> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> sibling is busy, because the physical core doesn't deliver its full nominal
> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> that are not actually good destinations.
> 

How does energy model define the opp for SMT?

SMT systems have multiple of different functional blocks, a few ALU(arithmetic),
LSU(load store unit) etc. If same/similar workload runs on sibling, it would affect the
performance, but sibling is using different functional blocks, then it would
not.

So underlying actual CPU Capacity of each thread depends on what each sibling is running.
I don't understand how does the firmware/energy models define this.

> = Proposed Solution =
> 
> This patch set aligns those paths with a simple rule already used
> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> partially idle SMT siblings as full-capacity targets where that would
> mislead load balance.
> 
> Patch set summary:
> 
>   - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> 
>     Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>     wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>     idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> 
>   - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> 
>     Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>     Provided for consistency with PATCH 1/4.
> 
>   - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> 
>     Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>     consistency with PATCH 1/4. I've also tested with/without
>     /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>     noticed any regression.
> 
>   - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> 
>     When choosing the housekeeping CPU that runs the idle load balancer,
>     prefer an idle CPU on a fully idle core so migrated work lands where
>     effective capacity is available.
> 
>     The change is still consistent with the same "avoid CPUs with busy
>     sibling" logic and it shows some benefits on Vera, but could have
>     negative impact on other systems, I'm including it for completeness
>     (feedback is appreciated).
> 
> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
>

I assume the CPU_CAPACITY values fixed?
first sibling has max, while other has less?

> Without these patches, performance can drop up to ~2x with CPU-intensive
> workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> account for busy SMT siblings.
> 

How is the performance measured here? Which benchmark?
By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
so it is all fitting nicely?

If you increase those numbers, how does the performance numbers compare?

Also, whats the system is like? SMT level?

> Alternative approaches have been evaluated, such as equalizing CPU
> capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> normalizing them in the kernel by grouping CPUs within a small capacity
> window (+-5%) [1][2], or enabling asympacking [3].
> 
> However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> results so far. Improving this policy also seems worthwhile in general, as
> other platforms in the future may enable SMT with asymmetric CPU
> topologies.
> 
> [1] https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
> [2] https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
> [3] https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com/
> 
> Andrea Righi (4):
>        sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>        sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>        sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>        sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> 
>   kernel/sched/fair.c     | 163 +++++++++++++++++++++++++++++++++++++++++++-----
>   kernel/sched/topology.c |   9 ---
>   2 files changed, 147 insertions(+), 25 deletions(-)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-27 11:14       ` K Prateek Nayak
@ 2026-03-27 16:39         ` Andrea Righi
  2026-03-30 10:17           ` K Prateek Nayak
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-27 16:39 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

Hi Prateek,

On Fri, Mar 27, 2026 at 04:44:01PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 3/27/2026 4:28 PM, Andrea Righi wrote:
> > On Fri, Mar 27, 2026 at 04:14:57PM +0530, K Prateek Nayak wrote:
> >> Hello Andrea,
> >>
> >> On 3/26/2026 8:32 PM, Andrea Righi wrote:
> >>>             /* This CPU fits with all requirements */
> >>> -           if (fits > 0)
> >>> -                   return cpu;
> >>> +           if (fits > 0) {
> >>> +                   if (prefer_idle_cores && on_idle_core)
> >>> +                           return cpu;
> >>> +                   if (!prefer_idle_cores)
> >>> +                           return cpu;
> >>
> >> nit.
> >>
> >> Can the above two be re-wittern as:
> >>
> >>     if (!prefer_idle_cores || on_idle_core)
> >>         return cpu;
> >>
> >> since they are equivalent.
> > 
> > Oh yes, indeed.
> 
> Also, can we just rewrite this Patch as:
> 
>   (Includes feedback from Vincent; Only build tested)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 700d0f145ca6..cffd5649b54e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7946,6 +7946,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
>  	unsigned long task_util, util_min, util_max, best_cap = 0;
>  	int fits, best_fits = 0;
>  	int cpu, best_cpu = -1;
> @@ -7959,6 +7960,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  	util_max = uclamp_eff_value(p, UCLAMP_MAX);
>  
>  	for_each_cpu_wrap(cpu, cpus, target) {
> +		bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>  		unsigned long cpu_cap = capacity_of(cpu);
>  
>  		if (!choose_idle_cpu(cpu, p))
> @@ -7967,7 +7969,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>  
>  		/* This CPU fits with all requirements */
> -		if (fits > 0)
> +		if (fits > 0 && preferred_core)
>  			return cpu;
>  		/*
>  		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -7976,6 +7978,14 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  		else if (fits < 0)
>  			cpu_cap = get_actual_cpu_capacity(cpu);
>  
> +		/*
> +		 * If we are on an preferred core, translate the range of fits
> +		 * from [-1, 1] to [-4, -2]. This ensures that an idle core
> +		 * is always given priority over (paritally) busy core.
> +		 */
> +		if (preferred_core)
> +			fits -= 3;
> +

Ah, I like this trick. Yes, this definitely makes the patch more compact.

>  		/*
>  		 * First, select CPU which fits better (-1 being better than 0).
>  		 * Then, select the one with best capacity at same level.
> ---
> 
> My naive eyes say it should be equivalent of what you have but maybe
> I'm wrong?

It seems correct to my naive eyes as well. Will test this out to make sure.

Unfortunately I just lost access to my system (bummer), I found another
Vera machine, but this one has a version of the firmware that exposes all
CPUs with the same highest_perf... so I can still do some testing, but not
the same one with SD_ASYM_CPUCAPACITY + SMT. I should get access to the
previous system with the different highest_perf values on Monday.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-27 16:31 ` Shrikanth Hegde
@ 2026-03-27 17:08   ` Andrea Righi
  2026-03-28  6:51     ` Shrikanth Hegde
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-27 17:08 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

On Fri, Mar 27, 2026 at 10:01:03PM +0530, Shrikanth Hegde wrote:
> Hi Andrea.
> 
> On 3/26/26 8:32 PM, Andrea Righi wrote:
> > This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> > introducing SMT awareness.
> > 
> > = Problem =
> > 
> > Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> > sibling is busy, because the physical core doesn't deliver its full nominal
> > capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> > that are not actually good destinations.
> > 
> 
> How does energy model define the opp for SMT?

For now, as suggested by Vincent, we should probably ignore EAS / energy
model and keep it as it is (not compatible with SMT). I'll drop PATCH 3/4
and focus only at SD_ASYM_CPUCAPACITY + SMT.

> 
> SMT systems have multiple of different functional blocks, a few ALU(arithmetic),
> LSU(load store unit) etc. If same/similar workload runs on sibling, it would affect the
> performance, but sibling is using different functional blocks, then it would
> not.
> 
> So underlying actual CPU Capacity of each thread depends on what each sibling is running.
> I don't understand how does the firmware/energy models define this.

They don't and they probably shouldn't. I don't think it's possible to
model CPU capacity with a static nominal value when SMT is enabled, since
the effective capacity changes if the corresponding sibling is busy or not.

It should be up to the scheduler to figure out a reasonable way to estimate
the actual capacity, considering the status of the other sibling (e.g.,
prioritizing the fully-idle SMT cores over the partially-idle SMT cores,
like we do in other parts of the scheduler code).

> 
> > = Proposed Solution =
> > 
> > This patch set aligns those paths with a simple rule already used
> > elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> > partially idle SMT siblings as full-capacity targets where that would
> > mislead load balance.
> > 
> > Patch set summary:
> > 
> >   - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> > 
> >     Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> >     wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> >     idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> > 
> >   - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> > 
> >     Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> >     Provided for consistency with PATCH 1/4.
> > 
> >   - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> > 
> >     Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> >     consistency with PATCH 1/4. I've also tested with/without
> >     /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> >     noticed any regression.
> > 
> >   - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> > 
> >     When choosing the housekeeping CPU that runs the idle load balancer,
> >     prefer an idle CPU on a fully idle core so migrated work lands where
> >     effective capacity is available.
> > 
> >     The change is still consistent with the same "avoid CPUs with busy
> >     sibling" logic and it shows some benefits on Vera, but could have
> >     negative impact on other systems, I'm including it for completeness
> >     (feedback is appreciated).
> > 
> > This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> > SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> > as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> > 
> 
> I assume the CPU_CAPACITY values fixed?
> first sibling has max, while other has less?

The firmware is exposing the same capacity for both siblings. SMT cores may
have different capacity, but siblings within the same SMT core have the
same capacity.

There was an idea to expose a higher capacity for all the 1st sibling and
a lower capacity for all the 2nd siblings, but I don't think it's a good
idea, since that would just confuse the scheduler (and the 2nd sibling
doesn't really have a lower nominal capacity if it's running alone).

> 
> > Without these patches, performance can drop up to ~2x with CPU-intensive
> > workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> > account for busy SMT siblings.
> > 
> 
> How is the performance measured here? Which benchmark?

I've used an internal NVIDIA suite (based on NVBLAS), I also tried Linpack
and got similar results. I'm planning to repeat the tests using public
benchmarks and share the results as soon as I can.

> By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
> so it is all fitting nicely?

That's the case that gives me the optimal results.

> 
> If you increase those numbers, how does the performance numbers compare?

I tried different number of tasks. The more I approach system saturation
the smaller the benefits are. When I completely saturate the system I don't
see any benefit with this changes, neither regressions, but I guess that's
expected.

> 
> Also, whats the system is like? SMT level?

2 siblings for each SMT core.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
  2026-03-27 11:34       ` K Prateek Nayak
@ 2026-03-27 20:36         ` Andrea Righi
  2026-03-27 22:45           ` Andrea Righi
  2026-03-30 17:29         ` Andrea Righi
  1 sibling, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-27 20:36 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > Hi Vincent,
> > 
> > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >>>
> >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> >>> work onto a CPU that still offers full effective capacity. Fall back to
> >>> any idle candidate if none qualify.
> >>
> >> This one isn't straightforward for me. The ilb cpu will check all
> >> other idle CPUs 1st and finish with itself so unless the next CPU in
> >> the idle_cpus_mask is a sibling, this should not make a difference
> >>
> >> Did you see any perf diff ?
> > 
> > I actually see a benefit, in particular, with the first patch applied I see
> > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > which seems pretty consistent across runs (definitely not in error range).
> > 
> > The intention with this change was to minimize SMT noise running the ILB
> > code on a fully-idle core when possible, but I also didn't expect to see
> > such big difference.
> > 
> > I'll investigate more to better understand what's happening.
> 
> Interesting! Either this "CPU-intensive workload" hates SMT turning
> busy (but to an extent where performance drops visibly?) or ILB
> keeps getting interrupted on an SMT sibling that is burdened by
> interrupts leading to slower balance (or IRQs driving the workload
> being delayed by rq_lock disabling them)
> 
> Would it be possible to share the total SCHED_SOFTIRQ time, load
> balancing attempts, and utlization with and without the patch? I too
> will go queue up some runs to see if this makes a difference.

Quick update: I also tried this on a Vera machine with a firmware that
exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
disabled and SMT still on of course) and I see similar performance
benefits.

Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
differences, all within error range (results produced using a vibe-coded
python script):

 - baseline (stats/sec):

  SCHED softirq count  :        2,625
  LB attempts (total)  :       69,832

  Per-domain breakdown:
    domain0 (SMT):
      lb_count    (total)  :       68,482  [balanced=68,472  failed=9]
        CPU_IDLE         : lb=1,408  imb(load=0 util=0 task=0 misfit=0)  gained=0
        CPU_NEWLY_IDLE   : lb=67,041  imb(load=0 util=0 task=7 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
    domain1 (MC):
      lb_count    (total)  :          902  [balanced=900  failed=2]
        CPU_NEWLY_IDLE   : lb=869  imb(load=0 util=0 task=0 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
    domain2 (NUMA):
      lb_count    (total)  :          448  [balanced=441  failed=7]
        CPU_NEWLY_IDLE   : lb=415  imb(load=0 util=0 task=44 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=268 misfit=0)  gained=0

 - with ilb-smt (stats/sec):

  SCHED softirq count  :        2,671
  LB attempts (total)  :       68,572

  Per-domain breakdown:
    domain0 (SMT):
      lb_count    (total)  :       67,239  [balanced=67,197  failed=41]
        CPU_IDLE         : lb=1,419  imb(load=0 util=0 task=0 misfit=0)  gained=0
        CPU_NEWLY_IDLE   : lb=65,783  imb(load=0 util=0 task=42 misfit=0)  gained=1
        CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
    domain1 (MC):
      lb_count    (total)  :          833  [balanced=833  failed=0]
        CPU_NEWLY_IDLE   : lb=796  imb(load=0 util=0 task=0 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
    domain2 (NUMA):
      lb_count    (total)  :          500  [balanced=488  failed=12]
        CPU_NEWLY_IDLE   : lb=463  imb(load=0 util=0 task=44 misfit=0)  gained=0
        CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=627 misfit=0)  gained=0

I'll add more direct instrumentation to check what ILB is doing
differently...

And I'll also repeat the test and collect the same metrics on the Vera
machine with the firmware that exposes different CPU capacities as soon as
I get access again.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
  2026-03-27 20:36         ` Andrea Righi
@ 2026-03-27 22:45           ` Andrea Righi
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-27 22:45 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

On Fri, Mar 27, 2026 at 09:36:15PM +0100, Andrea Righi wrote:
> On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> > 
> > On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > > Hi Vincent,
> > > 
> > > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> > >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> > >>>
> > >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> > >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> > >>> work onto a CPU that still offers full effective capacity. Fall back to
> > >>> any idle candidate if none qualify.
> > >>
> > >> This one isn't straightforward for me. The ilb cpu will check all
> > >> other idle CPUs 1st and finish with itself so unless the next CPU in
> > >> the idle_cpus_mask is a sibling, this should not make a difference
> > >>
> > >> Did you see any perf diff ?
> > > 
> > > I actually see a benefit, in particular, with the first patch applied I see
> > > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > > which seems pretty consistent across runs (definitely not in error range).
> > > 
> > > The intention with this change was to minimize SMT noise running the ILB
> > > code on a fully-idle core when possible, but I also didn't expect to see
> > > such big difference.
> > > 
> > > I'll investigate more to better understand what's happening.
> > 
> > Interesting! Either this "CPU-intensive workload" hates SMT turning
> > busy (but to an extent where performance drops visibly?) or ILB
> > keeps getting interrupted on an SMT sibling that is burdened by
> > interrupts leading to slower balance (or IRQs driving the workload
> > being delayed by rq_lock disabling them)
> > 
> > Would it be possible to share the total SCHED_SOFTIRQ time, load
> > balancing attempts, and utlization with and without the patch? I too
> > will go queue up some runs to see if this makes a difference.
> 
> Quick update: I also tried this on a Vera machine with a firmware that
> exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
> disabled and SMT still on of course) and I see similar performance
> benefits.
> 
> Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
> differences, all within error range (results produced using a vibe-coded
> python script):
> 
>  - baseline (stats/sec):
> 
>   SCHED softirq count  :        2,625
>   LB attempts (total)  :       69,832
> 
>   Per-domain breakdown:
>     domain0 (SMT):
>       lb_count    (total)  :       68,482  [balanced=68,472  failed=9]
>         CPU_IDLE         : lb=1,408  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NEWLY_IDLE   : lb=67,041  imb(load=0 util=0 task=7 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
>     domain1 (MC):
>       lb_count    (total)  :          902  [balanced=900  failed=2]
>         CPU_NEWLY_IDLE   : lb=869  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
>     domain2 (NUMA):
>       lb_count    (total)  :          448  [balanced=441  failed=7]
>         CPU_NEWLY_IDLE   : lb=415  imb(load=0 util=0 task=44 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=268 misfit=0)  gained=0
> 
>  - with ilb-smt (stats/sec):
> 
>   SCHED softirq count  :        2,671
>   LB attempts (total)  :       68,572
> 
>   Per-domain breakdown:
>     domain0 (SMT):
>       lb_count    (total)  :       67,239  [balanced=67,197  failed=41]
>         CPU_IDLE         : lb=1,419  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NEWLY_IDLE   : lb=65,783  imb(load=0 util=0 task=42 misfit=0)  gained=1
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
>     domain1 (MC):
>       lb_count    (total)  :          833  [balanced=833  failed=0]
>         CPU_NEWLY_IDLE   : lb=796  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
>     domain2 (NUMA):
>       lb_count    (total)  :          500  [balanced=488  failed=12]
>         CPU_NEWLY_IDLE   : lb=463  imb(load=0 util=0 task=44 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=627 misfit=0)  gained=0
> 
> I'll add more direct instrumentation to check what ILB is doing
> differently...

More data.

== SMT contention ==

tracepoint:sched:sched_switch
{
    if (args->next_pid != 0) {
        @busy[cpu] = 1;
    } else {
        delete(@busy[cpu]);
    }
}

tracepoint:sched:sched_switch
/ args->prev_pid == 0 && args->next_pid != 0 /
{
    $sib = (cpu + 176) % 352;

    if (@busy[$sib]) {
        @smt_contention++;
    } else {
        @smt_no_contention++;
    }
}

END
{
    printf("smt_contention %lld\n", (int64)@smt_contention);
    printf("smt_no_contention %lld\n", (int64)@smt_no_contention);
}

 - baseline:

@smt_contention: 1103
@smt_no_contention: 3815

 - ilb-smt:

@smt_contention: 937
@smt_no_contention: 4459

== ILB duration ==

 - baseline:

@ilb_duration_us:
[0]                  147 |                                                    |
[1]                  354 |@                                                   |
[2, 4)               739 |@@@                                                 |
[4, 8)              3040 |@@@@@@@@@@@@@@@@                                    |
[8, 16)             9825 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)            8142 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)            1267 |@@@@@@                                              |
[64, 128)           1607 |@@@@@@@@                                            |
[128, 256)          2222 |@@@@@@@@@@@                                         |
[256, 512)          2326 |@@@@@@@@@@@@                                        |
[512, 1K)            141 |                                                    |
[1K, 2K)              37 |                                                    |
[2K, 4K)               7 |

 - ilb-smt:

@ilb_duration_us:
[0]                   79 |                                                    |
[1]                  137 |                                                    |
[2, 4)              1440 |@@@@@@@@@@                                          |
[4, 8)              2897 |@@@@@@@@@@@@@@@@@@@@                                |
[8, 16)             7433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)            4993 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[32, 64)            2390 |@@@@@@@@@@@@@@@@                                    |
[64, 128)           2254 |@@@@@@@@@@@@@@@                                     |
[128, 256)          2731 |@@@@@@@@@@@@@@@@@@@                                 |
[256, 512)          1083 |@@@@@@@                                             |
[512, 1K)            265 |@                                                   |
[1K, 2K)              29 |                                                    |
[2K, 4K)               5 |                                                    |

== rq_lock hold ==

 - baseline:

@lb_rqlock_hold_us:
[0]               664396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                77446 |@@@@@@                                              |
[2, 4)             25044 |@                                                   |
[4, 8)             19847 |@                                                   |
[8, 16)             2434 |                                                    |
[16, 32)             605 |                                                    |
[32, 64)             308 |                                                    |
[64, 128)             38 |                                                    |
[128, 256)             2 |                                                    |

 - ilb-smt:

@lb_rqlock_hold_us:
[0]               229152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]               135060 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[2, 4)             26989 |@@@@@@                                              |
[4, 8)             48034 |@@@@@@@@@@                                          |
[8, 16)             1919 |                                                    |
[16, 32)            2236 |                                                    |
[32, 64)             595 |                                                    |
[64, 128)            135 |                                                    |
[128, 256)            27 |                                                    |

For what I see ILB runs are more expensive, but I still don't see why I'm
getting the speedup with this ilb-smt patch. I'll keep investigating...

-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-27 17:08   ` Andrea Righi
@ 2026-03-28  6:51     ` Shrikanth Hegde
  0 siblings, 0 replies; 42+ messages in thread
From: Shrikanth Hegde @ 2026-03-28  6:51 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel


>> How is the performance measured here? Which benchmark?
> 
> I've used an internal NVIDIA suite (based on NVBLAS), I also tried Linpack
> and got similar results. I'm planning to repeat the tests using public
> benchmarks and share the results as soon as I can.
> 
>> By any chance you are running number_running_task <= (nr_cpus / smt_threads_per_core),
>> so it is all fitting nicely?
> 
> That's the case that gives me the optimal results.
> 
>>
>> If you increase those numbers, how does the performance numbers compare?
> 
> I tried different number of tasks. The more I approach system saturation
> the smaller the benefits are. When I completely saturate the system I don't
> see any benefit with this changes, neither regressions, but I guess that's
> expected.
> 


Ok. That's good.

I gave hackbench on powerpc with SMT=4, i didn't observe any regressions or improvements.
Only PATCH 4/4 applies in this case as there is no asym_cpu_capacity

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (5 preceding siblings ...)
  2026-03-27 16:31 ` Shrikanth Hegde
@ 2026-03-28 13:03 ` Balbir Singh
  2026-03-28 22:50   ` Andrea Righi
  2026-03-30 22:30 ` Dietmar Eggemann
  7 siblings, 1 reply; 42+ messages in thread
From: Balbir Singh @ 2026-03-28 13:03 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	linux-kernel

On 3/27/26 02:02, Andrea Righi wrote:
> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> introducing SMT awareness.
> 
> = Problem =
> 
> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> sibling is busy, because the physical core doesn't deliver its full nominal
> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> that are not actually good destinations.
> 
> = Proposed Solution =
> 
> This patch set aligns those paths with a simple rule already used
> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> partially idle SMT siblings as full-capacity targets where that would
> mislead load balance.

In kernel/sched/topology.c

	/* Don't attempt to spread across CPUs of different capacities. */
	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
		sd->child->flags &= ~SD_PREFER_SIBLING;

Should handle the selection, but I guess this does not work for SMT level sd's?

> 
> Patch set summary:
> 
>  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> 
>    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> 
>  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> 
>    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>    Provided for consistency with PATCH 1/4.
> 
>  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> 
>    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>    consistency with PATCH 1/4. I've also tested with/without
>    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>    noticed any regression.
> 
>  - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> 
>    When choosing the housekeeping CPU that runs the idle load balancer,
>    prefer an idle CPU on a fully idle core so migrated work lands where
>    effective capacity is available.
> 
>    The change is still consistent with the same "avoid CPUs with busy
>    sibling" logic and it shows some benefits on Vera, but could have
>    negative impact on other systems, I'm including it for completeness
>    (feedback is appreciated).
> 
> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> 

Are you referring to nominal_freq?

> Without these patches, performance can drop up to ~2x with CPU-intensive
> workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> account for busy SMT siblings.
> 
> Alternative approaches have been evaluated, such as equalizing CPU
> capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> normalizing them in the kernel by grouping CPUs within a small capacity
> window (+-5%) [1][2], or enabling asympacking [3].
> 
> However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> results so far. Improving this policy also seems worthwhile in general, as
> other platforms in the future may enable SMT with asymmetric CPU
> topologies.
> 
> [1] https://lore.kernel.org/lkml/20260324005509.1134981-1-arighi@nvidia.com
> [2] https://lore.kernel.org/lkml/20260318092214.130908-1-arighi@nvidia.com
> [3] https://lore.kernel.org/all/20260325181314.3875909-1-christian.loehle@arm.com/
> 
> Andrea Righi (4):
>       sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>       sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>       sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>       sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> 
>  kernel/sched/fair.c     | 163 +++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/sched/topology.c |   9 ---
>  2 files changed, 147 insertions(+), 25 deletions(-)


Thanks,
Balbir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-28 13:03 ` Balbir Singh
@ 2026-03-28 22:50   ` Andrea Righi
  2026-03-29 21:36     ` Balbir Singh
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-28 22:50 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	linux-kernel

Hi Balbir,

On Sun, Mar 29, 2026 at 12:03:19AM +1100, Balbir Singh wrote:
> On 3/27/26 02:02, Andrea Righi wrote:
> > This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
> > introducing SMT awareness.
> > 
> > = Problem =
> > 
> > Nominal per-logical-CPU capacity can overstate usable compute when an SMT
> > sibling is busy, because the physical core doesn't deliver its full nominal
> > capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
> > that are not actually good destinations.
> > 
> > = Proposed Solution =
> > 
> > This patch set aligns those paths with a simple rule already used
> > elsewhere: when SMT is active, prefer fully idle cores and avoid treating
> > partially idle SMT siblings as full-capacity targets where that would
> > mislead load balance.
> 
> In kernel/sched/topology.c
> 
> 	/* Don't attempt to spread across CPUs of different capacities. */
> 	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> 		sd->child->flags &= ~SD_PREFER_SIBLING;
> 
> Should handle the selection, but I guess this does not work for SMT level sd's?

IIUC, SD_PREFER_SIBLING steers load balance toward sibling_imbalance()
(spread runnables across child/sibling domains), it doesn't encode the
fully-idle core first logic. In practice it doesn't give us SMT-aware
destination choice when a sibling is busy and this series is trying to
cover that gap in the palcement path.

BTW, on Vera the hierarchy is SMT -> MC -> NUMA:

root@localhost:~# grep . /sys/kernel/debug/sched/domains/cpu0/domain*/flags
/sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
/sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_SHARE_LLC
/sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SERIALIZE SD_NUMA

And domain1/groups_flags (child / SMT flags on the sched groups used at the
MC level) still has SD_PREFER_SIBLING together with SD_SHARE_CPUCAPACITY.

root@localhost:~# cat /sys/kernel/debug/sched/domains/cpu0/domain1/groups_flags
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING

So, prefer-sibling is still in play for SMT (including via MC
groups_flags). On machines where asymmetry attaches immediately above SMT,
topology may strip that flag and reduce this branch of behavior, but
explicit SMT-aware placement still matters.

> > 
> > Patch set summary:
> > 
> >  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
> > 
> >    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
> >    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
> >    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
> > 
> >  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
> > 
> >    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
> >    Provided for consistency with PATCH 1/4.
> > 
> >  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
> > 
> >    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
> >    consistency with PATCH 1/4. I've also tested with/without
> >    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
> >    noticed any regression.
> > 
> >  - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
> > 
> >    When choosing the housekeeping CPU that runs the idle load balancer,
> >    prefer an idle CPU on a fully idle core so migrated work lands where
> >    effective capacity is available.
> > 
> >    The change is still consistent with the same "avoid CPUs with busy
> >    sibling" logic and it shows some benefits on Vera, but could have
> >    negative impact on other systems, I'm including it for completeness
> >    (feedback is appreciated).
> > 
> > This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> > SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> > as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> > 
> 
> Are you referring to nominal_freq?
> 

Correct.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-28 22:50   ` Andrea Righi
@ 2026-03-29 21:36     ` Balbir Singh
  0 siblings, 0 replies; 42+ messages in thread
From: Balbir Singh @ 2026-03-29 21:36 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	linux-kernel

On 3/29/26 09:50, Andrea Righi wrote:
> Hi Balbir,
> 
> On Sun, Mar 29, 2026 at 12:03:19AM +1100, Balbir Singh wrote:
>> On 3/27/26 02:02, Andrea Righi wrote:
>>> This series attempts to improve SD_ASYM_CPUCAPACITY scheduling by
>>> introducing SMT awareness.
>>>
>>> = Problem =
>>>
>>> Nominal per-logical-CPU capacity can overstate usable compute when an SMT
>>> sibling is busy, because the physical core doesn't deliver its full nominal
>>> capacity. So, several SD_ASYM_CPUCAPACITY paths may pick high capacity CPUs
>>> that are not actually good destinations.
>>>
>>> = Proposed Solution =
>>>
>>> This patch set aligns those paths with a simple rule already used
>>> elsewhere: when SMT is active, prefer fully idle cores and avoid treating
>>> partially idle SMT siblings as full-capacity targets where that would
>>> mislead load balance.
>>
>> In kernel/sched/topology.c
>>
>> 	/* Don't attempt to spread across CPUs of different capacities. */
>> 	if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
>> 		sd->child->flags &= ~SD_PREFER_SIBLING;
>>
>> Should handle the selection, but I guess this does not work for SMT level sd's?
> 
> IIUC, SD_PREFER_SIBLING steers load balance toward sibling_imbalance()
> (spread runnables across child/sibling domains), it doesn't encode the
> fully-idle core first logic. In practice it doesn't give us SMT-aware
> destination choice when a sibling is busy and this series is trying to
> cover that gap in the palcement path.
> 

Thanks, so we care about idle selection, not necessarily balancing and yes I did
see that sd->child needs to be set for SD_PEFER_SIBLING to be cleared.

> BTW, on Vera the hierarchy is SMT -> MC -> NUMA:
> 
> root@localhost:~# grep . /sys/kernel/debug/sched/domains/cpu0/domain*/flags
> /sys/kernel/debug/sched/domains/cpu0/domain0/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
> /sys/kernel/debug/sched/domains/cpu0/domain1/flags:SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_SHARE_LLC
> /sys/kernel/debug/sched/domains/cpu0/domain2/flags:SD_BALANCE_NEWIDLE SD_ASYM_CPUCAPACITY SD_ASYM_CPUCAPACITY_FULL SD_SERIALIZE SD_NUMA
> 
> And domain1/groups_flags (child / SMT flags on the sched groups used at the
> MC level) still has SD_PREFER_SIBLING together with SD_SHARE_CPUCAPACITY.
> 
> root@localhost:~# cat /sys/kernel/debug/sched/domains/cpu0/domain1/groups_flags
> SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
> 
> So, prefer-sibling is still in play for SMT (including via MC
> groups_flags). On machines where asymmetry attaches immediately above SMT,
> topology may strip that flag and reduce this branch of behavior, but
> explicit SMT-aware placement still matters.
> 
>>>
>>> Patch set summary:
>>>
>>>  - [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
>>>
>>>    Prefer fully-idle SMT cores in asym-capacity idle selection. In the
>>>    wakeup fast path, extend select_idle_capacity() / asym_fits_cpu() so
>>>    idle selection can prefer CPUs on fully idle cores, with a safe fallback.
>>>
>>>  - [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
>>>
>>>    Reject misfit pulls onto busy SMT siblings on SD_ASYM_CPUCAPACITY.
>>>    Provided for consistency with PATCH 1/4.
>>>
>>>  - [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems
>>>
>>>    Enable EAS with SD_ASYM_CPUCAPACITY and SMT. Also provided for
>>>    consistency with PATCH 1/4. I've also tested with/without
>>>    /proc/sys/kernel/sched_energy_aware enabled (same platform) and haven't
>>>    noticed any regression.
>>>
>>>  - [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
>>>
>>>    When choosing the housekeeping CPU that runs the idle load balancer,
>>>    prefer an idle CPU on a fully idle core so migrated work lands where
>>>    effective capacity is available.
>>>
>>>    The change is still consistent with the same "avoid CPUs with busy
>>>    sibling" logic and it shows some benefits on Vera, but could have
>>>    negative impact on other systems, I'm including it for completeness
>>>    (feedback is appreciated).
>>>
>>> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
>>> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
>>> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
>>>
>>
>> Are you referring to nominal_freq?
>>
> 
> Correct.
> 

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-27 16:39         ` Andrea Righi
@ 2026-03-30 10:17           ` K Prateek Nayak
  2026-03-30 13:07             ` Vincent Guittot
  2026-03-30 13:22             ` Andrea Righi
  0 siblings, 2 replies; 42+ messages in thread
From: K Prateek Nayak @ 2026-03-30 10:17 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

Hello Andrea,

On 3/27/2026 10:09 PM, Andrea Righi wrote:
>> My naive eyes say it should be equivalent of what you have but maybe
>> I'm wrong?
> 
> It seems correct to my naive eyes as well. Will test this out to make sure.

So I found one small problem with fits > 0 && !preferred_core where even
though it is an ideal target, we don't end up preferring it because of
the larger "fits" value.

Here is an updated diff:

  (Only build tested)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226509231e67..580218656865 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7949,6 +7949,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
+	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
 	unsigned long task_util, util_min, util_max, best_cap = 0;
 	int fits, best_fits = 0;
 	int cpu, best_cpu = -1;
@@ -7962,6 +7963,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	util_max = uclamp_eff_value(p, UCLAMP_MAX);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
+		bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		if (!choose_idle_cpu(cpu, p))
@@ -7970,7 +7972,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
 
 		/* This CPU fits with all requirements */
-		if (fits > 0)
+		if (fits > 0 && preferred_core)
 			return cpu;
 		/*
 		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
@@ -7978,9 +7980,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		 */
 		else if (fits < 0)
 			cpu_cap = get_actual_cpu_capacity(cpu);
+		/*
+		 * fits > 0 implies we are not on a preferred core
+		 * but the util fits CPU capacity. Set fits to -2 so
+		 * the effective range becomes [-2, 0] where:
+		 *    0 - does not fit
+		 *   -1 - fits with the exception of UCLAMP_MIN
+		 *   -2 - fits with the exception of preferred_core
+		 */
+		else if (fits > 0)
+			fits = -2;
+
+		/*
+		 * If we are on an preferred core, translate the range of fits
+		 * of [-1, 0] to [-4, -3]. This ensures that an idle core
+		 * is always given priority over (partially) busy core.
+		 *
+		 * A fully fitting idle core would have returned early and hence
+		 * fits > 0 for preferred_core need not be dealt with.
+		 */
+		if (preferred_core)
+			fits -= 3;
 
 		/*
-		 * First, select CPU which fits better (-1 being better than 0).
+		 * First, select CPU which fits better (lower is more preferred).
 		 * Then, select the one with best capacity at same level.
 		 */
 		if ((fits < best_fits) ||
---

Sorry for the oversight but this should now be equivalent to your
Patch 1. I'll let Vincent comment if he prefers this to the original
or not :-)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-30 10:17           ` K Prateek Nayak
@ 2026-03-30 13:07             ` Vincent Guittot
  2026-03-30 13:22             ` Andrea Righi
  1 sibling, 0 replies; 42+ messages in thread
From: Vincent Guittot @ 2026-03-30 13:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

On Mon, 30 Mar 2026 at 12:17, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Andrea,
>
> On 3/27/2026 10:09 PM, Andrea Righi wrote:
> >> My naive eyes say it should be equivalent of what you have but maybe
> >> I'm wrong?
> >
> > It seems correct to my naive eyes as well. Will test this out to make sure.
>
> So I found one small problem with fits > 0 && !preferred_core where even
> though it is an ideal target, we don't end up preferring it because of
> the larger "fits" value.
>
> Here is an updated diff:
>
>   (Only build tested)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 226509231e67..580218656865 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7949,6 +7949,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +       bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
>         unsigned long task_util, util_min, util_max, best_cap = 0;
>         int fits, best_fits = 0;
>         int cpu, best_cpu = -1;
> @@ -7962,6 +7963,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
>         for_each_cpu_wrap(cpu, cpus, target) {
> +               bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>                 unsigned long cpu_cap = capacity_of(cpu);
>
>                 if (!choose_idle_cpu(cpu, p))
> @@ -7970,7 +7972,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
>                 /* This CPU fits with all requirements */
> -               if (fits > 0)
> +               if (fits > 0 && preferred_core)
>                         return cpu;
>                 /*
>                  * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -7978,9 +7980,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>                  */
>                 else if (fits < 0)
>                         cpu_cap = get_actual_cpu_capacity(cpu);
> +               /*
> +                * fits > 0 implies we are not on a preferred core
> +                * but the util fits CPU capacity. Set fits to -2 so
> +                * the effective range becomes [-2, 0] where:
> +                *    0 - does not fit
> +                *   -1 - fits with the exception of UCLAMP_MIN
> +                *   -2 - fits with the exception of preferred_core
> +                */
> +               else if (fits > 0)
> +                       fits = -2;
> +
> +               /*
> +                * If we are on an preferred core, translate the range of fits
> +                * of [-1, 0] to [-4, -3]. This ensures that an idle core
> +                * is always given priority over (partially) busy core.
> +                *
> +                * A fully fitting idle core would have returned early and hence
> +                * fits > 0 for preferred_core need not be dealt with.
> +                */
> +               if (preferred_core)
> +                       fits -= 3;
>
>                 /*
> -                * First, select CPU which fits better (-1 being better than 0).
> +                * First, select CPU which fits better (lower is more preferred).
>                  * Then, select the one with best capacity at same level.
>                  */
>                 if ((fits < best_fits) ||
> ---
>
> Sorry for the oversight but this should now be equivalent to your
> Patch 1. I'll let Vincent comment if he prefers this to the original
> or not :-)

Yes, I prefer this version which keeps the same logic for selecting the best cpu

Thanks
Vincent

>
> --
> Thanks and Regards,
> Prateek
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-30 10:17           ` K Prateek Nayak
  2026-03-30 13:07             ` Vincent Guittot
@ 2026-03-30 13:22             ` Andrea Righi
  2026-03-30 13:46               ` Andrea Righi
  1 sibling, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-30 13:22 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

Hi Prateek,

On Mon, Mar 30, 2026 at 03:47:07PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 3/27/2026 10:09 PM, Andrea Righi wrote:
> >> My naive eyes say it should be equivalent of what you have but maybe
> >> I'm wrong?
> > 
> > It seems correct to my naive eyes as well. Will test this out to make sure.
> 
> So I found one small problem with fits > 0 && !preferred_core where even
> though it is an ideal target, we don't end up preferring it because of
> the larger "fits" value.
> 
> Here is an updated diff:
> 
>   (Only build tested)

I'm getting worse performance with this one (but better than mainline).
I'm trying to understand why.

BTW, we also need to fix asym_fits_cpu() to do something like this:

	return (!sched_smt_active() || is_core_idle(cpu)) &&
	       (util_fits_cpu(util, util_min, util_max, cpu) > 0);

...or we'd return early from select_idle_sibling() with busy SMT cores.

Thanks,
-Andrea

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 226509231e67..580218656865 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7949,6 +7949,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> +	bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
>  	unsigned long task_util, util_min, util_max, best_cap = 0;
>  	int fits, best_fits = 0;
>  	int cpu, best_cpu = -1;
> @@ -7962,6 +7963,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  	util_max = uclamp_eff_value(p, UCLAMP_MAX);
>  
>  	for_each_cpu_wrap(cpu, cpus, target) {
> +		bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
>  		unsigned long cpu_cap = capacity_of(cpu);
>  
>  		if (!choose_idle_cpu(cpu, p))
> @@ -7970,7 +7972,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>  
>  		/* This CPU fits with all requirements */
> -		if (fits > 0)
> +		if (fits > 0 && preferred_core)
>  			return cpu;
>  		/*
>  		 * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -7978,9 +7980,30 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  		 */
>  		else if (fits < 0)
>  			cpu_cap = get_actual_cpu_capacity(cpu);
> +		/*
> +		 * fits > 0 implies we are not on a preferred core
> +		 * but the util fits CPU capacity. Set fits to -2 so
> +		 * the effective range becomes [-2, 0] where:
> +		 *    0 - does not fit
> +		 *   -1 - fits with the exception of UCLAMP_MIN
> +		 *   -2 - fits with the exception of preferred_core
> +		 */
> +		else if (fits > 0)
> +			fits = -2;
> +
> +		/*
> +		 * If we are on an preferred core, translate the range of fits
> +		 * of [-1, 0] to [-4, -3]. This ensures that an idle core
> +		 * is always given priority over (partially) busy core.
> +		 *
> +		 * A fully fitting idle core would have returned early and hence
> +		 * fits > 0 for preferred_core need not be dealt with.
> +		 */
> +		if (preferred_core)
> +			fits -= 3;
>  
>  		/*
> -		 * First, select CPU which fits better (-1 being better than 0).
> +		 * First, select CPU which fits better (lower is more preferred).
>  		 * Then, select the one with best capacity at same level.
>  		 */
>  		if ((fits < best_fits) ||
> ---
> 
> Sorry for the oversight but this should now be equivalent to your
> Patch 1. I'll let Vincent comment if he prefers this to the original
> or not :-)
> 
> -- 
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
  2026-03-30 13:22             ` Andrea Righi
@ 2026-03-30 13:46               ` Andrea Righi
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-30 13:46 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

On Mon, Mar 30, 2026 at 03:22:27PM +0200, Andrea Righi wrote:
> Hi Prateek,
> 
> On Mon, Mar 30, 2026 at 03:47:07PM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> > 
> > On 3/27/2026 10:09 PM, Andrea Righi wrote:
> > >> My naive eyes say it should be equivalent of what you have but maybe
> > >> I'm wrong?
> > > 
> > > It seems correct to my naive eyes as well. Will test this out to make sure.
> > 
> > So I found one small problem with fits > 0 && !preferred_core where even
> > though it is an ideal target, we don't end up preferring it because of
> > the larger "fits" value.
> > 
> > Here is an updated diff:
> > 
> >   (Only build tested)
> 
> I'm getting worse performance with this one (but better than mainline).
> I'm trying to understand why.

Nevermind...

> 
> BTW, we also need to fix asym_fits_cpu() to do something like this:
> 
> 	return (!sched_smt_active() || is_core_idle(cpu)) &&
> 	       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> 
> ...or we'd return early from select_idle_sibling() with busy SMT cores.

...I was actually missing this piece right here. So, everything looks good
with this extra change applied.

I'll repeat all my tests just in case and will send a new version with your
changes.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
  2026-03-27 11:34       ` K Prateek Nayak
  2026-03-27 20:36         ` Andrea Righi
@ 2026-03-30 17:29         ` Andrea Righi
  1 sibling, 0 replies; 42+ messages in thread
From: Andrea Righi @ 2026-03-30 17:29 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Christian Loehle, Koba Ko, Felix Abecassis,
	Balbir Singh, linux-kernel

On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
> 
> On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > Hi Vincent,
> > 
> > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> >>>
> >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> >>> work onto a CPU that still offers full effective capacity. Fall back to
> >>> any idle candidate if none qualify.
> >>
> >> This one isn't straightforward for me. The ilb cpu will check all
> >> other idle CPUs 1st and finish with itself so unless the next CPU in
> >> the idle_cpus_mask is a sibling, this should not make a difference
> >>
> >> Did you see any perf diff ?
> > 
> > I actually see a benefit, in particular, with the first patch applied I see
> > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > which seems pretty consistent across runs (definitely not in error range).
> > 
> > The intention with this change was to minimize SMT noise running the ILB
> > code on a fully-idle core when possible, but I also didn't expect to see
> > such big difference.
> > 
> > I'll investigate more to better understand what's happening.
> 
> Interesting! Either this "CPU-intensive workload" hates SMT turning
> busy (but to an extent where performance drops visibly?) or ILB
> keeps getting interrupted on an SMT sibling that is burdened by
> interrupts leading to slower balance (or IRQs driving the workload
> being delayed by rq_lock disabling them)

Alright, I dug a bit deeper into what's going on.

In this case, the workload showing the large benefit (the NVBLAS benchmark)
is running exactly one task per SMT core, all pinned to NUMA node 0. The
system has two nodes, so node 1 remains mostly idle.

With the SMT-aware select_idle_capacity(), tasks get distributed across SMT
cores in a way that avoids placing them on busy siblings, which is nice and
it's the part that gives most of the speedup.

However, without this ILB patch, find_new_ilb() always picks a CPU with a
busy sibling on node 0, because for_each_cpu_and() always starts from the
lower CPU IDs. As a result, the ILB always ends up running on CPUs with a
CPU-intensive worker running on its sibling, disrupting each other's
performance.

As an experiment, I tried something silly like the following, biasing the
ILB selection toward node 1 (node0 = 0-87,176-263, node1 = 88-177,264-351):

	struct cpumask tmp;

	cpumask_and(&tmp, nohz.idle_cpus_mask, hk_mask);
	for_each_cpu_wrap(ilb_cpu, &tmp, nr_cpu_ids / 4) {
		if (ilb_cpu == smp_processor_id())
			continue;

		if (idle_cpu(ilb_cpu))
			return ilb_cpu;
	}

And I get pretty much the same speedup (slighly better actually, because I
always get an idle CPU in one step, since node 1 is always idle with this
particular benchmark).

So, in this particular scenario this patch makes sense, because we
avoid the "SMT contention" at very low cost. In general, I think the
benefit can be quite situational. I could still make sense to have it, the
extra overhead is limited to an additional is_core_idle() check over idle &
HK candidates (worst case), which could be worthwhile if it reduces
interference from busy SMT siblings.

What do you think?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
                   ` (6 preceding siblings ...)
  2026-03-28 13:03 ` Balbir Singh
@ 2026-03-30 22:30 ` Dietmar Eggemann
  2026-03-31  9:04   ` Andrea Righi
  7 siblings, 1 reply; 42+ messages in thread
From: Dietmar Eggemann @ 2026-03-30 22:30 UTC (permalink / raw)
  To: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot
  Cc: Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

Hi Andrea,

On 26.03.26 16:02, Andrea Righi wrote:

[...]

> This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> 
> Without these patches, performance can drop up to ~2x with CPU-intensive
> workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> account for busy SMT siblings.
> 
> Alternative approaches have been evaluated, such as equalizing CPU
> capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> normalizing them in the kernel by grouping CPUs within a small capacity
> window (+-5%) [1][2], or enabling asympacking [3].
> 
> However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> results so far. Improving this policy also seems worthwhile in general, as
> other platforms in the future may enable SMT with asymmetric CPU
> topologies.
I still wonder whether we really need select_idle_capacity() (plus the
smt part) for asymmetric CPU capacity systems where the CPU capacity
differences are < 5% of SCHED_CAPACITY_SCALE.

The known example would be the NVIDIA Grace (!smt) server with its
slightly different perf_caps.highest_perf values.

We did run DCPerf Mediawiki on this thing with:

 (1) ASYM_CPUCAPACITY (default)

 (2) NO ASYM_CPUCAPACITY

We also ran on a comparable ARM64 server (!smt) for comparison:

 (1) ASYM_CPUCAPACITY

 (2) NO ASYM_CPUCAPACITY (default)

Both systems have 72 CPUs, run v6.8 and have a single MC sched domain
with LLC spanning over all 72 CPUs. During the tests there were ~750
tasks among them the workload related:

  #hhvmworker                   147
  #mariadbd                     204
  #memcached                     11
  #nginx                          8
  #wrk                          144
  #ProxygenWorker                 1

load_balance:

  not_idle	3x more on (2)

  idle		2x more on (2)

  newly_idle    2-10x more on (2)

wakeup:

  move_affine	2-3x more on (1)

  ttwu_local	1.5-2 more on (2)

We also instrumented all the bailout conditions in select_task_sibling()
(sis())->select_idle_cpu() and select_idle_capacity() (sic()).

In (1) almost all wakeups end up in select_idle_cpu() returning -1 due
to the fact that 'sd->shared->nr_idle_scan' under SIS_UTIL is 0. So
sis() in (1) almost always returns target (this_cpu or prev_cpu). sic()
doesn't do this.

What I haven't done is to try (1) with SIS_UTIL or (2) with NO_SIS_UTIL.

I wonder whether this is the underlying reason for the benefit of (1)
over (2) we see here with smt now?

So IMHO before adding smt support to (1) for these small CPPC based CPU
capacity differences we should make sure that the same can't be achieved
by disabling SIS_UTIL or to soften it a bit.

So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
related add-ons in sic()?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-30 22:30 ` Dietmar Eggemann
@ 2026-03-31  9:04   ` Andrea Righi
  2026-04-01 11:57     ` Dietmar Eggemann
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-03-31  9:04 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

Hi Dietmar,

On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> Hi Andrea,
> 
> On 26.03.26 16:02, Andrea Righi wrote:
> 
> [...]
> 
> > This patch set has been tested on the new NVIDIA Vera Rubin platform, where
> > SMT is enabled and the firmware exposes small frequency variations (+/-~5%)
> > as differences in CPU capacity, resulting in SD_ASYM_CPUCAPACITY being set.
> > 
> > Without these patches, performance can drop up to ~2x with CPU-intensive
> > workloads, because the SD_ASYM_CPUCAPACITY idle selection policy does not
> > account for busy SMT siblings.
> > 
> > Alternative approaches have been evaluated, such as equalizing CPU
> > capacities, either by exposing uniform values via firmware (ACPI/CPPC) or
> > normalizing them in the kernel by grouping CPUs within a small capacity
> > window (+-5%) [1][2], or enabling asympacking [3].
> > 
> > However, adding SMT awareness to SD_ASYM_CPUCAPACITY has shown better
> > results so far. Improving this policy also seems worthwhile in general, as
> > other platforms in the future may enable SMT with asymmetric CPU
> > topologies.
> I still wonder whether we really need select_idle_capacity() (plus the
> smt part) for asymmetric CPU capacity systems where the CPU capacity
> differences are < 5% of SCHED_CAPACITY_SCALE.
> 
> The known example would be the NVIDIA Grace (!smt) server with its
> slightly different perf_caps.highest_perf values.
> 
> We did run DCPerf Mediawiki on this thing with:
> 
>  (1) ASYM_CPUCAPACITY (default)
> 
>  (2) NO ASYM_CPUCAPACITY
> 
> We also ran on a comparable ARM64 server (!smt) for comparison:
> 
>  (1) ASYM_CPUCAPACITY
> 
>  (2) NO ASYM_CPUCAPACITY (default)
> 
> Both systems have 72 CPUs, run v6.8 and have a single MC sched domain
> with LLC spanning over all 72 CPUs. During the tests there were ~750
> tasks among them the workload related:
> 
>   #hhvmworker                   147
>   #mariadbd                     204
>   #memcached                     11
>   #nginx                          8
>   #wrk                          144
>   #ProxygenWorker                 1
> 
> load_balance:
> 
>   not_idle	3x more on (2)
> 
>   idle		2x more on (2)
> 
>   newly_idle    2-10x more on (2)
> 
> wakeup:
> 
>   move_affine	2-3x more on (1)
> 
>   ttwu_local	1.5-2 more on (2)
> 
> We also instrumented all the bailout conditions in select_task_sibling()
> (sis())->select_idle_cpu() and select_idle_capacity() (sic()).
> 
> In (1) almost all wakeups end up in select_idle_cpu() returning -1 due
> to the fact that 'sd->shared->nr_idle_scan' under SIS_UTIL is 0. So
> sis() in (1) almost always returns target (this_cpu or prev_cpu). sic()
> doesn't do this.
> 
> What I haven't done is to try (1) with SIS_UTIL or (2) with NO_SIS_UTIL.
> 
> I wonder whether this is the underlying reason for the benefit of (1)
> over (2) we see here with smt now?
> 
> So IMHO before adding smt support to (1) for these small CPPC based CPU
> capacity differences we should make sure that the same can't be achieved
> by disabling SIS_UTIL or to soften it a bit.
> 
> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
> related add-ons in sic()?

Thanks for running these experiments and sharing the data, this is very
useful!

I did a quick test on Vera using the NVBLAS benchmark, comparing NO
ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
within error range. I'll also run DCPerf MediaWiki with all the different
configurations to see if I get similar results.

More in general, I agree that for small capacity differences (e.g., within
~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
fine to go back to the idea of grouping together CPUS within the 5%
capacity window, if we think it's a safer approach (results in your case
are quite evident, and BTW, that means we also shouldn't have
ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
improve performance on Grace, that doesn't have SMT).

That said, I still think there's value in adding SMT awareness to
select_idle_capacity(). Even if we decide to avoid ASYM_CPUCAPACITY for
small capacity deltas, we should ensure that the behavior remains
reasonable if both features are enabled, for any reason. Right now, there
are cases where the current behavior leads to significant performance
degradation (~2x), so having a mechanism to prevent clearly suboptimal task
placement still seems worthwhile. Essentially, what I'm saying is that one
thing doesn't exclude the other.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-03-31  9:04   ` Andrea Righi
@ 2026-04-01 11:57     ` Dietmar Eggemann
  2026-04-01 12:08       ` Vincent Guittot
  0 siblings, 1 reply; 42+ messages in thread
From: Dietmar Eggemann @ 2026-04-01 11:57 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On 31.03.26 11:04, Andrea Righi wrote:
> Hi Dietmar,
> 
> On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
>> Hi Andrea,
>>
>> On 26.03.26 16:02, Andrea Righi wrote:

[...]

>> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
>> related add-ons in sic()?
> 
> Thanks for running these experiments and sharing the data, this is very
> useful!
> 
> I did a quick test on Vera using the NVBLAS benchmark, comparing NO
> ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
> within error range. I'll also run DCPerf MediaWiki with all the different

I'm not familiar with the NVBLAS benchmark. Does it drive your system
into 'sd->shared->nr_idle_scan = 0' state?

We just have to understand where this benefit of using sic() instead of
sis() is coming from. I'm doubtful that this is the best_cpu thing after
if (!choose_idle_cpu(cpu, p)) in sic()'s for_each_cpu_wrap(cpu, cpus,
target) loop given that the CPU capacity diffs are so small.

> configurations to see if I get similar results.
> 
> More in general, I agree that for small capacity differences (e.g., within
> ~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
> fine to go back to the idea of grouping together CPUS within the 5%
> capacity window, if we think it's a safer approach (results in your case
> are quite evident, and BTW, that means we also shouldn't have
> ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
> improve performance on Grace, that doesn't have SMT).

There shouldn't be so many machines with these binning-introduced small
CPU capacity diffs out there? In fact, I only know about your Grace
(!smt) and Vera (smt) machines.

> That said, I still think there's value in adding SMT awareness to
> select_idle_capacity(). Even if we decide to avoid ASYM_CPUCAPACITY for
> small capacity deltas, we should ensure that the behavior remains
> reasonable if both features are enabled, for any reason. Right now, there
> are cases where the current behavior leads to significant performance
> degradation (~2x), so having a mechanism to prevent clearly suboptimal task
> placement still seems worthwhile. Essentially, what I'm saying is that one
> thing doesn't exclude the other.

IMHO, in case we would know where this improvement is coming from using
sic() instead of default sis() (which already as smt support) then
maybe, it's a lot of extra code at the end ... And mobile big.LITTLE
(with larger CPU capacity diffs) doesn't have smt.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-04-01 11:57     ` Dietmar Eggemann
@ 2026-04-01 12:08       ` Vincent Guittot
  2026-04-01 12:42         ` Andrea Righi
  0 siblings, 1 reply; 42+ messages in thread
From: Vincent Guittot @ 2026-04-01 12:08 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Andrea Righi, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 31.03.26 11:04, Andrea Righi wrote:
> > Hi Dietmar,
> >
> > On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> >> Hi Andrea,
> >>
> >> On 26.03.26 16:02, Andrea Righi wrote:
>
> [...]
>
> >> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
> >> related add-ons in sic()?
> >
> > Thanks for running these experiments and sharing the data, this is very
> > useful!
> >
> > I did a quick test on Vera using the NVBLAS benchmark, comparing NO
> > ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
> > within error range. I'll also run DCPerf MediaWiki with all the different
>
> I'm not familiar with the NVBLAS benchmark. Does it drive your system
> into 'sd->shared->nr_idle_scan = 0' state?
>
> We just have to understand where this benefit of using sic() instead of
> sis() is coming from. I'm doubtful that this is the best_cpu thing after
> if (!choose_idle_cpu(cpu, p)) in sic()'s for_each_cpu_wrap(cpu, cpus,
> target) loop given that the CPU capacity diffs are so small.
>
> > configurations to see if I get similar results.
> >
> > More in general, I agree that for small capacity differences (e.g., within
> > ~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
> > fine to go back to the idea of grouping together CPUS within the 5%
> > capacity window, if we think it's a safer approach (results in your case
> > are quite evident, and BTW, that means we also shouldn't have
> > ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
> > improve performance on Grace, that doesn't have SMT).
>
> There shouldn't be so many machines with these binning-introduced small
> CPU capacity diffs out there? In fact, I only know about your Grace
> (!smt) and Vera (smt) machines.

In any case it's always better to add the support than enabling asym_packing

>
> > That said, I still think there's value in adding SMT awareness to
> > select_idle_capacity(). Even if we decide to avoid ASYM_CPUCAPACITY for
> > small capacity deltas, we should ensure that the behavior remains
> > reasonable if both features are enabled, for any reason. Right now, there
> > are cases where the current behavior leads to significant performance
> > degradation (~2x), so having a mechanism to prevent clearly suboptimal task
> > placement still seems worthwhile. Essentially, what I'm saying is that one
> > thing doesn't exclude the other.
>
> IMHO, in case we would know where this improvement is coming from using
> sic() instead of default sis() (which already as smt support) then
> maybe, it's a lot of extra code at the end ... And mobile big.LITTLE
> (with larger CPU capacity diffs) doesn't have smt.

The last proposal based on  prateek proposal in sic() doesn't seems that large

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-04-01 12:08       ` Vincent Guittot
@ 2026-04-01 12:42         ` Andrea Righi
  2026-04-01 13:12           ` Andrea Righi
  2026-04-03 11:47           ` Dietmar Eggemann
  0 siblings, 2 replies; 42+ messages in thread
From: Andrea Righi @ 2026-04-01 12:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Wed, Apr 01, 2026 at 02:08:27PM +0200, Vincent Guittot wrote:
> On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >
> > On 31.03.26 11:04, Andrea Righi wrote:
> > > Hi Dietmar,
> > >
> > > On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> > >> Hi Andrea,
> > >>
> > >> On 26.03.26 16:02, Andrea Righi wrote:
> >
> > [...]
> >
> > >> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
> > >> related add-ons in sic()?
> > >
> > > Thanks for running these experiments and sharing the data, this is very
> > > useful!
> > >
> > > I did a quick test on Vera using the NVBLAS benchmark, comparing NO
> > > ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
> > > within error range. I'll also run DCPerf MediaWiki with all the different
> >
> > I'm not familiar with the NVBLAS benchmark. Does it drive your system
> > into 'sd->shared->nr_idle_scan = 0' state?

It's something internally unfortunately... it's just running a single
CPU-intensive task for each SMT core (in practice half of the CPUs tasks).
I don't think we're hitting sd->shared->nr_idle_scan == 0 in this case.

> >
> > We just have to understand where this benefit of using sic() instead of
> > sis() is coming from. I'm doubtful that this is the best_cpu thing after
> > if (!choose_idle_cpu(cpu, p)) in sic()'s for_each_cpu_wrap(cpu, cpus,
> > target) loop given that the CPU capacity diffs are so small.
> >
> > > configurations to see if I get similar results.
> > >
> > > More in general, I agree that for small capacity differences (e.g., within
> > > ~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
> > > fine to go back to the idea of grouping together CPUS within the 5%
> > > capacity window, if we think it's a safer approach (results in your case
> > > are quite evident, and BTW, that means we also shouldn't have
> > > ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
> > > improve performance on Grace, that doesn't have SMT).
> >
> > There shouldn't be so many machines with these binning-introduced small
> > CPU capacity diffs out there? In fact, I only know about your Grace
> > (!smt) and Vera (smt) machines.
> 
> In any case it's always better to add the support than enabling asym_packing
> 
> >
> > > That said, I still think there's value in adding SMT awareness to
> > > select_idle_capacity(). Even if we decide to avoid ASYM_CPUCAPACITY for
> > > small capacity deltas, we should ensure that the behavior remains
> > > reasonable if both features are enabled, for any reason. Right now, there
> > > are cases where the current behavior leads to significant performance
> > > degradation (~2x), so having a mechanism to prevent clearly suboptimal task
> > > placement still seems worthwhile. Essentially, what I'm saying is that one
> > > thing doesn't exclude the other.
> >
> > IMHO, in case we would know where this improvement is coming from using
> > sic() instead of default sis() (which already as smt support) then
> > maybe, it's a lot of extra code at the end ... And mobile big.LITTLE
> > (with larger CPU capacity diffs) doesn't have smt.
> 
> The last proposal based on  prateek proposal in sic() doesn't seems that large

Exactly, I was referring just to that patch, which would solve the big part
of the performance issue. We can ignore the ILB part for now.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-04-01 12:42         ` Andrea Righi
@ 2026-04-01 13:12           ` Andrea Righi
  2026-04-03 11:47             ` Dietmar Eggemann
  2026-04-03 11:47           ` Dietmar Eggemann
  1 sibling, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-04-01 13:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Wed, Apr 01, 2026 at 02:42:34PM +0200, Andrea Righi wrote:
> On Wed, Apr 01, 2026 at 02:08:27PM +0200, Vincent Guittot wrote:
> > On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >
> > > On 31.03.26 11:04, Andrea Righi wrote:
> > > > Hi Dietmar,
> > > >
> > > > On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> > > >> Hi Andrea,
> > > >>
> > > >> On 26.03.26 16:02, Andrea Righi wrote:
> > >
> > > [...]
> > >
> > > >> So does (2) with NO_SIS_UTIL performs worse than (1) with your smt
> > > >> related add-ons in sic()?
> > > >
> > > > Thanks for running these experiments and sharing the data, this is very
> > > > useful!
> > > >
> > > > I did a quick test on Vera using the NVBLAS benchmark, comparing NO
> > > > ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
> > > > within error range. I'll also run DCPerf MediaWiki with all the different
> > >
> > > I'm not familiar with the NVBLAS benchmark. Does it drive your system
> > > into 'sd->shared->nr_idle_scan = 0' state?
> 
> It's something internally unfortunately... it's just running a single
> CPU-intensive task for each SMT core (in practice half of the CPUs tasks).
> I don't think we're hitting sd->shared->nr_idle_scan == 0 in this case.

Just finished running some tests with DCPerf MediaWiki on Vera as well
(sorry, it took a while, I did mutliple runs to rule out potential flukes):

 +---------------------------------+--------+--------+--------+--------+
 | Configuration                   |   rps  |  p50   |  p95   |  p99   |
 +---------------------------------+--------+--------+--------+--------+
 | NO ASYM + SIS_UTIL              |  8113  |  0.067 |  0.184 |  0.225 |
 | NO ASYM + NO_SIS_UTIL           |  8093  |  0.068 |  0.184 |  0.223 |
 |                                 |        |        |        |        |
 | ASYM + SMT + SIS_UTIL           |  8129  |  0.076 |  0.149 |  0.188 |
 | ASYM + SMT + NO_SIS_UTIL        |  8138  |  0.076 |  0.148 |  0.186 |
 |                                 |        |        |        |        |
 | ASYM + ILB SMT + SIS_UTIL       |  8189  |  0.075 |  0.150 |  0.189 |
 | ASYM + SMT + ILB SMT + SIS_UTIL |  8185  |  0.076 |  0.151 |  0.190 |
 +---------------------------------+--------+--------+--------+--------+

Looking at the data:
 - SIS_UTIL doesn't seem relevant in this case (differences are within
   error range),
 - ASYM_CPU_CAPACITY seems to provide a small throughput gain, but it seems
   more beneficial for tail latency reduction,
 - the ILB SMT patch seems to slightly improve throughput, but the biggest
   benefit is still coming from ASYM_CPU_CAPACITY.

Overall, also in this case it seems beneficial to use ASYM_CPU_CAPACITY
rather than equalizing the capacities.

That said, I'm still not sure why ASYM is helping. The frequency asymmetry
is really small (~2%), so the latency improvements are unlikely to come
from prioritizing the faster cores, as that should mainly affect throughput
rather than tail latency and likely to a smaller extent.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-04-01 12:42         ` Andrea Righi
  2026-04-01 13:12           ` Andrea Righi
@ 2026-04-03 11:47           ` Dietmar Eggemann
  1 sibling, 0 replies; 42+ messages in thread
From: Dietmar Eggemann @ 2026-04-03 11:47 UTC (permalink / raw)
  To: Andrea Righi, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Christian Loehle,
	Koba Ko, Felix Abecassis, Balbir Singh, linux-kernel

On 01.04.26 14:42, Andrea Righi wrote:
> On Wed, Apr 01, 2026 at 02:08:27PM +0200, Vincent Guittot wrote:
>> On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 31.03.26 11:04, Andrea Righi wrote:
>>>> Hi Dietmar,
>>>>
>>>> On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
>>>>> Hi Andrea,
>>>>>
>>>>> On 26.03.26 16:02, Andrea Righi wrote:

[...]

>>>> I did a quick test on Vera using the NVBLAS benchmark, comparing NO
>>>> ASYM_CPUCAPACITY with and without SIS_UTIL, but the difference seems to be
>>>> within error range. I'll also run DCPerf MediaWiki with all the different

Ah, but this benchmark with '#tasks == #cores' is tailored for this
prefer_core thing. And SIS_UTIL shouldn't close the idle CPU search.

>>> I'm not familiar with the NVBLAS benchmark. Does it drive your system
>>> into 'sd->shared->nr_idle_scan = 0' state?
> 
> It's something internally unfortunately... it's just running a single
> CPU-intensive task for each SMT core (in practice half of the CPUs tasks).
> I don't think we're hitting sd->shared->nr_idle_scan == 0 in this case.

OK.

>>> We just have to understand where this benefit of using sic() instead of
>>> sis() is coming from. I'm doubtful that this is the best_cpu thing after
>>> if (!choose_idle_cpu(cpu, p)) in sic()'s for_each_cpu_wrap(cpu, cpus,
>>> target) loop given that the CPU capacity diffs are so small.
>>>
>>>> configurations to see if I get similar results.
>>>>
>>>> More in general, I agree that for small capacity differences (e.g., within
>>>> ~5%) the benefits of using ASYM_CPUCAPACITY is questionable. And I'm also
>>>> fine to go back to the idea of grouping together CPUS within the 5%
>>>> capacity window, if we think it's a safer approach (results in your case
>>>> are quite evident, and BTW, that means we also shouldn't have
>>>> ASYM_CPU_CAPACITY on Grace, so in theory the 5% threshold should also
>>>> improve performance on Grace, that doesn't have SMT).
>>>
>>> There shouldn't be so many machines with these binning-introduced small
>>> CPU capacity diffs out there? In fact, I only know about your Grace
>>> (!smt) and Vera (smt) machines.
>>
>> In any case it's always better to add the support than enabling asym_packing

Yeah, the question for me is more between existing 'sis() + smt' or this
new 'sic() + smt' with those minor CPU capacity differences.

[...]

>>> IMHO, in case we would know where this improvement is coming from using
>>> sic() instead of default sis() (which already as smt support) then
>>> maybe, it's a lot of extra code at the end ... And mobile big.LITTLE
>>> (with larger CPU capacity diffs) doesn't have smt.
>>
>> The last proposal based on  prateek proposal in sic() doesn't seems that large
> 
> Exactly, I was referring just to that patch, which would solve the big part
> of the performance issue. We can ignore the ILB part for now.

OK, I see. It's in your v2 you sent out earlier today so I will comment
there.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-04-01 13:12           ` Andrea Righi
@ 2026-04-03 11:47             ` Dietmar Eggemann
  2026-04-03 14:45               ` Andrea Righi
  0 siblings, 1 reply; 42+ messages in thread
From: Dietmar Eggemann @ 2026-04-03 11:47 UTC (permalink / raw)
  To: Andrea Righi, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Christian Loehle,
	Koba Ko, Felix Abecassis, Balbir Singh, linux-kernel

On 01.04.26 15:12, Andrea Righi wrote:
> On Wed, Apr 01, 2026 at 02:42:34PM +0200, Andrea Righi wrote:
>> On Wed, Apr 01, 2026 at 02:08:27PM +0200, Vincent Guittot wrote:
>>> On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 31.03.26 11:04, Andrea Righi wrote:
>>>>> Hi Dietmar,
>>>>>
>>>>> On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
>>>>>> Hi Andrea,
>>>>>>
>>>>>> On 26.03.26 16:02, Andrea Righi wrote:

[...]

> Just finished running some tests with DCPerf MediaWiki on Vera as well
> (sorry, it took a while, I did mutliple runs to rule out potential flukes):
> 
>  +---------------------------------+--------+--------+--------+--------+
>  | Configuration                   |   rps  |  p50   |  p95   |  p99   |

Just to make sure: rps -> "Wrk RPS" and pXX -> "Nginx PXX time" in
run_details.json ?

>  +---------------------------------+--------+--------+--------+--------+
>  | NO ASYM + SIS_UTIL              |  8113  |  0.067 |  0.184 |  0.225 |
>  | NO ASYM + NO_SIS_UTIL           |  8093  |  0.068 |  0.184 |  0.223 |

Thanks for the test results! Ok, so SIS_UTIL doesn't seem to play a role
here. This workload should have #runnable tasks > #CPUs.

Still trying to grasp why 'sic() + smt' is better than 'sis() + smt' for
NVBLAS?

There is a subtle differences in the start cpu for iterating:

sis(): for_each_cpu_wrap(cpu, cpus, target + 1)
                                           ^^^
sic(): for_each_cpu_wrap(cpu, cpus, target)

Not sure if this makes all the difference?

>  |                                 |        |        |        |        |
>  | ASYM + SMT + SIS_UTIL           |  8129  |  0.076 |  0.149 |  0.188 |
>  | ASYM + SMT + NO_SIS_UTIL        |  8138  |  0.076 |  0.148 |  0.186 |

This should be the same, right? SIS_UTIL is only for sis() so when using
sic() this shouldn't differ. Or did you code SIS_UTIL into sic()?

>  |                                 |        |        |        |        |
>  | ASYM + ILB SMT + SIS_UTIL       |  8189  |  0.075 |  0.150 |  0.189 |
>  | ASYM + SMT + ILB SMT + SIS_UTIL |  8185  |  0.076 |  0.151 |  0.190 |
>  +---------------------------------+--------+--------+--------+--------+

So with '#tasks > #CPUs' smt doesn't make a difference.

> Looking at the data:
>  - SIS_UTIL doesn't seem relevant in this case (differences are within
>    error range),
>  - ASYM_CPU_CAPACITY seems to provide a small throughput gain, but it seems
>    more beneficial for tail latency reduction,
>  - the ILB SMT patch seems to slightly improve throughput, but the biggest
>    benefit is still coming from ASYM_CPU_CAPACITY.

> Overall, also in this case it seems beneficial to use ASYM_CPU_CAPACITY
> rather than equalizing the capacities.
> 
> That said, I'm still not sure why ASYM is helping. The frequency asymmetry

OK, I still would be more comfortable with this when I would now why
this is :-)

> is really small (~2%), so the latency improvements are unlikely to come
> from prioritizing the faster cores, as that should mainly affect throughput
> rather than tail latency and likely to a smaller extent.

[...]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-04-03 11:47             ` Dietmar Eggemann
@ 2026-04-03 14:45               ` Andrea Righi
  2026-04-03 20:44                 ` Andrea Righi
  0 siblings, 1 reply; 42+ messages in thread
From: Andrea Righi @ 2026-04-03 14:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

Hi Dietmar,

On Fri, Apr 03, 2026 at 01:47:17PM +0200, Dietmar Eggemann wrote:
> On 01.04.26 15:12, Andrea Righi wrote:
> > On Wed, Apr 01, 2026 at 02:42:34PM +0200, Andrea Righi wrote:
> >> On Wed, Apr 01, 2026 at 02:08:27PM +0200, Vincent Guittot wrote:
> >>> On Wed, 1 Apr 2026 at 13:57, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> On 31.03.26 11:04, Andrea Righi wrote:
> >>>>> Hi Dietmar,
> >>>>>
> >>>>> On Tue, Mar 31, 2026 at 12:30:55AM +0200, Dietmar Eggemann wrote:
> >>>>>> Hi Andrea,
> >>>>>>
> >>>>>> On 26.03.26 16:02, Andrea Righi wrote:
> 
> [...]
> 
> > Just finished running some tests with DCPerf MediaWiki on Vera as well
> > (sorry, it took a while, I did mutliple runs to rule out potential flukes):
> > 
> >  +---------------------------------+--------+--------+--------+--------+
> >  | Configuration                   |   rps  |  p50   |  p95   |  p99   |
> 
> Just to make sure: rps -> "Wrk RPS" and pXX -> "Nginx PXX time" in
> run_details.json ?

Correct, rps == "Wrk RPS", p50 == "Nginx P50 time", etc.

> 
> >  +---------------------------------+--------+--------+--------+--------+
> >  | NO ASYM + SIS_UTIL              |  8113  |  0.067 |  0.184 |  0.225 |
> >  | NO ASYM + NO_SIS_UTIL           |  8093  |  0.068 |  0.184 |  0.223 |
> 
> Thanks for the test results! Ok, so SIS_UTIL doesn't seem to play a role
> here. This workload should have #runnable tasks > #CPUs.
> 
> Still trying to grasp why 'sic() + smt' is better than 'sis() + smt' for
> NVBLAS?

Same...

> 
> There is a subtle differences in the start cpu for iterating:
> 
> sis(): for_each_cpu_wrap(cpu, cpus, target + 1)
>                                            ^^^
> sic(): for_each_cpu_wrap(cpu, cpus, target)
> 
> Not sure if this makes all the difference?

I quickly matching the wrap start (both ways), but still doesn't make any
difference: sic() is still slightly better than sis(). So the performance gap
doesn't seem to be in the wrap origin.

> 
> >  |                                 |        |        |        |        |
> >  | ASYM + SMT + SIS_UTIL           |  8129  |  0.076 |  0.149 |  0.188 |
> >  | ASYM + SMT + NO_SIS_UTIL        |  8138  |  0.076 |  0.148 |  0.186 |
> 
> This should be the same, right? SIS_UTIL is only for sis() so when using
> sic() this shouldn't differ. Or did you code SIS_UTIL into sic()?

No, you're right, it should be the same, SIS_UTIL is irrelevant here.

> 
> >  |                                 |        |        |        |        |
> >  | ASYM + ILB SMT + SIS_UTIL       |  8189  |  0.075 |  0.150 |  0.189 |
> >  | ASYM + SMT + ILB SMT + SIS_UTIL |  8185  |  0.076 |  0.151 |  0.190 |
> >  +---------------------------------+--------+--------+--------+--------+
> 
> So with '#tasks > #CPUs' smt doesn't make a difference.

Correct. At saturation there's no benefit with the SMT awareness, which makes
sense, all CPUs/siblings are busy, so there's no preferred fully-idle SMT core
to prioritize.

> 
> > Looking at the data:
> >  - SIS_UTIL doesn't seem relevant in this case (differences are within
> >    error range),
> >  - ASYM_CPU_CAPACITY seems to provide a small throughput gain, but it seems
> >    more beneficial for tail latency reduction,
> >  - the ILB SMT patch seems to slightly improve throughput, but the biggest
> >    benefit is still coming from ASYM_CPU_CAPACITY.
> 
> > Overall, also in this case it seems beneficial to use ASYM_CPU_CAPACITY
> > rather than equalizing the capacities.
> > 
> > That said, I'm still not sure why ASYM is helping. The frequency asymmetry
> 
> OK, I still would be more comfortable with this when I would now why
> this is :-)

Working on this. :)

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity
  2026-04-03 14:45               ` Andrea Righi
@ 2026-04-03 20:44                 ` Andrea Righi
  0 siblings, 0 replies; 42+ messages in thread
From: Andrea Righi @ 2026-04-03 20:44 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Christian Loehle, Koba Ko, Felix Abecassis, Balbir Singh,
	linux-kernel

On Fri, Apr 03, 2026 at 04:46:03PM +0200, Andrea Righi wrote:
> On Fri, Apr 03, 2026 at 01:47:17PM +0200, Dietmar Eggemann wrote:
...
> > > Looking at the data:
> > >  - SIS_UTIL doesn't seem relevant in this case (differences are within
> > >    error range),
> > >  - ASYM_CPU_CAPACITY seems to provide a small throughput gain, but it seems
> > >    more beneficial for tail latency reduction,
> > >  - the ILB SMT patch seems to slightly improve throughput, but the biggest
> > >    benefit is still coming from ASYM_CPU_CAPACITY.
> > 
> > > Overall, also in this case it seems beneficial to use ASYM_CPU_CAPACITY
> > > rather than equalizing the capacities.
> > > 
> > > That said, I'm still not sure why ASYM is helping. The frequency asymmetry
> > 
> > OK, I still would be more comfortable with this when I would now why
> > this is :-)
> 
> Working on this. :)

Alright, I think I found something. I tried to make sis() behave more like sic()
by adding the same SMT "full idle core" check in the fast path and removing the
extra select_idle_smt(prev) hop from the LLC idle path.

Essentially this:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7bebceb5ed9df..19fffa2df2d36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7651,29 +7651,6 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	return -1;
 }
 
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	int cpu;
-
-	for_each_cpu_and(cpu, cpu_smt_mask(target), p->cpus_ptr) {
-		if (cpu == target)
-			continue;
-		/*
-		 * Check if the CPU is in the LLC scheduling domain of @target.
-		 * Due to isolcpus, there is no guarantee that all the siblings are in the domain.
-		 */
-		if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
-			continue;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-			return cpu;
-	}
-
-	return -1;
-}
-
 #else /* !CONFIG_SCHED_SMT: */
 
 static inline void set_idle_cores(int cpu, int val)
@@ -7690,11 +7667,6 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma
 	return __select_idle_cpu(core, p);
 }
 
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	return -1;
-}
-
 #endif /* !CONFIG_SCHED_SMT */
 
 /*
@@ -7859,7 +7831,7 @@ static inline bool asym_fits_cpu(unsigned long util,
 		       (util_fits_cpu(util, util_min, util_max, cpu) > 0);
 	}
 
-	return true;
+	return !sched_smt_active() || is_core_idle(cpu);
 }
 
 /*
@@ -7964,16 +7936,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
-	if (sched_smt_active()) {
+	if (sched_smt_active())
 		has_idle_core = test_idle_cores(target);
 
-		if (!has_idle_core && cpus_share_cache(prev, target)) {
-			i = select_idle_smt(p, sd, prev);
-			if ((unsigned int)i < nr_cpumask_bits)
-				return i;
-		}
-	}
-
 	i = select_idle_cpu(p, sd, has_idle_core, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;

---

With this applied, I see identical performance between NO_ASYM and ASYM+SMT.

I'm not suggesting to apply this, but that seems to be the reason why ASYM+SMT
performs better in my case.

-Andrea

^ permalink raw reply related	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2026-04-03 20:45 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-03-27  8:09   ` Vincent Guittot
2026-03-27  9:46     ` Andrea Righi
2026-03-27 10:44   ` K Prateek Nayak
2026-03-27 10:58     ` Andrea Righi
2026-03-27 11:14       ` K Prateek Nayak
2026-03-27 16:39         ` Andrea Righi
2026-03-30 10:17           ` K Prateek Nayak
2026-03-30 13:07             ` Vincent Guittot
2026-03-30 13:22             ` Andrea Righi
2026-03-30 13:46               ` Andrea Righi
2026-03-26 15:02 ` [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
2026-03-27  8:09   ` Vincent Guittot
2026-03-27  9:45     ` Andrea Righi
2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
2026-03-27  8:45   ` Vincent Guittot
2026-03-27  9:44     ` Andrea Righi
2026-03-27 11:34       ` K Prateek Nayak
2026-03-27 20:36         ` Andrea Righi
2026-03-27 22:45           ` Andrea Righi
2026-03-30 17:29         ` Andrea Righi
2026-03-27 13:44   ` Shrikanth Hegde
2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
2026-03-27  6:52   ` Andrea Righi
2026-03-27 16:31 ` Shrikanth Hegde
2026-03-27 17:08   ` Andrea Righi
2026-03-28  6:51     ` Shrikanth Hegde
2026-03-28 13:03 ` Balbir Singh
2026-03-28 22:50   ` Andrea Righi
2026-03-29 21:36     ` Balbir Singh
2026-03-30 22:30 ` Dietmar Eggemann
2026-03-31  9:04   ` Andrea Righi
2026-04-01 11:57     ` Dietmar Eggemann
2026-04-01 12:08       ` Vincent Guittot
2026-04-01 12:42         ` Andrea Righi
2026-04-01 13:12           ` Andrea Righi
2026-04-03 11:47             ` Dietmar Eggemann
2026-04-03 14:45               ` Andrea Righi
2026-04-03 20:44                 ` Andrea Righi
2026-04-03 11:47           ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox